Case Study: Developing An Automatic Keyword Assignment Tool for the National Library of Estonia

TEXTA
TEXTA
Published in
7 min readJun 28, 2023

--

Introduction

About 1,900 periodicals are published in Estonia annually. When broken down into single issues, this number multiplies. On the basis of the Legal Deposit Copy Act, the National Library receives most of the files of periodical publications published in Estonia. They are stored and made available as individual articles in the Estonian digital archive (hereinafter DEA). Today, the volume of DEA is about 7 million articles and it is constantly growing.

Although the compilation of DEA began in 2014, the database also includes newspapers and magazines that were published before World War II and in the period between the re-independence and the enactment of the Legal Deposit Copy Act. The database contains the full texts of articles along with some identifiers like title, author, date and publication. In other respects, the database is unorganized, i.e. articles can be searched only based on words in the text or in the metadata. This has created a situation where it gets harder and harder to find the necessary information from the constantly growing database.

An Example:

Let’s consider a scenario where the user wants to find articles talking specifically about fish. They insert the word “fish” (“kala”) into DEA’s search engine and as a result will get:

a) The articles that are actually covering the topic of fish in a way that is relevant to their interests;

b) Articles that contain the word “fish”, but are in no way related to the topic the user is actually interested in, e.g: “You know what they say — plenty of fish in the sea!”

As the number of non-relevant articles may outnumber the number of relevant articles by a huge margin, it will take the user a lot of time and effort to navigate in the results in order to find what they are actually looking for. Furthermore, some relevant articles might not be discovered at all, if they do not contain the search word explicitly, e.g. an article talking about salmon migration might not even mention the word “fish”.

How to make navigation in the information easier?

Add keywords!

But now we have stumbled upon a couple of new problems:

1. Manual keyword assignment is time-consuming;

2. Manual keyword assignment is always subjective to a degree.

To defeat both of those adversaries, we can develop a tool for adding the keywords automatically — that, however, doesn’t mean that our problems are over — far from it, actually. Because similarly to energy, problems just change from one form into another.

Data-driven problems & solutions

1. Time

Although every project has to keep up with this fast runner, it carries another layer of meaning in this one: the oldest records present in the dataset are from the 19th century. Like hoop skirts, slavery and horseback riding as a primary way of transportation, a lot of the language used back in that time has fallen out of fashion. Along with terminology shifts and changes, the orthography of many words also differs from their contemporary counterparts. In addition, the documents are digitized by first scanning them and applying optical character recognition afterwards, which tends to lead to at least few (and in worst case many) typographical errors.

An example of a scanned document from 1920

Solution: With enough training examples, it would be possible for the model(s) to still learn, even from outdated texts containing spelling errors. Unfortunately, we were unable to meet this condition (more on that in the next section). Although there are ways to correct the errors and replace words with contemporary synonyms, it would have required a lot more time than we had for finishing this project. So the solution for this problem was similar to dealing with a middle school bully: to ignore it, at least for the time being.

2. Lack of training data

At first sight, this might seem like a contradiction to everything covered in the introduction. Didn’t we just claim that the amount of data we are dealing with is extremely huge? Unfortunately, there is still one relevant element missing: the labels. In order to get the model to predict suitable keywords, it first has to learn the features corresponding to the keywords and that requires a mapping between the text and the keywords. However, the number of labeled documents was only around 50 000. While this might not seem too bad — 50 000 is still a pretty solid amount after all — we have to take into account that it is necessary to have enough training examples for each keyword. Unfortunately, the average number of examples per keyword was only 2 while the absolute minimum is at least a few dozen.

Solution: We trained machine learning models for keywords that had enough examples, but also used some unsupervised methods for predicting the keywords.

3. A huge number of possible labels

While the average number of keywords per article is about 5, the total number of possible keywords in the taxonomy (EMS) used by the National Library of Estonia is around 40 000. A simplified baseline solution would mean that we would first need to apply a model for keyword 1 to the text, get the resulting probability, then apply a model for keyword 2, get the probability and so on, until we’ve gotten the results for all 40 000 keywords. Then we could just select the keywords with the highest probabilities as the final labels — easy! But … Let’s assume that applying the model for one keyword takes approximately 1 second. With sequential application, we would manage to get the final labelset for a single article after 11 hours!

If we would have unlimited hardware and as much computational power as we wish, this wouldn’t be a huge issue — the application will be fast and we could just apply the models in parallel. Unfortunately, this is rarely the case in practice — so we would still need to find a smarter algorithmic solution.

Solution: We used Texta Hybrid Tagger — it allows us to reduce the number of possible label candidates by comparing the source document to the ones in an existing database and gathering only labels of similar documents as potential candidates, ranking them by frequency and finally taking only n most frequent ones into consideration. As n is usually set around 20, we can predict the final set of keywords in 20 seconds instead of 11 hours — which makes it 2000x faster compared with the naive solution.

4. Different categories of keywords

So far we have been talking about keywords on a very general level, but in case of The National Library of Estonia, the keywords have to be divided into strict categories: persons (e.g. “Barack Obama”), organizations (e.g. “North Atlantic Treaty Organization”), titles (e.g. “Top Gun: Maverick”), time keywords (e.g. “19th century”), topic keywords (e.g. “economy”), genres (e.g. “horror”) et cetera.

Once again, with unlimited number of examples for each keyword, this would not be a huge problem — the model would easily learn to distinguish articles written about “Harry Potter and the Chamber of Secrets” (the book) from the ones written about “Harry Potter and the Chamber of Secrets” (the movie”). However, considering the actual amount of data, we would be lucky to make the model understand that the article contains any information about “Harry Potter” at all.

Solution: Use different methods for different types of keywords, e.g. named entity recognition with frequency thresholds for persons and organizations, combination of machine learning and keyword extraction for topic keywords, a customized algorithm for time keywords et cetera.

5. The keywords HAVE to belong to a preexisting taxonomy

This might seem like a made-up problem from the perspective of supervised machine learning models — as we already are using the labels from the preexisting taxonomy for training the models. But it gets more complicated if we want to use unsupervised methods — although there are plenty of unsupervised keyword extraction methods, there are no guarantees that the predicted keywords actually belong to the taxonomy and it gets even more complex with morphologically rich languages like Estonian. Some issues include:

1. The predicted keyword being in a different form / case than its counterparts in EMS, e.g. “hobustega” (with horses) vs. “hobused” (horses).

2. The predicted keyword being outdated and replaced with more contemporary synonym in EMS, e.g. “raalid” (computers) vs. “arvutid” (computers).

3. The predicted keyword not existing in the taxonomy at all, e.g. “treeningandmed” (training data).

Furthermore, how to link the detected NER entity with the correct person? It might seem easy on paper as it isn’t too complicated to link, for example, “Barack Obama” with its counterpart in the database. But what about different persons with the same name? Without careful entity linking, we might assign label “Johann Sebastian Bach” to “Sebastian Bach” and this might be enough for some poor kid’s music history report to state that Johann Sebastian Bach achieved mainstream success as the frontman of the hard rock band Skid Row.

Solution: Use lemmatization and fuzzy matching to link the extracted keywords with the ones in the taxonomy. However, mapping the persons with the same (or overlapping) names would require considering the overall context as well — and for that, we would still need enough data to train supervised machine learning models.

Harris Dickinson (an English actor) vs. Stever Harris and Bruce Dickinson (the members of Iron Maiden).

Final solution

We used Texta Hybrid Tagger for training the supervised prediction models for around 1000 keywords, RaKUn and lexical matching for unsupervised keyword detection and fuzzy matching for linking the keywords with the taxonomy. We used NER along with entity linking for detecting persons and organizations and a customized algorithm for predicting time keywords. The tool is available at https://marta.nlib.ee/ (but currently only in Estonian and for Estonian data).

--

--