Data Forensics Case Study — Ministry of Education and Research

TEXTA
TEXTA
Published in
4 min readApr 25, 2023

--

Estonia is widely known as a pioneer in building a digital state as we have been able to build the world’s first paperless government — e-Estonia. Today 99 percent of all Estonian state services can be accessed online, 98 percent of Estonian citizens have an electronic ID card, and more than 45 percent of the population uses e-election for voting. Part of this e-Estonia concept is that all public documents should be stored and published in Electronic Document and Records Management Systems (DMS). Furthermore if not deemed otherwise, documents should be made publicly available and visible. But this also means that before publishing the documents, they should be checked and decided whether they contain confidential information or not.

In the area of government of the Ministry of Education and Research there are several state agencies, foundations and institutions which are managed by the Ministry strategically through definition of their goals and analysis of their results, also determining their budget, exercising supervision, etc. The Ministry of Education and Research is also the owner of the DMS that all its subsidiaries are using. This means that around 500 independent schools and kindergartens are daily uploading documents there and they are also the ones responsible for ensuring that all delicate personal data is considered as “confidential”. With no automatic checks this is prone to human error and that is exactly what happened.

Journalists discovered that personal and extremely delicate data about students was somehow made public by accident. There were documents describing health and visits to the doctor, grades, inquiries to police, evaluations of behaviour and much more. All that with names and other identifiable information.

When the leak was made public, the Ministry of Education and Research reacted immediately. They closed down the DMS and started auditing it. The problem was that they had no idea how many of such documents were there and how to find them from the millions of existing documents in the system. This is when they turned to TEXTA.

Considering that the data was highly sensitive it was crucial that it would not leave the ministry. We had no problem with that because TEXTA Toolkit can run on premise or it can be hosted by us. As a matter of fact, it doesn’t even have to have connection to the outside world to function. As their choice was on premise, we gave them requirements for running our software and they could create a virtual server for us. TEXTA Toolkit dockerised so when the server was up and running the whole process of setting it up was almost immediate. And once that was done we could move to the part we know the best — data.

Our first task with the data was to make it understandable for the machine. Fortunately over the years we have created tools for almost every text analytics related task. In this case we could use our document parser. In really simple terms documents parser takes all the documents that you have in the server, hard drive or any other place you ask it for and makes them analysable and searchable in the TEXTA Toolkit. It can handle email, attachments, word documents, excel file etc.

Having all the documents parsed is just the first step. Next we had to run all the preprocessing tasks that were necessary from the machine learning and linguistic aspect. This was necessary because not all of the documents were in Estonian (there are also Russian speaking schools for example). And even if the documents were just in one language, Estonian has 14 different cases. It would be impossible to describe all of them when conducting more complex searches. We have a tool for this task as well. It is called MLP or Multilingual Processor. It is a delicate piece of software that automatically detects languages, lemmatises text (it is the process of grouping together different inflected forms of the same word), tokenize´a text (a process that splits paragraphs and sentences into smaller units that can be more easily assigned meaning) but it also does some more fancy machine learning stuff. For example it automatically extracts all the names, organisation and locations from the text. This is very useful information once you start doing the actual investigative work, which was the last step in this project.

After running all the aforementioned automatic tasks we could finally start with finding the documents containing sensitive information. In order to do that we had to put into play all the natural language processing tricks. Having an idea about what to look for, complex searches were put together. We were searching for phrases, excluding certain words, allowing spelling mistakes and so forth. After this approach was exhausted, the helm of machine learning was the next logical step. We trained embeddings (language models) to find contextually similar words to extend the searches, we trained machine learning models using the documents we had found as training data and we clustered documents together.

All this seems a lot? Actually not. With the help of TEXTA Toolkit we could conduct the audit in approximately two weeks and we handed over a list of over 1500 documents that should have never been in the public view.

--

--