Anonymizing Identifying Information in Court Cases / Case Study

TEXTA
TEXTA
Published in
4 min readNov 25, 2022

--

Project Description

This was a project we did for the Center of Registers and Information Systems (Registrite ja Infosüsteemide Keskus, RIK). The goal of the project was to anonymize data from Estonian court decisions using machine learning models.

These court decisions contain identifying information such as:

  • people’s names (like J.T. or Jaan-Juhan Tamm-Lepale)
  • ID codes
  • birth dates (like 4.10.2022 or sünd. 4. oktoobril 2022)
  • organizations’ names (like Texta or OÜsse Texta.)
  • business registry codes
  • phone numbers
  • IBAN bank accounts
  • email accounts
  • addresses (like Õnne 13, Morna linn, Raplamaa or Õnne 13.)

Removing this information by hand takes a long time, especially since some court decisions can be very long — over 50 pages. Luckily finding most of this different identifying information is possible by using named entity recognition (NER). For example, a named entity can refer to a person, location, organization or product.

We trained models with two types of machine learning, statistics based Conditional random fields and neural net based Stanford Stanza NER, as well as additional rule-based extractors to find and replace identifying data in the documents. Using these we can proceed to anonymize whole documents. Anonymization in this case refers to replacing identifying information with placeholders.

So if in a court document we found:

Jaan Tamm, sünd. 4.10.2022, id-kood 62210040000, aadress Õnne 13, Morna linn, Raplamaa

we could replace it with:

K.L, sünd. X, id-kood Y, aadress Z

Our Workflow

1. Pre-processing the data and using our standard NER models.

2. Creating rule-based extractors for finding formulaic data.

Telephone numbers, Estonian ID codes, e-mails and IBAN bank accounts have a standard that we could use to write regular expressions. For example, email accounts contain @ or replacers such as (ät) and can’t be too long. So j.tamm@mail.ee occurring in a document is probably an email.

Most court decisions come with metadata (personal details) that specifies which people or organizations involved in the case to retract or keep. This metadata could be input for rule-based extractors to find data the models missed.

3. Creating annotation tasks using our Annotator tool.

Screenshot of the Annotator tool in action, annotating organizations in news stories.

Our partners from RIK annotated documents containing people’s names, organizations’ names, birth dates and addresses. We also annotated documents containing educational facilities by ourselves.

Compared to TartuNLP EstNER, we have 99% more people’s names and three times more organizations’ names annotated in our court case dataset.

4. Training models for named entities based on annotated data.

Precision shows how many of the models’ found entities were actually the correct entities we were looking for.
Recall shows how many correct entities the model found out of all possible entities.
F1 score shows the relationship between precision and recall.

All of these scores go from 0 to 1, with 1 being the best score and 0 being the worst.

5. Clustering and linking found data.

Clustering means that if for example we have found addresses Õnne 13 and Õnne 13, Morna, then we can assume they refer to the same address.

Linking means that if we have the metadata and the address Õnne 13, Morna appears in the metadata, then we can use the metadata to see if we should replace or keep this address.

Clustering and linking were used to put together statistics on found entities at the end of the anonymization process.

6. Replacing found data.

If present, the metadata info was used to determine which data to keep and which to retract. Otherwise, all identified people’s names were replaced by randomly generated initials. The rest of the found entities were replaced by arbitrary replacers (such as “X”).

--

--