Hundreds of millions of rows of sensitive text: how can personal information be removed?

Advances in AI-based methods have enabled large scale use of free text data in knowledge-based management, research, service improvement, and the development of new innovations. At the same time, the EU General Data Protection Regulation (GDPR), the European Health Data Space (EHDS) Regulation and the Finnish Act on the Secondary Use of Health and Social Data impose strict requirements on the responsible and secure handling of sensitive text data:

Data may only be used for clearly predefined purposes.
Identifying information must be minimized before being used for analytics and AI development.

In practice, research and AI projects dealing with large amounts of data have either a) focused only on structured data, omitting unstructured text, or b) utilized texts containing personal information, potentially breaching regulations.

The rationale for both approaches has been the same: the automated removal of identifying information has not been feasible for large scale free text datasets. However, this is no longer the case. Organizations can now ensure the appropriate and compliant use of free text data across different research and AI use cases.

The right tools for processing large volumes of text

Manual anonymization involving human intervention is slow, costly, and in many cases impossible when dealing with hundreds of millions of texts. For datasets of this scale, anonymization using large language models is also too expensive and slow.

To address this challenge, Tieto Caretech has developed a solution designed specifically to detect identifying information in free text. The solution is based on a small language model trained on synthetic social and healthcare texts to identify and remove both direct and indirect identifiers in free form text.

The solution offers significant advantages:

Direct and indirect identifiers are detected with high accuracy.
Large scale free text data can be processed cost effectively and quickly.
The small language model does not hallucinate; it operates deterministically in the anonymization process.

This makes the large-scale removal of identifying information both scalable and realistic, even for massive text volumes.

When processing large free text datasets, the costs of large language models increase significantly compared to small language models.

When processing large free text datasets, the costs of large language models increase significantly compared to small language models.

Safer data opens the door to AI adoption

Once sensitive information has been removed and reliably masked, organisations can use large free text datasets safely and in compliance with legislation. This enables:

broader use of textual data in research
more responsible AI development
public disclosure of legal decisions and other official documents
wider internal use of customer feedback and other texts sources containing identifying information
the development of new services and innovations using data where the risk of identifying individuals has been minimized.

Is your organization thinking about utilizing sensitive free text data? Are you unsure whether your current free text processing practices meet requirements? Get in touch — we are continuously developing and expanding our capabilities in processing different types of free text data.

Simo Ihalainen

Senior Data Scientist, PhD, Tieto Caretech

Simo Ihalainen has a strong background in developing AI solutions and text analytics methods. He holds a doctorate degree in biomechanics and is a data science specialist, with a particular focus on the automated processing of large-scale free text datasets. His work centres on developing methods and models to minimise identifying information in sensitive text data.

Hundreds of millions of rows of sensitive text: how can personal information be removed?

Table of contents

The right tools for processing large volumes of text

Safer data opens the door to AI adoption