Text is everywhere, and everything is text. More textual data than ever before are available to computational social scientists—be it in the form of digitized books, communication traces on social media platforms, or digital scientific articles. Researchers in academia and industry increasingly use text data to understand human behavior and to measure patterns in language. Techniques from natural language processing have created a fertile soil to perform these tasks and to make inferences based on text data on a large scale.
However, a central obstacle prevalent across research areas—particularly in computational social science and social science as a whole—is access to sensitive data. Arguably, most existing text datasets are never shared, primarily as a result of data protection restrictions. Privacy regulations are in place—quite rightly—to protect the identity of persons referred to in text data, but the consequence of this is that many of the datasets of greatest potential value cannot be studied. For example, police forces, businesses and health care organizations hold massive amounts of data with an untapped potential to solve hard problems such as identifying trends in police reports or understanding language change in patient interviews at scale. In short, while there is huge potential in text data, the most valuable datasets hardly reach the researchers with the ability to analyze, interpret and generate societal impact from such data.
In addition, isolated data sharing agreements between stakeholders and universities hardly solve the problem because they prohibit the sharing of data among researchers, which makes follow-up research and replication studies unlikely and ultimately slows down the scientific progress that could be established. The lack of access to sensitive data not only inhibits research but potentially also creates bias towards questions that focus on readily available data (e.g. Twitter), instead of being driven by real-world importance. As a whole, the problem of accessing sensitive text data imposes a major obstacle on progress in computational social science. Text Wash aims to solve the underlying dilemma: how to research genuinely pressing issues using sensitive data without violating privacy and data protection regulation?
Solving the dilemma
If the primary impediment to the sharing of text data is the presence of sensitive and identifying information (e.g. names, dates), a straightforward solution lies in the anonymization of text data by removing identifiable and sensitive information. However, current approaches to data anonymization either require cost—and time-intensive manual anonymization by human experts, or the automatic manipulation of texts by replacing identifying information with generic and context-independent terms (e.g. by replacing all names and dates in a text with the phrase “XXX”). The latter of these, for example, is the approach employed by the anonymization tool provided by the UK Data Service. However, since both syntactic and semantic characteristics of texts are essential for current text processing and information extraction methods to work successfully, such approaches make texts unusable for proper linguistic analyses. To overcome this problem, we propose a fully-automated text anonymization tool that removes traceable and confidential information from English texts and substitutes such information with meaningful replacements. In doing so, both the sentence’s syntactic correctness and its semantic meaning will be preserved.
“Text Wash aims to solve the underlying dilemma: how to research genuinely pressing issues using sensitive data without violating privacy and data protection regulation?
Preserving meaning in text anonymization
In order to illustrate the importance of preserving the semantics when altering texts for linguistic analyses, let us consider the following example. Assume we would like to anonymize the statement
“Alice was very happy about Jamie’s termination of employment with TextWash Inc. last Monday. Three days later, Jamie sued TextWash Inc.”
Imagining that Alice, Jamie and TextWash Inc. are existing entities, such a statement would be difficult to share freely since it would expose Alice’s pleasure resulting from Jamie’s firing, and that they sued their former employer. Nevertheless, from a scientific perspective, this statement might contain valuable information and could be utilized to, for example, automatically extract the emotional valence of Alice’s reaction to Jamie’s firing. If we anonymized this text by simply replacing identifiable information with the generic phrase “XXX”, we would obtain the sentence
“XXX was very happy about XXX termination of employment with XXX last XXX. XXX later, XXX sued XXX”.
Such an approach would make it impossible to extract meaningful information from this statement. Neither humans nor computational algorithms could understand which of the de-identified entities correspond to the same person or organization, thus making it impossible to retain the semantic relationships between individual entities in the statement (in the second sentence, for example, we would not be able to understand that Jamie sued TextWash Inc. since all the references to previous occurrences of the same entity are lost).
Now if we consider an entity-specific anonymization method that i) differentiates between individual categories of identifiable information (e.g. persons, locations, organizations, dates) and ii) consistently replaces instances of these categories throughout the text, we would obtain the sentence
“[PERSON_1] was very happy about [PERSON_2] termination of employment with [ORGANISATION_1] last [DATE_1]. [DATE_2] later, [PERSON_2] sued [ORGANISATION_1]”.
This resulting sequence does not contain any sensitive information anymore and could hence be shared openly online without the violation of confidentiality. And additionally, since the statement consistently replaces certain entities and their respective instances, we could confidently analyze this statement in an automated way without giving up valuable textual characteristics. The following table summarizes the characteristics of both anonymization procedures.
Categorizes anonymized entities
Identifies reoccurrences of the same concept and thus preserves co-references between entities
Naive anonymization with “XXX”
Does not categorize anonymized entities
Does not account for repeated occurrences of anonymized concepts
This example clearly demonstrates the importance of context-preserving text anonymization when utilizing the anonymized texts for linguistic purposes. To enable researchers the anonymization of text in this manner, Text Wash aims at providing an easy-to-use tool that anonymizes information by consistently replacing identifiable information to preserve both the syntactic structure and the semantic representation of texts.
A bottom-up tool for text anonymization
Aside from preserving context and retaining the usefulness of anonymized texts for follow-up analyses, the success of text anonymization efforts crucially depends on stakeholder needs. Previous anonymization efforts have either ignored the researchers’ needs (e.g. by fully redacting texts with XXX) or have neglected the anonymization requirements from data owners (so that the data may still not meet the criteria for sharing).
We therefore closely collaborate with stakeholders from police forces and governmental data protection officials to precisely define the necessities and requirements of our text anonymization procedure. Once we have defined these requirements, our system will utilize state-of-the-art methods from natural language processing and information extraction to aim at sufficiently anonymizing text such that the manipulated confidential and sensitive information can eventually be shared openly across research communities. Our aim is to maximize impact by ensuring that the needs of both data holders and researchers are built into the tool from the start.
Text Wash is currently under development thanks to a SAGE Concept Grant. The tool will be available as an open-source library for R for the research community and as a stand-alone offline version for data owners.