TAs the leader of a data science team at the Urban Institute, I get to work on interesting issues that intersect data science and social science every day. By data science, I mean technical tools, architectures, and processes that are borrowed from computer science and are atypical in the social sciences. This is a slightly more limited definition than most would have for the term data science, but because so much of what defines a data scientist at Urban also defines a researcher — cleaning data, analyzing it, visualizing results, etc. — my definition draws a finer line.
For example, in my definition of data science, I would include machine learning models like random forests, big data platforms like Apache Spark, and text analytics techniques like named entity recognition. But I would not include techniques like linear regression, programming platforms like Stata, or CART modeling (otherwise known as decision trees), which are much more familiar to social scientists.
Looking back on my work over the past two years, and looking forward over the next few, here are the three things I’m most excited about at the intersection of data science and social science:
1. Using “Big Data” systems to project thousands of what-if scenarios to better understand the impact of proposed policies in real-time
2. Combining “Big Data” with traditional data sources to produce real-time, cost-effective early warning systems
3. Using new techniques to unlock novel sources of data trapped in various text formats that help facilitate new lines of research and understanding
I’ll dive into each of these briefly and link to additional reading below.
1. Big Data to project “what if?” scenarios
At the Urban Institute, we maintain a number of models that project what-if scenarios. These microsimulation models take policy changes as input and provide information on how that policy will affect people and the economy. For example, if Congress proposes a tax change, as they did at the end of 2017, our Tax Policy Center microsimulation model can calculate how those changes will affect different types of people and the government’s bottom line. And we don’t just do this for taxes; we have models for health insurance, Social Security, and other benefit programs as well.
I would argue that this approach to modeling, while innovative, does not make it into my definition of data science. While this approach is tremendously useful for decision makers and the public, it could be improved using modern data science techniques.
Currently, decision makers must first propose a law or request an agency to have the law evaluated, without having a quantitative measure of how each change they make in their proposal affects different groups of people. So they iterate: submitting a request for a model run and getting the result, tweaking the request and getting another result, and so on, which can be quite time consuming.
But with a massively parallelized setup enabled by cloud computing, we have the technology to enable decision makers to submit a request with the outcomes they’d like to see the policy achieve, and have the analysts provide them with a set of potential policy changes that could achieve those outcomes, according to a rigorous model.
Given recent advances in cloud computing — the cost-efficient availability of large amounts of compute resources on demand — we now have the potential to use big data systems to run millions of these simulation models at the same time. These systems could provide decision makers with all the potential outcomes of a wide range of policy changes with lightning speed, and report only those outcomes closest to those desired by the decision maker.
We’re working on this today. For each individual run of a microsimulation model, we will supply a slightly different what-if scenario. By running hundreds or even thousands of scenarios and analyzing their output, we can give lawmakers a “menu” of policy choices with their estimated effects on people and government budgets before they pass a law, so they can choose the most effective policy within their given political constraints.
2. Big data and traditional data produce early warning systems
To produce an accurate estimate of a population — like when conducting a census — researchers must field surveys that are expensive both in terms of time and money. But the evolution of the use of big data sources, such as data on mobile phone usage (location, length of calls, etc.), combined with much smaller-scale survey data will allow us to conduct the same census with similar accuracy, but faster and for a lower cost. Using the relationships observed between the big data and the small survey, and after conducting validation tests for bias, data scientists can extrapolate the potential responses of a big sample using the big data source.
As Princeton Professor Matt Salganik writes in his new book, Bit by Bit, researchers in Rwanda did just that. The authors found that by pairing geolocation information and frequency and duration of calls from mobile phone data with a small sample of survey data, they were able to produce results comparable in accuracy to the Demographic and Health Survey, but at a fraction of the cost and in a fraction of the time needed by the traditional survey.
This type of work has its drawbacks — mobile phone data needs to be carefully calibrated against the survey to be representative, for example — but the potential is immense. This pairing methodology could be applied to measuring crucial national issues like opioid usage, gentrification, segregation, or food insecurity in real-time, providing policymakers and the public with an effective early warning system and hopefully allow for a more rapid response.
3. Webscraping and text extraction to transform research
Programming languages such as R and Python now have a number of tools capable of extracting text from PDFs, scraping data from websites, and using advanced text analytics to process the resulting text into meaningful components for analysis. Combined, these tools can unlock new datasets from websites and text, and lead to new lines of research inquiry.
For example, data on trials and convictions locked in a court website can be scraped to better understand a host of new research questions across disciplines from criminal justice to housing. At Urban, we used webscraping techniques to collect court records from an online court case search tool to better understand how access to jobs is affected by criminal background checks. I used a similar process to study whether evictions are on the rise in DC.
Researchers — or, more frequently, their research assistants — spend an incredible amount of time searching for and summarizing research. Tools that enable website scraping, text extraction, and advanced text analytics are setting the stage for automating this task.
Some fields, like health research, are leading the way, and these tools are already being used to automate some of the literature review process. One recent example is the Chan-Zuckerberg Initiative’s acquisition of the company Meta, an “AI” tool built to help researchers automate the literature search process and therefore enable them to learn from each other efficiently (Meta is still in alpha release). Over the coming years, systematic literature reviews in the social sciences will benefit from the work being done in health care today to create reliable, automated processes for summarizing scientific progress in real time.
I’m excited about many other innovations, which I’ll explore in future posts. For example, using software applications in the field can provide researchers with real-time experimental data and using big data and text analytics can facilitate more accurate data linking. But, as I hope this discussion makes clear, many exciting opportunities exist right now for data science and social science to push the frontiers of knowledge and help us make better public policy.