Innovation

No More Tradeoffs: The Era of Big Data Analysis Has Come

July 16, 2019 1558

For centuries, being a scientist has meant learning to live with limited data. People only share so much on a survey form. Experiments don’t account for all the conditions of real world situations. Field research and interviews can only be generalized so far. Network analyses don’t tell us everything we want to know about the ties among people. And text/content/document analysis methods allow us to dive deep into a small set of documents, or they give us a shallow understanding of a larger archive. Never both. So far, the truly great scientists have had to apply many of these approaches to help us better see the world through their kaleidoscope of imperfect lenses.

This post was originally posted on sister site, SAGE Ocean, under the title, “No more tradeoffs: The era of big data content analysis has come“, by Nick Adams.

But, at least in the area of text analysis (AKA content analysis, or natural language processing), the old limits are crumbling, opening up a frontier of big social data that is fertile enough to support hundreds of research projects and next generation scientists.

The shift, like so many others, is coming thanks to advancements in data science technology. But it does not represent a total break with the past. For decades, social scientists have used expert document labeling tools, often called CAQDAS (Computer Assisted Qualitative Data Analysis Software), like AtlasTI, NVivo, and MaxQDA that allow a highly trained scholar to apply a custom set of labels to a few hundred documents. Initially built for labeling (AKA hand-coding) interview transcripts and meeting minutes, these tools are great for smaller projects. But training up a team to use them on a larger set of documents has always been a costly process of transferring expertise to research assistants and closely supervising their work to ensure its consistency. Lately, tools like Dedoose, DiscoverText, LightTag, and TagTog have re-factored this expert labeling approach into cloud-based software that is cheaper to deliver, update, and maintain—but their users still suffer the same bottleneck of expertise transfer. Even when they are applied to the simplest of labeling tasks, CAQDAS tools require the face-to-face training and oversight of a research team. In the university context, this management burden is grossly compounded by the high turnover among the transient student researcher workforce.

Crowds teaching machines

More recently, machine learning researchers have given social scientists hope. They have developed a number of new technologies able to harness the efforts of online workers to label data like images and video. The online workers go to a crowd work marketplace such as Amazon’s Mechanical Turk, where they are able to find brief ‘Human Intelligence Tasks’ they can perform through their Internet browser for a couple dollars per task. These crowd workers tend to be above average in education, and many do the work to earn supplemental household income. Already, they have completed billions of tasks applying highly accurate labels to images. 

But the research community has been waiting a long time for a way to enlist the crowd in the much more difficult analysis of textual data like news articles, court proceedings, historical archives, and more. Computers–originally built to apply logical and mathematical operations on numeric, ordinal, and categorical data—still can’t faithfully understand human language. Computer scientists’ text mining (AKA natural language processing) approaches have been especially inadequate when it comes to applying, measuring, and testing the rich social theory scholars have built up over centuries. The clamoring for a high-scale, high-fidelity content labeling tool has only intensified since Benoit, Conway, and Laver published their landmark paper proving that crowds of independently working internet laborers could produce results as good as or better than experts.

Finally, the wait is over. My new company, Thusly Inc. is now providing the first and only full-service, end-to-end, online textual data labeling factory that can be custom-rigged for researchers’ use in a matter of several hours, and allow them to go as deep and as broad as they wish with their text labeling projects. No more compromises.

New, different, bigger, better

Notably, this new tool—called TagWorks—has not come from billion dollar tech companies like Amazon’s Mechanical Turk or Figure Eight. I imagined, designed, and developed it as a social scientist and data scientist intimately familiar with the headaches of managing annotation projects using the older, expert-dependent software. My vision of vast, powerful, richly labeled archives combined with my burning frustration as I tried to use other tools at high scale. I became obsessed with the goal of providing researchers with customizable annotation tools producing world-class, research-grade data in months, not years. So, while industry software developers are encouraged to “move fast and break things,” the TagWorks team and I have carefully ensured that researchers’ data will be accurate, reviewable, and reproducible—and articulate with even their most nuanced and elaborate theories about the social world.

TagWorks is one of a kind. Big tech companies like Figure Eight sometimes offer to build their customers annotation tools over the course of 4-6 months at great expense. (The folks at LightTag help you see the prohibitive cost of building even simple tools). But only TagWorks is optimized for efficient assembly line data flows and automated crowd task management. And only TagWorks provides interfaces specifically designed to guide non-expert crowd workers in the production of highly accurate data without face-to-face training. 

this is a screenshot of the TagWorks homepage

Available now

For researchers who want high-quality data labels at a high-scale, the era of tradeoffs is over. You can label as many documents as you have with as many tags as you want. You can go deep. And you can go big. You can finally know and index everything important about your giant archive of documents. In a matter of months you could have the world’s most enriched archive in your discipline, or create the best labeled training data set producing the best text classification algorithm in your domain. To receive TagWorks’ free hybrid text analysis tips or schedule a free consultation to learn more about how you can advance the frontier in your field, email us at office@thusly.co or go to https://tag.works/get-started to get started.



Nick Adams is an expert in social science methods and natural language processing, and the CEO of Thusly Inc., which provides TagWorks as a service. He holds a doctorate in sociology from the University of California, Berkeley and is the founder and Chief Scientist of the Goodly Labs, a tech for social good non-profit based in Oakland, CA.

View all posts by Nick Adams

Related Articles

To Better Forecast AI, We Need to Learn Where Its Money Is Pointing
Innovation
April 10, 2024

To Better Forecast AI, We Need to Learn Where Its Money Is Pointing

Read Now
Free Online Course Reveals The Art of ChatGPT Interactions
Resources
March 28, 2024

Free Online Course Reveals The Art of ChatGPT Interactions

Read Now
Why Social Science? Because It Makes an Outsized Impact on Policy
Industry
March 4, 2024

Why Social Science? Because It Makes an Outsized Impact on Policy

Read Now
Why Don’t Algorithms Agree With Each Other?
Innovation
February 21, 2024

Why Don’t Algorithms Agree With Each Other?

Read Now
NSF Responsible Tech Initiative Looking at AI, Biotech and Climate

NSF Responsible Tech Initiative Looking at AI, Biotech and Climate

The U.S. National Science Foundation’s new Responsible Design, Development, and Deployment of Technologies (ReDDDoT) program supports research, implementation, and educational projects for multidisciplinary, multi-sector teams

Read Now
New Report Finds Social Science Key Ingredient in Innovation Recipe

New Report Finds Social Science Key Ingredient in Innovation Recipe

A new report from Britain’s Academy of Social Sciences argues that the key to success for physical science and technology research is a healthy helping of relevant social science.

Read Now
Did Turing Miss the Point? Should He Have Thought of the Limerick Test?

Did Turing Miss the Point? Should He Have Thought of the Limerick Test?

David Canter is horrified by the power of readily available large language technology.

Read Now
0 0 votes
Article Rating
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments