Business and Management INK

“Everything Not Saved Will Be Lost.” –Nintendo “Quit Screen” Message

April 8, 2025 1747

In this post, authors Richard F.J. Haans and Marc J. Mertens reflect on the inspiration behind their research article, “The Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data,” published in Organizational Research Methods.

Websites are more than just digital storefronts. They are vital for engaging with customers, attracting talent, appealing to investors, and communicating with other stakeholders. The richness of websites also makes them a potential treasure trove for researchers interested in understanding organizations and their leadership. Additionally, websites are dynamic. They constantly evolve in design, content, and functionality—reflecting the changing priorities, strategies, and market conditions of the organizations they represent.

However, this dynamism also poses significant challenges for researchers: Websites are regularly updated, and older content gets removed or replaced as a result—preventing researchers from leveraging these valuable data unless they happen to have saved them in time.

This is where the Wayback Machine, a digital archive of the World Wide Web operated by the Internet Archive, comes into play. By capturing and preserving billions of webpages as they appeared at various points in time, the Wayback Machine allows researchers to retrieve historical snapshots of websites. This offers a solution to the problem of data loss over time.

Yet, despite its promise, the Wayback Machine remains underutilized in organization and management research. This motivated us to systematically assess the quality and coverage of the Wayback Machine and develop innovative methods and tools to access and analyze its data. In our paper published in Organizational Research Methods, we introduce an open-source codebase that facilitates high-volume access to historical website data of organizations using the Wayback Machine. Specifically, we lay out a comprehensive, four-step tutorial—complete with code and best-practice examples—enabling researchers to systematically collect and utilize longitudinal website data.

We also created the CompuCrawl database, a freely accessible dataset featuring historical websites of over 11,000 North American firms listed in Compustat, spanning from 1996 to 2020. The CompuCrawl database contains more than 1.6 million webpages, serving as a powerful resource for future organizational research.

In conclusion, while websites are a rich and largely untapped data source for organizational research, the challenges of accessing and preserving historical content have limited their use. Archives like the Wayback Machine, in combination with the methodologies presented in our paper, open up new avenues for longitudinal research that can deepen our understanding of organizational dynamics over time. To dive deeper into our approach, explore the codebase, and access the CompuCrawl database, we invite you to read our article in Organizational Research Methods and visit our project website at https://haans-mertens.github.io/.

Richard F.J. Haans (PhD) is the Director of Full-time Doctoral Education and Associate Professor at Rotterdam School of Management, Erasmus University. He received his PhD from Tilburg University and has research interests in competitive dynamics and methodological advances. Marc J. Mertens is a research associate and PhD candidate at the University of Mannheim, Germany. He has research interests in stakeholder strategy, financial activism, impression management theory and optimal distinctiveness theory. He also recently received the Best Paper Award at the International Conference of the Global Research Foundation for Corporate Governance.

View all posts by Richard F.J. Haans and Marc J. Mertens

Related Articles

From Isolation to Impact: Tackling the Emotional Toll of Ethnographic Research in Business and Society
Business and Management INK
April 22, 2025

From Isolation to Impact: Tackling the Emotional Toll of Ethnographic Research in Business and Society

Read Now
Ready to Tackle Global Challenges? Apply to Attend Dubai Showcase
Infrastructure
April 17, 2025

Ready to Tackle Global Challenges? Apply to Attend Dubai Showcase

Read Now
DORA to Launch Practical Guide to Responsible Research Assessment
Resources
April 15, 2025

DORA to Launch Practical Guide to Responsible Research Assessment

Read Now
Author Reflections on Intraorganizational Developmental Networks
Business and Management INK
April 2, 2025

Author Reflections on Intraorganizational Developmental Networks

Read Now
Political Theory, UK Experience Among Topics in Politics Webinar Series

Political Theory, UK Experience Among Topics in Politics Webinar Series

This April and May, the Sage Politics Team is hosting a new series of Politics webinars. Similar to last year, these webinars […]

Read Now
Migrant Deaths Along the US-Mexico Border: Causes, Counts, and What the Future May Hold

Migrant Deaths Along the US-Mexico Border: Causes, Counts, and What the Future May Hold

The Accounting for Migrant Deaths Working Group has a simple but ambitious goal – to ensure an accurate count of migrant deaths […]

Read Now
Changing the World or Changing Ourselves?

Changing the World or Changing Ourselves?

In this blog post, co-authors Catherine Brentnall and David Higgins reflect on their interest in how educators change themselves and their practice […]

Read Now
0 0 votes
Article Rating
Subscribe
Notify of
guest


This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments