
Sustainable Discovery and Google Scholar’s Comprehensive Coverage

Discovery is a little easier when you know where to start looking. (Image: Detail from Walter Wither’s “Panning for Gold,” 1893)

This article by Max Kemman originally appeared on the LSE Impact of Social Sciences blog as “Standing on the shoulders of the Google giant: Sustainable discovery and Google Scholar’s comprehensive coverage” and is reposted under the Creative Commons license (CC BY 3.0).
Impact on the scholarly workflow
This shift from the library to Google surely must impact the results of the scholarly workflow. HighWire recently summarized this in the following points:
- Search is the new browse
- Full text indexing of current articles plus significant backfiles joined with relevance ranking to change how we looked and what we did.
- “Articles stand on their own merit”
- “Bring all researchers to the frontier”
- “So much more you can actually read”
One study found that all this resulted in doctoral students citing more literature since 2004, calling this ‘the Google effect.’ The Google Scholar team themselves found that with the availability of academic search engines, the impact of non-elite journals has grown, as well as the impact of older articles. Both results can partially be explained by the way Google Scholar presents literature: articles are represented by their titles and search snippets, putting less emphasis on the journal in which it is published. Moreover, Google Scholar ranks articles with more citations higher, meaning older articles have an advantage over newer articles. One study concluded that by doing so Scholar introduces a Matthew effect in the impact of older articles. Another notable choice for ranking this study found is that Scholar puts a high weight on words occurring in the article’s title.
Other changes in the scholar’s workflow summarized by HighWire include a growth in the number of articles clicked on, especially to read abstracts, as well as a growth in diversity of areas clicked on. One interesting consequence of these two changes is that scholars might want to write more accessible abstracts for the wider audience that finds their article through keyword searches and who might be interested despite not being an expert in the author’s field. In short, not only does Google Scholar have a known effect on discovery and citation of articles, it could have an unknown effect on the writing by authors as articles are increasingly ranked and evaluated on their titles and abstracts first.
Sustainable search
As is usually the case when scholars depend on an external entity for a very important task, the sustainability of Google Scholar has long been a worry. When Google Scholar lagged behind the new Google-logo design, scholars expressed concern on Twitter. The co-founder of Google Scholar, Anurag Acharya, has made it a case however that Scholar is really not in danger. For Google, Scholar is a relatively easy search problem, with a small user base, so maintaining the academic search engine is a small cost. Moreover, since many Googlers are ex-academics, Scholar gets a lot of sympathy from within the company.
Maybe the concern over Google Scholar’s sustainability is thus not needed. Still, we might ask whether it is entirely desirable that Google plays such an important role in the scholarly workflow, and as such in science in general. A question remains over why it is so difficult to replace Google Scholar with an alternative. Numerous features can be mentioned to compare between Google Scholar and its competitors, but one feature with which Scholar stands out is comprehensiveness. Although it is unclear how much is in Scholar, it is clear that is has the largest coverage of all the available discovery systems. There are three reasons for this:
- First, Google Scholar is essentially built on top of Google, meaning it is not limited to specific databases, but can work with a heuristic to decide whether or not to include something in Scholar: if it cites other academic work, and if other academic work cites it, it probably is academic work.
- Second, because of this, in contrast to search engines offered by publishers, Scholar indexes works that are available on the web in open access (or possibly illegal) form. This means Scholar usually is able to offer the author’s version of a paper.
- Finally, because Google is such a giant, it has managed to get publishers to agree to be crawled, so it can also index the full-text of the publisher’s version. This final point is what makes it difficult for scholars themselves to come up with a powerful alternative. For example, the recently launched Semantic Scholar looks interesting, but is limited to publicly available online articles. This means that such an undertaking misses a vast amount of literature, and is thus from the start already less attractive, whether its functionality is better than Google’s or not.
If we as scholars are genuinely concerned with Google Scholar’s sustainability, as well as with Google’s dominance in the scholarly workflow, it seems to me the only solution is to push for open access availability not only of new articles but also of old ones. Only once a comprehensive open access database of academic literature can be developed, can we really open up the space for competition with Google Scholar. Although Academia.edu allows (some) search engines to crawl their website after a request for permission, there is a debate whether this would be the best approach for open access. Whether this will lead to better offerings of course remains to be seen, as comprehensiveness is not the only aspect of interest. But I would be very interested to see what we can come up with as discovery systems once the data is available. Maybe this is an unreachable dream, but if there is one thing we can learn from Google Scholar, it is that the publishers’ monopoly over access to academic literature can be disrupted.