The last two decades have seen dramatic changes in scholarly communication. Journals have moved online, many of the worlds’ documents have been digitized, and comprehensive, relevance-ranked search has become available to almost everyone. Researchers almost everywhere can now find, and often access, nearly all of the world’s scholarly outputs. This has profound implications for the research process, academic careers (and wellbeing), as well as attempts to measure and incentivise research (including its ‘impact’ on other research, and on policy or practice).
In the next three articles, I’d like to summarise what I’ve discovered about one of the key services making this happen, Google Scholar. Many of the findings apply to other platforms, too, and indeed the wider citation and research ‘impact’ system. But to understand some of the politics and power of the current system, as well as the unanswered questions, we need to deep-dive into the details. In addition to testing the Scholar service, I reviewed some of the relevant literature and interviewed in-the-know experts. As I’m a policy person, not an academic, this is written from that perspective.
The past and present of Google Scholar
Scholar was founded in 2004 by Anurag Acharya and Alex Verstak with a mission to “make the world’s problem solvers [researchers] 10 percent more efficient.” Its current slogan is “Stand on the shoulders of giants” (a quote attributed to Bernard of Chartres). The problem that Acharya first started trying to fix while at university in India was of global access to research. The service has always been focused on serving actual researching academics – or scholars. By contrast, most undergraduates use libraries or Google Web Search for research, rather than Scholar.
Metric-tide surfer professor James Wilsdon, of the University of Sheffield, told me that Scholar is “part of the air researchers breath,” supporting them but also potentially marginalizing them if their work’s not there. It’s a key access point for many researchers, much as PubMed is for clinicians. Like it or not, most researchers care about citations. Interviews of 12 Stanford authors for HighWire publishers found that although any big number “matters” to them as authors (downloads, Tweets etc.), only citations really matter to them as readers (as it acts as a quality filter – along with author names, institutions, and journal titles). This is where Scholar comes into its own, but also where issues are magnified.
Google Scholar features and benefits
Scholar is a free search engine for a bibliographic database. Although its size isn’t public it’s estimated to be the world’s biggest academic search engine, with over 390 million documents (80-90 percent coverage of all English-language articles, and 50 percent of full-text scientific documents). That’s a wider scope than competitors such as Web of Science or Scopus. And because it’s less curated, Scholar is also more inclusive — for example, including documents without a Digital Object Identifier (DOI), or that appeared in a non-indexed publication, or that didn’t appear in any journal at all.
Scholar is separate from the catch-all Google Web Search, and stopped being a clearly visible option there in 2011 (to get to it now, you need to select Apps>More>Even more), though apparently usage continues to grow. Scholar’s also distinct from Google Books, which often doesn’t include the metadata required to – for example – search for specific articles within specific journal issues.
The current version of Google Scholar searches most peer-reviewed online academic journals (though often abstract/citation details only due to firewalls), and many books (though not all), conference papers and articles, theses and dissertations, preprints, abstracts, technical reports, other scholarly literature (including webpages), patents, as well as (US federal or state) case law and court opinions. Its Googley web-crawlers gather full text and/or metadata, about digital or physical documents, online or in libraries. That information not only feeds the algorithms of the more focused Scholar service, but also of the general Google Web Search.
That so much of the world’s academic literature is visible, and in my cases fully accessible, is due to the countless deals Acharya and colleagues have sealed with publishers and others over the past 15 years, mostly behind closed doors. Those still wishing to maintain firewalls around their actual content, are nearly always sold on the idea of making the existence of that content more visible to potential customers.
Scholar users have a relatively narrow range of intentions, and so the service is focused on discovering scholarly documents along with their authors, topic(s), and sources. The (relatively) systematic use of citations in academic literature allows for highly-focused ranking algorithms, showing how authors and documents are linked. This was the original inspiration for Google Search’s competitively-advantageous PageRank algorithm.
Scholar is quick and easy to use, with a lot of hard maths behind the scenes, just as you’d expect of a Google service. In addition to advanced searches, auto-suggestions as you type, and different language options, you can also log in via Scholar to up to five library services – though often this requires being on campus within an institutional network, or using a proxy server.
Other current Scholar features include allowing users to access multiple “Versions” of the same document, some of which may be firewalled and some open access – say on an author’s personal website. You can also see “Related articles” and the raw number of times a document has been cited.
The “My library” feature allows users to “Save” personal collections of articles they can then search and organise with tags. You can also import/export citations for software such as EndNote, BibTeX, RefMan, or RefWorks.
Users can set up and edit author profiles for themselves, including their institutional affiliation, five areas of interest, and their outputs and citations. Only profiles with verified academic email addresses appear in search results. By all accounts three-quarters of search results show links to authors’ public profiles. You can also “Follow” an author’s profile for email alerts of new articles, citations or related articles.
All of a user’s manual actions help the automated algorithms to operate, better linking authors to papers, topics and each other. There are also rumours that the profile feature was introduced so that people can navigate to the right content themselves, reducing the traffic (and processing demands) on the search function.
Scholar’s algorithms automatically calculate and display the following three citation metrics for an author over time:
- Citations, the total raw count of how many times an author’s research is cited by other researchers. Note this has been shown to be vulnerable to gaming and being skewed by outliers e.g. one-hit wonders.
- h-index, the namesake author-level metric invented by physicist Jorge Hirsch in 2005 in an attempt to better show an author’s overall contribution to a field (productivity and citation ‘impact’ beyond a one-hit wonder). It features the maximum number of an author’s most-cited papers (h) cited at least h times each by other papers. It can also be applied to groups of researchers or publications (see below).
- I10-index, a 2011 Google Scholar invention which lists the number of an author’s publications that are cited by at least ten other researchers.
For five years now a “Metrics” feature has shown “top” ranked publications within a field, using two metrics;
- h5-index, the h-index for all of a publication’s articles from the last five years
- h5-median, the median number of a publication’s h5-index article citations
There is an overall ranking of publications (topped by Nature), as well as eight categories each comprising numerous subcategories e.g. the one for social science has fifty-two subcategories. Some subcategories – such as history – appear twice, in social sciences and humanities. Scholar also displays a list of “Classic papers” in each field and subcategory, showing the top-10 cited papers that have “stood the test of time”.
Scholar’s two metrics are very close to the oft-critiqued and abused journal impact factor (JIF), which is the annual average of a journal’s article citations from the previous two years. This is often used as a proxy for a journal’s ‘impact’ or importance in its field, but is also vulnerable to outliers, and is often used to compare journals in different disciplines when it really shouldn’t be. This publication-level metric is also often used to evaluate the impact of individual authors, which again it shouldn’t be.
Note that citations (including the h-index and JIF) are not used (officially) by the panels in the UK’s six-yearly c.£2bn p.a. Research Excellence Framework (REF). But as Wilsdon points out, REF subject panels can ask about peer review, which can be informed by citations where relevant.
After this overview of what Google Scholar is, the next post will take a look at some of the criticisms of the research citation system and Scholar’s part in it.