The literature on severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and coronavirus disease 2019 (COVID-19) is vast and ever-expanding. As of February 2021, databases tracking academic literature estimate that there have been over 95,000 articles published on COVID-19 since October 2019, and at least one database estimates that there may be up to 150,000 scholarly publications on COVID-19.1,2 The National Center for Biotechnology Information (NCBI) also estimates that on average, roughly 2,000 articles on COVID-19 have been published each week since April 2020.1 Given its size and rapidly-evolving nature, making sense of the COVID-19 literature is both vital and potentially difficult. We can use bibliometric analysis to begin this sense-making process, determining the answers to questions like, “Where is the COVID-19 literature coming from?” and “What is the COVID-19 literature about?”, as well as finding potential pitfalls or problems within the literature base that a reader may need to look out for.

Where is it coming from?

We begin by analyzing data from the 97,129 publications related to COVID-19 in PubMed. The world map below shows the number of publications per country, with darker countries having more publications. The top five countries in terms of publication are the United States (91,810), China (88,259), Italy (59,022), France (25,431), and India (22,742).

Map of world's scientific production, heat map by country

These numbers make sense, as they include some of the countries most affected by COVID-19 and countries with the largest scientific outputs across topics. However, the numbers for the top five countries add up to far more than the number of articles included in our analysis. That is because international collaborations make up a large percentage of COVID-19 papers.

Web visual mapping international collaborations for COVID studies

The graphic above shows international collaborations between different countries. The strong red line indicates that the United States and China have the highest number of collaborations at 1,398 publications. There are also a large number of collaborations between the United States and Italy (1,090 publications), Canada (909 publications), and Australia (713 publications). Strong ties also exist between Italy and Spain (629 publications), Germany (581 publications), and France (555 publications). The strongest ties between countries appear to be largely dependent on proximity, cultural ties, or the extent of the COVID-19 outbreak in each country.

Bar graph of most relevant COVID literature sources

We can also consider the question of where COVID-19 articles are being published. The articles in our analysis come from over 5,800 sources, including journals, books, and other information sources. The 25 sources with the highest numbers of publications include some of the most highly cited journals in the world of biomedical sciences, including BMJ, Nature, the New England Journal of Medicine, and JAMA. This set also includes journals specializing in infectious diseases, epidemiology, and public health, as one would expect given the questions arising due to the COVID-19 pandemic.

A third set of publication sources in this list includes preprint servers like medRxiv. Seeing these preprint servers is unusual in a list of common resources for any topic, and it is an indicator of how quickly COVID-19 research is emerging and how strong the desire is to get information regarding COVID-19 into the hands of medical experts as quickly as possible. This desire cannot always be accommodated by the traditional journal publishing process, which may lead researchers to post publications to preprint servers prior to their journal publication.

While it may be useful to get information into the hands of experts as quickly as possible during the rapidly evolving global pandemic, a reader of these publications should keep in mind that articles found in a preprint server have not yet been through the full peer review and editing process that a published journal article has, and they may be more prone to errors and mistakes. Publications taken from a preprint server need to be read with a critical eye, keeping a lookout for errors, sources of bias, or misprints.

What is it about?

There are a few ways to determine what the COVID-19 literature is about. One of the easiest ways is to divide the literature up into major subject categories. The WHO COVID-19 database2 divides all of its literature into a number of broad categories describing what each article is about. The largest category includes studies about the risk factors for COVID-19, with 18,189 publications, followed by prognostic studies (17,520 publications), diagnosis (12,442 publications), etiology (9,813 publications), observational studies (8,449), case reports (4,445 publications), qualitative studies (3,984), clinical practice guidelines (3,687), screening studies (3,252), prevalence studies (2,867), and controlled clinical trials (2,652).

However, this doesn’t tell us what types of literature are being published. We can use PubMed’s indexing features to determine what types of studies or other types of literature are being published at the highest rates. This analysis shows that the largest category of COVID-19 publications is letters and editorials (20,663 publications). Publications describing COVID-19 studies make up only 7% of the published literature on COVID-19 (6,872 publications), while literature reviews (including systematic reviews and meta-analyses) make up 13% of the literature (12,618 publications). These numbers tell us that a large number of the publications in each of the WHO’s categories are likely not primary studies, but may be reviews, letters, or opinion-based publications.

One final way to examine what the COVID-19 literature is about is to consider the number of times a particular word is used in a title or abstract in the literature base. The word cloud below shows the 100 most common words in COVID-19 titles and abstracts, with larger words occurring more often than smaller words. Unsurprisingly, words like COVID, patients, pandemic, and disease top the list. However, other words in the list can indicate areas of interest across publication types or subjects, including words describing when and where the virus emerged (China, March), topics that may be important to a reader (symptoms, severe, syndrome, management), and commonly reported outcomes (mortality, pneumonia, transmission). Commonly occurring words such as public, healthcare, and system may also indicate the impact of COVID-19 not only on the individual patient level, but on the world’s population and global healthcare resources.

Rainbow-colored word cloud, COVID and patients are two largest words

How do I make sense of it?

As the availability of COVID-19 literature continues to grow, making sense of the enormous quantity of literature may be difficult, even for experienced clinicians or researchers. This issue is compounded by the fact that the vital need to publish COVID-19 data as quickly as possible to address the urgent information needs of medical practitioners may lead to errors or biases in this information. We can use bibliometric analysis to help make sense of the broader landscape of COVID-19 research, giving us a sense of what is being published, where, and on what topics. However, fully making sense of this emerging topic requires an in-depth review of specific, focused questions within the literature, conducted by experts in locating, evaluating, and synthesizing this literature. Bibliometric analysis can help start this process by giving us a general sense of the literature and helping to generate these initial, formative questions.

Note: All analyses were performed using the bibliometrix R-package.3 Data were obtained using the Canadian Agency for Drugs and Technologies in Health (CADTH) peer-reviewed COVID-19 search strategy in PubMed.4

1. National Center for Biotechnology Information. LitCovid. U.S. National Library of Medicine. Published 2020. Accessed February 3, 2021.
2. World Health Organization. Global research on coronavirus disease (COVID-19). World Health Organization. Published 2020. Accessed February 3, 2021.
3. Aria M, Cuccurullo C. bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics. 2017;11(4):959-975.
4. Canadian Agency for Drugs and Technologies in Health. CADTH COVID-19 search strings. Canadian Agency for Drugs and Technologies in Health. Published 2020. Accessed February 3, 2021.