Do paper citations and indices correlate?

Evaluating the impact of research activity is a complex issue, that is guaranteed to stir hot debates whatever the audience or the context. The way evaluators have access to research output is mostly via publications, whatever the type – articles, books, conference proceedings, technical reports etc. It is important to note that in many fields of research, publications are not (or should not) themselves the output of research. In particular, in natural sciences and mathematics, the output of science comes under many guises such as theorems, software, datasets, techniques, chemical compounds and materials, patents etc.  the publications being only a report on the research activity and its outcome. Nevertheless, as a consequence, a large part of the evaluation relies on the publications. The most obvious way to do so is by reading the publications, the so-called peer review. This is what is done before a manuscript is accepted for publication in scientific journals (and increasingly now after it is published). However, to assess funding applications, project achievements, individual and institution performances, most evaluations rely in part on the analysis of publication impact.

[small digression. Let’s be clear about something. Everyone claiming that peer-review of papers is being used to evaluate funding applications, individuals for positions or promotions, or institute performance, is either an hypocrite or has never been part of such an evaluation committee. This never happens, for two reasons, one negative and one positive. The first one is that nobody would have time to perform such as exercise. Members of evaluation panels are often senior researchers, chosen for their recognized track records. They lead research groups and are completely over-committed. Reading a paper seriously, understanding its content, its novelty, takes a significant amount of time. The notion that we read dozens of papers from dozens of scientists for a given panel is just a fantasy. The second, positive, reason is that members of evaluation committees present a very limited collective expertise. For instance I am part of a committee covering the totality of the research spectrum. In this committee, there are only a handful of people covering the entirety of life sciences! It is VERY FORTUNATE that we are not actually judging the papers ourselves!]

To improve on arbitrary judgments based on unconscious bias triggered by journal names, and to complement evaluation by external reviewers, people try to use quantitative metrics, developed by the field called bibliometrics, and in particular citation analysis. For instance, The UK Research Excellence Framework (REF) provides guidance for the use of citation data. It is important to note that those metrics are not sufficient, and REF is actively assessing how best to use them.

In the field of natural sciences citation, a variety of metrics are used to evaluate the impact of articles, individuals and institutes, including citation counts, h-indices and impact factors (yes, this is very wrong, IF are meant for journals not for papers and authors). Recently, a new metrics has been proposed to assess the impact of a given article, the Relative Citation Ratio.

Scientists are inherently navel gazing (or maybe it is just me), and I was curious to see how all these correlated for me. So I collated my bibliometrics data using Google Scholar.  First let’s look at the classic measurements. If I plot the citations of each paper versus the impact factor of the journal it was published in for the year of its publication, the correlation is not overwhelming …


The paper describing SBML is clearly an outlier and makes hard to judge the rest of the plot, so let’s discard it for the time being (yes, I should also discard the outliers in the other direction, but hey, this is a blog post, not a research paper …)

citvsifnosbmlNow, the correlation is clear, but still not overwhelming. The correlation seems to disappear for the highest impact factors, above 18. However, there is an obvious correction to bring to citation counts: recent papers are less cited than old papers. Because I am now a senior scientist, I tend to publish a bit more in papers of high impact factors. Examples are papers reporting the results of large collaborations and invited reviews. So we need to correct for paper age by dividing the counts with the number of year elapsed since their publication.


Indeed, the correlation is clearer. But there is still a lot of noise. I would not say that choosing a higher impact factor is a foolproof way to getting more citations. And I would certainly not say that a paper in a high impact factor journal has necessarily a big impact!

Let’s now turn to the Relative citation ratio. How does it compare to the Impact factor?


Well, the correlation is quasi-identical to the one with the average citations per year. Which of course leads us to the main comparison, which is between the RCR and the citation counts.


The correlation is much better. The outlier with 37 citations and an RCR of 0 is actually an artifact of Google Scholar. Of course, the RCR offers more than just an improved citation count. For instance, it also compares a paper’s impact to the impact of all papers reporting research funded by the NIH. A problem of the current tool though, is that its citation data comes from the Web of Science databases. Those databases do not contain all the scientific journals. They do not record citations in books. And of course they are not open. The RCR is a neat tool, but considering the strong correlation with pure citations, at least in my case, I think just looking at the citation counts is actually a good easy to use proxy for impact.

All that focused on article per article impact. But would total citations be a good proxy to evaluate individual researchers? Continuing the navel gazing exercise, I extracted the data for people in my institute who set up a Google profile. I omitted the PhD students, because publication records and citations are too noisy. I divided the positions in department heads, tenure group leaders, tenure track group leaders (5 year positions, most often a first experience of group leader), senior research associates (indefinite contracts but not group leaders) and post-doctoral fellows.


The correlation between total citations and h-index is quite impressive. This is probably due to the fact that we do not have distortions due to anomalous papers (e.g. BLAST or Clustal in bioinformatics). The occasional highly cited papers (e.g. SBML in my case) are just averaged out. And what comes out clearly is that in the majority of cases, positions match publication impact.  Are total citations or h-index the best predictor? We can plot the rank in both classifications.


The H-index seems to correlate a bit better with tenure, SRA  and tenure track positions. The separation between tenure track and post-docs is more blurry because some post-docs are quite senior and have impressive CVs. But overall, the separations are quite clear. And so is the message. In my institute, there is little hope to become tenure track if you have less than 1000 citations and a single digit h-index. For tenure, the bar would be close to 3000 citations and h-index in the mid-tenth. When it comes to department heads we’re talking 10000 citations and an h-index of 50.

Now all that is of course very focused on my field of research. Molecular, cellular and systems biology is a very peculiar community. The publication habits, the criteria of excellence, everything is very homogeneous, almost military. It is also a fairly inward looking community. Not only there are very little contacts with other sciences, but there are very little contacts with the other components of life sciences as well. A fair amount of its members are actually convinced all scientists in all fields are thinking and acting alike. They would be surprised, and dismayed, to witness what I once saw in a conference: German computer science students impersonating us, exchanging pompous sentences about journal articles, impact factors and citations. They had the time of their life. Very humbling.

All that to say that everything in this blog post should be taken with more than a grain of salt.


The right to know

In the comments we often see about author-pays vs reader-pays, one parameter is regularly absent: The ability to access science. As demonstrated by the MMR scare or the climate-gate in the UK, the general population sometimes needs to get access to primary scientific literature. And it cannot. With the generalisation of the 23andMe and similar companies, biological research invaded the life of any citizen. When one obtains one’s SNP analysis and is told that one has an accrued risk of x % of getting a disease, the first reflex is to “google” the disease and search for scientific information (at least this is what I’d do,and what I do each time I, or someone close to me, faces a health issue). At best one can get the abstracts of papers. That generally only reports a biased view of the story (or even the opposite of the paper. I have examples of that, including one where the authors changed a conclusion following peer-review but forgot to change the abstract. For years the paper was quoted for the wrong conclusion in the abstract. That says a lot about citations in life sciences by the way …). And finally, this week I reviewed job applications, and was not able to access many of the applicant papers.This certainly did not help the applicants.

But most importantly, the citizen already paid for this research. They should not pay again to access the results. That publishers want to sale nice journals with added values like layout, news and views etc. is all fine. But trying to forbid the deposition of the manuscript, the raw product of the researchers’ activity, in an open repository – like they do with the  Research Work Act – is criminal. It is purely a robbery of the citizen. Note that Nature Publishing Group is very different from let’s say the AAAS and Elsevier in that respect. Authors retain their copyright and can do whatever they want with their own production. Of course all that is not true Open Access Publication.

Update on February 28 Elsevier withdrawn their support to the RWA. However, in the withdrawal statement, they re-state clearly that they will continue the fight and still support the idea behind the act, i.e. the interdiction of mandatory deposition of manuscripts in open repositories.