Do paper citations and indices correlate?

Evaluating the impact of research activity is a complex issue, that is guaranteed to stir hot debates whatever the audience or the context. The way evaluators have access to research output is mostly via publications, whatever the type – articles, books, conference proceedings, technical reports etc. It is important to note that in many fields of research, publications are not (or should not) themselves the output of research. In particular, in natural sciences and mathematics, the output of science comes under many guises such as theorems, software, datasets, techniques, chemical compounds and materials, patents etc.  the publications being only a report on the research activity and its outcome. Nevertheless, as a consequence, a large part of the evaluation relies on the publications. The most obvious way to do so is by reading the publications, the so-called peer review. This is what is done before a manuscript is accepted for publication in scientific journals (and increasingly now after it is published). However, to assess funding applications, project achievements, individual and institution performances, most evaluations rely in part on the analysis of publication impact.

[small digression. Let’s be clear about something. Everyone claiming that peer-review of papers is being used to evaluate funding applications, individuals for positions or promotions, or institute performance, is either an hypocrite or has never been part of such an evaluation committee. This never happens, for two reasons, one negative and one positive. The first one is that nobody would have time to perform such as exercise. Members of evaluation panels are often senior researchers, chosen for their recognized track records. They lead research groups and are completely over-committed. Reading a paper seriously, understanding its content, its novelty, takes a significant amount of time. The notion that we read dozens of papers from dozens of scientists for a given panel is just a fantasy. The second, positive, reason is that members of evaluation committees present a very limited collective expertise. For instance I am part of a committee covering the totality of the research spectrum. In this committee, there are only a handful of people covering the entirety of life sciences! It is VERY FORTUNATE that we are not actually judging the papers ourselves!]

To improve on arbitrary judgments based on unconscious bias triggered by journal names, and to complement evaluation by external reviewers, people try to use quantitative metrics, developed by the field called bibliometrics, and in particular citation analysis. For instance, The UK Research Excellence Framework (REF) provides guidance for the use of citation data. It is important to note that those metrics are not sufficient, and REF is actively assessing how best to use them.

In the field of natural sciences citation, a variety of metrics are used to evaluate the impact of articles, individuals and institutes, including citation counts, h-indices and impact factors (yes, this is very wrong, IF are meant for journals not for papers and authors). Recently, a new metrics has been proposed to assess the impact of a given article, the Relative Citation Ratio.

Scientists are inherently navel gazing (or maybe it is just me), and I was curious to see how all these correlated for me. So I collated my bibliometrics data using Google Scholar.  First let’s look at the classic measurements. If I plot the citations of each paper versus the impact factor of the journal it was published in for the year of its publication, the correlation is not overwhelming …


The paper describing SBML is clearly an outlier and makes hard to judge the rest of the plot, so let’s discard it for the time being (yes, I should also discard the outliers in the other direction, but hey, this is a blog post, not a research paper …)

citvsifnosbmlNow, the correlation is clear, but still not overwhelming. The correlation seems to disappear for the highest impact factors, above 18. However, there is an obvious correction to bring to citation counts: recent papers are less cited than old papers. Because I am now a senior scientist, I tend to publish a bit more in papers of high impact factors. Examples are papers reporting the results of large collaborations and invited reviews. So we need to correct for paper age by dividing the counts with the number of year elapsed since their publication.


Indeed, the correlation is clearer. But there is still a lot of noise. I would not say that choosing a higher impact factor is a foolproof way to getting more citations. And I would certainly not say that a paper in a high impact factor journal has necessarily a big impact!

Let’s now turn to the Relative citation ratio. How does it compare to the Impact factor?


Well, the correlation is quasi-identical to the one with the average citations per year. Which of course leads us to the main comparison, which is between the RCR and the citation counts.


The correlation is much better. The outlier with 37 citations and an RCR of 0 is actually an artifact of Google Scholar. Of course, the RCR offers more than just an improved citation count. For instance, it also compares a paper’s impact to the impact of all papers reporting research funded by the NIH. A problem of the current tool though, is that its citation data comes from the Web of Science databases. Those databases do not contain all the scientific journals. They do not record citations in books. And of course they are not open. The RCR is a neat tool, but considering the strong correlation with pure citations, at least in my case, I think just looking at the citation counts is actually a good easy to use proxy for impact.

All that focused on article per article impact. But would total citations be a good proxy to evaluate individual researchers? Continuing the navel gazing exercise, I extracted the data for people in my institute who set up a Google profile. I omitted the PhD students, because publication records and citations are too noisy. I divided the positions in department heads, tenure group leaders, tenure track group leaders (5 year positions, most often a first experience of group leader), senior research associates (indefinite contracts but not group leaders) and post-doctoral fellows.


The correlation between total citations and h-index is quite impressive. This is probably due to the fact that we do not have distortions due to anomalous papers (e.g. BLAST or Clustal in bioinformatics). The occasional highly cited papers (e.g. SBML in my case) are just averaged out. And what comes out clearly is that in the majority of cases, positions match publication impact.  Are total citations or h-index the best predictor? We can plot the rank in both classifications.


The H-index seems to correlate a bit better with tenure, SRA  and tenure track positions. The separation between tenure track and post-docs is more blurry because some post-docs are quite senior and have impressive CVs. But overall, the separations are quite clear. And so is the message. In my institute, there is little hope to become tenure track if you have less than 1000 citations and a single digit h-index. For tenure, the bar would be close to 3000 citations and h-index in the mid-tenth. When it comes to department heads we’re talking 10000 citations and an h-index of 50.

Now all that is of course very focused on my field of research. Molecular, cellular and systems biology is a very peculiar community. The publication habits, the criteria of excellence, everything is very homogeneous, almost military. It is also a fairly inward looking community. Not only there are very little contacts with other sciences, but there are very little contacts with the other components of life sciences as well. A fair amount of its members are actually convinced all scientists in all fields are thinking and acting alike. They would be surprised, and dismayed, to witness what I once saw in a conference: German computer science students impersonating us, exchanging pompous sentences about journal articles, impact factors and citations. They had the time of their life. Very humbling.

All that to say that everything in this blog post should be taken with more than a grain of salt.

Selection panels, follow the “rule of thirds”

Over the past decade or so, I have been part of quite a few grant panels, for national and international funding agencies. Each time, the funding agency felt compelled to reinvent all the procedures from scratch, and ignore whatever experience has been gained from thousands of such exercises in the past. The reason is always to be more efficient, and serve science by selecting the best projects, with the highest likelihood of impact. Invariably, the system put in place achieves the exact opposite.

One aspect in particular severely impacts the resulting outcome: the success rate. The success rate varies widely from one funding scheme to another. The most highly sought after funding sources, such as grants from the Human Frontier Science Project, barely reach a few % of success. For such competitive schemes, multi-stage selection systems are often put in place, with only a fraction of the projects going from one stage to the next. Now the interesting – and slightly depressing – facts are:

  • the fraction of projects moving from one stage to another seems completely random, and disconnected from the final success rate;
  • the length of the documentation required in the application is disconnected from the number of stages and success rate;
  • the number of panel members looking at the documentation also varies seemingly in a random fashion, and is certainly not related to the number and size of proposals.

One of the worst examples I saw of that situation was the selection of Horizon 2020 collaborative projects in 2015, where a first step selected 30% of the projects, and a second step selected 5% of these. Such fractions were not only wrong because not equal, with an increased selection pressure, but they were actually the wrong way around, with 70% of scientists writing small documents without success for the first stage and 95% of scientists writing very long applications without success for the second stage. An even worse example is the advanced ERC grants, where all applicants are asked a short and long project description. But the panel select the projects during step 1 using only the short description! Since only 1/3 of projects are sent for external reviews, 2/3 of the applicants wrote a long application that will never be read by anyone!!!

What are the consequences:

  1. frustration and dispiriting of scientists, that compounded the lack of research done while they were working on the grant applications;
  2. increased workload on panel members, who had to read and evaluate a lot of documentation, 95% for nothing;
  3. enormous waste of taxpayers money on both sides of the fence;
  4. funded projects that are almost certainly not the best.

Waste of money

So, what does such a process cost? Let’s look at the panel side. Evaluating a 3-6 pages document, that outlines a project, takes maybe one hour per project. Let’s assume a project was read by two panel members. The H2020 call I was talking about above had 355 applications, of which 108 were selected for the second stage, 5 of them being funded. So, we are talking 710 hours of reading for the first stage. To which we need to add the panel meeting. We’ll assume a panel of 10 members, meeting once for 2 days (8 hours a day) plus travel (~10 h return). So 260 more hours. Total is 970 hours. This represents GBP 48500 (I took a very average salary for a PI, costing their institution GBP 50 per hour). To which we need to add travelling, accommodation and catering costs, about 5000 (again super conservative). Of these 53500, 35700 are wasted on failed applications.

A complete application of 50-100 pages would require half a day (4 hours), hence 864 hours of reading for the lot, plus the panel meeting. Total is 1124, that is GBP 56200 plus 5000 of meeting. But … hold on, I forgot someone! For this second stage, the opinion of external experts will be sought. Now, I am not going to overestimate their amount of work. They are assumed to spend half a day on each proposal. But I will only count 2 hours. And each proposal is evaluated by 3 experts. So total is 93600, of which 89300 are wasted on failed applications.

But those are the costs on the panel side, the ones directly supported by some funding agencies (most do not fund reviewers’ time and some do not support panel members’ time).

Now, on the applicant side, the one superbly ignored by the funding agencies … For the first stage, I will assume 10 people are involved, spending a day for most and a week for 2 of them (coordinator, grant officer). They also have a meeting to which 7 people travel and spend a night. The total spent for the 355 projects is 4 millions of which 2.8 millions are spent for failed applications. for the second stage, more people are involved, spending more time, let’s say 12 people spending a week on the project, and 3 spending 3 weeks. 10 people travel to a preparatory meeting. We are talking of a total expense of 9.2 millions, of which 5 are wasted on failed applications. The funding bodies could not care less about this money. They do not pay for this side of the process. The institution of the applicants (and therefore other funding bodies) do so.

Adding panel and applicant spending, 9.4 MILLIONS pounds of taxpayers money have been spent on failed applications! Now the interesting fact is that this particular call had a total budget of 30 millions Euros, that is a bit more than 20 millions GPB. In other words, to distribute 2 of their pounds, the taxpayer spent another pound! ONE THIRD of this public money was spent without any scientific research being done.

Random selection of projects

Now, that’s for the efficiency. Let’s move to the efficacy. Surely this very expensive process selected the best possible scientific projects? Being super selective means only the “crème de la crème” are selected? Not at all! This is misunderstanding how grant applications are selected.

1) Within a panel, grant applications are distributed to a few of the panel members, sometimes called “introducing members”. This is generally (but not always) based on the expertise of those members, who can then evaluate the proposal and select suitable external reviewers. These introducing members have an enormous power. They are generally the only ones reading an application attentively enough to detect flaws. They are giving the initial score to a project, that will decide how it will be discussed in the panel meeting. Panel members have different habits to score projects. Some will provide a Gaussian distribution of scores. Some will only give highest scores to projects they want to discuss and lowest for the one they do not like. This will affect the global score, drawn from the combination of scores from various introducing members.

Introducing members are defending or destroying the application during the meeting. If the introducing member is negative, you’re doomed. If the introducing member is an expert in your field, you’re doomed. If the panel member is a shy individual, you’re doomed. If the panel member cannot be bothered or was depressed, you’re doomed. If the introducing panel is not an expert but saw an interesting talk in the domain a couple of weeks ago, you’re saved. If the panel member has a big voice, you’re saved. If the panel member is competitive and wants “his” projects to be funded so he beats the other panel members, you’re saved. So there is an enormous bias towards boasting, competitive, vocal introducing members.

As with every process in the universe, the noise (non-scientifically related component of the selection) increases in function of the square root of the signal (proportion of projects funded). If only a very few projects are selected among plenty, the effect of the introducing members on the whole selection will be proportionally bigger (although for any given project, it does not matter).

percent Visual rendering of the selection of 30, 10, 3 and 1 % of proposals.

2) Discussing a lot of projects during a panel meeting leads to temporal bias. We are more lenient at the beginning of the day, and more severe towards the end of the day. Not only do we get tired, nervous, dehydrated, we also tend to wield an axe rather than clippers to prune the good from the bad. While we find excuses and side interests to a lot of projects at the beginning of the day, the slight error or clumsy statement is damning when we reach tea time. Now, the more projects, the less likely it is that they will be discussed several times in a day,and therefore the more sensitive the process will be to the panel’s physiology.

3) Recognising excellence from a grant application is not that easy. And the excellence of the projects is in general not linear. Many projects will be totally rubbish (oh come on! I am not the only one having been in a grant panel, have I?). But many will be excellent as well. With a few in between. Imagine a “sigmoid curve”. Selecting between the very best projects is very difficult. One needs more information to distinguish between close competitors (green box). While we do not need much to eliminate the hopeless ones (red box).


So, how do we fix this?

A proposal: remember the rule of thirds

This idea is based on the way we actually rank proposals. Whatever selection I have to do among competitive pieces, I make three piles: NO, YES, MAYBE. The NO pile is made up of project I think should be rejected no matter what. The YES pile is made up of projects that I would be proud to have proposed. They’re excellent, and they should be funded. The MAYBE pile is … well maybe. We need more discussion, it depends on the funding etc. Because each project is read by several reviewers/panel members, there will be variation of scoring. But this noise should happen at the edge of the groups. One should then discuss the bottom of the YES pile, and the top of the MAYBE pile (see blue box on the excellence plot).


So, choosing which projects to fund should obey the rule of third: accept at least a third of them. If there is not enough money for at least 1/3, then a 2 stage process must be organised. If the money is too short to fund 1/9 of them, then a 3 stage process must be organised etc. At each stage, three equal piles are drawn, YES, NO, MAYBE.


The first stage should be strategic. For instance, each project is only described in a one page document. The panel chair and co-chair select within ALL proposals the ones that are suitable for the call. That way, since they see all proposals, they can balance topics, senior vs junior, gender etc. according to the policies of the funding bodies. This can be done very quickly, in a few days of intense work.

The second stage involves panel members. A project description must then include the science, track records etc. Each panel member has several projects, each project is evaluated by several members. Each member must have a significant share. That should be done fairly quickly since the descriptions are short, and no external opinion is sought.

The final stage involves external scientists. Only then does one require the full project descriptions.

Note that the pile is the same height at each step: The less proposals, the longer the descriptions.

How does the progressive selection look?

What are the costs for the EU call we used as example before?

Panel side: The first step is done on 1 page. It involves 10 min by chair and co-chair. So, we are talking 119 hours of reading for the first stage. There is not panel meeting. The total expense is then ~5900.

The second stage is equivalent to the first stage previously. Evaluating a 3-6 pages document, that outlines a project, takes maybe one hour per project. Let’s assume a project was read by two panel members. On third of 355 projects are evaluated, that is 119. So, we are talking 238 hours of reading for the first stage. To which we need to add the panel meeting. We’ll assume a panel of 10 members, meeting once for 2 days (8 hours a day) plus travel (~10 h return). So 260 more hours. Total is 498 hours. This represents GBP 24900. To which we need to add travelling, accommodation and catering costs, about 5000 (again super conservative).

The complete application still requires half a day (4 hours). But we have only 40 of them hence 320 hours of reading plus the panel meeting. Total is 590, that is GBP 29500 plus 5000 of meeting. For this third stage, the opinion of external experts will be sought. As before, I assume they will spend 2 hours per proposal. And each proposal is evaluated by 3 experts. So total is 240 hours. Plus the panel meeting.

Total for the panel side is 5900+29900+4600. That is 81800, not a huge saving on the previous situation (still one year of PhD salary …).

Now, on the applicant side, this is a completely different story. For the first stage, only one person is involved, the coordinator, spending 1 day. The total is therefore 142000 for the 355 projects. The second stage is now what was previously the first stage, except only 119 projects are involved. The total spent is 900600. The third stage is now like the second stage previously, except only 40 projects are evaluated. The total is 1920000.

The total for the applicant side is therefore 142000+900600+192000 = 3418600

Adding panel and applicant spending, only 3500400 pounds of taxpayers money have been spent, a 2/3 economy!

Now, should it have stopped here? No. This process was still not good, because only 12.5% of the projects have been selected during the last round. An even better process would have been to add yet another layer of selection. The second layer would have involved the panel members, but without meeting. The third layer (panel member and extended discussions during a meeting) would have selected 14 projects. The last exercise involving external reviewers, would have selected 5 amongst those. Only 42 reviewers would be needed (14*3) instead of a whooping 350 or so.

The process would perhaps be a bit longer (“perhaps”, because most of the time lost in those processes is NOT due to the evaluation, but to administrative treatment of applications and unnecessary delays between the different stages). But so much effort, money and anxiety saved! And so much more time for scientists to do research!

What to do and not to do in advanced modelling courses

I previously introduced our in silico systems biology course. After 5 years of this course, I collected a few lessons that are probably applicable to any advanced course. Nothing very new or surprising, but worth keeping in mind when organising these teaching events.

Select the students well

Beware of the wrong expectations, and of the students who do not find what they thought they would. Disappointed students can wreak the atmosphere of a course. Beware that terminologies are different in different domains. One of the most overloaded terms is “model”. 3D structure model, Hidden markov model, general linear model, chemical kinetics model, all those are models. But they address different population. Systems Biology itself is problematic. Choose also the level of the course and stick to it when selecting the students. Even if there is not the expected number of applicant (fortunately not a problem for our in silico systems biology course anymore), do not be tempted to select inadequate candidate. Better take on less students than having a few students bored or unable to follow. Our course is advanced, and covers quite a lot of ground. We cannot expect all students to be expert in every aspect of the course. However, by selecting students who are skilled in at least one aspect of the course (and balancing the expertises), we liven up the lessons (more interesting questions and discussions) and students become themselves “associated trainers”.

More hands-on, practicals, tutorials

Students learn with their fingers. A demo will never replace an actual hands-on, where the students make the mistakes and fix them (with the help of trainers). And of course, keep the lecturers from diving in their own research and give scientific presentations. This is a course, not a conference. If needed, organise special scientific presentations a few times during the course, but not in the lessons.

Focus on concrete applications of tools

Avoid lengthy descriptions of the theoretical basis of algorithms. It is good that students learn what is under the bonnet, and can choose solutions. But (in general) they are here to learn how to use those tools for their research, not to develop the next generation of them. Two complementary approaches are 1) building toy examples, that illustrate specific uses, and 2) using famous simple examples from the literature.

Do not try to cram too much in the course

It is better to explain well a typical set of techniques, than cover inadequately the whole field. It is generally not possibly to present all the approaches used in a field of computational biology. Even a seasonned researcher in the field does not master all of them. Introduce very carefully the common basis. And then move on to a few examples of more advanced approaches. If the basics are well understood, and the students are really using the content of the course for their research, they will be able to continue training on their own.

Engage the students

It is very important that the students feel part of the course. Those events last only one week or two. The students needs to bind with the organisers, the trainers and between themselves immediatedly. Make them present their work the first day, maybe with one slide each. Organise poster sessions. Real poster sessions, where students are kept around the posters. Drinks and snacks are a good methods if they are located at the same place and keeps the students there. If you selected the students wisely (see first point), they should be interested in each other research.

Try to keep trainers around

So they can interact with students outside of their presentation/tutorials. It is very difficult. You choose the best trainers, so they are obviously very busy people. But sometimes it is better to choose better trainers than better scientists. Also select your trainers even more carefully than your students. You want good presenters, but also good interactors. Bad trainers will arrive just before their course, spend the coffee breaks reading their mails, and leave just after. Those people do not like teaching, and frankly they don’t deserve your students. Do-not hesitate to replace them, even if they are famous. Observe them also outside the classroom. This is very sad to say, but some trainers cannot behave when interacting with young adults.

These are only a few advices. I am sure there are plenty others. What are your experiences?

“What is systems biology” – the students talk

This year was the 5th instalment of our Wellcome-Trust / EMBL-EBI course “in silico systems biology“.

This course finds its origin a few years ago in a workshop of the EBI industry programme on “Pathways and models”. The workshop, that lasted 2 days, was praised by the attendees. However, the time limitation caused a bit of frustration and made us skip entire aspects we would have liked to cover. I therefore decided to try making it into a full-blown course with the help of Vicky Schneider then responsible of training at the EBI.

The first course, supported by EMBO, lasted 4 days. It was well received. However, we tried to cover too much, from functional genomics and network reconstruction to quantitative modelling of biological processes. Fortunately, the existence of another EBI course “Networks and pathways“, allowed us to focus on modelling. We progressively improved the programme through 1 FEBS course and 3 Wellcome-Trust advanced courses. Without boasting, the current course, co-organised with Julio Saez-Rodriguez and Laura Emery, reached almost perfection. The programme always evolves, but the changes slowed down with time, and we are now more in an optimisation/refinement phase. One of the big advantages is that we kept a core of trainers, who help improving the consistency and quality of the content. We are now happy to see our first generations of students having become active figures in systems biology. Some group leaders who attended the course in the past now send their own students every year. A forthcoming post will discuss a few things I learnt from organising those courses.

Beside the regular training, we always have a few group activities. This year, they were split in small groups at the beginning, and had to answer a few questions. One of them was …

What is systems biology?

Everyone has their own idea about that one, including myself (for more on the history, nature and challenges of systems biology). Here I provide you with the unfiltered and unclustered responses of 25 students (repetitions originate from different groups coming with the same answers):

  • Mechanisms on different levels
  • Wholistic view (tautology intended)
  • Dynamics of biological systems
  • Fun
  • Mathematical modeling
  • Insight to the systems
  • Predictions
  • Looking at the system as a whole and not per component
  • Should also be: formal, unambiguous
  • Holistic approach
  • Using modelling to answer biological questions
  • understanding dynamics of a system in terms of predictability
  • Mechanistic insight
  • A tool to complement experimental data
  • Experiments-modeling cycle leading to discovery
  • formalisms
  • Technology+bio data+ in silico
  • integrating levels of biological processes
  • reaching the experimentally unapproachable

Interesting isn’t it? At first it looks pretty much all over the place. Let-me re-order the answers and group them:

  1. Entire systems
    • Wholistic view (tautology intended)
    • Looking at the system as a whole and not per component
    • Holistic approach
  2. Mechanisms
    • Insight to the systems
    • Mechanistic insight
    • Mechanisms on different levels
    • integrating levels of biological processes
  3. Dynamics
    • Dynamics of biological systems
    • understanding dynamics of a system in terms of predictability
  4. Modeling
    • Mathematical modeling
    • Should also be: formal, unambiguous
    • formalisms
    • Using modelling to answer biological questions
  5. Complement the observation
    • A tool to complement experimental data
    • reaching the experimentally unapproachable
    • Experiments-modeling cycle leading to discovery
    • Predictions
    • Technology+bio data+ in silico
  6. And of course
    • Fun

We basically fall back on the two global positions in the field: a philosophical statement about life sciences (1,2,3), and a set of techniques (4,5). That reminds me a lot the discussions we had about molecular biology at university a few decades ago …

Turns out I am genetically British and not Gallish!

On January 28th 2014, I became a British citizen (while still retaining my original French citizenship). This blog post is not about the reasons behind that decision, that I might or might not expose someday. But I just discovered that I might already be British by descent!

A year ago, I sent some spit to a US company called 23andMe. This company amplified my DNA and looked for all the sites that are variable in the human population (e.g. some people present at a given position on a chromosome the base pair G-C while other have the base pair A-T). We called those SNPs (for Single Nucleotide Polymorphisms, pronounce “snips”), and the procedure SNP genotyping. My aim was to look for increased risks for some diseases. If we have the genotype (the state of all the SNPs) of many people, patients and controls, one can determine if some SNPs are statistically more frequent in people with a disease than in the general population. I thus discovered that I had a higher risk than the general population of having Venous Thromboembolism, Alzheimer’s Disease and Exfoliation Glaucoma. Conversely I have a lower risk of Gout, Colorectal Cancer and Age-related Macular Degeneration (so less chance of becoming blind with macular degeneration but more with glaucoma, hmm). I have also a higher sensitivity to the blood thinner warfarin and a decreased sensitivity to treatments for type II diabetes and hepatitis C. This health related activity of 23andMe was recently blocked by the Food and Drug Administration, a decision that led to many interesting and heated discussions (google “23andMe” “FDA” to have an idea of the divide that decision brought to the community of biologists).

However, there is another use of SNP genotyping: genealogy. Because mutations are very rare, most SNPs tend to be conserved between parents and children. As a result, I share around 50% of my SNPs with each of my parents and my children (49.9% with my daughter who was also 23andMeed), 25% with my grand-parents etc. We have a lot of SNPs, several millions. That means we can trace genealogical relationships very far. If we know the geographical localisation of the 23andMe clients, and assume most of them did not travel away from where their ancestors lived, one can use the genotyping to follow genetic migrations.

23andMe offers tools to do just that, and surprise … I am mainly of British ancestry! One talks here of the ancient ancestry, before fast travels and genetic mixing. 23andMe provides three levels of statistical analysis, from the most conservative (less informative, but more robust) to the most speculative (more informative but less solid).




Whatever level of significance, what stands out is “British & Irish”. Another tool allow to visualise this DNA ancestry on a global Similarity Map.


Each square represent a genotype, i.e. a person, significantly related to me. My genotype is represented by a green “callout”. My contacts on 23andMe (people I contacted or who contacted me, because we are perhaps distant relatives) are represented by black callouts. As you can see on the maps, my contacts and I cluster better with the British cloud than the French one.

What could that mean? I can imagine three reasons, with increasing likelihood.

1) Most of British clients of 23andMe are of French ascent. Therefore the apparent result is the opposite of the truth, the cluster labelled “British” being made of French genotypes. This is not completely insane. After all, William the conqueror and his army came from France. However, him and his army were norman, i.e. viking (and since Normandy is close to Britanny, that could rather explain the 1.3% scandinavian).

2) There was a recent influx of British DNA in my ancestry lines. This is certainly possible, but more than 50%? The influx should be recent and multiple (affecting both lines of ancestry). And indeed looking at the standard and speculative mapping above, we see British ancestry on both sets of chromosomes.

3) Some theories postulate that invasions do not affect the general population, only aristocracy being replaced. They received support from genotyping. Principal component analysis of genotypes from a pan-European cohort reproduced the geographical origin of its members. In other words, the genotypic distances followed the geographical distances.

Is-it always true though? Both my parents are “bretons“, coming from Britanny. Britanny was invaded repeatedly between the 3rd and 6th century by britons fleeing picts and anglo-saxons. Some legends state that these britons killed celtic mens and fecundated the women (the same legends state that the women’s tongues were cut so that they could not teach their own dialect to their children). Could it be that in Britanny the genetic pool was genetically replaced?

I would be very interested to hear from any other breton having had their genotype done.

Modelling success stories (4) Birth of synthetic biology 2000

For the fourth entry in the series, I will not introduce one, but two papers, published back to back in a January 2000 issue of Nature. The particularity of these articles is not that the described models presented novel features or revealed new biological insights. However, they can be considered as marking the birth of Synthetic Biology as a defined subfield of bioengineering and an applied face of Systems Biology. It is quite revealing that they focused on systems exhibiting the favourite behaviours of computational systems biologists: oscillation and multistability.

Both papers were published back to back in a Nature issue of January 2000.

Elowitz MB, Leibler S (2000) A synthetic oscillatory network of transcriptional regulators. Nature, 403:335-338.

This paper presents a model, called the repressilator, formed by three repressors in tandem. Each of them is constitutively expressed, this expression being repressed by one of the others. Deterministic and stochastic simulations show that for high transcription rate and sufficiently high protein turnover, the system oscillates, the three repressors being expressed in sequence.

Stability of the repressilator

Stability of the repressilator. See Elowitz and Leibler for the legend.

The authors implemented the model in bacteria, using the Lactose repressor of E Coli (LacI), a repressor from a tetracycline-resistant transposon (TetR) and a lambda phase repressor (CI).

The various biochemical reactions involved in implementing the repressilator  in the SBGN Process Description  language.

The various biochemical reactions involved in the repressilator implementation (SBGN Process Description language).

They indeed observed an oscillation, detected by a reporter plasmid under the control of a TetR sensitive promoter. Interestingly, the period of the oscillation is longer than the duplication time and a full oscillation spans several generations of bacteria. You can download a curated version of the repressilator in different formats from BioModels Database (BIOMD0000000012).

Gardner TS, Cantor CR, Collins JS (2000) Construction of a genetic toggle switch in Escherichia coli. Nature, 403: 339-342.

The second paper builds on a bistable switch, formed by two mutual repressors (constitutively expressed in the absence of the other). If the strength of the promoters is balanced, the system naturally forms a bi-stable switch, where only one of the repressor is expressed at a given time (stochastic simulations can display switches between the two stable states).

See Gardner et al for the legend.

Stability of the repressor switch. See Gardner et al for the legend.

The authors built two versions of this switch, in a way that allowed to use external signals to disable one of the repressions, therefore stabilising specifically one state. Interestingly, the authors built their switches in E coli using the same repressors as Elowitz and Leibler.

Structure of the repressor based toggle switches

Structure of the repressor based toggle switches

A curated version of the toggle switch in different formats from BioModels Database (BIOMD0000000507)

Both papers became milestones in synthetic biology (as witnessed by over 2000 citations each according to Google scholar as of January 2014). The model they describe are also classic examples used in biological modelling courses to explore oscillatory and multistable systems, simulated by deterministic and stochastic approaches.

Can-we simulate a whole-cell at atomistic level? I don’t think so

[disclaimer: Most of this has been written in 2008. My main opinion did not change, but some of the data backing up the arguments might seem dated]

Over the last 15 years, it has become fashionable to launch “Virtual Cell Projects”. Some of those are sensible, and based on sounds methods (one of the best recent examples being the fairly complete model of an entire Mycoplasma cell – if we except membrane processes and spatial considerations.) However, some call for “whole-cell simulation at atomic resolution”. Is it a reasonable goal to pursue? Can-we count on increase computing power to help us?

I do not think so. Not only do I believe whole-cell simulations at atomic resolutions are not only envisionable in 10 or 15 years, but IMHO they are not envisionable in any foreseable future. I actualy consider such claims damageable by 1) feeding wrong expectancies to funders and the public, 2) diverting funding from feasible, even if less ambitious, projects and 3) down-scaling the achievments of real scientific modelling efforts (see my series on modelling success stories).

Two types of problems appear when one wants to model cellular functions at the atomic scale: practical and theoretical. Let’s evacuate the practical ones, because I think they are insurmountable and therefore less interesting. As of spring 2008, the largest molecular dynamic simulation I heard of involved ~1 million atoms during 50 nanoseconds (molecular dynamics of tobacco mosaic virus capside). Even this simulation used massive power, (>30 years of a desktop CPU). With much smaller systems (10 000 atoms), people succeded to go up to half a millisecond (as of 2008). In terms of spatial size, we are very far from even the smallest cells. The simulation of an E coli sized cell would require to simulate roughly 1 000 000 000 000 atomes, that is 1 million times what we do today. But the problem is that molecular dynamics does not scale linearly. Even with space discretisation, long-range interactions (e.g. electrostatic) mean we would need far more than 1 million times more power, several orders of magnitude more. In addition, we are talking about 50 nanosecond here. To model a simple cellular behaviour, we need to reach the second time scale. So in summary, we are talking about an increase of several orders of magnitude more than 10 to the power of 14. Even if the corrected Moore law (doubling every 2 years) stayed valid, we would be talking of more than a century here, not a couple of decades!

Now, IMHO the real problems are the theoretical ones. The point is that we do not really know how to perform those simulations. The force fields I am aware of (the ones I fiddle with in the past) AMBER, CHARMM and GROMACS, are perfectly fine to describe fine movements of atoms, formation of hydrogen bonds, rotation of side-chains etc. We learnt a lot from such molecular dynamics simulations, and a Nobel prize was granted for them in 2013. But as far as I know, those methods do not permit to describe (adequately) the large scale movements of large atomic assemblies such as protein secondary structure elements, and even less the formation of such structurat elements. We cannot simulate the opening of an ion channel or the large movements of motor proteins (although we can predict them, for instance using normal modes). Therefore, even if we could simulate milliseconds of biochemistry, the result would most probably be fairly inaccurate.

There are (at least) three ways out of there, anHere we go againd they all require to leave the atomic level. Plus they also all bump into computation problems.

* Coarse-grain simulations: We lump several atoms into one particle. That worked in many cases, and this is a very promising approach, particularly if the timescale of atom and atom ensembles are quite different. See for instance the worked being done on the tobacco mosaic virus mentioned above. However, (in 2008 and according to my limited knowledge) the methods are even more inaccurate than atomic resolution molecular dynamics. And we are just pushing the computation problem further, even with very coarse models (note that the accuracy decreases with the coarseness) we are only gaining a few orders of magnitude. One severe problem here, is that one cannot rely solely on physics principles (newtonian laws or quantum physics) to design the methods. But we are still at scales that make real time experimental quantitative measurements very difficult.

* Standard Computational Systems Biology approaches. We model the cellular processes at macroscopic levels, using differential equations to represent reaction diffusion processes. The big advantage is that we can measure the constants, the concentrations etc. That worked well in the past (think about Hodgkin-Huxley predicting ion channels, Dennis Noble predicting the heart pacemaker and Goldbeter and Koshland predicting the MAP kinase cascade), and still works well. But does-it work for whole cell simulation? No, it does not really. It does not because of what we call combinatorial explosion. If you have a protein that possess several state variables such as phosphorylation sites, you have to enumerate all the possible states. If you take the example of Calcium/calmodulin kinase II, and you decide to model only the main features, binding of ATP and calmoldulin, phosphorylation in T286 and T306, activity, and the fact that it is a dodecamer, you need 2 to the power of 60 different states, that is 1 billion of billions ordinary differential equations. In a cell, you would have thousands of such cases (think of the EGF receptor with its 38 phosphorylation sites!).

* Agent-based modelling (aka single-particle simulations or mesoscopic modelling). Here we abstract the molecules to their main features, far far above the atomic level, and we represent each molecule as an agent, that knows its state. That avoids the combinatorial explosion described above. But those simulation are still super-heavy. We simulated hundreds of molecules moving and interacting in a 3D block of 1 micrometer during seconds. Those simulations take days to months to run on the cluster (and they spit out terabytes of data, but that is another problem). However, the problem is that they scale even worsely than molecular dynamics. Dominic does not simulate the molecules he is not interested in. If he did simulate all the molecules of the dendritic spine, it would take all the CPUs of the planet during years.

So where is the solution? The solution is in multiscale simulations. Let’s simulate at atomic level when we need atomic level, and at higher levels when we need higher level descriptions. A simulation should always be done at a level where we can gain useful insights and possess experimental information to set-up the model and validate its predictions. The Nobel committee did not miss it when it attributed the 2013 chemistry prize “for the development of multiscale models for complex chemical systems”.

Update 19 December 2013

Here we go again, this time with the magic names of Stanford and Google. What they achieved with their “exaclyde cloud computing system” is of the same order of magnitude that what was done in 2008. So no extraordinary feat here. 2.5 millisecond of 60000 atoms. But that does not stop them to launch the “whole-cell-at-atomic-resolution” again.