Why not using R more often as a backend for modeling and simulation?

I have a confession to make. I never learnt how to use MatLab. Despite having been in the business of developing and using dynamic models for more than a decade now, I succeeded to avoid fiddling with the master software used for that purpose. I used it once to run scripts written by one of my students. And I toyed a bit with Octave and SciLab, which are free replacements. But I was lucky enough to develop models in the infancy of systems biology, where it was perfectly acceptable to re-invent the numerical wheel in C (or even Perl!). However, this shameful omission will hopefully remained ignored but for the readers of this blog. Indeed, there is a massive haemorrhage of academic MatLab-based tools towards R. R is used largely in bioinformatics, and the continual incompatibilities between the successive versions of MatLab make maintaining a software based on that tool really hard (and the difficulty to obtain MatLab trial licences for courses does not help. That was a decisive criterion for a massive recoding from MatLab to R at the EBI, in order to be able to run properly our in silico systems biology course). Anyway, I could directly learn R, and nobody would notice how clueless I was.

And then comes my second confession. Despite having spent 8+ years at the EBI, where R is used daily by battalions of bioinformaticians, and where Bioconductor was in part developed, I did not learn to use this magical tool until recently. I have more excuses here. R was develop primarily as a tool for statistics, and was not initially strong on the computational modelling side. For instance, despite the existence of two packages handling SBML in R, RSBML and SBMLR, very few of the models in BioModels Database if any where developed using R. But my group is now heavily involved in the Drug Disease Model Repository (DDMoRe) project of the Innovative Medicine Initiative (IMI). And several of DDMoRe’s tools will be based on R. Now comes the most embarrassing part of the confession. It turns out that my wife is an experimental biologist, doing a lot of measurements and statistics. And she followed an R course organised at the Babraham Institute (where incidentally I will move half of my group from October 2012). She gave me her course materials, and I spent a few evenings with a more productive use of my time than usual (how much of the impetus was due to embarrassment will be left uninvestigated). By coincidence, I was reading at the time “Dynamic models in biology” from Ellner and Guckenheimer. While this excellent book was accompanied by exercises in MatLab, the authors later converted them in R, and produced a very useful “An introduction to R for dynamic models in biology”. That document, plus the material provided with the R package deSolve, convinced me that R was not only very powerful for everything we do in computational systems biology, it was also a lot of fun!

While I was happily rejuvenating the geek in me by writing scripts, I still wondered why there were not more GUIs to help using the simulations capabilities of R. Or taking the problem the other way around:  why do people bother to implement the computing and statistical layers in end-user software? They could just concentrate on the user experience, and re-use R packages in the background. If those packages are not powerful, flexible or versatile enough, they could contribute to their development, and make the world better. Yes, the simulations in R are maybe a bit slower that the equivalent methods implemented in a dedicated software written in C. But the slower part in a simulation task often comes from the choice of the wrong numerical solver for the task at hand, or even from the dialogue between solver and GUI (I recently used a tool where I could *see* the curves being drawn when simulating the model of Tyson 1991 (BIOMD0000000005), while any decent simulator should provide the results instantly). Moreover, entire domains of research use MatLab (physics, engineering …). The argument of optimisation does not hold long. First, if we want perfect optimisation we should write directly assembly code, and second, how computational biologists can believe they are better at coding simulation tools than people who did that in physics for the last 50 years or so?

To be honest, the time running simulations is a very small part of the activity for a typical student or post-doc in computational systems biology. Most of the time is spent developing the model, reading and data-mining. And a significant amount of time is spent (wasted?) developing Yet Another Simulation Software. This software is most often suboptimal, undocumented and dies when the main developer moves on. Even when the project is maintained, it is pretty hard to provide versions working on all operating systems, a problem that will grow worse with the emerging smartphone and tablet landscape as proper computing tools.

Several international projects aimed at providing a reusable infrastructure for modeling and simulation in biology. Historical examples are the Systems Biology Workbench (which side effect was the development of SBML, so in a sense that was the most influential software project of the last 15 years in biological modeling), or Bio-SPICE. A modern version is the GARUDA project. I really hope these projects decide to avoid re-inventing the wheel and use existing infrastructures. For all that involves calculations, I think I should be R. Not you?

Advertisements