SBGN-ML, SBML visual packages, CellDesigner format, what are they? When should I use them?

Visual representation of biochemical pathways has been a key tool used to understand cellular and molecular systems for a long time. Any knowledge integration project involves a jigsaw puzzle step, where different pieces have to be put together. When Feynman cheekily wrote on his blackboard just before his death “What I cannot create I do not understand”, he meant that he only fully understood a system once he derived a (mathematical) model for it, and interestingly Feynman is also famous for one of the earliest standard graphical representations of reaction networks, namely the Feynman diagrams to represent models of subatomic particle interactions. The earliest metabolic “map” I possess comes from the 3rd edition of “Outlines of Biochemistry” by Gortner published in 1949. I would be happy to hear if you have older ones.

Goertner

(I let you find out all the inconsistencies, confusions and error-generating features in this map. This might be food for another text, but I believe this is a great example to support the creation of standards, best practices, and software tools!)

Until recently, those diagrams were mostly drawn by hand, initially on paper, then using drawing software. There was not so much thinking spent in consistency, visual semantics, or interoperability. This changed in the 1990s, as part as Systems Biology’s revival. The other thing that changed in the 1990s was the widespread use of computers and software tools to build and analyse models. The child of both trends was the development of standard computer-readable formats to represent biological networks.

When drawing a knowledge representation map, one can divide the decision-making process, and therefore the things we need to encode in order to share the map, in three  parts:

What – How can people identify what I represent? A biochemical map is a network made up of nodes, linked by arcs. The network may contain only one type of nodes, for instance a protein-protein interaction network or an influence network, or be a bipartite graph, like a reaction network – one type of nodes representing the pools involved in the reactions, the other representing the reactions themselves. One decision is the shape to use for each node so that it carries a visual information about the nature of what it represents. Another concerns the arcs linking the nodes, that can also contain visual clues, such as directionality, sign, type of influence etc. All this must be encoded in some way, either semantically (a code identifying the type of glyphs, from an agreed-up list of codes), or graphical (embedding an image or describing the node).

Where – Once the glyphs are chosen, one needs to place them. The relative position of the information should not always carry much information, but there are some cases where it must, e.g. members of complexes, inclusion in compartments etc. And there is no denying that the relative position of glyphs is also used to convey more subjective information. For instance a linear chain of reactions induce the idea of a flow, much better than a set of reactions going randomly up and down, right and left. Another unwritten convention it to represent membrane signal transduction on the top of the maps, with the “end-result”, often effect on gene expression, at the bottom, with the idea of a cascading flux of information. The coordinates of the the glyphs must then be shared as well.

How – Finally, the impact of a visual representation also depends on aesthetic factors. The relative size of glyphs and labels, thickness of arcs, the colours, shades and textures, all influence the facility with which viewers absorb the information contained in a map. Relying on such aspects to interpret the meaning of a map should be avoided, in particular if the map is to be shared between different media, where rendering could affect the final aspect. But wanting to keep this aspect as close as possible makes sense.

A bit of history

Different formats have been developed over the years to cover these different aspects with different accuracy and constraints. In order to understand why we have such a variety of description formats on offer, a bit of history might be useful. Being able to encode graphical representation of models in SBML was mentioned as early as 2000 (Andrew Finney. Possible Extensions to the Systems Biology Markup Language. 27 November 2000.).

In 2002, the group of Hiroaki Kitano presented a graphical editor for the Systems Biology Markup Language (SBML, Hucka et al 2003), called SBedit, and proposed extensions to SBML necessary for encoding maps (Tanimura et al. Proposal for SBEdit’s extension of SBML-Level-1. 8 July 2002). This software latter became CellDesigner (Funahashi et al 2003), a full-featured modelling developing environment, using SBML as its native format. All graphical information is encoded in CellDesigner-specific annotations, using the SBML extension system. In addition to the layout (the where), CellDesigner proposed a set of standardised glyphs to use for representing different types of molecular entities and different relationships (the what) (Kitano et al 2003). At the same time, Herbert Sauro developed an extension to SBML to encode the maps designed in the software JDesigner (Herbert Sauro. JDesigner SBMLAnnotation. 8 January 2003). Both CellDesigner and JDesigner annotations could also encode the appearance of glyphs (how).

In 2003, Gauges et al (Gauges et al. Including Layout information in SBML files. 13 May 2003) proposed to split the description of the layout (the where) and the rendering (the what and the how), and to focus on the layout part in SBML (Gauges et al 2006). Eventually, this effort led to the development of two SBML Level 3 Packages, Layout (Gauges et al 2015) and Render (Bergmann et al 2017).

Once the SBML Layout annotations were finalised, the SBML and BioPAX communities came together to standardise visual representations for biochemical pathways. This led to the Systems Biology Graphical Notation, as set of three standard graphical languages with agreed upon symbols and rules to assemble them (the what, Le Novère et al 2009). While the shape of SBGN glyphs determine their meaning, neither their placement in the map nor their graphical attributes (colour, texture, edge thickness, the how) affect the map semantics. SBGN maps are ultimately images and can be exchanged as such, either in bitmaps or vector graphics. They are also graphs and can be exchanged using graph formats, such as GraphML. However, it was felt that sharing and editing SBGN maps would be much easier if more semantics was encoded rather than graphical details. This led to the development of SBGN-ML (van Iersel et al 2012), which not only encode the SBGN part of SBGN maps, but also the layout and size of graph elements.

So we have at least three solutions to encode biochemical maps using XML standards from the COMBINE community (Hucka et al 2015): 1) SBGN-ML, 2) SBML with Layout extension (controlled Layout annotations in Level 2 and Layout package in Level 3) and 3) SBML with proprietary extensions. Regarding the latter, we will only consider CellDesigner, for two reasons. Firstly, CellDesigner is the most used graphical model designer in systems biology (at the time of writing, the articles describing the software have been cited over 1000 times). Secondly, CellDesigner’ SBML extensions are used in other software tools.  These solutions are not equivalent, they present different advantages and disadvantages, and round-tripping is in general not possible.

SBGN-ML

Curiously, despite its name, SBGN-ML does not explicitly describe the SBGN part of the maps (the what). Since the shape of nodes is a standard, it is only necessary to mention their type, and any supporting software will know which symbol to use. For instance, SBGN-ML will not specify that a protein X must be represented with a round-corner rectangle. It will only say that there is a macromolecule X at a certain position with given width and height. Any SBGN-supporting software must know that a macromolecule is represented by a round-corner rectangle. The consequence is that SBGN-ML cannot be used to encode maps using non-SBGN symbols. However, software tools can decide to use different symbols attributed to a given class of SBGN objects during the rendering of the maps. Instead of using a round-corner rectangle each time the class of a glyph is macromolecule, it could use a star. The resulting image would not be an SBGN map. But if modified, and saved back in SBGN-ML, it could be recognised by another supporting software. Such a behaviour is not to be encouraged if we want people to get used to SBGN symbols, but it provides a certain level or interoperability.

What is explicitly described in SBGN-ML instead are the parts that are not regulated by SBGN itself, but are specific to the map. That include the size of the glyphs (bounding box), the textual labels, as well as the positions of glyphs (the where). SBGN-ML currently does not encode rendering properties such as text size, colours and textures (the how). But the language provides an element extension, analogous to the SBML annotation, that allows to augment the language. One can use this element to extend each glyph, or to encode styles, and the community started to do so in an agreed-upon manner.

Note that SBGN-ML only encodes the graph. While there is a certain amount of biological semantics, linked to the identity of the glyphs, it is not a general purpose format that would encode advanced semantic of regulatory features, such as BioPAX (Demir et al. 2010),  or mathematical relationships such as SBML. However, users can distribute SBML files along SBGN-ML files, for instance in a COMBINE Archive (Bergmann et al 2014). Unfortunately, there is currently no blessed way to map an SBML element, such as a particular species, to a given SBGN-ML glyph.

SBML Level 3 + Layout and Render packages

As we mentioned before, SBML Level 3 provides two packages helping with the visual representations of networks: Layout (the where) and Render (the how). Contrarily to SBGN-ML, which is meant to describe maps in a standard graphical notation, the SBML Level 3 packages do not restrict the way one represents biochemical networks. This provides more flexibility to the user, but decreases the “stand-alone” semantics content of the representations. I.e. if non-standard symbols are used, their meaning must be defined in an external legend. It is of course possible to use only SBGN glyphs to encode maps. The visual rendering of such a file will be SBGN, but the automatic analysis of the underlying  format will be harder.

The SBML Layout package permits to encode the position of objects, points, curves and bounding boxes. Curves can have complex shapes, encoded as Béziers curves. The package allows to distinguish between different general types of nodes such as compartments, molecular species, reactions and text. However, there is little biological semantics encoded by the shapes, either regarding the nodes (e.g. nothing distinguishes a simple chemical from a protein) or the edges (one cannot distinguish an inhibition from a stimulation). In addition, the SBML Render package permits to define styles that can be applied to types of glyphs. This includes colours and gradients, geometric shapes, properties of text, lines, line-endings etc. Render can encode a wide variety of graphical properties, and pave the gap to generic graphical formats such as SVG.

If we are trying to visualise a model, one advantage of using SBML packages is that all the information is included in a single file, providing an easy mapping between the model constructs and their representation. This goes a long way to solve the issue of the biological semantics mentioned above, since it can be retrieved from the SBML Core elements, linked to the Layout elements. Let’s note that while SBML Layout+Render do not encode the nature of the objects represented by the glyphs (the what) using specific structures, this can be retrieved via the attributes sboTerm of the corresponding SBML Core elements, using the appropriate values from the Systems Biology Ontology (Courtot et al 2011).

CellDesigner notation

CellDesigner uses SBML (currently Level 2) as its native language. However, it extended it with its own proprietary annotation, keeping the SBML perfectly valid (which is also the way software tools such as JDesigner operate). Visually, the CellDesigner notation is close to SBGN Process Descriptions, having been the strongest inspiration for the community effort. CellDesigner offers an SBGN-View mode, that produce graphs closer to pure SBGN PD.

CellDesigner’s SBML extensions increase the semantics of SBML elements such as molecular species or regulatory arcs,  in a way not dissimilar to SBGN-ML. In addition, it provides a description of each glyph linked to the SBML elements, covering the ground of SBML Layout and Render. The SBML extensions being specific to CellDesigner, they do not offer the flexibility of SBML Render. However, the limited spectrum of possibility might makes the support easier.

CellDesigner notation SBML Layout+Render SBGN-ML
Encode the what
Encode the where
Encode the how
Contain the mathematical model part
Writing supported by more than 1 tool
Reading supported by more than 1 tool
Is a community standard

Examples of usages and conversions

Now let’s see the three formats in action. We start with SBGN-ML. First, we can load a model, for instance from BioModels (Chelliah et al 2015), in CellDesigner (version 4.4 at the time of writing). Here we will use the model BIOMD0000000010, an SBML version of  the MAP kinase model described in Kholodenko et al (2000).

image10

From an SBML file that does not contain any visual representation, CellDesigner created one using its auto-layout functions. One can then export an SBGN-ML file. This SBGN-ML file can be imported for instance in Cytoscape (Shannon et al.  2003) 2.8 using the CySBGN plugin (Gonçalves et al 2013).

image2

The position and size of nodes are conserved, but edges have different size (and the catalysis glyph is wrong). The same SBGN-ML file can be open in the online SBGN editor Newt.

image4

An alternative to CellDesigner to produce the SBGN-ML map could be Vanted (Junker et al 2006, version 2.6.4 at the time of writing). Using the same model from BioModels, we can auto-layout the map (we used the organic layout here) and then convert the graph to SBGN using the SBGN-ED plugin (Czauderna et al 2010).

image11

The map can then be saved as SBGN-ML, and as before opened in Newt.

image3

The positions of the nodes are conserved. But the connection of edges is a bit different. In that case, Newt is slightly more SBGN compliant.

Now, let’s start with a vanilla SBML file. We can import  our BIOMD0000000010 model in COPASI  (Hoops et al 2006, version 4.22 at the time of writing). COPASI now offers auto-layout capabilities, with possibilities of manually editing the resulting maps.

image5

Now, when we’ll export the model in SBML, it will contain the map encoded with the Layout and Render packages. When the model is uploaded in any software tool supporting the packages, we will retrieve the map. For instance, we can use the SBML Layout Viewer. Note that if the layout is conserved, it is not the case of the rendering.

image1

Alternatively, we can load the model to CellDesigner, and manually generate a nice map (NB: a CellDesigner plugin that can read SBML Layout was implemented during Google Summer of Code 2014 . It is part of the JSBML project).

image6

We can create an SBML Layout using CellDesigner layout converter. When we import the model in COPASI we can visualise the map encoded in Layout. NB: the difference of appearance here is due to a problem in CellDesigner converter, not COPASI.

image8

The same model can be loaded in the SBML Layout Viewer.

image7

How do I choose between the formats?

There is unfortunately no unique solution at the moment. The main question one has to ask is what do we want to do with the visual maps?

Are they meant to be a visual representation of an underlying model, the model being the important part, that needs to be exchanged? If that is the case, SBML packages or CellDesigner notation should be used.

Does the project mostly/only involves graphical representations, and those must be exchanged? CellDesigner or SBGN-ML would therefore be better.

Does the rendering of graphical elements matter? In that case, SBML packages or CellDesigner notations are currently better (but that is going to change soon).

Is standardisation important for the project, in addition to immediate interoperability? If yes, SBML packages or SBGN-ML would be the way to go.

All those questions and more have to be clearly spelled out at the beginning of a project. The answer will quickly emerge from the answers.

Acknowledgements

Thanks to Frank Bergmann, Andreas Dräger, Akira Funahashi, Sarah Keating, Herbert Sauro for help and corrections.

References

Bergmann FT, Adams R, Moodie S, Cooper J, Glont M, Golebiewski M, Hucka M, Laibe C, Miller AK, Nickerson DP, Olivier BG, Rodriguez N, Sauro HM, Scharm M, Soiland-Reyes S, Waltemath D, Yvon F, Le Novère N (2015) COMBINE archive and OMEX format: one file to share all information to reproduce a modeling project. BMC Syst Biol 15, 369. doi:10.1186/s12859-014-0369-z

Bergmann FT, Keating SM, Gauges R, Sahle S, Wengler K (2017) Render, Version 1 Release 1. Available from COMBINE <http://identifiers.org/combine.specifications/sbml.level-3.version-1.render.version-1.release-1>

Chelliah V, Juty N, Ajmera I, Raza A, Dumousseau M, Glont M, Hucka M, Jalowicki G, Keating S, Knight-Schrijver V, Lloret-Villas A, Natarajan K, Pettit J-B, Rodriguez N, Schubert M, Wimalaratne S, Zhou Y, Hermjakob H, Le Novère N, Laibe C (2015)  BioModels: ten year anniversary. Nucleic Acids Res 43(D1), D542-D548. doi:10.1093/nar/gku1181

Courtot M, Juty N, Knüpfer C, Waltemath D, Zhukova A, Dräger A, Dumontier M, Finney A, Golebiewski M, Hastings J, Hoops S, Keating S, Kell DB, Kerrien S, Lawson J, Lister A, Lu J, Machne R, Mendes P, Pocock M, Rodriguez N, Villeger A, Wilkinson DJ, Wimalaratne S, Laibe C, Hucka M, Le Novère N. Controlled vocabularies and semantics in Systems Biology. Mol Syst Biol  7, 543. doi:

Czauderna T, Klukas C, Schreiber F (2010) Editing, validating and translating of SBGN maps. Bioinformatics 26(18), 2340-2341. doi:10.1093/bioinformatics/btq407

Demir E, Cary MP, Paley S, Fukuda K, Lemer C, Vastrik I, Wu G, D’Eustachio P, Schaefer C, Luciano J, Schacherer F, Martinez-Flores I, Hu Z, Jimenez-Jacinto V, Joshi-Tope G, Kandasamy K, Lopez-Fuentes AC, Mi H, Pichler E, Rodchenkov I, Splendiani A, Tkachev S, Zucker J, Gopinathrao G, Rajasimha H, Ramakrishnan R, Shah I, Syed M, Anwar N, Babur O, Blinov M, Brauner E, Corwin D, Donaldson S, Gibbons F, Goldberg R, Hornbeck P, Luna A, Murray-Rust P, Neumann E, Ruebenacker O, Samwald M, van Iersel M, Wimalaratne S, Allen K, Braun B, Carrillo M, Cheung KH, Dahlquist K, Finney A, Gillespie M, Glass E, Gong L, Haw R, Honig M, Hubaut O, Kane D, Krupa S, Kutmon M, Leonard J, Marks D, Merberg D, Petri V, Pico A, Ravenscroft D, Ren L, Shah N, Sunshine M, Tang R, Whaley R, Letovksy S, Buetow KH, Rzhetsky A, Schachter V, Sobral BS, Dogrusoz U, McWeeney S, Aladjem M, Birney E, Collado-Vides J, Goto S, Hucka M, Le Novère N, Maltsev N, Pandey A, Thomas P, Wingender E, Karp PD, Sander C, Bader GD  (2010) The BioPAX Community Standard for Pathway Data Sharing. Nat Biotechnol, 28, 935–942. doi:10.1038/nbt.1666

Funahashi A, Morohashi M, Kitano H, Tanimura N (2003) CellDesigner: a process diagram editor for gene-regulatory and biochemical networks. Biosilico 1 (5), 159-162

Gauges R, Rost U, Sahle S, Wegner K (2006) A model diagram layout extension for SBML. Bioinformatics 22(15), 1879-1885. doi:10.1093/bioinformatics/btl195

Gauges R, Rost U, Sahle S, Wengler K, Bergmann FT (2015) The Systems Biology Markup Language (SBML) Level 3 Package: Layout, Version 1 Core. J Integr Bioinform 12(2), 267. doi:10.2390/biecoll-jib-2015-267

Gonçalves E, van Iersel M, Saez-Rodriguez J (2013) CySBGN: A Cytoscape plug-in to integrate SBGN maps. BMC Bioinfo 14, 17. doi:10.1186/1471-2105-14-17

Hoops S, Sahle S, Gauges R, Lee C, Pahle J, Simus N, Singhal M, Xu L, Mendes P, Kummer U (2006) COPASI-a COmplex PAthway SImulator. Bioinformatics 22(24), 3067-3074. doi:10.1093/bioinformatics/btl485

Hucka M, Bolouri H, Finney A, Sauro HM, Doyle JC, Kitano H, Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A, Cuellar AA, Dronov S, Ginkel M, Gor V, Goryanin II, Hedley WJ, Hodgman TC, Hunter PJ, Juty NS, Kasberger JL, Kremling A, Kummer U, Le Novère N, Loew LM, Lucio D, Mendes P, Mjolsness ED, Nakayama Y, Nelson MR, Nielsen PF, Sakurada T, Schaff JC, Shapiro BE, Shimizu TS, Spence HD, Stelling J, Takahashi K, Tomita M, Wagner J, Wang J (2003) The Systems Biology Markup Language (SBML): A Medium for Representation and Exchange of Biochemical Network Models. Bioinformatics, 19, 524-531. doi:10.1093/bioinformatics/btg015

Hucka M, Nickerson DP, Bader G, Bergmann FT, Cooper J, Demir E, Garny A, Golebiewski M, Myers CJ, Schreiber F, Waltemath D, Le Novère N (2015) Promoting coordinated development of community-based information standards for modeling in biology: the COMBINE initiative. Frontiers Bioeng Biotechnol 3, 19. doi:10.3389/fbioe.2015.00019

Junker BH, Klukas C, Schreiber F (2006) VANTED: A system for advanced data analysis and visualization in the context of biological networks. BMC Bioinfo 7, 109. doi:10.1186/1471-2105-7-109

Kholodenko BN (2000) Negative feedback and ultrasensitivity can bring about oscillations in the mitogen-activated protein kinase cascades. Eur J Biochem.267(6), 1583-1588. doi:10.1046/j.1432-1327.2000.01197.x

Kitano H (2003) A graphical notation for biochemical networks. Biosilico 1 (5), 169-176. doi:10.1016/S1478-5382(03)02380-1

Le Novère N, Hucka M, Mi H., Moodie S, Shreiber F, Sorokin A, Demir E, Wegner K, Aladjem M, Wimalaratne S, Bergman FT, Gauges R, Ghazal P, Kawaji H, Li L, Matsuoka Y, Villéger A, Boyd SE, Calzone L, Courtot M, Dogrusoz U, Freeman T, Funahashi A, Ghosh S, Jouraku A, Kim S, Kolpakov F, Luna A, Sahle S, Schmidt E, Watterson S, Goryanin I, Kell DB, Sander C, Sauro H, Snoep JL, Kohn K, Kitano H (2009) The Systems Biology Graphical Notation. Nat Biotechnol 27, 735-741. doi:10.1038/nbt.1558

Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramge D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks.  Bioinformatics 13, 2498-2504. doi:10.1101/gr.1239303

van Iersel MP, Villéger AC, Czauderna T, Boyd SE, Bergmann FT, Luna A, Demir E, Sorokin A, Dogrusoz U, Matsuoka Y, Funahashi A, Aladjem MI, Mi H, Moodie SL, Kitano H, Le Novère N, Schreiber F (2012) Software support for SBGN maps: SBGN-ML and LibSBGN. Bioinformatics 28, 2016-2021. doi:10.1093/bioinformatics/bts270

Advertisements

Setting things straight

If you are reading that, you are probably aware that I have been convicted and sentenced for a crime related to accessing indecent images on the internet. Following the sentence, there was quite a lot of media exposure, in particular follow-up articles in the Cambridge News and the BBC. These reports were quite misleading, misusing dates, and choosing words in order to maximise the shock effect.

I hoped that in time those reports would wane, but this was a pipe dream. I also cannot expect people to forget what they read. Although I cannot fight the free press, I can communicate on my own. This post aims at giving precisions on the what, when and how, and setting things straight.

First of all, I want to assure any reader that I am not seeking to minimise my crime, or trying to find excuses for it. I made mistakes, horrible ones, and I am paying for them. A side effect of the story is that it made me think deeply about what those crimes are, why and how they are committed.

Back to the facts. In August 2017, the police seized 14 electronic devices from my home, including computers, phones and hard-drives. In May 2018, I was summoned for an interview under caution at the Peterborough police station. The police gave me back 13 of the devices, that were found clean.

On an old Iomega hard disk, that I used as a home backup until a few years ago, they found two indecent moving images. The images were in the cache of a peer to peer software called Limewire. They were not readable, and were identified through digital forensic investigation. I was therefore charged for making images but not possession (i.e. I did not have these images and could not give them to anyone). Note that of course making an image is not at all the same as taking a picture. In the present case, it means creating a computer file of an image.

The police determined that the two files were opened only once, at the date of creation in May 2008, and never again. This was the charge for which I was sentenced. The police told me at the time that the offense was “at the lowest end of the scale”, and that I would probably only get a fine.

Now, during the interview under caution in May 2018, I was asked when I started using the Limewire software. I answered “in the early 2000s for downloading music”. I have since determined that I started using Limewire sometimes in 2005 (in fact, I only had broadband at home from end of 2003).

During the interview, the police also asked me about my use of chatrooms and forums. At one point, I was asked when I last saw an indecent image in those chatrooms. I answered “about a year ago”. Now, the specific chatrooms we were talking about were not illegal, or dedicated to the kind of material leading to the interview. And one does not choose what to see in forums and chatrooms, which is why people are not prosecuted for that (*).

These facts were then transformed through a chinese whisper game. The police tape (that was never heard after the interview) led to a written report transmitted to the prosecution, which ordered a pre-sentencing report. This pre-sentencing report was sent back to the prosecution (different judges ordered the report and sentenced me). And finally, all that was described by the judge in front of journalists who took hand-written notes.

As a consequence, the facts described above became in the news reports, downloading child pornography between 2000 and 2016 and having two indecent moving images. Again, I am not trying to minimise my crime or finding excuses for it. But both these statements were untrue. And I believe the portrait those articles made of me instilled a climate of shock and fear that is not foreign to some reactions I am, and my family is, still suffering from.

I was sentenced in Cambridge crown court on 4th September. The judge was severe, and the sentence unexpectedly strong, taking solicitor, barrister, the police, probation services, and of course myself, by surprise. I was sentenced to 8 months custody, suspended for 2 years. During this period, I will be monitored by the probation services. I have also to complete a 30 days rehabilitation activity. This sentence will be spent after 4 years and 8 months.

In parallel to the sentence, I have to be on the sex offender registry for 10 years. In brief, this means that the police needs to know where I am at any time, plus a few other things. I also received a Sexual Harm Prevention Order, which forbids me to hide or delete my internet history and allows the police to monitor my electronic devices at any time without a warrant. However, this SHPO does not apply to electronic devices provided by employers.

Finally, I was fired from my job at the Babraham Institute and expelled from any board, panel, collaboration, community I was part of.

So, I made mistakes, and I am paying for them, dearly. I need to atone, and I will. There is no need to inflate my errors, and mislead people around me into fearing a monstrous predator.

(*) It is estimated that half a million adult UK males viewed such material in the past, which points to a societal problem that is large and complex, and is unlikely to be solved by casting out a few individuals. In other words, if you know 60 adult males, then it very likely one of them accessed this type of content. If you work in an institute employing 180 adult males, then 3 of them could have done so. And the proportion increases with the use of internet.

So, is it rare? “No”. Is it “normal”? No! Should the society deal with the problem? Yes. Is there redemption for the 1.7% of the population? I would hope so.

The Stop It Now campaign from the Lucy Faithfull foundation is a good source of information and help. If you are accessing this type of material, or thinking about it, you can contact them.

Tips for writing senior grant applications involving systems modelling

As I get older, I am asked to sit on more senior grant panels. Larger projects generally mean more diverse research approaches. I am also offered to sit on multi-disciplinary panels, where the projects must feature different approaches, or even different fields. Finally, those panels are often made up of more “mature” scientists. As a result of all that:

1) Panel members have less expertise, if any, in some of the fields covered by the applications. Even in their own field, they mostly master only part of the landscape. They left university decades ago, and kept abreast of developments via reading scientific literature. Since there are only 24 h in a day, we tend to focus our reading largely on what directly impacts our own research. When it comes to techniques, most of senior panel members left actual experiments/programming/equations for quite a while, and although they have an “academic” knowledge of the methods, they might not know the state of the art, or master the subtleties and pitfalls of given techniques.

2) Panel members/reviewers have a shorter attention span. This is not only because of age (although let’s face it, there is some decrease in focus and stamina), but also because senior scientists have more commitments and are always running for deadlines (there is also the increasing number of projects to evaluate. In my latest panel, I was in charge of 25 projects). Finally, there might also be a bit of arrogance and “cannot be bothered” attitude.

Applicants to large grants should take these facts into consideration. One area where it is particularly important is Systems Biology. I am not going to dwell on what is Systems Biology, but a subset includes the development of mathematical models and numerical simulations to reproduce the behaviour of biological systems. This is an area where I encountered systematic strategic errors in the way grant applications are written, that  decreased their chances of being successfully funded comparatively to projects in other areas. Below are a few advices that could allow senior panels to appreciate your projects better. This is only my opinion, and I might be wrong about the impact of those mistakes. Also, some of the advices seem obvious. But nevertheless, they are not systematically followed.

What is the question?

Building models is great. Models are integrators of knowledge, and building a model that can reproduce known behaviours of a system is the best way to see if we understand it. Models can also suggest new avenues of research or treatment. But models must fulfill a purpose. This purpose is not obligatory part of the project submitted itself, but it must be mentioned there. Why are you building the model? What questions do you want to answer with the model simulations and analysis? Note that this is not specific to computational modelling. I saw rejection of projects describing very complex experimental techniques which use was not justified. Technical activity must be commensurate with expected benefits.

Have a colleague to read your proposal, and ask them afterward “can you tell me what we are going to do with this model?” (Do not ask the question before, while handing them the application. Your colleague is clever and friendly. They will finds hints here and there, add their own ideas and put something together).

Clarity of the research plan

I cannot count the number of time where after a lengthy introduction, a detailed experiment part, projects end up with “and we will build a model”, or even worse, the dreadful “we will use a systems biology approach”. WTF is a “systems biology approach”? I have called myself a systems biologist for the best part of two decades and have not a clue. It does not mean anything, or it means too many things.

Explain what you are going to do and how. Which kind of modeling approach will you use? Why? Is this the best modeling approach considering the data you have and the questions you ask? If building a model, how will you build it? What will be the variables (e.g. which molecular species will be represented)? How will you relate them? Will you use numerical or logic approaches? Will you incorporate uncertainty? Will you study steady-states or kinetics? Will you use deterministic or stochastic simulations? Which software will you use? How are you going to analyse the data? How are you going to link the model to experimental evidence? Do you have a plan for parameterisation?

Don’t overdo it. One does not describe generic molecular biology kits or culture media in senior grant applications (except if this is at the core of the project), so we do not need to describe technical details that will not affect the results and their interpretation. But give enough details to convince the reviewers or panel members who might actually know what all that is about. An experimental plan that would not precise the organism or cell line to be used has almost no chance of getting through. Same for modelling! And actually, you also need to add enough explanation to allow for non-specialists to understand what you are going to do. For instance, many people are baffled by genome-scale constraint-based modelling of metabolism, confusing flux balance analysis and metabolic flux analysis, and therefore misunderstanding the (absent) role of metabolite concentrations. They also mistake them with ODE models, concluding that they are too big to be parameterised.

Have a colleague to read your proposal, and ask them afterward “Can you tell me which modelling method I will use and why it is the best for this project?

Provide preliminary data

Almost any experimental project comes packed with preliminary data that shows 1) why is the investigated question interesting, 2) why is the workplan feasible. It is no different with modelling. Why would people believe you? Past track record on other projects is not sufficient. At best it can show that you can do modelling. Good. But this is not enough. Remember, this blog post is particularly focused on large projects, with multiple lines of investigations and requesting large amounts of money. Often these projects have been written over many months or even a year. I know a few such projects for which the production of preliminary data required another, smaller, dedicated funding. So there is no excuse not to spend a sufficient effort benchmarking the modelling approaches you will use, and getting preliminary results, hopefully exciting and justifying a scale-up.

Describe the validation steps

This is a very important part, even more important than for experimental projects. A modeller cannot say “let the data talk”. Any number of models can lead to reasonable simulation results. That does not mean the model is the correct one, or that the results mean anything. You must convince the panel that you have a plan to check that your results are valid and are not a bunch of random numbers or graphs. How will you validate the results of your simulations and analysis? How will this validation feedback in your model design? How precisely will your predictions lead to new experiments? Who will do the validation? Where? When?

Have a colleague to read your proposal, and ask them afterward “Can you draw me a workflow of the modelling part of this project, and identify the points of contact with the rest of the project?

These were just some tips I came up with. Do you disagree with them? Would you add others?

Do paper citations and indices correlate?

Evaluating the impact of research activity is a complex issue, that is guaranteed to stir hot debates whatever the audience or the context. The way evaluators have access to research output is mostly via publications, whatever the type – articles, books, conference proceedings, technical reports etc. It is important to note that in many fields of research, publications are not (or should not) themselves the output of research. In particular, in natural sciences and mathematics, the output of science comes under many guises such as theorems, software, datasets, techniques, chemical compounds and materials, patents etc.  the publications being only a report on the research activity and its outcome. Nevertheless, as a consequence, a large part of the evaluation relies on the publications. The most obvious way to do so is by reading the publications, the so-called peer review. This is what is done before a manuscript is accepted for publication in scientific journals (and increasingly now after it is published). However, to assess funding applications, project achievements, individual and institution performances, most evaluations rely in part on the analysis of publication impact.

[small digression. Let’s be clear about something. Everyone claiming that peer-review of papers is being used to evaluate funding applications, individuals for positions or promotions, or institute performance, is either an hypocrite or has never been part of such an evaluation committee. This never happens, for two reasons, one negative and one positive. The first one is that nobody would have time to perform such as exercise. Members of evaluation panels are often senior researchers, chosen for their recognized track records. They lead research groups and are completely over-committed. Reading a paper seriously, understanding its content, its novelty, takes a significant amount of time. The notion that we read dozens of papers from dozens of scientists for a given panel is just a fantasy. The second, positive, reason is that members of evaluation committees present a very limited collective expertise. For instance I am part of a committee covering the totality of the research spectrum. In this committee, there are only a handful of people covering the entirety of life sciences! It is VERY FORTUNATE that we are not actually judging the papers ourselves!]

To improve on arbitrary judgments based on unconscious bias triggered by journal names, and to complement evaluation by external reviewers, people try to use quantitative metrics, developed by the field called bibliometrics, and in particular citation analysis. For instance, The UK Research Excellence Framework (REF) provides guidance for the use of citation data. It is important to note that those metrics are not sufficient, and REF is actively assessing how best to use them.

In the field of natural sciences citation, a variety of metrics are used to evaluate the impact of articles, individuals and institutes, including citation counts, h-indices and impact factors (yes, this is very wrong, IF are meant for journals not for papers and authors). Recently, a new metrics has been proposed to assess the impact of a given article, the Relative Citation Ratio.

Scientists are inherently navel gazing (or maybe it is just me), and I was curious to see how all these correlated for me. So I collated my bibliometrics data using Google Scholar.  First let’s look at the classic measurements. If I plot the citations of each paper versus the impact factor of the journal it was published in for the year of its publication, the correlation is not overwhelming …

citvsif

The paper describing SBML is clearly an outlier and makes hard to judge the rest of the plot, so let’s discard it for the time being (yes, I should also discard the outliers in the other direction, but hey, this is a blog post, not a research paper …)

citvsifnosbmlNow, the correlation is clear, but still not overwhelming. The correlation seems to disappear for the highest impact factors, above 18. However, there is an obvious correction to bring to citation counts: recent papers are less cited than old papers. Because I am now a senior scientist, I tend to publish a bit more in papers of high impact factors. Examples are papers reporting the results of large collaborations and invited reviews. So we need to correct for paper age by dividing the counts with the number of year elapsed since their publication.

citvsifcorryear

Indeed, the correlation is clearer. But there is still a lot of noise. I would not say that choosing a higher impact factor is a foolproof way to getting more citations. And I would certainly not say that a paper in a high impact factor journal has necessarily a big impact!

Let’s now turn to the Relative citation ratio. How does it compare to the Impact factor?

rcrvsif

Well, the correlation is quasi-identical to the one with the average citations per year. Which of course leads us to the main comparison, which is between the RCR and the citation counts.

rcrvscit

The correlation is much better. The outlier with 37 citations and an RCR of 0 is actually an artifact of Google Scholar. Of course, the RCR offers more than just an improved citation count. For instance, it also compares a paper’s impact to the impact of all papers reporting research funded by the NIH. A problem of the current tool though, is that its citation data comes from the Web of Science databases. Those databases do not contain all the scientific journals. They do not record citations in books. And of course they are not open. The RCR is a neat tool, but considering the strong correlation with pure citations, at least in my case, I think just looking at the citation counts is actually a good easy to use proxy for impact.

All that focused on article per article impact. But would total citations be a good proxy to evaluate individual researchers? Continuing the navel gazing exercise, I extracted the data for people in my institute who set up a Google profile. I omitted the PhD students, because publication records and citations are too noisy. I divided the positions in department heads, tenure group leaders, tenure track group leaders (5 year positions, most often a first experience of group leader), senior research associates (indefinite contracts but not group leaders) and post-doctoral fellows.

bi-h-indexvscit

The correlation between total citations and h-index is quite impressive. This is probably due to the fact that we do not have distortions due to anomalous papers (e.g. BLAST or Clustal in bioinformatics). The occasional highly cited papers (e.g. SBML in my case) are just averaged out. And what comes out clearly is that in the majority of cases, positions match publication impact.  Are total citations or h-index the best predictor? We can plot the rank in both classifications.

ranking-post

The H-index seems to correlate a bit better with tenure, SRA  and tenure track positions. The separation between tenure track and post-docs is more blurry because some post-docs are quite senior and have impressive CVs. But overall, the separations are quite clear. And so is the message. In my institute, there is little hope to become tenure track if you have less than 1000 citations and a single digit h-index. For tenure, the bar would be close to 3000 citations and h-index in the mid-tenth. When it comes to department heads we’re talking 10000 citations and an h-index of 50.

Now all that is of course very focused on my field of research. Molecular, cellular and systems biology is a very peculiar community. The publication habits, the criteria of excellence, everything is very homogeneous, almost military. It is also a fairly inward looking community. Not only there are very little contacts with other sciences, but there are very little contacts with the other components of life sciences as well. A fair amount of its members are actually convinced all scientists in all fields are thinking and acting alike. They would be surprised, and dismayed, to witness what I once saw in a conference: German computer science students impersonating us, exchanging pompous sentences about journal articles, impact factors and citations. They had the time of their life. Very humbling.

All that to say that everything in this blog post should be taken with more than a grain of salt.

Selection panels, follow the “rule of thirds”

Over the past decade or so, I have been part of quite a few grant panels, for national and international funding agencies. Each time, the funding agency felt compelled to reinvent all the procedures from scratch, and ignore whatever experience has been gained from thousands of such exercises in the past. The reason is always to be more efficient, and serve science by selecting the best projects, with the highest likelihood of impact. Invariably, the system put in place achieves the exact opposite.

One aspect in particular severely impacts the resulting outcome: the success rate. The success rate varies widely from one funding scheme to another. The most highly sought after funding sources, such as grants from the Human Frontier Science Project, barely reach a few % of success. For such competitive schemes, multi-stage selection systems are often put in place, with only a fraction of the projects going from one stage to the next. Now the interesting – and slightly depressing – facts are:

  • the fraction of projects moving from one stage to another seems completely random, and disconnected from the final success rate;
  • the length of the documentation required in the application is disconnected from the number of stages and success rate;
  • the number of panel members looking at the documentation also varies seemingly in a random fashion, and is certainly not related to the number and size of proposals.

One of the worst examples I saw of that situation was the selection of Horizon 2020 collaborative projects in 2015, where a first step selected 30% of the projects, and a second step selected 5% of these. Such fractions were not only wrong because not equal, with an increased selection pressure, but they were actually the wrong way around, with 70% of scientists writing small documents without success for the first stage and 95% of scientists writing very long applications without success for the second stage. An even worse example is the advanced ERC grants, where all applicants are asked a short and long project description. But the panel select the projects during step 1 using only the short description! Since only 1/3 of projects are sent for external reviews, 2/3 of the applicants wrote a long application that will never be read by anyone!!!

What are the consequences:

  1. frustration and dispiriting of scientists, that compounded the lack of research done while they were working on the grant applications;
  2. increased workload on panel members, who had to read and evaluate a lot of documentation, 95% for nothing;
  3. enormous waste of taxpayers money on both sides of the fence;
  4. funded projects that are almost certainly not the best.

Waste of money

So, what does such a process cost? Let’s look at the panel side. Evaluating a 3-6 pages document, that outlines a project, takes maybe one hour per project. Let’s assume a project was read by two panel members. The H2020 call I was talking about above had 355 applications, of which 108 were selected for the second stage, 5 of them being funded. So, we are talking 710 hours of reading for the first stage. To which we need to add the panel meeting. We’ll assume a panel of 10 members, meeting once for 2 days (8 hours a day) plus travel (~10 h return). So 260 more hours. Total is 970 hours. This represents GBP 48500 (I took a very average salary for a PI, costing their institution GBP 50 per hour). To which we need to add travelling, accommodation and catering costs, about 5000 (again super conservative). Of these 53500, 35700 are wasted on failed applications.

A complete application of 50-100 pages would require half a day (4 hours), hence 864 hours of reading for the lot, plus the panel meeting. Total is 1124, that is GBP 56200 plus 5000 of meeting. But … hold on, I forgot someone! For this second stage, the opinion of external experts will be sought. Now, I am not going to overestimate their amount of work. They are assumed to spend half a day on each proposal. But I will only count 2 hours. And each proposal is evaluated by 3 experts. So total is 93600, of which 89300 are wasted on failed applications.

But those are the costs on the panel side, the ones directly supported by some funding agencies (most do not fund reviewers’ time and some do not support panel members’ time).

Now, on the applicant side, the one superbly ignored by the funding agencies … For the first stage, I will assume 10 people are involved, spending a day for most and a week for 2 of them (coordinator, grant officer). They also have a meeting to which 7 people travel and spend a night. The total spent for the 355 projects is 4 millions of which 2.8 millions are spent for failed applications. for the second stage, more people are involved, spending more time, let’s say 12 people spending a week on the project, and 3 spending 3 weeks. 10 people travel to a preparatory meeting. We are talking of a total expense of 9.2 millions, of which 5 are wasted on failed applications. The funding bodies could not care less about this money. They do not pay for this side of the process. The institution of the applicants (and therefore other funding bodies) do so.

Adding panel and applicant spending, 9.4 MILLIONS pounds of taxpayers money have been spent on failed applications! Now the interesting fact is that this particular call had a total budget of 30 millions Euros, that is a bit more than 20 millions GPB. In other words, to distribute 2 of their pounds, the taxpayer spent another pound! ONE THIRD of this public money was spent without any scientific research being done.

Random selection of projects

Now, that’s for the efficiency. Let’s move to the efficacy. Surely this very expensive process selected the best possible scientific projects? Being super selective means only the “crème de la crème” are selected? Not at all! This is misunderstanding how grant applications are selected.

1) Within a panel, grant applications are distributed to a few of the panel members, sometimes called “introducing members”. This is generally (but not always) based on the expertise of those members, who can then evaluate the proposal and select suitable external reviewers. These introducing members have an enormous power. They are generally the only ones reading an application attentively enough to detect flaws. They are giving the initial score to a project, that will decide how it will be discussed in the panel meeting. Panel members have different habits to score projects. Some will provide a Gaussian distribution of scores. Some will only give highest scores to projects they want to discuss and lowest for the one they do not like. This will affect the global score, drawn from the combination of scores from various introducing members.

Introducing members are defending or destroying the application during the meeting. If the introducing member is negative, you’re doomed. If the introducing member is an expert in your field, you’re doomed. If the panel member is a shy individual, you’re doomed. If the panel member cannot be bothered or was depressed, you’re doomed. If the introducing panel is not an expert but saw an interesting talk in the domain a couple of weeks ago, you’re saved. If the panel member has a big voice, you’re saved. If the panel member is competitive and wants “his” projects to be funded so he beats the other panel members, you’re saved. So there is an enormous bias towards boasting, competitive, vocal introducing members.

As with every process in the universe, the noise (non-scientifically related component of the selection) increases in function of the square root of the signal (proportion of projects funded). If only a very few projects are selected among plenty, the effect of the introducing members on the whole selection will be proportionally bigger (although for any given project, it does not matter).

percent Visual rendering of the selection of 30, 10, 3 and 1 % of proposals.

2) Discussing a lot of projects during a panel meeting leads to temporal bias. We are more lenient at the beginning of the day, and more severe towards the end of the day. Not only do we get tired, nervous, dehydrated, we also tend to wield an axe rather than clippers to prune the good from the bad. While we find excuses and side interests to a lot of projects at the beginning of the day, the slight error or clumsy statement is damning when we reach tea time. Now, the more projects, the less likely it is that they will be discussed several times in a day,and therefore the more sensitive the process will be to the panel’s physiology.

3) Recognising excellence from a grant application is not that easy. And the excellence of the projects is in general not linear. Many projects will be totally rubbish (oh come on! I am not the only one having been in a grant panel, have I?). But many will be excellent as well. With a few in between. Imagine a “sigmoid curve”. Selecting between the very best projects is very difficult. One needs more information to distinguish between close competitors (green box). While we do not need much to eliminate the hopeless ones (red box).

ExcellenceDistrib

So, how do we fix this?

A proposal: remember the rule of thirds

This idea is based on the way we actually rank proposals. Whatever selection I have to do among competitive pieces, I make three piles: NO, YES, MAYBE. The NO pile is made up of project I think should be rejected no matter what. The YES pile is made up of projects that I would be proud to have proposed. They’re excellent, and they should be funded. The MAYBE pile is … well maybe. We need more discussion, it depends on the funding etc. Because each project is read by several reviewers/panel members, there will be variation of scoring. But this noise should happen at the edge of the groups. One should then discuss the bottom of the YES pile, and the top of the MAYBE pile (see blue box on the excellence plot).

sorting

So, choosing which projects to fund should obey the rule of third: accept at least a third of them. If there is not enough money for at least 1/3, then a 2 stage process must be organised. If the money is too short to fund 1/9 of them, then a 3 stage process must be organised etc. At each stage, three equal piles are drawn, YES, NO, MAYBE.

MuliLevelSorting

The first stage should be strategic. For instance, each project is only described in a one page document. The panel chair and co-chair select within ALL proposals the ones that are suitable for the call. That way, since they see all proposals, they can balance topics, senior vs junior, gender etc. according to the policies of the funding bodies. This can be done very quickly, in a few days of intense work.

The second stage involves panel members. A project description must then include the science, track records etc. Each panel member has several projects, each project is evaluated by several members. Each member must have a significant share. That should be done fairly quickly since the descriptions are short, and no external opinion is sought.

The final stage involves external scientists. Only then does one require the full project descriptions.

Note that the pile is the same height at each step: The less proposals, the longer the descriptions.

How does the progressive selection look?
percent-cascade


What are the costs for the EU call we used as example before?

Panel side: The first step is done on 1 page. It involves 10 min by chair and co-chair. So, we are talking 119 hours of reading for the first stage. There is not panel meeting. The total expense is then ~5900.

The second stage is equivalent to the first stage previously. Evaluating a 3-6 pages document, that outlines a project, takes maybe one hour per project. Let’s assume a project was read by two panel members. On third of 355 projects are evaluated, that is 119. So, we are talking 238 hours of reading for the first stage. To which we need to add the panel meeting. We’ll assume a panel of 10 members, meeting once for 2 days (8 hours a day) plus travel (~10 h return). So 260 more hours. Total is 498 hours. This represents GBP 24900. To which we need to add travelling, accommodation and catering costs, about 5000 (again super conservative).

The complete application still requires half a day (4 hours). But we have only 40 of them hence 320 hours of reading plus the panel meeting. Total is 590, that is GBP 29500 plus 5000 of meeting. For this third stage, the opinion of external experts will be sought. As before, I assume they will spend 2 hours per proposal. And each proposal is evaluated by 3 experts. So total is 240 hours. Plus the panel meeting.

Total for the panel side is 5900+29900+4600. That is 81800, not a huge saving on the previous situation (still one year of PhD salary …).

Now, on the applicant side, this is a completely different story. For the first stage, only one person is involved, the coordinator, spending 1 day. The total is therefore 142000 for the 355 projects. The second stage is now what was previously the first stage, except only 119 projects are involved. The total spent is 900600. The third stage is now like the second stage previously, except only 40 projects are evaluated. The total is 1920000.

The total for the applicant side is therefore 142000+900600+192000 = 3418600

Adding panel and applicant spending, only 3500400 pounds of taxpayers money have been spent, a 2/3 economy!

Now, should it have stopped here? No. This process was still not good, because only 12.5% of the projects have been selected during the last round. An even better process would have been to add yet another layer of selection. The second layer would have involved the panel members, but without meeting. The third layer (panel member and extended discussions during a meeting) would have selected 14 projects. The last exercise involving external reviewers, would have selected 5 amongst those. Only 42 reviewers would be needed (14*3) instead of a whooping 350 or so.

The process would perhaps be a bit longer (“perhaps”, because most of the time lost in those processes is NOT due to the evaluation, but to administrative treatment of applications and unnecessary delays between the different stages). But so much effort, money and anxiety saved! And so much more time for scientists to do research!

What to do and not to do in advanced modelling courses

I previously introduced our in silico systems biology course. After 5 years of this course, I collected a few lessons that are probably applicable to any advanced course. Nothing very new or surprising, but worth keeping in mind when organising these teaching events.

Select the students well

Beware of the wrong expectations, and of the students who do not find what they thought they would. Disappointed students can wreak the atmosphere of a course. Beware that terminologies are different in different domains. One of the most overloaded terms is “model”. 3D structure model, Hidden markov model, general linear model, chemical kinetics model, all those are models. But they address different population. Systems Biology itself is problematic. Choose also the level of the course and stick to it when selecting the students. Even if there is not the expected number of applicant (fortunately not a problem for our in silico systems biology course anymore), do not be tempted to select inadequate candidate. Better take on less students than having a few students bored or unable to follow. Our course is advanced, and covers quite a lot of ground. We cannot expect all students to be expert in every aspect of the course. However, by selecting students who are skilled in at least one aspect of the course (and balancing the expertises), we liven up the lessons (more interesting questions and discussions) and students become themselves “associated trainers”.

More hands-on, practicals, tutorials

Students learn with their fingers. A demo will never replace an actual hands-on, where the students make the mistakes and fix them (with the help of trainers). And of course, keep the lecturers from diving in their own research and give scientific presentations. This is a course, not a conference. If needed, organise special scientific presentations a few times during the course, but not in the lessons.

Focus on concrete applications of tools

Avoid lengthy descriptions of the theoretical basis of algorithms. It is good that students learn what is under the bonnet, and can choose solutions. But (in general) they are here to learn how to use those tools for their research, not to develop the next generation of them. Two complementary approaches are 1) building toy examples, that illustrate specific uses, and 2) using famous simple examples from the literature.

Do not try to cram too much in the course

It is better to explain well a typical set of techniques, than cover inadequately the whole field. It is generally not possibly to present all the approaches used in a field of computational biology. Even a seasonned researcher in the field does not master all of them. Introduce very carefully the common basis. And then move on to a few examples of more advanced approaches. If the basics are well understood, and the students are really using the content of the course for their research, they will be able to continue training on their own.

Engage the students

It is very important that the students feel part of the course. Those events last only one week or two. The students needs to bind with the organisers, the trainers and between themselves immediately. Make them present their work the first day, maybe with one slide each. Organise poster sessions. Real poster sessions, where students are kept around the posters. Drinks and snacks are a good methods if they are located at the same place and keeps the students there. If you selected the students wisely (see first point), they should be interested in each other research.

Try to keep trainers around

So they can interact with students outside of their presentation/tutorials. It is very difficult. You choose the best trainers, so they are obviously very busy people. But sometimes it is better to choose better trainers than better scientists. Also select your trainers even more carefully than your students. You want good presenters, but also good interactors. Bad trainers will arrive just before their course, spend the coffee breaks reading their mails, and leave just after. Those people do not like teaching, and frankly they don’t deserve your students. Do-not hesitate to replace them, even if they are famous. Observe them also outside the classroom. This is very sad to say, but some trainers cannot behave when interacting with young adults.

These are only a few advices. I am sure there are plenty others. What are your experiences?

“What is systems biology” – the students talk

This year was the 5th instalment of our Wellcome-Trust / EMBL-EBI course “in silico systems biology“.

This course finds its origin a few years ago in a workshop of the EBI industry programme on “Pathways and models”. The workshop, that lasted 2 days, was praised by the attendees. However, the time limitation caused a bit of frustration and made us skip entire aspects we would have liked to cover. I therefore decided to try making it into a full-blown course with the help of Vicky Schneider then responsible of training at the EBI.

The first course, supported by EMBO, lasted 4 days. It was well received. However, we tried to cover too much, from functional genomics and network reconstruction to quantitative modelling of biological processes. Fortunately, the existence of another EBI course “Networks and pathways“, allowed us to focus on modelling. We progressively improved the programme through 1 FEBS course and 3 Wellcome-Trust advanced courses. Without boasting, the current course, co-organised with Julio Saez-Rodriguez and Laura Emery, reached almost perfection. The programme always evolves, but the changes slowed down with time, and we are now more in an optimisation/refinement phase. One of the big advantages is that we kept a core of trainers, who help improving the consistency and quality of the content. We are now happy to see our first generations of students having become active figures in systems biology. Some group leaders who attended the course in the past now send their own students every year. A forthcoming post will discuss a few things I learnt from organising those courses.

Beside the regular training, we always have a few group activities. This year, they were split in small groups at the beginning, and had to answer a few questions. One of them was …

What is systems biology?

Everyone has their own idea about that one, including myself (for more on the history, nature and challenges of systems biology). Here I provide you with the unfiltered and unclustered responses of 25 students (repetitions originate from different groups coming with the same answers):

  • Mechanisms on different levels
  • Wholistic view (tautology intended)
  • Dynamics of biological systems
  • Fun
  • Mathematical modeling
  • Insight to the systems
  • Predictions
  • Looking at the system as a whole and not per component
  • Should also be: formal, unambiguous
  • Holistic approach
  • Using modelling to answer biological questions
  • understanding dynamics of a system in terms of predictability
  • Mechanistic insight
  • A tool to complement experimental data
  • Experiments-modeling cycle leading to discovery
  • formalisms
  • Technology+bio data+ in silico
  • integrating levels of biological processes
  • reaching the experimentally unapproachable

Interesting isn’t it? At first it looks pretty much all over the place. Let-me re-order the answers and group them:

  1. Entire systems
    • Wholistic view (tautology intended)
    • Looking at the system as a whole and not per component
    • Holistic approach
  2. Mechanisms
    • Insight to the systems
    • Mechanistic insight
    • Mechanisms on different levels
    • integrating levels of biological processes
  3. Dynamics
    • Dynamics of biological systems
    • understanding dynamics of a system in terms of predictability
  4. Modeling
    • Mathematical modeling
    • Should also be: formal, unambiguous
    • formalisms
    • Using modelling to answer biological questions
  5. Complement the observation
    • A tool to complement experimental data
    • reaching the experimentally unapproachable
    • Experiments-modeling cycle leading to discovery
    • Predictions
    • Technology+bio data+ in silico
  6. And of course
    • Fun

We basically fall back on the two global positions in the field: a philosophical statement about life sciences (1,2,3), and a set of techniques (4,5). That reminds me a lot the discussions we had about molecular biology at university a few decades ago …