Extending SBML using the annotation element

One of the frequent complaints I hear from end-users (modelers) about SBML is that the language does not provide structures to encode all types of models, or even all kinds of data. This is partially true. Indeed SBML does not provide specific structures (elements or attributes) to encode everything one could want to store during the a modeling and simulation activity. How could-it? However, SBML provides a generic construct that allows almost arbitrary extensions. This is the annotation element, that can be added to all SBML classes inherited from SBase (which means most of the SBML elements). In an annotation element, one can put any XML content, as far as there is only one top element in a given namespace.


<annotation xmlns:ns1="http://www.namespace1.org" >
<ns1:elementA>
<ns1:elementA1 attribute="foo" />
</ns1:elementA>
<elementB xmlns="http://mynamespaces.net/namespace2">
<elementB1 attributeC1="value" />
<elementB2 attributeC2="anotherValue" />
</elementB>
</annotation>

In the example above, the namespace of an extension (namespace 1) is declared in an attribute of the element annotation itself, forcing all the subelements to be prefixed (by ns1). On the contrary, the other extension, namespace2, is declared in the relevant top element, and all the children are automatically in the new namespace. One specific type of SBML annotation is described in the SBML specification. This controlled annotation can be used to fulfil the requirements of MIRIAM. Other annotations are not standard, and on the contrary are proprietary to a given software, used for instance to encode information not (yet) part of SBML.  We will describe a few of them below. Annotations are a great mechanism to benchmark proposed extensions of the language.

Controlled annotations

SBML provide a set of controlled annotations,  based on other XML terminologies such as the Resource Description Framework (RDF), vCard, the Dublin Core Metadata and BioModels qualifiers. SBML controlled annotations are used to store two types of information. 1) clerical information about the model generation, such as who created or modified a model element and when.


<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:vCard="http://www.w3.org/2001/vcard-rdf/3.0#">
<rdf:Description rdf:about="#metaid_0000001">
<dcterms:contributor rdf:parseType="Resource">
<vCard:N rdf:parseType="Resource">
<vCard:Family>Le Novère</vCard:Family>
<vCard:Given>Nicolas</vCard:Given>
</vCard:N>
<vCard:EMAIL>lenov@ebi.ac.uk</vCard:EMAIL>
<vCard:ORG>
<vCard:Orgname>EMBL-EBI</vCard:Orgname>
</vCard:ORG>
</dcterms:contributor>
<dcterms:created rdf:parseType="Resource">
<dcterms:W3CDTF>2005-05-23T17:11:24</dcterms:W3CDTF>
</dcterms:created>
<dcterms:modified rdf:parseType="Resource">
<dcterms:W3CDTF>2005-05-23T23:11:45</dcterms:W3CDTF>
</dcterms:modified>
</rdf:Description>
</rdf:RDF>

The attribute “rdf:about” on the element rdf:Description points to the metaid of the containing SBML element. The Dublin Core elements contributor, created and modified record who created the containing SBML element, when it was creatd, and when it was last modified.

2) Cross-references to external resources, such as entries in databases or terms of controlled vocabularies.


<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:bqbiol="http://biomodels.net/biology-qualifiers/">
<rdf:Description rdf:about="#_274092">
<bqbiol:hasPart>
<rdf:Bag>
<rdf:li rdf:resource="http://identifiers.org/uniprot/P62158>
<rdf:li rdf:resource="http://identifiers.org/obo.chebi:CHEBI29108"/>
</rdf:Bag>
</bqbiol:isVersionOf>
</rdf:Description>
</rdf:RDF>

This annotation describes the fact that both calmodulin (UniProt P62158) and calcium ion (ChEBI 9108) are part of the biological entity represented by the annotated SBML element. Those cross-references had a tremendous effect on the way people use SBML encoded files. It is not too much of an exageration to say that a subfield of computational systems biology was made possible thanks to them, dealing with the automatic processing of SBML encoded pathways and models. One can see for instance the software suite SBMLsemantics, that allow to annotate, compare and merge models. Another use of those crossreference is to provide additional information for converting SBML into other format. See for instance  SBML2BioPAX.

Proprietary annotations

While the section above dealt with controlled annotations, following a syntax described in the SBML specification, the power of SBML annotations is by no mean limited to them. Those annotation elements can be used for instance to encode aspects of the model that are not yet supported by SBML. The spatial simulator Mesord was one of the first tools to make full use of them in that respect. Mesord “format” is valid SBML. And all models developed in Mesord can be imported in other SBML-supported tools such as COPASI. However, only well-stirred biochemistry will then be simulated. In addition, the information describing the spatial component of the modelling is stored in proprietary annotation. The following annotations (taken from the model MODEL5974712823 from BioModels Database, encoding the model of Fange et al 2006) describe the creation of a compartment “cytosol” which capsule shape is made by the union of a cylinder and two spheres, and specify the diffusion constants of a molecule in the cytosol and the plasma membrane.


<compartment metaid="_303076" id="cytosol">
<annotation xmlns:MesoRD="http://www.icm.uu.se" xmlns:jd="http://www.sys-bio.org/sbml">
<MesoRD:union>
<MesoRD:cylinder MesoRD:height="3.5" MesoRD:radius="0.5" MesoRD:units="um"/>
<MesoRD:translation MesoRD:units="um" MesoRD:x="0.00" MesoRD:y="-1.75" MesoRD:z="0">
<MesoRD:sphere MesoRD:radius="0.5" MesoRD:units="um"/>
</MesoRD:translation>
<MesoRD:translation MesoRD:units="um" MesoRD:x="0.00" MesoRD:y="1.75" MesoRD:z="0">
<MesoRD:sphere MesoRD:radius="0.5" MesoRD:units="um"/>
</MesoRD:translation>
</MesoRD:union>
</annotation>
</compartment>
<!-- -->
<species metaid="_303121" id="D1" name="D" compartment="cytosol" initialAmount="0" substanceUnits="item" hasOnlySubstanceUnits="true">
<annotation xmlns:MesoRD="http://www.icm.uu.se" xmlns:jd="http://www.sys-bio.org/sbml">
<MesoRD:diffusion MesoRD:compartment="cytosol" MesoRD:rate="0.0" MesoRD:units="cm2ps"/>
<MesoRD:diffusion MesoRD:compartment="membrane" MesoRD:rate="2.5e-8" MesoRD:units="cm2ps"/>
</annotation>
</species>

Another area where annotation have been used extensively is to encode graphical representation of biochemical networks corresponding to the models. Early in the development of SBML the software JDesigner, developed by Herbert Sauro, was a precursor in the domain. The following annotations (taken from the model BIOMD0000000328 from BioModels Database, encoding the model of  Bucher et al 2011) describe a compartment “medium”, with its size, its position on the canvas and various graphical characteritics. Note that the namespace is declared in the main sbml element rather than the annotation element.


<sbml xmlns="http://www.sbml.org/sbml/level2/version4" xmlns:jd2="http://www.sys-bio.org/sbml/jd2" level="2" version="4">
<!-- [...] -->
<jd2:compartment id="medium" size="2" visible="true">
<jd2:boundingBox h="266" w="1010" x="196" y="318"/>
<jd2:membraneStyle color="FFFFA500" thickness="12"/>
<jd2:interiorStyle color="FFFFEEEE"/>
<jd2:text value="medium" visible="true">
<jd2:position rx="14" ry="48"/>
<jd2:font fontColor="FF000000" fontName="Arial" fontSize="8"/>
</jd2:text>
</jd2:compartment>

However, what really demonstrated the power of SBML annotation for extending the language was CellDesigner notation. Its aim is very similar to JDesigner, to encode the graphical representation of a model encoded in SBML (earlier versions of CellDesigner were called SBedit). The following annotations (taken from the model BIOMD0000000220 from BioModels Database, encoding the model of Albeck et al 2008) describe the representation of a complex species PARP_C3. Firstly, in the SBML model element, CellDesigner annotations mention that both proteins PARP and C3 are part of the complex PARP_C3. Then they encode the graphical representation of the complex, and finally associate this complex with the SBML species representing it.


<celldesigner:species id="s57" name="PARP">
<celldesigner:annotation>
<celldesigner:complexSpecies>PARP_C3</celldesigner:complexSpecies>
<celldesigner:speciesIdentity>
<celldesigner:class>PROTEIN</celldesigner:class>
<celldesigner:proteinReference>pr19</celldesigner:proteinReference>
</celldesigner:speciesIdentity>
</celldesigner:annotation>
</celldesigner:species>
<celldesigner:species id="s58" name="C3">
<celldesigner:annotation>
<celldesigner:complexSpecies>PARP_C3</celldesigner:complexSpecies>
<celldesigner:speciesIdentity>
<celldesigner:class>PROTEIN</celldesigner:class>
<celldesigner:proteinReference>pr11</celldesigner:proteinReference>
</celldesigner:speciesIdentity>
</celldesigner:annotation>
</celldesigner:species>
<celldesigner:complexSpeciesAlias id="csa13" species="PARP_C3">
<celldesigner:activity>inactive</celldesigner:activity>
<celldesigner:bounds h="120.0" w="100.0" x="359.0" y="1421.0"/>
<celldesigner:view state="usual"/>
<celldesigner:backupSize h="0.0" w="0.0"/>
<celldesigner:backupView state="none"/>
<celldesigner:usualView>
<celldesigner:innerPosition x="0.0" y="0.0"/>
<celldesigner:boxSize height="120.0" width="100.0"/>
<celldesigner:singleLine width="2.0"/>
<celldesigner:paint color="fff7f7f7" scheme="Color"/>
</celldesigner:usualView>
<celldesigner:briefView>
<celldesigner:innerPosition x="0.0" y="0.0"/>
<celldesigner:boxSize height="60.0" width="80.0"/>
<celldesigner:singleLine width="2.0"/>
<celldesigner:paint color="fff7f7f7" scheme="Color"/>
</celldesigner:briefView>
</celldesigner:complexSpeciesAlias>

<species metaid="metaid_0000109" id="PARP_C3" name="PARP:C3" compartment="cell" initialAmount="0" charge="0">
<annotation>
<celldesigner:positionToCompartment>inside</celldesigner:positionToCompartment>
<celldesigner:speciesIdentity>
<celldesigner:class>COMPLEX</celldesigner:class>
<celldesigner:name>PARP:C3</celldesigner:name>
</celldesigner:speciesIdentity>
</annotation>
</species>

CellDesigner turned to be a resounding success, and many systems biologists used it as a user-friendly tool to draw pathways, not always with modeling in mind. A significant portion of CellDesigner users do not actually know that its native format is an extended SBML, and call it “CellDesigner format”. Because what makes the success of CellDesigner is largely encoded in proprietary annotations, 3rd party software started to develop support for those annotations. One may regret that the SBML layout extension (see below) was not used instead, but it was just pragmatism on the side of this software. (Of course nowadays SBGN-ML should be the prefered way of encoding graphical representations of biochemical networks in a standard XML, and we hope CellDesigner will develop support for the format soon). Among the software that adopted CellDesigner SBML extension one can mention the Cytoscape plugin BiNoM, and CellPublisher.

Using annotations to test SBML extensions: The layout proposal

An interesting use for SBML annotation is to develop and try out possible SBML developments. To remain in the area of graphical representation, one can mention the SBML layout extension. This extension of SBML has been under discussion since 2002 and an agreement was reached many years ago (Gauges et al 2006). Since SBML was not a modular language by then, the layout extension was encoded in annotation. The following example shows the declaration of a layout in the annotation of the model element (a given SBML model can carry several layouts). This layout contains a compartment, at a given position and with a given size. Note that the layout extension does not deal with the actual visual representation, which is dealt with the SBML rendering extension.


<model id="TestModel">
<annotation>
<listOfLayouts xmlns="http://projects.eml.org/bcb/sbml/level2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<layout id="Layout_1">
<dimensions width="400" height="220"/>
<listOfCompartmentGlyphs>
<compartmentGlyph id="CompartmentGlyph_1" compartment="Compartment_1">
<boundingBox id="bb1">
<position x="5" y="5"/>
<dimensions width="390" height="210"/>
</boundingBox>
</compartmentGlyph>
</listOfCompartmentGlyphs>

In SBML Level 3, the list of layouts and all its content is removed from the annotation of the model element and becomes a bona fide SBML element, in the namespace of the SBML Level 3 layout package.

Conclusion: Anyone can extend SBML to cover anything

If you think the current SBML specification does not cover a feature crucial for your activity, do not through SBML and either give up on model sharing or develop your own language. Develop an extension of SBML instead. First look at the current list of packages to see is something is in the oven. If this is the case, please, PLEASE, join the community effort and try to improve the existing proposals. If nothing is available, either because it is a feature useful only for a few people or talks, because it has been so far judged to far from SBML mission, or because it is a feature hard to cover, feel free to develop your extension, support it in your software, and share it with your collaborators. If it is useful, it will be shared between groups, and maybe you can propose it for a future SBML development. As the hacker’s mantra says: “who codes wins”.

Advertisements

Why not using R more often as a backend for modeling and simulation?

I have a confession to make. I never learnt how to use MatLab. Despite having been in the business of developing and using dynamic models for more than a decade now, I succeeded to avoid fiddling with the master software used for that purpose. I used it once to run scripts written by one of my students. And I toyed a bit with Octave and SciLab, which are free replacements. But I was lucky enough to develop models in the infancy of systems biology, where it was perfectly acceptable to re-invent the numerical wheel in C (or even Perl!). However, this shameful omission will hopefully remained ignored but for the readers of this blog. Indeed, there is a massive haemorrhage of academic MatLab-based tools towards R. R is used largely in bioinformatics, and the continual incompatibilities between the successive versions of MatLab make maintaining a software based on that tool really hard (and the difficulty to obtain MatLab trial licences for courses does not help. That was a decisive criterion for a massive recoding from MatLab to R at the EBI, in order to be able to run properly our in silico systems biology course). Anyway, I could directly learn R, and nobody would notice how clueless I was.

And then comes my second confession. Despite having spent 8+ years at the EBI, where R is used daily by battalions of bioinformaticians, and where Bioconductor was in part developed, I did not learn to use this magical tool until recently. I have more excuses here. R was develop primarily as a tool for statistics, and was not initially strong on the computational modelling side. For instance, despite the existence of two packages handling SBML in R, RSBML and SBMLR, very few of the models in BioModels Database if any where developed using R. But my group is now heavily involved in the Drug Disease Model Repository (DDMoRe) project of the Innovative Medicine Initiative (IMI). And several of DDMoRe’s tools will be based on R. Now comes the most embarrassing part of the confession. It turns out that my wife is an experimental biologist, doing a lot of measurements and statistics. And she followed an R course organised at the Babraham Institute (where incidentally I will move half of my group from October 2012). She gave me her course materials, and I spent a few evenings with a more productive use of my time than usual (how much of the impetus was due to embarrassment will be left uninvestigated). By coincidence, I was reading at the time “Dynamic models in biology” from Ellner and Guckenheimer. While this excellent book was accompanied by exercises in MatLab, the authors later converted them in R, and produced a very useful “An introduction to R for dynamic models in biology”. That document, plus the material provided with the R package deSolve, convinced me that R was not only very powerful for everything we do in computational systems biology, it was also a lot of fun!

While I was happily rejuvenating the geek in me by writing scripts, I still wondered why there were not more GUIs to help using the simulations capabilities of R. Or taking the problem the other way around:  why do people bother to implement the computing and statistical layers in end-user software? They could just concentrate on the user experience, and re-use R packages in the background. If those packages are not powerful, flexible or versatile enough, they could contribute to their development, and make the world better. Yes, the simulations in R are maybe a bit slower that the equivalent methods implemented in a dedicated software written in C. But the slower part in a simulation task often comes from the choice of the wrong numerical solver for the task at hand, or even from the dialogue between solver and GUI (I recently used a tool where I could *see* the curves being drawn when simulating the model of Tyson 1991 (BIOMD0000000005), while any decent simulator should provide the results instantly). Moreover, entire domains of research use MatLab (physics, engineering …). The argument of optimisation does not hold long. First, if we want perfect optimisation we should write directly assembly code, and second, how computational biologists can believe they are better at coding simulation tools than people who did that in physics for the last 50 years or so?

To be honest, the time running simulations is a very small part of the activity for a typical student or post-doc in computational systems biology. Most of the time is spent developing the model, reading and data-mining. And a significant amount of time is spent (wasted?) developing Yet Another Simulation Software. This software is most often suboptimal, undocumented and dies when the main developer moves on. Even when the project is maintained, it is pretty hard to provide versions working on all operating systems, a problem that will grow worse with the emerging smartphone and tablet landscape as proper computing tools.

Several international projects aimed at providing a reusable infrastructure for modeling and simulation in biology. Historical examples are the Systems Biology Workbench (which side effect was the development of SBML, so in a sense that was the most influential software project of the last 15 years in biological modeling), or Bio-SPICE. A modern version is the GARUDA project. I really hope these projects decide to avoid re-inventing the wheel and use existing infrastructures. For all that involves calculations, I think I should be R. Not you?