Using medians instead of means for single cell measures

In the absence of information regarding the structure of variability (whether intrinsic noise, technical error or biological variation), one very often assumes, consciously or not, a normal distribution, i.e. a “bell curve”. This is probably due to an intuitive application of the central limit theorem which stipulates that when independent random variables are added, their normalized sum tends toward such a normal distribution, even if the original variables themselves are not normally distributed. The reasoning then goes that any biological process is the sum of many sub-processes, each with its own variability structure, therefore its “noise” should be Gaussian.

Although that sounds almost common sense, alarm bells start ringing when we use such distributions with molecular measurements. Firstly, a normal distribution ranges from -∞ to +∞. And there is no such things as negative amounts. So, at most, the variability would follow a truncated normal distribution, starting at 0. Secondly, the normal distribution is symmetrical. However, in everyday conversation, the biologists will talk of a variability “reaching twofold”. For a molecular measurement, a two-fold increase and a two-fold decrease do not represent the same amount. So there is an asymmetric notion here. We are talking about linking the addition and removal of the same “quantum of variability” to a multiplication or division by a same number. Immediately logarithms come to mind. And log2 fold changes are indeed one of the most used method to quantify differences. Populations of molecular measurements can also be – sometimes reasonably – fitted with log-normal distributions. Of course, several other distributions have been used to fit better cellular contents of RNA and protein, including the gamma, Poisson and negative binomial distributions, as well as more complicated mix.

Let’s look at some single-cell gene expression measurements. Below, I plotted the distribution of read counts (read counts per million reads to be accurate) for four genes in 232 cells. The asymmetry is obvious, even for NDUFAB1 (the acyl carrier protein, central to lipid metabolism). This dataset was generated using a SmartSeq approach and Illumina HiSeq sequencing. It is therefore likely that many of the observed 0 are “dropouts”, possibly due to the reverse transcriptase stochastically missing the mRNAs. This problem is probably even amplified with methods such as Chromium, that are known to detect less genes per cell. Nevertheless, even if we remove all 0, we observe extremely similar distributions.


One of the important consequences of the normal distribution’s symmetry, is that mean and median of the distribution are identical. In a population, we should have the same amounts of samples presenting less and presenting more substance than the mean. In other words, a “typical” sample, representative of the population, should display the mean amount of the substance measured. It is easy to see that this is not the case at all for our single cell gene expressions. The numbers of cells expressing more than the mean of the population are 99 for ACP (not hugely far from the 116 of the median), 86 for hexokinase, 78 for histone acetyl transferase P300 and 30 for actin 2. In fact, in the latter case, the median is 0, mRNAs having been detected in only 50 of the 232 cells ! So, if we take a cell randomly in the population, most of the time it presents a count of 0 CPM of actin 2. The mean expression of 52.5 CPM is certainly not representative!

If we want to model the cell type, and provide initial concentrations for some messenger RNAs, we must use the median of the measurements, not the mean (of course, the best route of action would be to build an ensemble model, cf below). The situation would be different if we wanted to model the tissue, that is a sum of non individualised cells representative of the population.

To explain how such asymmetric distributions can arise from noise following normal distributions, we can build a small model of gene expression. mRNA is transcribed at a constant flux, with a rate constant kT. It is then degraded following a unimolecular decay with rate kdeg (chosen to be 1 on average, for convenience). Both rate constants are computed from energies, following the Arrhenius equation, k = Ae-(E/RT), where R is the gas constant, 8.314 and T is the temperature, that we set at 310 K (37 deg C). To simplify we’ll just set the scaling factor A to 1, assuming it is included in the reference energy. E is 0 for degradation, and we modulate the reference transcription energy to control the level of transcript. Both transcription and degradation energy will be affected by normally distributed noises that represent differences between cells (e.g. concentration and state of enzymes). So Ei = E + noise. Because of Arrhenius equation, the normal distributions of energy are transformed into lognormal distributions of rates. Below I plot the distributions of the noises in the cells and the resulting rates.


The equilibrium concentration of the mRNA is then kdeg/kT (we could run stochastic simulations to add temporal fluctuations, but that would not change the message). The number of molecules is obtained by multiplying by volume (1e-15 l) and Avogadro number. Each panel presents 300 cells. The distribution of the top-left looks kind of intermediate between those of hexokinase and ACP above. To get the values on the top-right panel, we simulate an overall increase of the transcription rate by twofold, using a decrease of the energy by 8.314*310*ln(2). In this specific case, the observed ratios between the two medians and between the two means are both about 2.04, close to the “truth”. So we could correctly infer a twofold increase by looking at the means. In the bottom panels, we increase the variability of the systems by doubling the standard deviation of the energy noises. Now the ratio of the median is 1.8, inferring a 80% increase while the ratio of the means is 2.53, inferring an increase of 153%!


In summary:

  1. Means of single cell molecular measurements are not a good way of getting a value representing the population;
  2. Comparing the means of single measurements in two populations does not provide an accurate estimation of the underlying changes;

Batch correcting using part of a dataset – illustration with cell-types in scRNA-seq

We all know that one of the main sources of variability in molecular measurements are batch effects. We perform the “same” experiment twice, using the same protocol and the same piece of kit. We believe we control everything and eliminate all sorts of unwanted perturbations. But at the end of day, everything is different, luminosity, air pressure, what we ate at lunch etc. Fortunately, we can (sometimes, and partially) correct these effects down the line. Now, what happens if we want to correct the batch effects for only part of a dataset? When I needed to, I had to dig quite a bit to find out how we can do so. I thought the trick could be useful to others (why we would like to do so will become clear latter).

NB: I use the example of single-cell RNA-seq in this post. However, the approach can of course be applied to all kind of data.

First, let’s look at the batch correction applied to an entire dataset. I will use scRNA-seq data, generated with the Smart-Seq protocol. The details of the sample preparation and the generation of the datasets is not relevant for the current post. So I will start directly with the count tables. The datasets haver been cleaned though. Counts have been corrected for library sizes (Count Per Million reads) and “bad” cells have been removed (cells with lot fewer reads than most, here less than 500000, and cells whose reads mapped to less genes, here 9000). I then removed all the genes that did not show at least 10 CPM in at least one cell, and genes that did not vary by at least two-fold across the entire datasets. Finally, I took the log of the counts.  Other approaches can be used, such as DESeq2’s rlog. It does not matter here.  Let’s load the counts.

# load counts
cpm1 <- read.table("OneCellType.csv", 

We can look at the resulting table. Columns are cells. The first part of the name indicate the name of the 96-well plate while the second part are the coordinates of the well. We have two batches, one composed of 1 plate, EPI1, and the other composed of 2 plates, EPI2 and EPI3. In total, we have 232 cells and 24235 genes.


As we can see, many many cells show zeros, as expected with scRNA-seq. But no row is made up entirely of zeros. Let’s visualise the dataset with Principal Component Analysis.

# transpose the count matrix; needed for PCA

# Run the PCA

# Provide the variance of components

# increase the left margin to avoid cropping the labels
par(mar=c(5.1, 5.1, 4.1, 2.1))

# Colour by plates. 

# Plot PC1 and PC2 (the first coordinates from PCA1$x)
     xlab=sprintf("PC1 %.1f %%",eigen1[1,2]),
     ylab=sprintf("PC2 %.1f %%",eigen1[2,2]))   


The resulting plot does not exhibit much structure, and the variance landscape is very flat, the first 2 components showing only about 5% of it. This is all good since all those cells are supposed to belong to the same cell-type. However,  we can see that while EPI2 and EPI3 (blue and green) 0 the two plates processed in the same batch – overlap nicely, EPI1 (in orange) is off. And this batch effect aligns perfectly with PC1. Yes, the most important source of variability (albeit small) is the batch! So we need to correct for it. To do that, we will use the function ComBat from the Bioconductor package sva.

First we create a table that link our cells with the batches.

# load the package sva

# create a table with the cells and the batches they belong to
cells<-data.frame(batch = c(rep("b1",ncol(cpm1[,grep("EPI1",colnames(cpm1))])),
                  row.names = colnames(cpm1))


Since all cells belong to the same cell-type, we have only one column, linking each cell to its batch. We can now run the batch correction itself. Since we have only one variable in the model, we do not let anything out (~1). Then we replot the PCA.

# model to use in the batch correction
modcombat = model.matrix(~1,data=cells)
bcor_cpm1 = ComBat(dat=cpm1,batch=cells$batch,
                   par.prior=TRUE, prior.plots=FALSE)

# redo the PCA

     xlab=sprintf("PC1 %.1f %%",eigenb1[1,2]),
     ylab=sprintf("PC2 %.1f %%",eigenb1[2,2]))


Success! Now, all three plates, from the first and second batch, are merged together.  Note the decrease of variance associated with PC1, from 3.1 % to 2.5 %. This was expected since the shift of EPI1 compared with EPI2 and EPI3 was along PC1 in the first place. Now, this is important. If the feature were were looking to analyse also aligned with PC1, we would have thrown the baby out with the bathwater.  Fortunately, in our case, the interesting stuff aligned with PC2. And PC2 is almost not affected by the batch correction. Of course you cannot know that in advance. It required a series of iterative analyses to become aware of it. As a rule, the more orthogonal the feature is with the batch effect, the less it will be affected by the batch correction.

Now, that was easy since all the cells belonged to the same cell-type. In fact, in the real study, that was not quite the case. The first batch contained one cell-type, while the second batch contained two cell types. So let’s load the complete dataset.

# load counts
cpm2 <- read.table("TwoCellTypes.csv", 

We have initially 326 cells and 25299 genes. In addition to the plates of EPI cells, we have now a plate of LPM cells (what EPI and LPM mean really does not matter here). We can now run a PCA again.

# transpose data for PCA

# Principal Component Analysis


     xlab=sprintf("PC1 %.1f %%",eigen2[1,2]),
     ylab=sprintf("PC2 %.1f %%",eigen2[2,2]))


We can see that again, while EPI2 and EPI3 are together, the cloud of EPI1 cell is slightly shifted upward. The plate composed of cells belonging to another cell type, plotted in black, is clearly different from the three EPI ones. Let’s try to batch correct to bring together EPI1, EPI2 and EPI3. The procedure is absolutely identical as the previous one, except we now integrate the new plate, resulting in 3 plates associated with batch b2.

# create a table with the cells and the batches they belong to
cells2<-data.frame(batch = c(rep("b1",ncol(cpm2[,grep("EPI1",colnames(cpm2))])),
                    row.names = colnames(cpm2))

modcombat = model.matrix(~1,data=cells2)
bcor_cpm2 = ComBat(dat=cpm2,batch=cells2$batch,
                   par.prior=TRUE, prior.plots=FALSE)


     xlab=sprintf("PC1 %.1f %%",eigenb2[1,2]),ylab=sprintf("PC2 %.1f %%",eigenb2[2,2]))


Fail! EPI1 from batch1 has now been batch-corrected taking into account all three plates of batch b2, which include the LPM cells. As a result, EPI1 cells end up located somewhere in between EPI2/3 and LPM1. Clearly this is wrong. What we want is putting together EPI1 and EPI2/3, ignoring LPM1. We can do that without a problem, by declaring the cell-types as a variable of interest, that we will then ignore. But first we need to do something. The initial clean-up was performed on the entire dataset. Even if gene expression varied throughout the dataset, we could still have genes that do not vary either across EPI cells or across LPM cells. That would cause newest versions of ComBat to throw errors. So let’s remove the culprit genes.

# sanity check on variance. remove genes which
# variance is 0 either in EPI or in LPM
varEPI <- apply(cpm2[,grep("EPI",colnames(cpm2))], 1, var)
varLPM <- apply(cpm2[,grep("LPM",colnames(cpm2))], 1, var)
cpm2 <- cpm2[-which(varEPI == 0 | varLPM == 0 ),]

We still have 20885 genes to play with, which is largely enough. Now we will create a table linking the cells and the batches, as before, but in addition add the cell-type they belong to.

# create a table with the cells and the batches they belong to
cells3<-data.frame(batch = c(rep("b1",ncol(cpm2[,grep("EPI1",colnames(cpm2))])),
                   celltype = c(rep("epi",ncol(cpm2[,grep("EPI",colnames(cpm2))])),
                   row.names = colnames(cpm2))



We can now instruct ComBat to correct for the batch but taking into account the cell-type in the model (~cells3$celltype).

modcombat = model.matrix(~cells3$celltype,data=cells3)
bcor_cpm3 = ComBat(dat=cpm2,batch=cells3$batch,
                   par.prior=TRUE, prior.plots=FALSE)

And we run the PCA again.


     xlab=sprintf("PC1 %.1f %%",eigenb3[1,2]),
     ylab=sprintf("PC2 %.1f %%",eigenb3[2,2]))


Hooray! Now we have EPI1, EPI2 and EPI3 together, regardless of the batch, while LPM cells stand quietly to the side. We can export those coordinates for future use (e.g. clustering).

write.csv(bcor_cpm3, file="batchCorrected.cvs",quote=FALSE)

DVDs I sell on eBay

Here is a list of the DVDs I am currently selling on eBay. The list is updated continuously so come back if you did not see something you wanted yet. All I sell can also be seen directly on my seller page

NB1: The descriptions I provided for each book are mostly not mine, but gathered from book back-covers, their website etc.

NB2: I only mention postage costs for the UK in the eBay listings. However, providing that the proper costs are covered, I am happy to send them anywhere.


front[eBay link]

A lovelorn screenwriter becomes desperate as he tries and fails to adapt ‘The Orchid Thief’ by Susan Orlean for the screen.


Director: Spike Jonze

Writers: Susan Orlean (book), Charlie Kaufman (screenplay)

Along came Polly, 2004

front[eBay link]

A buttoned up newlywed finds his too organized life falling into chaos when he falls in love with an old classmate.

Director:John Hamburg

Writer:John Hamburg

Angus, Thongs and perfect Snogging, 2008

front[eBay link]

The story centers on a 14-year-old girl who keeps a diary about the ups and downs of being a teenager, including the things she learns about kissing.

Arthur and the invisibles, from Luc Besson

front[eBay link]

Ten-year-old Arthur, in a bid to save his grandfather’s house from being demolished, goes looking for some much-fabled hidden treasure in the land of the Minimoys, tiny people living in harmony with nature.


Director:Luc Besson


front[eBay link]

When a bumbling New Yorker is dumped by his activist girlfriend, he travels to a tiny Latin American nation and becomes involved in its latest rebellion.


Director:Woody Allen

Bewitched, 2006

front[eBay link]

Thinking he can overshadow an unknown actress in the part, an egocentric actor unknowingly gets a witch cast in an upcoming television remake of the classic sitcom Bewitched (1964).


Director:Nora Ephron

The Bonfire of the Vanities, 1990

front[eBay link]

After his mistress runs over a young teen, a Wall Street hotshot sees his life unravel in the spotlight and attracting the interest of a down and out reporter.


Director: Brian De Palma (as Brian DePalma)

Writers: Michael Cristofer (screenplay), Tom Wolfe (novel)



Bridesmaids, 2011

front[eBay link]

Competition between the maid of honor and a bridesmaid, over who is the bride’s best friend, threatens to upend the life of an out-of-work pastry chef.
Director: Paul Feig


The Cave, 2005

Bloodthirsty creatures await a pack of divers who become trapped in an underwater cave network.


Director: Bruce Hunt

City of Angels, 1999

front[eBay link]

Inspired by the modern classic, Wings of Desire, City involves an angel (Cage) who is spotted by a doctor in an operating room. Franz plays Cage’s buddy who somehow knows a lot about angels.

Director: Brad Silberling

Writers:Wim Wenders (screenplay “Der Himmel über Berlin”), Peter Handke (screenplay “Der Himmel über Berlin”)

Stars:Nicolas Cage, Meg Ryan, Andre Braugher


Closer, 2004

front[eBay link]

The relationships of two couples become complicated and deceitful when the man from one couple meets the woman of the other.

Writers: Patrick Marber (play), Patrick Marber (screenplay)


Cold Mountain, 2004

In the waning days of the American Civil War, a wounded soldier embarks on a perilous journey back home to Cold Mountain, North Carolina to reunite with his sweetheart.

Writers: Charles Frazier (book), Anthony Minghella (screenplay)

Stars: Jude Law, Nicole Kidman, Renée Zellweger

The Devil’s Chair, 2007

front[eBay link]

With a pocketful of drugs, Nick West takes out his girlfriend Sammy, for a good time. When they explore an abandoned asylum, the discovery of a bizarre device, a cross between an electric chair and sadistic fetish machine, transforms drugged-out bliss into agony and despair.


Director: Adam Mason


Embrassez qui vous voudrez (Summer Things), 2004

front[eBay link]

Two couple of friends, one very rich the other almost homeless, decides to go on Holiday. Julie, a single mother, joins them too. Once at seaside, it starts a complicate love cross among them that will involve also a transsexual, a jealous brother,a Latin Lover and another nervous stressed couple. Not to mention about the daughter of one of them that is secretly in Chicago with one of his father’s employee… More, at the end of the summer all of them will join the same party…


Director:Michel Blanc

Garfield, 2004

Jon Arbuckle buys a second pet, a dog named Odie. However, Odie is then abducted and it is up to Jon’s cat, Garfield, to find and rescue the canine.


Director: Peter Hewitt (as Pete Hewitt)

Writers: Jim Davis (comic strip “Garfield”), Joel Cohen


Garfield 2, 2006

front[eBay link]

Jon and Garfield visit the United Kingdom, where a case of mistaken cat identity finds Garfield ruling over a castle. His reign is soon jeopardized by the nefarious Lord Dargis, who has designs on the estate.


Director: Tim Hill

Half Light

front[eBay link]

Rachel Carlson, a successful novelist moves to a small Scottish village to move on with her life after the death of her son. Strange things start to happen when she is haunted by ghosts and real life terror.


Director:Craig Rosenberg

He’s just not that into you, 2009

frong[eBay link]

This Baltimore-set movie of interconnecting story arcs deals with the challenges of reading or misreading human behavior.


Director:Ken Kwapis


I Could Never Be Your Woman, 2007

cover[eBay link]

A mother falls for a younger man while her daughter falls in love for the first time. Mother Nature messes with their fates.

Director:Amy Heckerling

Writer:Amy Heckerling

Stars:Michelle Pfeiffer, Paul Rudd, Saoirse Ronan

The Interpreter, 2005 (DVD 2010)

front[eBay link]

Political intrigue and deception unfold inside the United Nations, where a U.S. Secret Service agent is assigned to investigate an interpreter who overhears an assassination plot.

Writers: Martin Stellman (story), Brian Ward (story)


Kingdom of heaven, 2005

front[eBay link]

Balian of Ibelin travels to Jerusalem during the Crusades of the 12th century, and there he finds himself as the defender of the city and its people.

Ladies in lavender, 2005

front[eBay link]

Two sisters befriend a mysterious foreigner who washes up on the beach of their 1930’s Cornish seaside village.


Director: Charles Dance

Writers: William J. Locke (short story), Charles Dance


Little Children

front[eBay link]

The lives of two lovelorn spouses from separate marriages, a registered sex offender, and a disgraced ex-police officer intersect as they struggle to resist their vulnerabilities and temptations in suburban Massachusetts.


Director:Todd Field

Writers:Todd Field (screenplay), Tom Perrotta (screenplay)

Stars:Kate Winslet, Jennifer Connelly, Patrick Wilson


Lord of War, 2005

front[eBay link]

An arms dealer confronts the morality of his work as he is being chased by an INTERPOL Agent.

Director: Andrew Niccol


Michael Clayton, 2006

front[eBay link]

A law firm brings in its “fixer” to remedy the situation after a lawyer has a breakdown while representing a chemical company that he knows is guilty in a multibillion-dollar class action suit.


Director:Tony Gilroy

Writer:Tony Gilroy

Mickey Blue Eyes, 2000

front[eBay link]

An English auctioneer proposes to the daughter of a Mafia kingpin, only to realize that certain “favors” would be asked of him.

Director: Kelly Makin


Nora, 2000

front[eBay link]

In 1904, in Dublin, James Joyce chats up Nora Barnacle, a hotel maid recently come from Galway. She enchants him with her frank, direct and uninhibited manner, and before long, he’s convinced her to come with him to Trieste, where he has a job with Berlitz. Over time, Nora pulls him through phobias, tolerates his drinking, takes in his brother Stan, and bests Joyce at ‘the writin’ game’ to bring him back to Italy from Dublin where he’s gone to open a cinema. But his sexual jealousy threatens the relationship and sends her back to Galway with the children. Is there any way to tame Jim’s green-eyed monster? And, will the lad ever get his stories published?
Director: Pat Murphy

Writers: Brenda Maddox (based on the biography: written by), Pat Murphy

The Peculiar Adventures of Hector, 2007

front[eBay link]

This mini series follows the adventures of Hector and his friends on a series of amazingly imaginative adventures on their way to school each morning.

Practical Magic, 1998

Front[eBay link]

Two witch sisters, raised by their eccentric aunts in a small town, face closed-minded prejudice and a curse which threatens to prevent them ever finding lasting love.


Director: Griffin Dunne

Writers: Alice Hoffman (novel), Robin Swicord (screenplay) | 2 more credits »

Priceless, 2008

Through a set of wacky circumstances, a young gold digger mistakenly woos a mild-mannered bartender thinking he’s a wealthy suitor.


Director: Pierre Salvadori

Writers: Pierre Salvadori (scenario), Benoît Graffin (scenario) | 1 more credit »


Rachel getting married

front[eBay link]

A young woman who has been in and out of rehab for the past ten years, returns home for the weekend for her sister’s wedding.


Director:Jonathan Demme

Writer:Jenny Lumet

Rire et Chatiment (Laughter and Punishment)

front[eBay link]

Vincent Roméro is an osteopath who makes a point of being funny every minute he is awake. In fact it is his way to remain the center of attention. His antics wind up irritating his wife Camille, a geriatric doctor, to such an extent that she leaves him one day. At first Vincent is so persuaded his many qualities are hard to beat that he can’t doubt she will soon come back. But as she doesn’t, he starts questioning himself. All the more as he realizes, to his dismay, that each time he clowns the people around him start dying of laughter…literally that is!

Director:Isabelle Doval

Stars:José Garcia, Isabelle Doval, Laurent Lucas, Benoit Poolvoerde

Shaggy and Scooby-Doo get a clue Vol 1

ebay-front[eBay link]

Shaggy and Scooby-Doo get a clue Vol 2

front[eBay link]

Surf’s Up, 2007

front[eBay link]

A behind-the-scenes look at the annual Penguin World Surfing Championship, and its newest participant, up-and-comer Cody Maverick.

Writers: Don Rhymer (screenplay by), Ash Brannon (screenplay by)


Trigger Happy TV season 1, DVD 2006

front[eBay link]

Trigger Happy TV is a hidden camera/practical joke reality television series. The original British edition of the show, produced by Absolutely Productions, starred Dom Joly and ran for two series on the British television channel Channel 4 from 2000 to 2003. Joly made a name for himself as the sole star of the show, which he produced and directed with cameraman Sam Cadman.



Two brothers, 2004

front[eBay link]

Two tigers are separated as cubs and taken into captivity, only to be reunited years later as enemies by an explorer (Pearce) who inadvertently forces them to fight each other.


Director:Jean-Jacques Annaud

Writers:Alain Godard (scenario), Jean-Jacques Annaud (scenario)



The Two Jakes, 1990 (DVD 2002)

front[eBay link]

The sequel to Chinatown (1974) finds J.J. “Jake” Gittes investigating adultery and murder, and the money that comes from oil.


Director: Jack Nicholson

Writers: Robert Towne (characters), Robert Towne



Walk the line, 2005

front[eBay link]

A chronicle of country music legend Johnny Cash‘s life, from his early days on an Arkansas cotton farm to his rise to fame with Sun Records in Memphis, where he recorded alongside Elvis Presley, Jerry Lee Lewis, and Carl Perkins.

Director: James Mangold

Writers: Johnny Cash (book), Johnny Cash (book)



Wild Target, 2010

front[eBay link]

A hitman tries to retire but a beautiful thief may change his plans.


Director: Jonathan Lynn

Writers: Lucinda Coxon (screenplay), Pierre Salvadori (film “Cible émouvante”)

Books I sell on eBay

NB1: The descriptions I provided for each book are mostly not mine, but gathered from book back-covers, their website etc.

NB2: I only mention postage costs for the UK in the eBay listings. However, providing that the proper costs are covered, I am happy to send them anywhere.


Comparative Vertebrate Neuroanatomy, Butler and Hodos, 1996

Ebay-cover.jpg[eBay link]

By applying the tools of modern neuroanatomy to brain structure and function in various species, researchers have discovered that numerous cell groups and interconnections, known to be present in mammals, also exist in non-mammalian vertebrates. This book reveals how the brains of various vertebrates are astoundingly similar in some ways, while in others they are quite different. The authors examine how the form of the brain is modified and magnified to perfect and capitalize on a specific function, making any particular animal a “specialist” in its area. They also clarify the forms and functions of the nervous system that have allowed vertebrates to adapt to almost every aspect of the earth’s environment.

Computational Neuroscience: Realistic Modeling for Experimentalists, 2001


[eBay link]

Designed primarily as an introduction to realistic modeling methods, Computational Neuroscience: Realistic Modeling for Experimentalists focuses on methodological approaches, selecting appropriate methods, and identifying potential pitfalls. The authors address varying levels of complexity, from molecular interactions within single neurons to the processing of information by neural networks. He avoids theoretical mathematics and provides just enough of the basic math used by experimentalists.

Each chapter covers: the theoretical foundation, parameters needed, appropriate software descriptions, evaluation of the model, future directions expected, examples in text boxes and references.

Handbook of Basal Ganglia structure and function, Steiner and Tseng, 2010

Ebay-cover[eBay link]

The Basal Ganglia comprise a group of forebrain nuclei that are interconnected with the cerebral cortex, thalamus and brainstem. Basal ganglia circuits are involved in various functions, including motor control and learning, sensorimotor integration, reward and cognition. The importance of these nuclei for normal brain function and behavior is emphasized by the numerous and diverse disorders associated with basal ganglia dysfunction, including Parkinson’s disease, Tourette’s syndrome, Huntington’s disease, obsessive-compulsive disorder, dystonia, and psychostimulant addiction.

The Handbook of Basal Ganglia provides a comprehensive overview of the structural and functional organization of the basal ganglia, with special emphasis on the progress achieved over the last 10-15 years. Organized in six parts, the volume describes the general anatomical organization and provides a review of the evolution of the basal ganglia, followed by detailed accounts of recent advances in anatomy, cellular/molecular, and cellular/physiological mechanisms, and our understanding of the behavioral and clinical aspects of basal ganglia function and dysfunction.

Modeling Neural Development, 2003

Ebay-cover[eBay link]

This is one of the first books to study neural development using computational and mathematical modeling. Modeling provides precise and exact ways of expression, which allow us to go beyond the insights that intuitive or commonsense reasoning alone can yield. Most neural modeling focuses on information processing in the adult nervous system; Modeling Neural Development shows how models can be used to study the development of the nervous system at different levels of organization and at different phases of development, from molecule to system and from neurulation to cognition.

The book’s fourteen chapters follow loosely the chronology of neural development. Chapters 1 and 2 study the very early development of the nervous system, discussing gene networks, cell differentiation, and neural tube development. Chapters 3-5 examine neuronal morphogenesis and neurite outgrowth. Chapters 6-8 study different aspects of the self-organization of neurons into networks. Chapters 9-12 cover refinement of connectivity and the development of specific connectivity patterns. Chapters 13 and 14 focus on some of the functional implications of morphology and development.

Each chapter contains an overview of the biology of the topic in question, a review of the modeling efforts in the field, a discussion in more detail of some of the models, and some perspectives on future theoretical and experimental work.

The science of studying neural development by computational and mathematical modeling is relatively new; this book, as Dale Purves writes in the foreword, “serves as an important progress report” in the effort to understand the complexities of neural development.

Systems Engineering methods, Harold Chestnut, 1967

Ebay-internalCover[eBay link]

The first, and the most important, volume of Harold Chestnut on Systems Engineering and Analysis series. A classic.



A dog’s best friend, 1999


[eBay link]

Toby’s life is all fillet steak and walks in the park, but one night he gets lost – and everything changes!
He gets wet, cold and hungry, but also makes a new friend. And when his old owners turn up, Toby had to make the toughest decision of his life!

Children of seven and above will enjoy reading this story, but is can be read to younger children too.

24 pages.


The Booktime Book of Fantastic First Poems

Ebay-cover[eBay link]

28 pages.



Fat Puss and Friends, Harriet Castor

Ebay-cover[eBay link]

92 page.

Contains the stories:
Fat Puss
Fat Puss finds a friend
Fat Puss in summer
Fat Puss meets a stranger
Fat Puss at Christmas

Football mad, four books in one, 2008

Ebay-cover[eBay link]

Includes books:

The worst team in the world by Alan MacDonald

There’s only one Danny Ogle by Helena Pielichaty

Nice One, Sam! by John Goodwin

Mark’s Dream Team by Alan MacDonald

415 pages

Horrid Henry, 1995 (book 2008)

front[eBay link]

The original novel that initiated the very successful series.

“His fiendish plots will make you ache with laughter”

Horrid Henry’s purple hand gang joke book, 2011

front[eBay link]

“Laugh your head off with Henry and the rest of the purple hand gang in this collection of the best, most ridiculously rib-tickling jokes by Horrid Henry fans everywhere”

A Giant Slice of Horrid Henry

front[eBay link]

Contains the three stories:
Horrid Henry meets the Queen
Horrid Henry’s underpants
Horrid Henry’s stinkbomb

A Helping of Horrid Henry

front[eBay link]

Three hilarious books in one volume:
Horrid Henry’s Nits
Horrid Henry Gets Rich Quick
(a.k.a. Horrid Henry Strikes it Rich)
Horrid Henry’s Haunted House


The Hundred-Mile-an-Hour Dog Goes for Gold, Jeremy Strong

Ebay-Cover[eBay link]

“Guess what’s coming to town!
The animal games!

There’ll be show-jumping for horses AND rabbits, and discus for dogs – so of course I have to enter Streaker.

Mum says a CARROT is more obedient than my dog, but I think she can do it – Streaker can go for GOLD!”

149 pages.

Lego Star Wars character encyclopedia

Ebay-front[eBay link]

Describing more than 300 minifigures, all the main characters of the saga, in their different outfits.

A figurine is provided, although it is not Hans Solo as written on the book, but a rebel trooper. It was already the case when we bought the book new. This must have been a mistake made in factory.


Little Bear’s ABC and 123, Jane Hissey

Ebay-cover[eBay link]

57 pages.



Monster stories for under fives, Joan Stimson, Ladybird

Ebay-cover[eBay link]

43 pages.



Pigs Might Fly, Red House Younger Children Award 2006

Ebay-front[eBay link]

“Let me win, little pig, LET ME WIN!” The Big Bad Wolf is back and badder than ever! So when the Three Pigs enter the “Pie in the Sky” Air Race, he’s determined to snaffle the prize pies and have the pigs for pudding. Will the Wolf win – or can Wilbur save the day? A fast-paced, frantically-funny sequel to a well-loved tale.

32 pages.

Winner of the Red House Children’s Book Award, category Younger Children.


Puzzle Castle, Usborne young puzzles

front[eBay link]

32 pages

Puzzle Christmas, Usborne young puzzles

front[eBay link]

32 pages

Puzzle Dinosaurs, Usborne young puzzle

front[eBay link]

32 pages

My Treasury of Stories and Rhymes, hardback 2005

Ebay-front[eBay link]

384 pages of children delight. Many stories and many songs, of different lengths and levels.

Where’s my Teddy? Jez Alborough 2004

Ebay-cover[eBay link]

Eddy’s lost his teddy, Frddy.
So off he goes to the wood to find him.
But the wood is dark and horrible and little Eddy is in for a gigantic surprise!


32 pages.



Wicked World Cup 2006, Michael Coleman

Ebay-cover[eBay link]

“This international guide gives you the coolest commentary on the brilliant Brazilians, the awesome Argentinians, the invincible Italians and the fantastic French.

Plus discover the secrets of wicked wonders Pelé, Beckham, and the English heroes of 1966.”

The book has been read many times. No pages are missing, and none are ripped or damaged. The result chart tables at the end of the book have been partially filled with the results of the 2006 world cup.

Wriggle and Roar, by Julia Donaldson (“The Gruffalo”) and Nick Sharratt (“Tracy Beaker”)

ebay-cover[eBay link]

Rhymes to join in with.

A classic by Julia Donaldson, illustrated by Nick Sharratt of “Tracy Beaker” fame

40 pages.


Gardens of delight, by Erica James

front[eBay link]

The Gardens of Delight brochure promises the opportunity to visit some of the most beautiful gardens in the Lake Como area of Italy. For Lucy, the chance to go to Italy offers more than just gardens. Lake Como is where her father lives, and the last time she saw him was when she was just a teenager.

Recently married Helen and her wealthy husband have just moved into the Old Rectory. With her husband spending so much time away from home, Helen throws herself into caring for the garden. But Helen needs help – and friends – and so decides to take the plunge and join the local Garden Club.

Conrad isn’t the least bit interested in gardening. Widowed for five years, his life revolves around work and humouring Mac, his elderly uncle who lives with him, and who has expressed a desire to go on the Gardens of Delight tour. Reluctantly, Conrad agrees to accompany him. ‘Anything for a peaceful life,’ he concedes. But a peaceful life is the last thing any of them are in for…

479 pages

Hidden Talents, by Erica James

front[eBay link]

Dulcie Ballantyne knows that creative writers’ groups attract an unlikely mix of people, so when she starts up Hidden Talents, she is well prepared for the assortment of people she is bringing together.

Beth King is facing empty-nest syndrome as her only son, Nathan, leaves home for university. Jack Solomon, a local estate agent, is having trouble coming to terms with the shock of his wife leaving him for his best friend. Jaz Rafferty is an intensely private seventeen-year-old girl who writes to escape her large, boisterous family. Victor Blackmore is a know-it-all, who claims to be writing the blockbuster novel every publisher will be clamouring for.

What they all have in common is a need to escape, as well as a desire to keep their lives as private as possible. As they grow more confident in their writing skills, friendships develop, and gradually they come to realise that a little openness isn’t necessarily a bad thing.

488 pages.

Powershift, by Alvin Toffler

front[eBay link]

Alvin Toffler’s Future Shock and The Third Wave are among the most influential books of our time. Now, in Powershift, he brings to a climax the ideas set forth in his previous works to offer a stunning vision of the future that will change your life.

In Powershift, Toffler argues that while headlines focus on shifts of power at the global level, equally significant shifts are taking place in the everyday world we all inhabit–the world of supermarkets and hospitals, banks and business offices, television and telephones, politics and personal life. The very nature of power is changing under our eyes. Powershift maps the “info-wars” of tomorrow and outlines a new system of wealth creation based on individualism, innovation, and information. As old political antagonisms fade, Toffler identifies where the next, far more impo world division will arise–not between East and West or North and South, but between the “fast” and the “slow.” 612 pages

The Singularity project, by FM Busby

front[eBay link]

Busby, in the classic mode of Robert A. Heinlein (who was a friend and fan of Busby’s work), tells a rich, fast-paced and cleverly plotted tale of the day after tomorrow in The Singularity Project. Mitch Banning is a free-lance engineer in Seattle, who hires onto a secret high-tech project financed by industrialist George Detweiler that will change the world…if it works. And Mitch doesn’t believe it will, since the people creating the hush-hush demonstration of the world’s first matter transmitter include an elderly con-man, an addict-physicist, and a tough South American Indian with a knife. 349 pages


Sunset in St Tropez, by Danielle Steele

front[eBay link]

In her 55th bestselling novel, Danielle Steel explores the seasons of an extraordinary friendship, weaving the story of three couples, lifelong friends, for whom a month’s holiday in St. Tropez becomes a summer of change, revelation, secrets, surprises, and new beginnings…

The Swiss Family Robinson, by Johann Wyss, 1968 edition

front[eBay link]

American print of the Swiss classic. This is the full-length version of WHG Kingston’s translation (446 pages).


The Time Ships, by Stephen Baxter

front[eBay link]

The highly-acclaimed sequel to H G Wells’s The Time Machine, from the heir to Arthur C. Clarke.

Written to celebrate the centenary of the publication of H G Wells’s classic story THE TIME MACHINE, Stephen Baxter’s stunning sequel is an outstanding work of imaginative fiction.

The Time Traveller has abandoned his charming and helpless Eloi friend Weena to the cannibal appetites of the Morlocks, the devolved race of future humans from whom he was forced to flee. He promptly embarks on a second journey to the year AD 802,701, pledged to rescue Weena. He never arrives. The future was changed by his presence… and will be changed again. Hurling towards infinity, the Traveller must resolve the paradoxes building around him in a dazzling temporal journey of discovery. He must achieve the impossible if Weena is to be saved.

629 pages

Zero Coupon, by Paul Erdman

front[eBay link]

Willy Saxon is a financier extraordinaire, until his code of honour gets him three years in jail. Now he’s free to shake the world’s money tree in a daring hustle of global dimensions, tapping the greed of Wall Street and challenging the arrogance of European bankers until both are brought to their knees. The plan’s airtight. The prize is untold billions. And this time Willy plays for keeps.
350 pages


plotGODESeq: differential expression and Gene Ontology enrichment on one plot

I recently came across the package GOplot by Wencke Walter In particular I liked the function GOBubble. However, I found difficult to customise the plot. In particular I wanted to colour the bubbles differently, and to control the plotting area. So I took the idea and extended it. Many aspects of the plot can be configured. It is a work in progress. Not all features of GOBubble are implemented at the moment. For instance, we cannot separate the different branches of Gene Ontology, or add a table listing labelled terms. I also have a few ideas to make the plot more versatile. If you have suggestions, please tell me. The code and the example below can be found at

What we want to obtain at the end is the following plot:


The function plotGODESeq() takes two mandatory inputs: 1) a file containing Gene Ontology enrichment data,  2) a file containing differential gene expression data. Note that the function works better if the dataset is limited, in particular the number of GO terms. It is useful to analyse the effect of a perturbation, chemical or genetic, or to compare two cell types that are not too dissimilar. Comparing samples that exhibit several thousands of differentially expressed genes, resulting in thousands of enriched GO terms, will not only slow the function to a halt, it is also useless (GO enrichment should not be used in these conditions anyway. The results always show things like “neuronal transmission” enriched in neurons versus “immune process” enriched in leucocytes). A large variety of other arguments can be used to customise the plot, but none are mandatory.

To use the function, you need to source the script from where it is; In this example it is located in the session directory. (I know I should make a package of the function. On my ToDo list)



The Gene Ontology enrichment data must be a dataframe containing at least the columns: ID – the identifier of the GO term, description– the description of the term, Enrich – the ratio of observed over expected enriched genes annotated with the GO term, FDR – the False Discovery Rate (a.k.a. adjusted p-value), computed e.g. with the Benjamini-Hochberg correction, and genes – the list of observed genes annotated with the GO term. Any other column can be present. It will not be taken into account. The order of columns does not matter. Here we will load results coming from and analysis run on the server WebGestalt. Feel free to use whatever Gene Ontology enrichment tool you want, as far as the format of the input fits.

# load results from WebGestalt
goenrich_data <- read.table("GO-example.csv", 

# rename the columns to make them less weird 
# and compatible with the GOPlot package
colnames(goenrich_data) %in% c("geneset","R","OverlapGene_UserID")
] <- c("ID","Enrich","genes")

# remove commas from GO term descriptions, because they suck
goenrich_data$description <- gsub(',',"",goenrich_data$description)

The differential expression data must be a dataframe which rownames are the gene symbols, from the same namespace as the genes column of the GO enrichment data above. In addition, one column must be named log2FoldChange, containing the quantitative difference of expression between two conditions. Any other column can be present. It will not be taken into account. The order of columns does not matter.

# Load results from DESeq2
deseq_data <- read.table("DESeq-example.csv", 

Now we can create the plot.


The y-axis is the negative log of the FDR (adjusted p-value). The x-axis is the zscore, that is for a given GO term:

(nb(genes up) – nb(genes down))/sqrt(nb(genes up) + nb(genes down))

The genes associated with each GO term are taken from the GO enrichment input, while the up or down nature of each gene is taken from the differential expression input file. The area of each bubble is proportional to the enrichment (number of observed genes divided by number of expected genes). This is the proper way of doing it, rather than using the radius, although of course, the visual impact is less important.


Choosing what to plot

The console output tells us that we plotted 1431 bubbles. That is not very pretty or informative … The first thing we can note is that we have a big mess at the bottom of the plot, which corresponds to the highest values of FDR. Let’s restrict ourselves to the most significant results, by setting the argument maxFDR to 1e-8.


This is better. We now plot only 181 GO terms. Note the large number of terms aligned at the top of the plot. Those are terms with an FDR of 0. The Y axis being logarithmic, we plot them by setting their FDR to a tenth of the smallest non-0 value. GO over-representation results are often very redundant. We can use GOplot’s function reduce_overlap by setting the argument collapse to the proportion of genes that needs to be identical so that GO terms are merged in one bubble. Let’s use collapse=0.9 (GO terms are merged if 90% of the annotated genes are identical) .


Now we only plot 62 bubbles, i.e. two-third of the terms are now “hidden”. Use this procedure with caution. Note how the plot now looks distorted towards one condition. More “green” terms have been hidden than “red” terms.

The colour used by default for the bubbles is the zscore. It is kind of redundant with the x-axis. Also, the zscore only considers the number of genes up or down-regulated. It does not take into account the amplitude of the change. By setting the argument color to l2fc, we can use the average fold change of all the genes annotated with the GO term instead.


Now we can see that while the proportion of genes annotated by GO:0006333 that are down-regulated is lower than for GO:0008380, the amplitude of their average down-regulation is larger.

WARNING: The current code does not work if the color scheme chosen for the bubbles is based on variable, l2fc or zscore, that do not contain negative and positive values. Sometimes, the “collapsing” can cause this situation, if there is an initial unbalance between zscores and/or l2fc. It is a bug, I know. On the ToDo list …

Using GO identifiers is handy and terse, but since I do not know GO by heart, it makes the plot hard to interpret. We can use the full description of each term instead, by setting the argument label to description.


Customising the bubbles

The width of the labels can be modified by setting the argument wrap to the maximum number of characters (the default used here is 15). Depending of the breadth of values for FDR and zscore, the size of the bubbles can be an issue, either by overlapping too much, or on the contrary by being tiny. We can change that by the argument scale which scales the radius of the bubbles. Let’s fix it to 0.7, to decrease the size of each bubble by a third (the radius, not the area!).


There is often a big crowd of terms at the bottom and centre of the plot. This is not so clear here, with the harsh FDR threshold, but look at the first plot of the post. These terms are generally the least interesting, since they have a lower significance (higher FDR) and mild zscore. We can decide to label the bubbles only under a certain FDR with the argument maxFDRLab and/or above a certain absolute zscore with the argument minZscoreLab. Let’s fix them to 1e-12 and 2 respectively.


Finally, you are perhaps not too fond of the default color scheme. This can be changed with the arguments lowCol, midCol, highCol. Let’s set them to  “deepskyblue4”, “#DDDDDD” and “firebrick”,


Customising the plotting area

The first modifications my collaborators asked me to introduced was to centre the plot on a zscore of 0, and to add space around so they could annotate the plot. One can centre the plot by declaring centered = TRUE (the default is FALSE). Since our example is extremely skewed towards negative zscores, this would not be a good idea. However, adding some space on both side will come handy in the last step of beautification. We can do that by declaring extrawidth=3 (default is 1).


The legend position can be optimised with the arguments leghoffset and legvoffset. Setting them to {-0.5,1.5}


The complete call:

            maxFDR = 1e-8,
            collapse = 0.9,
            lowCol = "deepskyblue4",
            midCol = "#DDDDDD",
            highCol = "firebrick",
            label = "description",
            scale = 0.7,
            maxFDRLab = 1e-12,
            minZscoreLab = 2.5,
            wrap = 15)

Now we can export an SVG version and play with the labels in Inkscape. This part is unfortunately the most demanding …


SBGN-ML, SBML visual packages, CellDesigner format, what are they? When should I use them?

Visual representation of biochemical pathways has been a key tool used to understand cellular and molecular systems for a long time. Any knowledge integration project involves a jigsaw puzzle step, where different pieces have to be put together. When Feynman cheekily wrote on his blackboard just before his death “What I cannot create I do not understand”, he meant that he only fully understood a system once he derived a (mathematical) model for it, and interestingly Feynman is also famous for one of the earliest standard graphical representations of reaction networks, namely the Feynman diagrams to represent models of subatomic particle interactions. The earliest metabolic “map” I possess comes from the 3rd edition of “Outlines of Biochemistry” by Gortner published in 1949. I would be happy to hear if you have older ones.


(I let you find out all the inconsistencies, confusions and error-generating features in this map. This might be food for another text, but I believe this is a great example to support the creation of standards, best practices, and software tools!)

Until recently, those diagrams were mostly drawn by hand, initially on paper, then using drawing software. There was not so much thinking spent in consistency, visual semantics, or interoperability. This changed in the 1990s, as part as Systems Biology’s revival. The other thing that changed in the 1990s was the widespread use of computers and software tools to build and analyse models. The child of both trends was the development of standard computer-readable formats to represent biological networks.

When drawing a knowledge representation map, one can divide the decision-making process, and therefore the things we need to encode in order to share the map, in three  parts:

What – How can people identify what I represent? A biochemical map is a network made up of nodes, linked by arcs. The network may contain only one type of nodes, for instance a protein-protein interaction network or an influence network, or be a bipartite graph, like a reaction network – one type of nodes representing the pools involved in the reactions, the other representing the reactions themselves. One decision is the shape to use for each node so that it carries a visual information about the nature of what it represents. Another concerns the arcs linking the nodes, that can also contain visual clues, such as directionality, sign, type of influence etc. All this must be encoded in some way, either semantically (a code identifying the type of glyphs, from an agreed-up list of codes), or graphical (embedding an image or describing the node).

Where – Once the glyphs are chosen, one needs to place them. The relative position of the information should not always carry much information, but there are some cases where it must, e.g. members of complexes, inclusion in compartments etc. And there is no denying that the relative position of glyphs is also used to convey more subjective information. For instance a linear chain of reactions induce the idea of a flow, much better than a set of reactions going randomly up and down, right and left. Another unwritten convention it to represent membrane signal transduction on the top of the maps, with the “end-result”, often effect on gene expression, at the bottom, with the idea of a cascading flux of information. The coordinates of the the glyphs must then be shared as well.

How – Finally, the impact of a visual representation also depends on aesthetic factors. The relative size of glyphs and labels, thickness of arcs, the colours, shades and textures, all influence the facility with which viewers absorb the information contained in a map. Relying on such aspects to interpret the meaning of a map should be avoided, in particular if the map is to be shared between different media, where rendering could affect the final aspect. But wanting to keep this aspect as close as possible makes sense.

A bit of history

Different formats have been developed over the years to cover these different aspects with different accuracy and constraints. In order to understand why we have such a variety of description formats on offer, a bit of history might be useful. Being able to encode graphical representation of models in SBML was mentioned as early as 2000 (Andrew Finney. Possible Extensions to the Systems Biology Markup Language. 27 November 2000.).

In 2002, the group of Hiroaki Kitano presented a graphical editor for the Systems Biology Markup Language (SBML, Hucka et al 2003), called SBedit, and proposed extensions to SBML necessary for encoding maps (Tanimura et al. Proposal for SBEdit’s extension of SBML-Level-1. 8 July 2002). This software latter became CellDesigner (Funahashi et al 2003), a full-featured modelling developing environment, using SBML as its native format. All graphical information is encoded in CellDesigner-specific annotations, using the SBML extension system. In addition to the layout (the where), CellDesigner proposed a set of standardised glyphs to use for representing different types of molecular entities and different relationships (the what) (Kitano et al 2003). At the same time, Herbert Sauro developed an extension to SBML to encode the maps designed in the software JDesigner (Herbert Sauro. JDesigner SBMLAnnotation. 8 January 2003). Both CellDesigner and JDesigner annotations could also encode the appearance of glyphs (how).

In 2003, Gauges et al (Gauges et al. Including Layout information in SBML files. 13 May 2003) proposed to split the description of the layout (the where) and the rendering (the what and the how), and to focus on the layout part in SBML (Gauges et al 2006). Eventually, this effort led to the development of two SBML Level 3 Packages, Layout (Gauges et al 2015) and Render (Bergmann et al 2017).

Once the SBML Layout annotations were finalised, the SBML and BioPAX communities came together to standardise visual representations for biochemical pathways. This led to the Systems Biology Graphical Notation, as set of three standard graphical languages with agreed upon symbols and rules to assemble them (the what, Le Novère et al 2009). While the shape of SBGN glyphs determine their meaning, neither their placement in the map nor their graphical attributes (colour, texture, edge thickness, the how) affect the map semantics. SBGN maps are ultimately images and can be exchanged as such, either in bitmaps or vector graphics. They are also graphs and can be exchanged using graph formats, such as GraphML. However, it was felt that sharing and editing SBGN maps would be much easier if more semantics was encoded rather than graphical details. This led to the development of SBGN-ML (van Iersel et al 2012), which not only encode the SBGN part of SBGN maps, but also the layout and size of graph elements.

So we have at least three solutions to encode biochemical maps using XML standards from the COMBINE community (Hucka et al 2015): 1) SBGN-ML, 2) SBML with Layout extension (controlled Layout annotations in Level 2 and Layout package in Level 3) and 3) SBML with proprietary extensions. Regarding the latter, we will only consider CellDesigner, for two reasons. Firstly, CellDesigner is the most used graphical model designer in systems biology (at the time of writing, the articles describing the software have been cited over 1000 times). Secondly, CellDesigner’ SBML extensions are used in other software tools.  These solutions are not equivalent, they present different advantages and disadvantages, and round-tripping is in general not possible.


Curiously, despite its name, SBGN-ML does not explicitly describe the SBGN part of the maps (the what). Since the shape of nodes is a standard, it is only necessary to mention their type, and any supporting software will know which symbol to use. For instance, SBGN-ML will not specify that a protein X must be represented with a round-corner rectangle. It will only say that there is a macromolecule X at a certain position with given width and height. Any SBGN-supporting software must know that a macromolecule is represented by a round-corner rectangle. The consequence is that SBGN-ML cannot be used to encode maps using non-SBGN symbols. However, software tools can decide to use different symbols attributed to a given class of SBGN objects during the rendering of the maps. Instead of using a round-corner rectangle each time the class of a glyph is macromolecule, it could use a star. The resulting image would not be an SBGN map. But if modified, and saved back in SBGN-ML, it could be recognised by another supporting software. Such a behaviour is not to be encouraged if we want people to get used to SBGN symbols, but it provides a certain level or interoperability.

What is explicitly described in SBGN-ML instead are the parts that are not regulated by SBGN itself, but are specific to the map. That include the size of the glyphs (bounding box), the textual labels, as well as the positions of glyphs (the where). SBGN-ML currently does not encode rendering properties such as text size, colours and textures (the how). But the language provides an element extension, analogous to the SBML annotation, that allows to augment the language. One can use this element to extend each glyph, or to encode styles, and the community started to do so in an agreed-upon manner.

Note that SBGN-ML only encodes the graph. While there is a certain amount of biological semantics, linked to the identity of the glyphs, it is not a general purpose format that would encode advanced semantic of regulatory features, such as BioPAX (Demir et al. 2010),  or mathematical relationships such as SBML. However, users can distribute SBML files along SBGN-ML files, for instance in a COMBINE Archive (Bergmann et al 2014). Unfortunately, there is currently no blessed way to map an SBML element, such as a particular species, to a given SBGN-ML glyph.

SBML Level 3 + Layout and Render packages

As we mentioned before, SBML Level 3 provides two packages helping with the visual representations of networks: Layout (the where) and Render (the how). Contrarily to SBGN-ML, which is meant to describe maps in a standard graphical notation, the SBML Level 3 packages do not restrict the way one represents biochemical networks. This provides more flexibility to the user, but decreases the “stand-alone” semantics content of the representations. I.e. if non-standard symbols are used, their meaning must be defined in an external legend. It is of course possible to use only SBGN glyphs to encode maps. The visual rendering of such a file will be SBGN, but the automatic analysis of the underlying  format will be harder.

The SBML Layout package permits to encode the position of objects, points, curves and bounding boxes. Curves can have complex shapes, encoded as Béziers curves. The package allows to distinguish between different general types of nodes such as compartments, molecular species, reactions and text. However, there is little biological semantics encoded by the shapes, either regarding the nodes (e.g. nothing distinguishes a simple chemical from a protein) or the edges (one cannot distinguish an inhibition from a stimulation). In addition, the SBML Render package permits to define styles that can be applied to types of glyphs. This includes colours and gradients, geometric shapes, properties of text, lines, line-endings etc. Render can encode a wide variety of graphical properties, and pave the gap to generic graphical formats such as SVG.

If we are trying to visualise a model, one advantage of using SBML packages is that all the information is included in a single file, providing an easy mapping between the model constructs and their representation. This goes a long way to solve the issue of the biological semantics mentioned above, since it can be retrieved from the SBML Core elements, linked to the Layout elements. Let’s note that while SBML Layout+Render do not encode the nature of the objects represented by the glyphs (the what) using specific structures, this can be retrieved via the attributes sboTerm of the corresponding SBML Core elements, using the appropriate values from the Systems Biology Ontology (Courtot et al 2011).

CellDesigner notation

CellDesigner uses SBML (currently Level 2) as its native language. However, it extended it with its own proprietary annotation, keeping the SBML perfectly valid (which is also the way software tools such as JDesigner operate). Visually, the CellDesigner notation is close to SBGN Process Descriptions, having been the strongest inspiration for the community effort. CellDesigner offers an SBGN-View mode, that produce graphs closer to pure SBGN PD.

CellDesigner’s SBML extensions increase the semantics of SBML elements such as molecular species or regulatory arcs,  in a way not dissimilar to SBGN-ML. In addition, it provides a description of each glyph linked to the SBML elements, covering the ground of SBML Layout and Render. The SBML extensions being specific to CellDesigner, they do not offer the flexibility of SBML Render. However, the limited spectrum of possibility might makes the support easier.

CellDesigner notation SBML Layout+Render SBGN-ML
Encode the what
Encode the where
Encode the how
Contain the mathematical model part
Writing supported by more than 1 tool
Reading supported by more than 1 tool
Is a community standard

Examples of usages and conversions

Now let’s see the three formats in action. We start with SBGN-ML. First, we can load a model, for instance from BioModels (Chelliah et al 2015), in CellDesigner (version 4.4 at the time of writing). Here we will use the model BIOMD0000000010, an SBML version of  the MAP kinase model described in Kholodenko et al (2000).


From an SBML file that does not contain any visual representation, CellDesigner created one using its auto-layout functions. One can then export an SBGN-ML file. This SBGN-ML file can be imported for instance in Cytoscape (Shannon et al.  2003) 2.8 using the CySBGN plugin (Gonçalves et al 2013).


The position and size of nodes are conserved, but edges have different size (and the catalysis glyph is wrong). The same SBGN-ML file can be open in the online SBGN editor Newt.


An alternative to CellDesigner to produce the SBGN-ML map could be Vanted (Junker et al 2006, version 2.6.4 at the time of writing). Using the same model from BioModels, we can auto-layout the map (we used the organic layout here) and then convert the graph to SBGN using the SBGN-ED plugin (Czauderna et al 2010).


The map can then be saved as SBGN-ML, and as before opened in Newt.


The positions of the nodes are conserved. But the connection of edges is a bit different. In that case, Newt is slightly more SBGN compliant.

Now, let’s start with a vanilla SBML file. We can import  our BIOMD0000000010 model in COPASI  (Hoops et al 2006, version 4.22 at the time of writing). COPASI now offers auto-layout capabilities, with possibilities of manually editing the resulting maps.


Now, when we’ll export the model in SBML, it will contain the map encoded with the Layout and Render packages. When the model is uploaded in any software tool supporting the packages, we will retrieve the map. For instance, we can use the SBML Layout Viewer. Note that if the layout is conserved, it is not the case of the rendering.


Alternatively, we can load the model to CellDesigner, and manually generate a nice map (NB: a CellDesigner plugin that can read SBML Layout was implemented during Google Summer of Code 2014 . It is part of the JSBML project).


We can create an SBML Layout using CellDesigner layout converter. When we import the model in COPASI we can visualise the map encoded in Layout. NB: the difference of appearance here is due to a problem in CellDesigner converter, not COPASI.


The same model can be loaded in the SBML Layout Viewer.


How do I choose between the formats?

There is unfortunately no unique solution at the moment. The main question one has to ask is what do we want to do with the visual maps?

Are they meant to be a visual representation of an underlying model, the model being the important part, that needs to be exchanged? If that is the case, SBML packages or CellDesigner notation should be used.

Does the project mostly/only involves graphical representations, and those must be exchanged? CellDesigner or SBGN-ML would therefore be better.

Does the rendering of graphical elements matter? In that case, SBML packages or CellDesigner notations are currently better (but that is going to change soon).

Is standardisation important for the project, in addition to immediate interoperability? If yes, SBML packages or SBGN-ML would be the way to go.

All those questions and more have to be clearly spelled out at the beginning of a project. The answer will quickly emerge from the answers.


Thanks to Frank Bergmann, Andreas Dräger, Akira Funahashi, Sarah Keating, Herbert Sauro for help and corrections.


Bergmann FT, Adams R, Moodie S, Cooper J, Glont M, Golebiewski M, Hucka M, Laibe C, Miller AK, Nickerson DP, Olivier BG, Rodriguez N, Sauro HM, Scharm M, Soiland-Reyes S, Waltemath D, Yvon F, Le Novère N (2015) COMBINE archive and OMEX format: one file to share all information to reproduce a modeling project. BMC Syst Biol 15, 369. doi:10.1186/s12859-014-0369-z

Bergmann FT, Keating SM, Gauges R, Sahle S, Wengler K (2017) Render, Version 1 Release 1. Available from COMBINE <>

Chelliah V, Juty N, Ajmera I, Raza A, Dumousseau M, Glont M, Hucka M, Jalowicki G, Keating S, Knight-Schrijver V, Lloret-Villas A, Natarajan K, Pettit J-B, Rodriguez N, Schubert M, Wimalaratne S, Zhou Y, Hermjakob H, Le Novère N, Laibe C (2015)  BioModels: ten year anniversary. Nucleic Acids Res 43(D1), D542-D548. doi:10.1093/nar/gku1181

Courtot M, Juty N, Knüpfer C, Waltemath D, Zhukova A, Dräger A, Dumontier M, Finney A, Golebiewski M, Hastings J, Hoops S, Keating S, Kell DB, Kerrien S, Lawson J, Lister A, Lu J, Machne R, Mendes P, Pocock M, Rodriguez N, Villeger A, Wilkinson DJ, Wimalaratne S, Laibe C, Hucka M, Le Novère N. Controlled vocabularies and semantics in Systems Biology. Mol Syst Biol  7, 543. doi:

Czauderna T, Klukas C, Schreiber F (2010) Editing, validating and translating of SBGN maps. Bioinformatics 26(18), 2340-2341. doi:10.1093/bioinformatics/btq407

Demir E, Cary MP, Paley S, Fukuda K, Lemer C, Vastrik I, Wu G, D’Eustachio P, Schaefer C, Luciano J, Schacherer F, Martinez-Flores I, Hu Z, Jimenez-Jacinto V, Joshi-Tope G, Kandasamy K, Lopez-Fuentes AC, Mi H, Pichler E, Rodchenkov I, Splendiani A, Tkachev S, Zucker J, Gopinathrao G, Rajasimha H, Ramakrishnan R, Shah I, Syed M, Anwar N, Babur O, Blinov M, Brauner E, Corwin D, Donaldson S, Gibbons F, Goldberg R, Hornbeck P, Luna A, Murray-Rust P, Neumann E, Ruebenacker O, Samwald M, van Iersel M, Wimalaratne S, Allen K, Braun B, Carrillo M, Cheung KH, Dahlquist K, Finney A, Gillespie M, Glass E, Gong L, Haw R, Honig M, Hubaut O, Kane D, Krupa S, Kutmon M, Leonard J, Marks D, Merberg D, Petri V, Pico A, Ravenscroft D, Ren L, Shah N, Sunshine M, Tang R, Whaley R, Letovksy S, Buetow KH, Rzhetsky A, Schachter V, Sobral BS, Dogrusoz U, McWeeney S, Aladjem M, Birney E, Collado-Vides J, Goto S, Hucka M, Le Novère N, Maltsev N, Pandey A, Thomas P, Wingender E, Karp PD, Sander C, Bader GD  (2010) The BioPAX Community Standard for Pathway Data Sharing. Nat Biotechnol, 28, 935–942. doi:10.1038/nbt.1666

Funahashi A, Morohashi M, Kitano H, Tanimura N (2003) CellDesigner: a process diagram editor for gene-regulatory and biochemical networks. Biosilico 1 (5), 159-162

Gauges R, Rost U, Sahle S, Wegner K (2006) A model diagram layout extension for SBML. Bioinformatics 22(15), 1879-1885. doi:10.1093/bioinformatics/btl195

Gauges R, Rost U, Sahle S, Wengler K, Bergmann FT (2015) The Systems Biology Markup Language (SBML) Level 3 Package: Layout, Version 1 Core. J Integr Bioinform 12(2), 267. doi:10.2390/biecoll-jib-2015-267

Gonçalves E, van Iersel M, Saez-Rodriguez J (2013) CySBGN: A Cytoscape plug-in to integrate SBGN maps. BMC Bioinfo 14, 17. doi:10.1186/1471-2105-14-17

Hoops S, Sahle S, Gauges R, Lee C, Pahle J, Simus N, Singhal M, Xu L, Mendes P, Kummer U (2006) COPASI-a COmplex PAthway SImulator. Bioinformatics 22(24), 3067-3074. doi:10.1093/bioinformatics/btl485

Hucka M, Bolouri H, Finney A, Sauro HM, Doyle JC, Kitano H, Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A, Cuellar AA, Dronov S, Ginkel M, Gor V, Goryanin II, Hedley WJ, Hodgman TC, Hunter PJ, Juty NS, Kasberger JL, Kremling A, Kummer U, Le Novère N, Loew LM, Lucio D, Mendes P, Mjolsness ED, Nakayama Y, Nelson MR, Nielsen PF, Sakurada T, Schaff JC, Shapiro BE, Shimizu TS, Spence HD, Stelling J, Takahashi K, Tomita M, Wagner J, Wang J (2003) The Systems Biology Markup Language (SBML): A Medium for Representation and Exchange of Biochemical Network Models. Bioinformatics, 19, 524-531. doi:10.1093/bioinformatics/btg015

Hucka M, Nickerson DP, Bader G, Bergmann FT, Cooper J, Demir E, Garny A, Golebiewski M, Myers CJ, Schreiber F, Waltemath D, Le Novère N (2015) Promoting coordinated development of community-based information standards for modeling in biology: the COMBINE initiative. Frontiers Bioeng Biotechnol 3, 19. doi:10.3389/fbioe.2015.00019

Junker BH, Klukas C, Schreiber F (2006) VANTED: A system for advanced data analysis and visualization in the context of biological networks. BMC Bioinfo 7, 109. doi:10.1186/1471-2105-7-109

Kholodenko BN (2000) Negative feedback and ultrasensitivity can bring about oscillations in the mitogen-activated protein kinase cascades. Eur J Biochem.267(6), 1583-1588. doi:10.1046/j.1432-1327.2000.01197.x

Kitano H (2003) A graphical notation for biochemical networks. Biosilico 1 (5), 169-176. doi:10.1016/S1478-5382(03)02380-1

Le Novère N, Hucka M, Mi H., Moodie S, Shreiber F, Sorokin A, Demir E, Wegner K, Aladjem M, Wimalaratne S, Bergman FT, Gauges R, Ghazal P, Kawaji H, Li L, Matsuoka Y, Villéger A, Boyd SE, Calzone L, Courtot M, Dogrusoz U, Freeman T, Funahashi A, Ghosh S, Jouraku A, Kim S, Kolpakov F, Luna A, Sahle S, Schmidt E, Watterson S, Goryanin I, Kell DB, Sander C, Sauro H, Snoep JL, Kohn K, Kitano H (2009) The Systems Biology Graphical Notation. Nat Biotechnol 27, 735-741. doi:10.1038/nbt.1558

Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramge D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks.  Bioinformatics 13, 2498-2504. doi:10.1101/gr.1239303

van Iersel MP, Villéger AC, Czauderna T, Boyd SE, Bergmann FT, Luna A, Demir E, Sorokin A, Dogrusoz U, Matsuoka Y, Funahashi A, Aladjem MI, Mi H, Moodie SL, Kitano H, Le Novère N, Schreiber F (2012) Software support for SBGN maps: SBGN-ML and LibSBGN. Bioinformatics 28, 2016-2021. doi:10.1093/bioinformatics/bts270

Setting things straight

If you are reading that, you are probably aware that I have been convicted and sentenced for a crime related to accessing indecent images on the internet. Following the sentence, there was quite a lot of media exposure, in particular follow-up articles in the Cambridge News and the BBC. These reports were quite misleading, misusing dates, and choosing words in order to maximise the shock effect.

I hoped that in time those reports would wane, but this was a pipe dream. I also cannot expect people to forget what they read. Although I cannot fight the free press, I can communicate on my own. This post aims at giving precisions on the what, when and how, and setting things straight.

First of all, I want to assure any reader that I am not seeking to minimise my crime, or trying to find excuses for it. I made mistakes, horrible ones, and I am paying for them. A side effect of the story is that it made me think deeply about what those crimes are, why and how they are committed.

Back to the facts. In August 2017, the police seized 14 electronic devices from my home, including computers, phones and hard-drives. In May 2018, I was summoned for an interview under caution at the Peterborough police station. The police gave me back 13 of the devices, that were found clean.

On an old Iomega hard disk, that I used as a home backup until a few years ago, they found two indecent moving images. The images were in the cache of a peer to peer software called Limewire. They were not readable, and were identified through digital forensic investigation. I was therefore charged for making images but not possession (i.e. I did not have these images and could not give them to anyone). Note that, of course, making an image is not at all the same as taking a picture. In the present case, it means creating a computer file of an image.

The police determined that the two files were opened only once, at the date of creation in May 2008, and never again. This was the charge for which I was sentenced. The police told me at the time that the offense was “at the lowest end of the scale”, and that I would probably only get a fine.

Now, during the interview under caution in May 2018, I was asked when I started using the Limewire software. I answered “in the early 2000s for downloading music”. I have since determined that I started using Limewire sometimes in 2005 (in fact, I only had broadband at home from end of 2003).

During the interview, the police also asked me about my use of chatrooms and forums. At one point, I was asked when I last saw an indecent image in those chatrooms. I answered “about a year ago”. Now, the specific chatrooms we were talking about were not illegal, or dedicated to the kind of material leading to the interview. And one does not choose what to see in forums and chatrooms, which is why people are not prosecuted for that (*).

These facts were then transformed through a Chinese whisper game. The police tape (that was never heard after the interview) led to a written report transmitted to the prosecution, which ordered a pre-sentencing report. This pre-sentencing report was sent back to the prosecution (different judges ordered the report and sentenced me). And finally, all of this was described by the judge in front of journalists who took hand-written notes.

As a consequence, the facts described above became, in the news reports, downloading child pornography between 2000 and 2016 and having two indecent moving images. Again, I am not trying to minimise my crime or finding excuses for it. But both these statements were untrue. And I believe the portrait those articles made of me instilled a climate of shock and fear that is not foreign to some reactions I am, and my family is, still suffering from.

I was sentenced in Cambridge crown court on 4th September. The judge was severe, and the sentence unexpectedly strong, taking solicitor, barrister, the police, probation services, and of course myself, by surprise. I was sentenced to 8 months custody, suspended for 2 years. During this period, I will be monitored by the probation services. I have also to complete a 30 days rehabilitation activity. This sentence will be spent after 4 years and 8 months.

In parallel to the sentence, I have to be on the sex offender registry for 10 years. In brief, this means that the police needs to know where I am at any time, plus a few other things. I also received a Sexual Harm Prevention Order, which forbids me to hide or delete my internet history and allows the police to monitor my electronic devices at any time without a warrant. However, this SHPO does not apply to electronic devices provided by employers.

Finally, I was fired from my job at the Babraham Institute and expelled from any board, panel, collaboration, community I was part of.

So, I made mistakes, and I am paying for them, dearly. I need to atone, and I will. There is no need to inflate my errors, and mislead people around me into fearing a monstrous predator.

(*) It is estimated that half a million adult UK males viewed such material in the past, which points to a societal problem that is large and complex, and is unlikely to be solved by casting out a few individuals. In other words, if you know 60 adult males, then it very likely one of them accessed this type of content. If you work in an institute employing 180 adult males, then 3 of them could have done so. And the proportion increases with the use of internet.

So, is it rare? No. Is it “normal”? No! Should the society deal with the problem? Yes. Is there redemption for the 1.7% of the population? I would hope so.

The Stop It Now campaign from the Lucy Faithfull foundation is a good source of information and help. If you are accessing this type of material, or thinking about it, you can contact them.