Batch correcting using part of a dataset – illustration with cell-types in scRNA-seq

We all know that one of the main sources of variability in molecular measurements are batch effects. We perform the “same” experiment twice, using the same protocol and the same piece of kit. We believe we control everything and eliminate all sorts of unwanted perturbations. But at the end of day, everything is different, luminosity, air pressure, what we ate at lunch etc. Fortunately, we can (sometimes, and partially) correct these effects down the line. Now, what happens if we want to correct the batch effects for only part of a dataset? When I needed to, I had to dig quite a bit to find out how we can do so. I thought the trick could be useful to others (why we would like to do so will become clear latter).

NB: I use the example of single-cell RNA-seq in this post. However, the approach can of course be applied to all kind of data.

First, let’s look at the batch correction applied to an entire dataset. I will use scRNA-seq data, generated with the Smart-Seq protocol. The details of the sample preparation and the generation of the datasets is not relevant for the current post. So I will start directly with the count tables. The datasets haver been cleaned though. Counts have been corrected for library sizes (Count Per Million reads) and “bad” cells have been removed (cells with lot fewer reads than most, here less than 500000, and cells whose reads mapped to less genes, here 9000). I then removed all the genes that did not show at least 10 CPM in at least one cell, and genes that did not vary by at least two-fold across the entire datasets. Finally, I took the log of the counts.  Other approaches can be used, such as DESeq2’s rlog. It does not matter here.  Let’s load the counts.

# load counts
cpm1 <- read.table("OneCellType.csv", 
                   sep=",",fill=T,
                   header=T,row.names=1)

We can look at the resulting table. Columns are cells. The first part of the name indicate the name of the 96-well plate while the second part are the coordinates of the well. We have two batches, one composed of 1 plate, EPI1, and the other composed of 2 plates, EPI2 and EPI3. In total, we have 232 cells and 24235 genes.

cpm1

As we can see, many many cells show zeros, as expected with scRNA-seq. But no row is made up entirely of zeros. Let’s visualise the dataset with Principal Component Analysis.

# transpose the count matrix; needed for PCA
tcpm1<-t(cpm1)

# Run the PCA
PCA1<-prcomp(tcpm1)

# Provide the variance of components
library(factoextra)
eigen1<-get_eig(PCA1)  

# increase the left margin to avoid cropping the labels
par(mar=c(5.1, 5.1, 4.1, 2.1))

# Colour by plates. 
colour<-0
colour[grep("EPI1",colnames(cpm1))]<-rgb(230/255,159/255,0/255)
colour[grep("EPI2",colnames(cpm1))]<-rgb(86/255,180/255,233/255)
colour[grep("EPI3",colnames(cpm1))]<-rgb(0/255,158/255,115/255)

# Plot PC1 and PC2 (the first coordinates from PCA1$x)
plot(PCA1$x,col=colour,pch=16,
     cex=1.5,cex.lab=1.5,cex.axis=1.5,
     xlab=sprintf("PC1 %.1f %%",eigen1[1,2]),
     ylab=sprintf("PC2 %.1f %%",eigen1[2,2]))   

1celltypenocorr

The resulting plot does not exhibit much structure, and the variance landscape is very flat, the first 2 components showing only about 5% of it. This is all good since all those cells are supposed to belong to the same cell-type. However,  we can see that while EPI2 and EPI3 (blue and green) 0 the two plates processed in the same batch – overlap nicely, EPI1 (in orange) is off. And this batch effect aligns perfectly with PC1. Yes, the most important source of variability (albeit small) is the batch! So we need to correct for it. To do that, we will use the function ComBat from the Bioconductor package sva.

First we create a table that link our cells with the batches.

# load the package sva
library(sva)

# create a table with the cells and the batches they belong to
cells<-data.frame(batch = c(rep("b1",ncol(cpm1[,grep("EPI1",colnames(cpm1))])),
                            rep("b2",ncol(cpm1[,grep("EPI2",colnames(cpm1))])),
                            rep("b2",ncol(cpm1[,grep("EPI3",colnames(cpm1))]))),
                  row.names = colnames(cpm1))

cells

Since all cells belong to the same cell-type, we have only one column, linking each cell to its batch. We can now run the batch correction itself. Since we have only one variable in the model, we do not let anything out (~1). Then we replot the PCA.

# model to use in the batch correction
modcombat = model.matrix(~1,data=cells)
bcor_cpm1 = ComBat(dat=cpm1,batch=cells$batch,
                   mod=modcombat,
                   par.prior=TRUE, prior.plots=FALSE)

# redo the PCA
tbcpm1<-t(bcor_cpm1)
PCAb1<-prcomp(tbcpm1)
eigenb1<-get_eig(PCAb1)

plot(PCAb1$x,col=colour,pch=16,
     cex=1.5,cex.lab=1.5,cex.axis=1.5,
     xlab=sprintf("PC1 %.1f %%",eigenb1[1,2]),
     ylab=sprintf("PC2 %.1f %%",eigenb1[2,2]))

1celltypecorr

Success! Now, all three plates, from the first and second batch, are merged together.  Note the decrease of variance associated with PC1, from 3.1 % to 2.5 %. This was expected since the shift of EPI1 compared with EPI2 and EPI3 was along PC1 in the first place. Now, this is important. If the feature were were looking to analyse also aligned with PC1, we would have thrown the baby out with the bathwater.  Fortunately, in our case, the interesting stuff aligned with PC2. And PC2 is almost not affected by the batch correction. Of course you cannot know that in advance. It required a series of iterative analyses to become aware of it. As a rule, the more orthogonal the feature is with the batch effect, the less it will be affected by the batch correction.

Now, that was easy since all the cells belonged to the same cell-type. In fact, in the real study, that was not quite the case. The first batch contained one cell-type, while the second batch contained two cell types. So let’s load the complete dataset.

# load counts
cpm2 <- read.table("TwoCellTypes.csv", 
                   sep=",",fill=T,
                   header=T,row.names=1)

We have initially 326 cells and 25299 genes. In addition to the plates of EPI cells, we have now a plate of LPM cells (what EPI and LPM mean really does not matter here). We can now run a PCA again.

# transpose data for PCA
tcpm2_clean<-t(cpm2)

# Principal Component Analysis
PCA2<-prcomp(tcpm2)
eigen2<-get_eig(PCA2) 

colour<-0
colour[grep("EPI1",colnames(cpm2))]<-rgb(230/255,159/255,0/255)
colour[grep("EPI2",colnames(cpm2))]<-rgb(86/255,180/255,233/255)
colour[grep("EPI3",colnames(cpm2))]<-rgb(0/255,158/255,115/255)
colour[grep("LPM",colnames(cpm2))]<-rgb(0/255,0/255,0/255)

plot(PCA2$x,col=colour,pch=16,cex=1.5,
     cex.lab=1.5,cex.axis=1.5,
     xlab=sprintf("PC1 %.1f %%",eigen2[1,2]),
     ylab=sprintf("PC2 %.1f %%",eigen2[2,2]))

2celltypesnocorr

We can see that again, while EPI2 and EPI3 are together, the cloud of EPI1 cell is slightly shifted upward. The plate composed of cells belonging to another cell type, plotted in black, is clearly different from the three EPI ones. Let’s try to batch correct to bring together EPI1, EPI2 and EPI3. The procedure is absolutely identical as the previous one, except we now integrate the new plate, resulting in 3 plates associated with batch b2.

# create a table with the cells and the batches they belong to
cells2<-data.frame(batch = c(rep("b1",ncol(cpm2[,grep("EPI1",colnames(cpm2))])),
                    rep("b2",ncol(cpm2[,grep("EPI2",colnames(cpm2))])),
                    rep("b2",ncol(cpm2[,grep("EPI3",colnames(cpm2))])),
                    rep("b2",ncol(cpm2[,grep("LPM1",colnames(cpm2))]))),
                    row.names = colnames(cpm2))

modcombat = model.matrix(~1,data=cells2)
bcor_cpm2 = ComBat(dat=cpm2,batch=cells2$batch,
                   mod=modcombat,
                   par.prior=TRUE, prior.plots=FALSE)

tbcpm2<-t(bcor_cpm2)
PCAb2<-prcomp(tbcpm2)
eigenb2<-get_eig(PCAb2)

plot(PCAb2$x,col=colour,pch=16,cex=1.5,cex.lab=1.5,cex.axis=1.5,
     xlab=sprintf("PC1 %.1f %%",eigenb2[1,2]),ylab=sprintf("PC2 %.1f %%",eigenb2[2,2]))

2celltypeswrongcorr

Fail! EPI1 from batch1 has now been batch-corrected taking into account all three plates of batch b2, which include the LPM cells. As a result, EPI1 cells end up located somewhere in between EPI2/3 and LPM1. Clearly this is wrong. What we want is putting together EPI1 and EPI2/3, ignoring LPM1. We can do that without a problem, by declaring the cell-types as a variable of interest, that we will then ignore. But first we need to do something. The initial clean-up was performed on the entire dataset. Even if gene expression varied throughout the dataset, we could still have genes that do not vary either across EPI cells or across LPM cells. That would cause newest versions of ComBat to throw errors. So let’s remove the culprit genes.

# sanity check on variance. remove genes which
# variance is 0 either in EPI or in LPM
varEPI <- apply(cpm2[,grep("EPI",colnames(cpm2))], 1, var)
varLPM <- apply(cpm2[,grep("LPM",colnames(cpm2))], 1, var)
cpm2 <- cpm2[-which(varEPI == 0 | varLPM == 0 ),]

We still have 20885 genes to play with, which is largely enough. Now we will create a table linking the cells and the batches, as before, but in addition add the cell-type they belong to.

# create a table with the cells and the batches they belong to
cells3<-data.frame(batch = c(rep("b1",ncol(cpm2[,grep("EPI1",colnames(cpm2))])),
                            rep("b2",ncol(cpm2[,grep("EPI2",colnames(cpm2))])),
                            rep("b2",ncol(cpm2[,grep("EPI3",colnames(cpm2))])),
                            rep("b2",ncol(cpm2[,grep("LPM1",colnames(cpm2))]))),
                   celltype = c(rep("epi",ncol(cpm2[,grep("EPI",colnames(cpm2))])),
                              rep("lpm",ncol(cpm2[,grep("LPM",colnames(cpm2))]))),
                   row.names = colnames(cpm2))

cells3_1

cells3_2

We can now instruct ComBat to correct for the batch but taking into account the cell-type in the model (~cells3$celltype).

modcombat = model.matrix(~cells3$celltype,data=cells3)
bcor_cpm3 = ComBat(dat=cpm2,batch=cells3$batch,
                   mod=modcombat,
                   par.prior=TRUE, prior.plots=FALSE)

And we run the PCA again.

tbcpm3<-t(bcor_cpm3)
PCAb3<-prcomp(tbcpm3)
eigenb3<-get_eig(PCAb3)

plot(PCAb3$x,col=colour,pch=16,cex=1.5,
     cex.lab=1.5,cex.axis=1.5,
     xlab=sprintf("PC1 %.1f %%",eigenb3[1,2]),
     ylab=sprintf("PC2 %.1f %%",eigenb3[2,2]))

2celltypesgoodcorr

Hooray! Now we have EPI1, EPI2 and EPI3 together, regardless of the batch, while LPM cells stand quietly to the side. We can export those coordinates for future use (e.g. clustering).

write.csv(bcor_cpm3, file="batchCorrected.cvs",quote=FALSE)

DVDs I sell on eBay

Here is a list of the DVDs I am currently selling on eBay. The list is updated continuously so come back if you did not see something you wanted yet. All I sell can also be seen directly on my seller page.

NB1: The descriptions I provided for each book are mostly not mine, but gathered from book back-covers, their website etc.

NB2: I only mention postage costs for the UK in the eBay listings. However, providing that the proper costs are covered, I am happy to send them anywhere.

Adaptation., Nicolas Cage, Meryl Streep

front[eBay link]

A lovelorn screenwriter becomes desperate as he tries and fails to adapt ‘The Orchid Thief’ by Susan Orlean for the screen.

 

Director: Spike Jonze

Writers: Susan Orlean (book), Charlie Kaufman (screenplay)

Along came Polly, 2004, Ben Stiller, Jennifer Aniston

front[eBay link]

A buttoned up newlywed finds his too organized life falling into chaos when he falls in love with an old classmate.

Director:John Hamburg

Writer:John Hamburg

Angus, Thongs and perfect Snogging

front[eBay link]

2008

Arthur and the invisibles, from Luc Besson

front[eBay link]

Ten-year-old Arthur, in a bid to save his grandfather’s house from being demolished, goes looking for some much-fabled hidden treasure in the land of the Minimoys, tiny people living in harmony with nature.

Director:Luc Besson

Bananas, Woody Allen

front[eBay link]

When a bumbling New Yorker is dumped by his activist girlfriend, he travels to a tiny Latin American nation and becomes involved in its latest rebellion.

 

Director:Woody Allen

Bewitched, 2006, Nicolas Kidman, Will Ferrell

front[eBay link]

Thinking he can overshadow an unknown actress in the part, an egocentric actor unknowingly gets a witch cast in an upcoming television remake of the classic sitcom Bewitched (1964).

 

Director:Nora Ephron

Embrassez qui vous voudrez (Summer Things), 2004, Charlotte Rampling, Jacques Dutronc

front[eBay link]

Two couple of friends, one very rich the other almost homeless, decides to go on Holiday. Julie, a single mother, joins them too. Once at seaside, it starts a complicate love cross among them that will involve also a transsexual, a jealous brother,a Latin Lover and another nervous stressed couple. Not to mention about the daughter of one of them that is secretly in Chicago with one of his father’s employee… More, at the end of the summer all of them will join the same party…

 

Director:Michel Blanc

 

Half Light, Demi Moore

front[eBay link]

Rachel Carlson, a successful novelist moves to a small Scottish village to move on with her life after the death of her son. Strange things start to happen when she is haunted by ghosts and real life terror.

 

Director:Craig Rosenberg

He’s just not that into you, 2009, Barrymore, Connelly, Affleck, Cooper

frong[eBay link]

This Baltimore-set movie of interconnecting story arcs deals with the challenges of reading or misreading human behavior.

Director:Ken Kwapis

 

I Could Never Be Your Woman

cover[eBay link]

Inspector Gadget 2

ebay-front[eBay link]

Walt Disney 2003

 

Little Children

front[eBay link]

The lives of two lovelorn spouses from separate marriages, a registered sex offender, and a disgraced ex-police officer intersect as they struggle to resist their vulnerabilities and temptations in suburban Massachusetts.

Director:Todd Field

Writers:Todd Field (screenplay), Tom Perrotta (screenplay)

Stars:Kate Winslet, Jennifer Connelly, Patrick Wilson

 

Michael Clayton, 2006, George Clooney, Tilda Swindon

front[eBay link]

A law firm brings in its “fixer” to remedy the situation after a lawyer has a breakdown while representing a chemical company that he knows is guilty in a multibillion-dollar class action suit.

 

Director:Tony Gilroy

Writer:Tony Gilroy

My Own Private Idaho, River Phoenix, Keanu Reeves, 2-disk set

front[eBay link]

Two best friends living on the streets of Portland as hustlers embark on a journey of self discovery and find their relationship stumbling along the way.

Director:Gus Van Sant

Writer:Gus Van Sant (screenplay)

The Peculiar Adventures of Hector, 2007

front[eBay link]

This mini series follows the adventures of Hector and his friends on a series of amazingly imaginative adventures on their way to school each morning.

Rachel getting married, Anne Hathaway

front[eBay link]

A young woman who has been in and out of rehab for the past ten years, returns home for the weekend for her sister’s wedding.

Director:Jonathan Demme

Writer:Jenny Lumet

Rire et Chatiment, Jose Garcia, Benoit Poolvoerde

front[eBay link]

Vincent Roméro is an osteopath who makes a point of being funny every minute he is awake. In fact it is his way to remain the center of attention. His antics wind up irritating his wife Camille, a geriatric doctor, to such an extent that she leaves him one day. At first Vincent is so persuaded his many qualities are hard to beat that he can’t doubt she will soon come back. But as she doesn’t, he starts questioning himself. All the more as he realizes, to his dismay, that each time he clowns the people around him start dying of laughter…literally that is!

Director:Isabelle Doval

 

Shaggy and Scooby-Doo get a clue Vol 1

ebay-front[eBay link]

Shaggy and Scooby-Doo get a clue Vol 2

front[eBay link]

Trigger Happy TV season 1, DVD 2006

front[eBay link]

Trigger Happy TV is a hidden camera/practical joke reality television series. The original British edition of the show, produced by Absolutely Productions, starred Dom Joly and ran for two series on the British television channel Channel 4 from 2000 to 2003. Joly made a name for himself as the sole star of the show, which he produced and directed with cameraman Sam Cadman.

 

 

Two brothers, 2004

front[eBay link]

Two tigers are separated as cubs and taken into captivity, only to be reunited years later as enemies by an explorer (Pearce) who inadvertently forces them to fight each other.

 

Director:Jean-Jacques Annaud

Writers:Alain Godard (scenario), Jean-Jacques Annaud (scenario)

 

 

Books I sell on eBay

Here is a list of the books I am currently selling on eBay. The list is updated continuously so come back if you did not see something you wanted yet. Feel free to contact me if you are looking after something you believe I could have but is not below yet. All I sell can also be seen directly on my seller page.

NB1: The descriptions I provided for each book are mostly not mine, but gathered from book back-covers, their website etc.

NB2: I only mention postage costs for the UK in the eBay listings. However, providing that the proper costs are covered, I am happy to send them anywhere.

SYSTEMS BIOLOGY AND MODELLING

Computational Neuroscience: Realistic Modeling for Experimentalists, 2001

Ebay-cover

[eBay link]

Designed primarily as an introduction to realistic modeling methods, Computational Neuroscience: Realistic Modeling for Experimentalists focuses on methodological approaches, selecting appropriate methods, and identifying potential pitfalls. The authors address varying levels of complexity, from molecular interactions within single neurons to the processing of information by neural networks. He avoids theoretical mathematics and provides just enough of the basic math used by experimentalists.

Each chapter covers: the theoretical foundation, parameters needed, appropriate software descriptions, evaluation of the model, future directions expected, examples in text boxes and references.

Systems Biology in Practice: Concepts, Implementation and Application, 2005

Ebay-Cove

[eBay link]

Presenting the main concepts, this book leads students as well as advanced researchers from different disciplines to an understanding of current ideas in the complex field of comprehensive experimental investigation of biological objects, analysis of data, development of models, simulation, and hypothesis generation.

It provides readers with guidance on how a specific complex biological question may be tackled.

The authors demonstrate how mathematical concepts can illuminate the principles underlying biology at a genetic, molecular, cellular and even organism level, and how to use mathematical tools for analysis and prediction.

A first course in the numerical analysis of differential equations, 1996

Ebay-cover[eBay link]

This book presents a rigorous account of the fundamentals of numerical analysis of both ordinary and partial differential equations. The point of departure is mathematical but the exposition strives to maintain a balance among theoretical, algorithmic and applied aspects of the subject. In detail, topics covered include numerical solution of ordinary differential equations by multistep and Runge-Kutta methods; finite difference and finite elements techniques for the Poisson equation; a variety of algorithms to solve large, sparse algebraic systems; and methods for parabolic and hyperbolic differential equations and techniques of their analysis. The book is accompanied by an appendix that presents brief back-up in a number of mathematical topics.

Modeling Neural Development, 2003

Ebay-cover[eBay link]

This is one of the first books to study neural development using computational and mathematical modeling. Modeling provides precise and exact ways of expression, which allow us to go beyond the insights that intuitive or commonsense reasoning alone can yield. Most neural modeling focuses on information processing in the adult nervous system; Modeling Neural Development shows how models can be used to study the development of the nervous system at different levels of organization and at different phases of development, from molecule to system and from neurulation to cognition.

The book’s fourteen chapters follow loosely the chronology of neural development. Chapters 1 and 2 study the very early development of the nervous system, discussing gene networks, cell differentiation, and neural tube development. Chapters 3-5 examine neuronal morphogenesis and neurite outgrowth. Chapters 6-8 study different aspects of the self-organization of neurons into networks. Chapters 9-12 cover refinement of connectivity and the development of specific connectivity patterns. Chapters 13 and 14 focus on some of the functional implications of morphology and development.

Each chapter contains an overview of the biology of the topic in question, a review of the modeling efforts in the field, a discussion in more detail of some of the models, and some perspectives on future theoretical and experimental work.

The science of studying neural development by computational and mathematical modeling is relatively new; this book, as Dale Purves writes in the foreword, “serves as an important progress report” in the effort to understand the complexities of neural development.

Foundations of Systems Biology, Kitano Ed, 2002

Ebay-COVER[eBay link]

The emerging field of systems biology involves the application of experimental, theoretical, and modeling techniques to the study of biological organisms at all levels, from the molecular, through the cellular, to the behavioral. Its aim is to understand biological processes as whole systems instead of as isolated parts. Developments in the field have been made possible by advances in molecular biology–in particular, new technologies for determining DNA sequence, gene expression profiles, protein-protein interactions, and so on. Foundations of Systems Biology provides an overview of the state of the art of the field. The book covers the central topics of systems biology: comprehensive and automated measurements, reverse engineering of genes and metabolic networks from experimental data, software issues, modeling and simulation, and system-level analysis.

This book is a piece of history, witnessing the birth of what would become a vibrant community. See in particular the chapter on SBW and SBML. The state of the book is very good, except for a repair of the cover, that became partially detached.

Continuous Systems Modeling, François Cellier, 1991

Ebay-cover[eBay link]

Continuous System Modeling describes systematically and methodically how mathematical models of processes that change continuously over teim are constructed. The end product is usually a set differential or difference equations.

Continuous System Modeling focuses on the derivation of those equation.
Application areas include electrical circuitry, mechanical systems, chemical reaction kinetics, population dynamics, and ecology, as well as social and economical systems. The methodologies presented to deal with these diverse application areas are manifold. Bond graphs and systems dynamics are just two of the modeling techniques presented. Four chapters deal with the process of modeling itself. In these chapters, the mechanisms of human modeling and understanding are investigated and modeled.
Continuous System Modeling is a highly computer-oriented text. Software tools that implement the various methodologies are introduced alongside the applications and conceptual tools. Homework problems, suggestions for term projects, and open research questions conclude every chapter to deepen the understanding of the student and increase his or her motivation.
Continuous System Modeling is the first text of its kind. The book has been written to accompany senior and graduate students enrolled in a modeling class, but it may also serve as a reference for modeling and simulation practitioners.
755 pages

Computational Modeling Of Genetic And Biochemical Networks, Bower and Bolouri Ed, 2000

A piece of history, witnessing the birth of Systems Biology
“The advent of ever more sophisticated molecular manipulation techniques has made it clear that cellular systems are far more complex and dynamic than previously thought. At the same time, experimental techniques are providing an almost overwhelming amount of new data. It is increasingly apparent that linking molecular and cellular structure to function will require the use of new computational tools. This book provides specific examples, across a wide range of molecular and cellular systems, of how modeling techniques can be used to explore functionally relevant molecular and cellular relationships. The modeling techniques covered are applicable to cell, developmental, structural, and mathematical biology; genetics; and computational neuroscience. The book, intended as a primer for both theoretical and experimental biologists, is organized in two parts: models of gene activity and models of interactions among gene products. Modeling examples are provided at several scales for each subject. Each chapter includes an overview of the biological system in question and extensive references to important work in the area.”

Computational Modeling Methods for Neuroscientists, De Schutter 2010

Ebay-frontCover[eBay link]

A guide to computational modeling methods in neuroscience, covering a range of modeling scales from molecular reactions to large neural networks. This book offers an introduction to current methods in computational modeling in neuroscience. The book describes realistic modeling methods at levels of complexity ranging from molecular interactions to large neural networks. A “how to” book rather than an analytical account, it focuses on the presentation of methodological approaches, including the selection of the appropriate method and its potential pitfalls. It is intended for experimental neuroscientists and graduate students who have little formal training in mathematical methods, but it will also be useful for scientists with theoretical backgrounds who want to start using data-driven modeling methods. The mathematics needed are kept to an introductory level; the first chapter explains the mathematical methods the reader needs to master to understand the rest of the book. The chapters are written by scientists who have successfully integrated data-driven modeling with experimental work, so all of the material is accessible to experimentalists. The chapters offer comprehensive coverage with little overlap and extensive cross-references, moving from basic building blocks to more complex applications.

Nonlinear Ordinary Differential Equations. An introduction to dynamical systems. 3rd Ed,  1999

Jordan-Cover[eBay link]

Nonlinear ordinary differential equations was first published in 1977 and has since become a standard text in the teaching of the subject. It takes a qualitative approach, and is designed for advanced undergraduate and graduate students of dynamical systems in mathematics or mathematics-related subjects.
The text of this third edition has been completely revised to bring it into line with current teaching, including an expansion of the material on bifurcations and chaos. The book is directed towards practical applications of the theory, with several hundred examples and problems covering a wide variety of applications. Prerequisites are kept to a minimum, with appendices containing the necessary mathematical theory new to this edition.
From reviews of the first edition: “The book can profitably be used in a senior undergraduate course. It is well written, well motivated and contains some recent developments of interest which have not been readily accessible before at this level. “
From reviews of the second edition: “The subject has wide applications in physical, biological, and social sciences which continuously supply new problems of practical and theoretical importance. The book does a good job in motivating the reader in such pursuits, and presents the subject in a simple but elegant style.”

Computational Cell Biology

A favourite of systems biologists, this textbook provides an introduction to dynamic modeling in molecular cell biology, taking a computational and intuitive approach. Detailed illustrations, examples, and exercises are included throughout the text. Appendices containing mathematical and computational techniques are provided as a reference tool.

Systems Engineering methods, Harold Chestnut, 1967

Ebay-internalCover[eBay link]

The first, and the most important, volume of Harold Chestnut on Systems Engineering and Analysis series. A classic.

NEUROSCIENCES

Fundamental Neuroscience, 2nd Ed, 2003

Ebay-cover

[eBay link]

One of the two “bibles” of Neuroscience, this massive compendium offers a comprehensive presentation of the discipline of neuroscience, from molecules to cognition. This book builds from an introductory section that includes fundamental neuroanatomy and covers cellular and molecular neuroscience, development, sensory systems, motor systems, regulatory systems, and behavioral and cognitive neuroscience.

The second edition of Fundamental Neuroscience is a thoroughly revised version of what is probably the most comprehensive and best textbook in neuroscience. It features over 25% new material including completely new chapters and illustrations. More concise and manageable than the previous edition, this book has been retooled to better serve its audience in the neuroscience and medical communities.  CD ROM not included.

Fundamental Neuroscience, 3rd Ed, 2008

Cover[eBay link]

Fundamental Neuroscience, 3rd Edition introduces graduate and upper-level undergraduate students to the full range of contemporary neuroscience. Addressing instructor and student feedback on the previous edition, all of the chapters are rewritten to make this book more concise and student-friendly than ever before. Each chapter is once again heavily illustrated and provides clinical boxes describing experiments, disorders, and methodological approaches and concepts.

Capturing the promise and excitement of this fast-moving field, Fundamental Neuroscience, 3rd Edition is the text that students will be able to reference throughout their neuroscience careers!

New to this edition:

  • 30% new material including new chapters on Dendritic Development and Spine Morphogenesis, Chemical Senses, Cerebellum, Eye Movements, Circadian Timing, Sleep and Dreaming, and Consciousness
  • Additional text boxes describing key experiments, disorders, methods, and concepts
  • Multiple model system coverage beyond rats, mice, and monkeys
  • Extensively expanded index for easier referencing

Computational Neuroscience: Realistic Modeling for Experimentalists, 2001

Ebay-cover

[eBay link]

Designed primarily as an introduction to realistic modeling methods, Computational Neuroscience: Realistic Modeling for Experimentalists focuses on methodological approaches, selecting appropriate methods, and identifying potential pitfalls. The authors address varying levels of complexity, from molecular interactions within single neurons to the processing of information by neural networks. He avoids theoretical mathematics and provides just enough of the basic math used by experimentalists.

Each chapter covers: the theoretical foundation, parameters needed, appropriate software descriptions, evaluation of the model, future directions expected, examples in text boxes and references.

Modeling Neural Development, 2003

Ebay-cover[eBay link]

This is one of the first books to study neural development using computational and mathematical modeling. Modeling provides precise and exact ways of expression, which allow us to go beyond the insights that intuitive or commonsense reasoning alone can yield. Most neural modeling focuses on information processing in the adult nervous system; Modeling Neural Development shows how models can be used to study the development of the nervous system at different levels of organization and at different phases of development, from molecule to system and from neurulation to cognition.

The book’s fourteen chapters follow loosely the chronology of neural development. Chapters 1 and 2 study the very early development of the nervous system, discussing gene networks, cell differentiation, and neural tube development. Chapters 3-5 examine neuronal morphogenesis and neurite outgrowth. Chapters 6-8 study different aspects of the self-organization of neurons into networks. Chapters 9-12 cover refinement of connectivity and the development of specific connectivity patterns. Chapters 13 and 14 focus on some of the functional implications of morphology and development.

Each chapter contains an overview of the biology of the topic in question, a review of the modeling efforts in the field, a discussion in more detail of some of the models, and some perspectives on future theoretical and experimental work.

The science of studying neural development by computational and mathematical modeling is relatively new; this book, as Dale Purves writes in the foreword, “serves as an important progress report” in the effort to understand the complexities of neural development.

Handbook of Basal Ganglia structure and function, Steiner and Tseng, 2010

Ebay-cover[eBay link]

The Basal Ganglia comprise a group of forebrain nuclei that are interconnected with the cerebral cortex, thalamus and brainstem. Basal ganglia circuits are involved in various functions, including motor control and learning, sensorimotor integration, reward and cognition. The importance of these nuclei for normal brain function and behavior is emphasized by the numerous and diverse disorders associated with basal ganglia dysfunction, including Parkinson’s disease, Tourette’s syndrome, Huntington’s disease, obsessive-compulsive disorder, dystonia, and psychostimulant addiction.

The Handbook of Basal Ganglia provides a comprehensive overview of the structural and functional organization of the basal ganglia, with special emphasis on the progress achieved over the last 10-15 years. Organized in six parts, the volume describes the general anatomical organization and provides a review of the evolution of the basal ganglia, followed by detailed accounts of recent advances in anatomy, cellular/molecular, and cellular/physiological mechanisms, and our understanding of the behavioral and clinical aspects of basal ganglia function and dysfunction.

Comparative Vertebrate Neuroanatomy, Butler and Hodos, 1996

Ebay-cover.jpg[eBay link]

By applying the tools of modern neuroanatomy to brain structure and function in various species, researchers have discovered that numerous cell groups and interconnections, known to be present in mammals, also exist in non-mammalian vertebrates. This book reveals how the brains of various vertebrates are astoundingly similar in some ways, while in others they are quite different. The authors examine how the form of the brain is modified and magnified to perfect and capitalize on a specific function, making any particular animal a “specialist” in its area. They also clarify the forms and functions of the nervous system that have allowed vertebrates to adapt to almost every aspect of the earth’s environment.

Computational Modeling Methods for Neuroscientists, De Schutter 2010

Ebay-frontCover[eBay link]

A guide to computational modeling methods in neuroscience, covering a range of modeling scales from molecular reactions to large neural networks. This book offers an introduction to current methods in computational modeling in neuroscience. The book describes realistic modeling methods at levels of complexity ranging from molecular interactions to large neural networks. A “how to” book rather than an analytical account, it focuses on the presentation of methodological approaches, including the selection of the appropriate method and its potential pitfalls. It is intended for experimental neuroscientists and graduate students who have little formal training in mathematical methods, but it will also be useful for scientists with theoretical backgrounds who want to start using data-driven modeling methods. The mathematics needed are kept to an introductory level; the first chapter explains the mathematical methods the reader needs to master to understand the rest of the book. The chapters are written by scientists who have successfully integrated data-driven modeling with experimental work, so all of the material is accessible to experimentalists. The chapters offer comprehensive coverage with little overlap and extensive cross-references, moving from basic building blocks to more complex applications.

Dendritic Spines, 2009

ebay-cover[eBay link]

A leading neurobiologist explores the fundamental function of dendritic spines in neural circuits by analyzing different aspects of their biology, including structure, development, motility, and plasticity. Most neurons in the brain are covered by dendritic spines, small protrusions that arise from dendrites, covering them like leaves on a tree. But a hundred and twenty years after spines were first described by Ramon y Cajal, their function is still unclear. Dozens of different functions have been proposed, from Cajal’s idea that they enhance neuronal interconnectivity to hypotheses that spines serve as plasticity machines, neuroprotective devices, or even digital logic elements. In Dendritic Spines, leading neurobiologist Rafael Yuste attempts to solve the “spine problem,” searching for the fundamental function of spines. He does this by examining many aspects of spine biology that have fascinated him over the years, including their structure, development, motility, plasticity, biophysical properties, and calcium compartmentalization. Yuste argues that we may never understand how the brain works without understanding the specific function of spines. In this book, he offers a synthesis of the information that has been gathered on spines (much of which comes from his own studies of the mammalian cortex), linking their function with the computational logic of the neuronal circuits that use them. He argues that once viewed from the circuit perspective, all the pieces of the spine puzzle fit together nicely into a single, overarching function. Yuste connects these two topics, integrating current knowledge of spines with that of key features of the circuits in which they operate. He concludes with a speculative chapter on the computational function of spines, searching for the ultimate logic of their existence in the brain and offering a proposal that is sure to stimulate discussions and drive future research.

BIOCHEMISTRY AND MOLECULAR BIOLOGY

The Physical Basis Of Biochemistry: The Foundations Of Molecular Biophysics

Ebay-cover[eBay link]

The physical basis of biochemistry is a rigorous, imaginative textbook that applies physical and chemical principles to the understanding of the biology of cells. The book features problem sets and examples, clear illustrations, and extensive appendixes, that provide additional information on related topics in mathematics, physics and chemistry. Based on years of the author’s experience teaching the subject, the physical basis of biochemistry is suitable for graduate and advanced undergraduate courses in physical biochemistry, biophysical chemistry, and physical chemistry with applications in the life sciences. Mixing practical applications with a consistent theoretical paradigm of modeling biological systems, the book will be welcomes by instructors seeking a text that combines a quantitative approach with a uniform biological perspective.

Biophysical Chemistry, James P. Allen, 2008

Ebay-cover[eBay link]

Biophysical Chemistry presents physical chemistry through the use of biological and biochemical topics, examples, and applications to biochemistry. It presents a rigorous, up to date treatment of the material without presuming unrealistic prior knowledge of math theory. Necessary calculus models are laid out in a step by step fashion for students less confident in their math abilities. The format of the text allows teachers ample flexibility in deciding which derivations to present in class. Students are guided through an in depth understanding of fundamental concepts such as a quantum mechanical description of the hydrogen atom and techniques are presented with an emphasis on learning through analysis of real data. Cases are drawn from timely research areas in biochemistry using an integrated approach to problem solving. Every chapter features important recent advances in biochemistry, an examination of current research problems, math and derivation boxes to guide students, and examples of both numerical and concept based problems.

Biophysical Chemistry, Cantor and Schimmel (the ultimate textbook)

Ebay-cover[eBay link]

This three-part series remains the definitive text on the physical properties of biological macromolecules and the physical techniques used to study them. It is appropriate for a broad spectrum of advanced undergraduate and graduate courses and serves as a comprehensive reference for researchers.
Part I: The Conformation of Biological Macromolecules 1980, paper, 365 pages, 158 illustrations 0-7167-1188-5
Part II: Techniques for the Study of Biological Structure and Function 1980, paper, 365 pages, 158 illustrations 0-7167-1190-7
Part III: The Behavior of Biological Macromolecules 1980, paper, 597 pages, 243 illustrations 0-7167-1192-3

Calcium Signaling, Cold Spring Harbour Perspective in Biology, Bootman, Berridge et al 2012

Calcium ions play a critical role in signaling in a wide variety of cells and tissues, including muscle, immune cells, neurons, the liver, and oocytes. They can enter the cytoplasm from the extracellular environment or from intracellular stores and control the activity of numerous enzymes, other signaling molecules, transcription factors, and cytoskeletal components. Written and edited by experts in the field, this collection from Cold Spring Harbor Perspectives in Biology explores the channels and pumps that transport calcium between different compartments and the regulation of calcium fluxes. The contributors discuss calcium buffers and sensors and how these produce distinct spatiotemporal calcium signals in different circumstances. Including discussions of calcium signaling during development and normal physiology, as well as perturbed signaling in diseases such as diabetes, neurodegeneration, and atherosclerosis, the volume is an indispensable reference for biochemists, cell and developmental biologists, and physiologists seeking a comprehensive treatment of calcium biology.

COMPUTING AND PROGRAMMING

Programming PHP, 2002

Ebay-cover[eBay link]

This is a comprehensive guide to PHP, a simple yet powerful language for creating dynamic web content. It is a detailed reference to the language and its applications, including such topics as form processing, sessions, databases, XML, and graphics.

510 pages.

Cascading Style Sheets, the definitive guide

Simply put, Cascading Style Sheets (CSS) is a way to separate a document’s structure from its presentation. The benefits of this can be quite profound: CSS allows a much richer document appearance than HTML; CSS saves time–you can create or change the appearance of an entire document in just one place; and its compact file size makes web pages load quickly. Eric Meyer, a member of the CSS&FP Working Group and an internationally known expert on HTML and CSS, tackles the subject with passion and delivers a comprehensive and thorough update to his groundbreaking book. All readers will benefit from both the depth and breadth of his experience and his clear and honest style. Cascading Style Sheets: The Definitive Guide is a comprehensive guide to CSS implementation. The book includes content on positioning, lists and generated content, table layout, user interface, paged media, and more. It explores in detail each individual CSS property and how it interacts with other properties, and shows how to avoid common mistakes in interpretation.If you’re ready to take the next step with your HTML coding and incorporate CSS or are already a CSS code warrior, you’ll find Cascading Style Sheets: The Definitive Guide is the book you’ve been craving.

Thinking in Java, 3rd edition, Bruce Eckel

cover[eBay link]

In Thinking in Java, Third Edition, Bruce Eckel revises his widely-praised Java introduction to fully reflect the power of Java 2 Standard Edition, Version 1.4. This edition of the book that won the 1999 Software Development Magazine Productivity Award adds thoroughly updated coverage of servlets, JSP, and EJB. As always, Eckel doesn’t just show students what to do, but why. He introduces all the basics of objects as Java uses them; then walks carefully through the fundamentals of Java programming, including program flow, initialization and cleanup, implementation hiding, reusing classes and polymorphism. Using to-the-point examples, he introduces exception handling, Java I/O, run-time type identification, and passing and returning objects. Coverage also includes: database integration, transactions, security, Swing GUI development, Jini, JavaSpaces, and much more. Eckel presents Java one simple step at a time, carefully organizing his material so students can learn new concepts even in short study periods, and thoroughly digest each new idea and technique before moving on.
The accompanying CD is missing.

CHILDREN BOOKS

Lego Star Wars character encyclopedia

Ebay-front[eBay link]

Describing more than 300 minifigures, all the main characters of the saga, in their different outfits.

A figurine is provided, although it is not Hans Solo as written on the book, but a rebel trooper. It was already the case when we bought the book new. This must have been a mistake made in factory.

8 adventures of Horrid Henry in 4 books

Ebay-allFront

[eBay link]

A banquet of Horrid Henry adventures for less than a tenner:
“Horrid Henry”
“Horrid Henry’s Purple Hand Gang Joke Book”
“A Helping of Horrid Henry” (containing “Horrid Henry’s Nits”, “Horrid Henry Gets RIch Quick”, “Horrid Henry’s Haunted House”), “A Giant Slice of Horrid Henry” (containing “Horrid Henry Meets the Queen”, “Horrid Henry’s Underpants”, “Horrid Henry’s StinkBomb”).

5 Usborne Puzzle books

Ebay-books.jpg[eBay link]

Five Usborne Puzzle Books:

Puzzle Castle
Puzzle Christmas
Puzzle Pirates
Puzzle Train
Puzzle Dinosaurs

Each comes with plenty of puzzles, including the famous “Can you find …”

Tales from the Wind in the Willows by Kenneth Grahame, illustrated by Val Biro

Ebay-outCover[eBay link]

Slightly shorter version of this great classic, with large characters, very suitable to children in primary school. The book is nicely illustrated.

Very good state, almost like new. The out-cover has a trace of price sticker on the front.

Little Bear’s ABC and 123, Jane Hissey

Ebay-cover[eBay link]

57 pages.

Fat Puss and Friends, Harriet Castor

Ebay-cover[eBay link]

92 page.

Contains the stories:
Fat Puss
Fat Puss finds a friend
Fat Puss in summer
Fat Puss meets a stranger
Fat Puss at Christmas

Shampoozel, Laurence Anholt, Arthur Robins

Ebay-cover[eBay link]

A brain-ticklingly brilliant seriously silly story.

Shampoozel’s brush with the Bad Hair Witch grows into a hair-raising adventure. Will is all come out in the wash?

64 pages.

Humpty Dumpty and other nursery rhymes, Ladybird

Ebay-cover[eBay link]

28 pages.

Monster stories for under fives, Joan Stimson, Ladybird

Ebay-cover[eBay link]

43 pages.

The Booktime Book of Fantastic First Poems

Ebay-cover[eBay link]

28 pages.

Pigs Might Fly, Red House Younger Children Award 2006

Ebay-front[eBay link]

“Let me win, little pig, LET ME WIN!” The Big Bad Wolf is back and badder than ever! So when the Three Pigs enter the “Pie in the Sky” Air Race, he’s determined to snaffle the prize pies and have the pigs for pudding. Will the Wolf win – or can Wilbur save the day? A fast-paced, frantically-funny sequel to a well-loved tale.

32 pages.

Winner of the Red House Children’s Book Award, category Younger Children.

The diabolical Mr Tiddles, 2012

MrTiddles-cover

[eBay link]

Fearsome dinosaurs, whooshing jetpacks, rockstar guitars, a horse called Alan … Juste a few of the things that Harry unexpectedly finds in his bedroom. Where are they coming from?

It couldn’t have anything to do with Harry’s furry-purry new pussycat Mr Tiddles … could it?

A madcap adventure revealing the secret lives of pets.

32 pages

A dog’s best friend, 1999

Ebay-cover

[eBay link]

Toby’s life is all fillet steak and walks in the park, but one night he gets lost – and everything changes!
He gets wet, cold and hungry, but also makes a new friend. And when his old owners turn up, Toby had to make the toughest decision of his life!

Children of seven and above will enjoy reading this story, but is can be read to younger children too.

24 pages.

Mr Big, Ed Vere

MrBig-Cover[eBay link]

A true friend can come in any shape or size.

32 pages.

Dogger, by Shirley Hughes, ed from 2009

Dogger-Cover[eBay link]

“Once there was a soft brown toy called Dogger. One of his ears pointed upwards and the other flopped over. His fur was worn in places because he was quite old. He belonged to Dave.”

This classic picture book tells the familiar and reassuring story of how a much-loved toy is lost and finally found again.

32 pages.

Wriggle and Roar, by Julia Donaldson (“The Gruffalo”) and Nick Sharratt (“Tracy Beaker”)

ebay-cover[eBay link]

Rhymes to join in with.

A classic by Julia Donaldson, illustrated by Nick Sharratt of “Tracy Beaker” fame

40 pages.

Some Dogs Do. Jez Alborough 2007

Ebay-cover[eBay link]

“All dogs walk and jump and run, but dogs don’t fly – it can’t be done … Or can it?”

40 pages.

Where’s my Teddy? Jez Alborough 2004

Ebay-cover[eBay link]

Eddy’s lost his teddy, Frddy.
So off he goes to the wood to find him.
But the wood is dark and horrible and little Eddy is in for a gigantic surprise!

32 pages.

Cheeky Charlie! Ben Redlich 2008

Ebay-Cover[eBay link]

Charlie prances around the jungle giving cheek to everyone he meets, but can he see the funny side when he catches a glimpse of his own backside?

24 pages.

The Hundred-Mile-an-Hour Dog Goes for Gold, Jeremy Strong

Ebay-Cover[eBay link]

“Guess what’s coming to town!
The animal games!

There’ll be show-jumping for horses AND rabbits, and discus for dogs – so of course I have to enter Streaker.

Mum says a CARROT is more obedient than my dog, but I think she can do it – Streaker can go for GOLD!”

149 pages.

Wicked World Cup 2006, Michael Coleman

Ebay-cover[eBay link]

“This international guide gives you the coolest commentary on the brilliant Brazilians, the awesome Argentinians, the invincible Italians and the fantastic French.

Plus discover the secrets of wicked wonders Pelé, Beckham, and the English heroes of 1966.”

The book has been read many times. No pages are missing, and none are ripped or damaged. The result chart tables at the end of the book have been partially filled with the results of the 2006 world cup.

Football mad, four books in one, 2008

Ebay-cover[eBay link]

Includes books:

The worst team in the world by Alan MacDonald

There’s only one Danny Ogle by Helena Pielichaty

Nice One, Sam! by John Goodwin

Mark’s Dream Team by Alan MacDonald

415 pages

My Treasury of Stories and Rhymes, hardback 2005

Ebay-front[eBay link]

384 pages of children delight. Many stories and many songs, of different lengths and levels.

MaxinaBoxBoy, Tudor Humphries, 1999

Ebay-cover[eBay link]

Nice short book telling the story of Max and his adventures in boxes.

Nessie needs new glasses

ebay-cover[eBay link]

Nessie needs her glasses in order to do all the things that a Loch Ness Monster likes to do!
However, when her glasses are broken, she does not have the money to pay for a new pair. Nessie sets on a tour of Scotland to see if any of her relatives can help her. The trouble is that her eyesight really is quite poor …….!

Maisy’s bedtime, Lucy Cousins 2011

ebay-cover[eBay link]

MISCELLANEA

Birds of Britain and Europe, 2012

ebay-cover[eBay link]

Perfect pocket book to go on a walk, or hide in a spot, and try to recognise birds.
250 pages, 310 bird species

 

plotGODESeq: differential expression and Gene Ontology enrichment on one plot

I recently came across the package GOplot by Wencke Walter http://wencke.github.io/. In particular I liked the function GOBubble. However, I found difficult to customise the plot. In particular I wanted to colour the bubbles differently, and to control the plotting area. So I took the idea and extended it. Many aspects of the plot can be configured. It is a work in progress. Not all features of GOBubble are implemented at the moment. For instance, we cannot separate the different branches of Gene Ontology, or add a table listing labelled terms. I also have a few ideas to make the plot more versatile. If you have suggestions, please tell me. The code and the example below can be found at https://github.com/lenov/plotGODESeq/

What we want to obtain at the end is the following plot:

final-inkscape

The function plotGODESeq() takes two mandatory inputs: 1) a file containing Gene Ontology enrichment data,  2) a file containing differential gene expression data. Note that the function works better if the dataset is limited, in particular the number of GO terms. It is useful to analyse the effect of a perturbation, chemical or genetic, or to compare two cell types that are not too dissimilar. Comparing samples that exhibit several thousands of differentially expressed genes, resulting in thousands of enriched GO terms, will not only slow the function to a halt, it is also useless (GO enrichment should not be used in these conditions anyway. The results always show things like “neuronal transmission” enriched in neurons versus “immune process” enriched in leucocytes). A large variety of other arguments can be used to customise the plot, but none are mandatory.

To use the function, you need to source the script from where it is; In this example it is located in the session directory. (I know I should make a package of the function. On my ToDo list)

source('plotGODESeq.R')

Input

The Gene Ontology enrichment data must be a dataframe containing at least the columns: ID – the identifier of the GO term, description– the description of the term, Enrich – the ratio of observed over expected enriched genes annotated with the GO term, FDR – the False Discovery Rate (a.k.a. adjusted p-value), computed e.g. with the Benjamini-Hochberg correction, and genes – the list of observed genes annotated with the GO term. Any other column can be present. It will not be taken into account. The order of columns does not matter. Here we will load results coming from and analysis run on the server WebGestalt. Feel free to use whatever Gene Ontology enrichment tool you want, as far as the format of the input fits.

# load results from WebGestalt
goenrich_data <- read.table("GO-example.csv", 
                            sep="\t",fill=T,quote="\"",header=T)

# rename the columns to make them less weird 
# and compatible with the GOPlot package
colnames(goenrich_data)[
colnames(goenrich_data) %in% c("geneset","R","OverlapGene_UserID")
] <- c("ID","Enrich","genes")

# remove commas from GO term descriptions, because they suck
goenrich_data$description <- gsub(',',"",goenrich_data$description)

The differential expression data must be a dataframe which rownames are the gene symbols, from the same namespace as the genes column of the GO enrichment data above. In addition, one column must be named log2FoldChange, containing the quantitative difference of expression between two conditions. Any other column can be present. It will not be taken into account. The order of columns does not matter.

# Load results from DESeq2
deseq_data <- read.table("DESeq-example.csv", 
                         sep=",",fill=T,header=T,row.names=1)

Now we can create the plot.

plotGODESeq(goenrich_data,deseq_data)

The y-axis is the negative log of the FDR (adjusted p-value). The x-axis is the zscore, that is for a given GO term:

(nb(genes up) – nb(genes down))/sqrt(nb(genes up) + nb(genes down))

The genes associated with each GO term are taken from the GO enrichment input, while the up or down nature of each gene is taken from the differential expression input file. The area of each bubble is proportional to the enrichment (number of observed genes divided by number of expected genes). This is the proper way of doing it, rather than using the radius, although of course, the visual impact is less important.

Raw

Choosing what to plot

The console output tells us that we plotted 1431 bubbles. That is not very pretty or informative … The first thing we can note is that we have a big mess at the bottom of the plot, which corresponds to the highest values of FDR. Let’s restrict ourselves to the most significant results, by setting the argument maxFDR to 1e-8.

maxFDR

This is better. We now plot only 181 GO terms. Note the large number of terms aligned at the top of the plot. Those are terms with an FDR of 0. The Y axis being logarithmic, we plot them by setting their FDR to a tenth of the smallest non-0 value. GO over-representation results are often very redundant. We can use GOplot’s function reduce_overlap by setting the argument collapse to the proportion of genes that needs to be identical so that GO terms are merged in one bubble. Let’s use collapse=0.9 (GO terms are merged if 90% of the annotated genes are identical) .

collapse

Now we only plot 62 bubbles, i.e. two-third of the terms are now “hidden”. Use this procedure with caution. Note how the plot now looks distorted towards one condition. More “green” terms have been hidden than “red” terms.

The colour used by default for the bubbles is the zscore. It is kind of redundant with the x-axis. Also, the zscore only considers the number of genes up or down-regulated. It does not take into account the amplitude of the change. By setting the argument color to l2fc, we can use the average fold change of all the genes annotated with the GO term instead.

l2fc

Now we can see that while the proportion of genes annotated by GO:0006333 that are down-regulated is lower than for GO:0008380, the amplitude of their average down-regulation is larger.

WARNING: The current code does not work if the color scheme chosen for the bubbles is based on variable, l2fc or zscore, that do not contain negative and positive values. Sometimes, the “collapsing” can cause this situation, if there is an initial unbalance between zscores and/or l2fc. It is a bug, I know. On the ToDo list …

Using GO identifiers is handy and terse, but since I do not know GO by heart, it makes the plot hard to interpret. We can use the full description of each term instead, by setting the argument label to description.

label

Customising the bubbles

The width of the labels can be modified by setting the argument wrap to the maximum number of characters (the default used here is 15). Depending of the breadth of values for FDR and zscore, the size of the bubbles can be an issue, either by overlapping too much, or on the contrary by being tiny. We can change that by the argument scale which scales the radius of the bubbles. Let’s fix it to 0.7, to decrease the size of each bubble by a third (the radius, not the area!).

scale

There is often a big crowd of terms at the bottom and centre of the plot. This is not so clear here, with the harsh FDR threshold, but look at the first plot of the post. These terms are generally the least interesting, since they have a lower significance (higher FDR) and mild zscore. We can decide to label the bubbles only under a certain FDR with the argument maxFDRLab and/or above a certain absolute zscore with the argument minZscoreLab. Let’s fix them to 1e-12 and 2 respectively.

labelThreshold

Finally, you are perhaps not too fond of the default color scheme. This can be changed with the arguments lowCol, midCol, highCol. Let’s set them to  “deepskyblue4”, “#DDDDDD” and “firebrick”,

bubbleColor

Customising the plotting area

The first modifications my collaborators asked me to introduced was to centre the plot on a zscore of 0, and to add space around so they could annotate the plot. One can centre the plot by declaring centered = TRUE (the default is FALSE). Since our example is extremely skewed towards negative zscores, this would not be a good idea. However, adding some space on both side will come handy in the last step of beautification. We can do that by declaring extrawidth=3 (default is 1).

extrawidth

The legend position can be optimised with the arguments leghoffset and legvoffset. Setting them to {-0.5,1.5}

legend

The complete call:

plotGODESeq(goenrich_data,
            deseq_data,
            maxFDR = 1e-8,
            collapse = 0.9,
            color="l2fc",
            lowCol = "deepskyblue4",
            midCol = "#DDDDDD",
            highCol = "firebrick",
            extrawidth=3,
            centered=FALSE,
            leghoffset=-0.5,
            legvoffset=1.5,
            label = "description",
            scale = 0.7,
            maxFDRLab = 1e-12,
            minZscoreLab = 2.5,
            wrap = 15)

Now we can export an SVG version and play with the labels in Inkscape. This part is unfortunately the most demanding …

final-inkscape

SBGN-ML, SBML visual packages, CellDesigner format, what are they? When should I use them?

Visual representation of biochemical pathways has been a key tool used to understand cellular and molecular systems for a long time. Any knowledge integration project involves a jigsaw puzzle step, where different pieces have to be put together. When Feynman cheekily wrote on his blackboard just before his death “What I cannot create I do not understand”, he meant that he only fully understood a system once he derived a (mathematical) model for it, and interestingly Feynman is also famous for one of the earliest standard graphical representations of reaction networks, namely the Feynman diagrams to represent models of subatomic particle interactions. The earliest metabolic “map” I possess comes from the 3rd edition of “Outlines of Biochemistry” by Gortner published in 1949. I would be happy to hear if you have older ones.

Goertner

(I let you find out all the inconsistencies, confusions and error-generating features in this map. This might be food for another text, but I believe this is a great example to support the creation of standards, best practices, and software tools!)

Until recently, those diagrams were mostly drawn by hand, initially on paper, then using drawing software. There was not so much thinking spent in consistency, visual semantics, or interoperability. This changed in the 1990s, as part as Systems Biology’s revival. The other thing that changed in the 1990s was the widespread use of computers and software tools to build and analyse models. The child of both trends was the development of standard computer-readable formats to represent biological networks.

When drawing a knowledge representation map, one can divide the decision-making process, and therefore the things we need to encode in order to share the map, in three  parts:

What – How can people identify what I represent? A biochemical map is a network made up of nodes, linked by arcs. The network may contain only one type of nodes, for instance a protein-protein interaction network or an influence network, or be a bipartite graph, like a reaction network – one type of nodes representing the pools involved in the reactions, the other representing the reactions themselves. One decision is the shape to use for each node so that it carries a visual information about the nature of what it represents. Another concerns the arcs linking the nodes, that can also contain visual clues, such as directionality, sign, type of influence etc. All this must be encoded in some way, either semantically (a code identifying the type of glyphs, from an agreed-up list of codes), or graphical (embedding an image or describing the node).

Where – Once the glyphs are chosen, one needs to place them. The relative position of the information should not always carry much information, but there are some cases where it must, e.g. members of complexes, inclusion in compartments etc. And there is no denying that the relative position of glyphs is also used to convey more subjective information. For instance a linear chain of reactions induce the idea of a flow, much better than a set of reactions going randomly up and down, right and left. Another unwritten convention it to represent membrane signal transduction on the top of the maps, with the “end-result”, often effect on gene expression, at the bottom, with the idea of a cascading flux of information. The coordinates of the the glyphs must then be shared as well.

How – Finally, the impact of a visual representation also depends on aesthetic factors. The relative size of glyphs and labels, thickness of arcs, the colours, shades and textures, all influence the facility with which viewers absorb the information contained in a map. Relying on such aspects to interpret the meaning of a map should be avoided, in particular if the map is to be shared between different media, where rendering could affect the final aspect. But wanting to keep this aspect as close as possible makes sense.

A bit of history

Different formats have been developed over the years to cover these different aspects with different accuracy and constraints. In order to understand why we have such a variety of description formats on offer, a bit of history might be useful. Being able to encode graphical representation of models in SBML was mentioned as early as 2000 (Andrew Finney. Possible Extensions to the Systems Biology Markup Language. 27 November 2000.).

In 2002, the group of Hiroaki Kitano presented a graphical editor for the Systems Biology Markup Language (SBML, Hucka et al 2003), called SBedit, and proposed extensions to SBML necessary for encoding maps (Tanimura et al. Proposal for SBEdit’s extension of SBML-Level-1. 8 July 2002). This software latter became CellDesigner (Funahashi et al 2003), a full-featured modelling developing environment, using SBML as its native format. All graphical information is encoded in CellDesigner-specific annotations, using the SBML extension system. In addition to the layout (the where), CellDesigner proposed a set of standardised glyphs to use for representing different types of molecular entities and different relationships (the what) (Kitano et al 2003). At the same time, Herbert Sauro developed an extension to SBML to encode the maps designed in the software JDesigner (Herbert Sauro. JDesigner SBMLAnnotation. 8 January 2003). Both CellDesigner and JDesigner annotations could also encode the appearance of glyphs (how).

In 2003, Gauges et al (Gauges et al. Including Layout information in SBML files. 13 May 2003) proposed to split the description of the layout (the where) and the rendering (the what and the how), and to focus on the layout part in SBML (Gauges et al 2006). Eventually, this effort led to the development of two SBML Level 3 Packages, Layout (Gauges et al 2015) and Render (Bergmann et al 2017).

Once the SBML Layout annotations were finalised, the SBML and BioPAX communities came together to standardise visual representations for biochemical pathways. This led to the Systems Biology Graphical Notation, as set of three standard graphical languages with agreed upon symbols and rules to assemble them (the what, Le Novère et al 2009). While the shape of SBGN glyphs determine their meaning, neither their placement in the map nor their graphical attributes (colour, texture, edge thickness, the how) affect the map semantics. SBGN maps are ultimately images and can be exchanged as such, either in bitmaps or vector graphics. They are also graphs and can be exchanged using graph formats, such as GraphML. However, it was felt that sharing and editing SBGN maps would be much easier if more semantics was encoded rather than graphical details. This led to the development of SBGN-ML (van Iersel et al 2012), which not only encode the SBGN part of SBGN maps, but also the layout and size of graph elements.

So we have at least three solutions to encode biochemical maps using XML standards from the COMBINE community (Hucka et al 2015): 1) SBGN-ML, 2) SBML with Layout extension (controlled Layout annotations in Level 2 and Layout package in Level 3) and 3) SBML with proprietary extensions. Regarding the latter, we will only consider CellDesigner, for two reasons. Firstly, CellDesigner is the most used graphical model designer in systems biology (at the time of writing, the articles describing the software have been cited over 1000 times). Secondly, CellDesigner’ SBML extensions are used in other software tools.  These solutions are not equivalent, they present different advantages and disadvantages, and round-tripping is in general not possible.

SBGN-ML

Curiously, despite its name, SBGN-ML does not explicitly describe the SBGN part of the maps (the what). Since the shape of nodes is a standard, it is only necessary to mention their type, and any supporting software will know which symbol to use. For instance, SBGN-ML will not specify that a protein X must be represented with a round-corner rectangle. It will only say that there is a macromolecule X at a certain position with given width and height. Any SBGN-supporting software must know that a macromolecule is represented by a round-corner rectangle. The consequence is that SBGN-ML cannot be used to encode maps using non-SBGN symbols. However, software tools can decide to use different symbols attributed to a given class of SBGN objects during the rendering of the maps. Instead of using a round-corner rectangle each time the class of a glyph is macromolecule, it could use a star. The resulting image would not be an SBGN map. But if modified, and saved back in SBGN-ML, it could be recognised by another supporting software. Such a behaviour is not to be encouraged if we want people to get used to SBGN symbols, but it provides a certain level or interoperability.

What is explicitly described in SBGN-ML instead are the parts that are not regulated by SBGN itself, but are specific to the map. That include the size of the glyphs (bounding box), the textual labels, as well as the positions of glyphs (the where). SBGN-ML currently does not encode rendering properties such as text size, colours and textures (the how). But the language provides an element extension, analogous to the SBML annotation, that allows to augment the language. One can use this element to extend each glyph, or to encode styles, and the community started to do so in an agreed-upon manner.

Note that SBGN-ML only encodes the graph. While there is a certain amount of biological semantics, linked to the identity of the glyphs, it is not a general purpose format that would encode advanced semantic of regulatory features, such as BioPAX (Demir et al. 2010),  or mathematical relationships such as SBML. However, users can distribute SBML files along SBGN-ML files, for instance in a COMBINE Archive (Bergmann et al 2014). Unfortunately, there is currently no blessed way to map an SBML element, such as a particular species, to a given SBGN-ML glyph.

SBML Level 3 + Layout and Render packages

As we mentioned before, SBML Level 3 provides two packages helping with the visual representations of networks: Layout (the where) and Render (the how). Contrarily to SBGN-ML, which is meant to describe maps in a standard graphical notation, the SBML Level 3 packages do not restrict the way one represents biochemical networks. This provides more flexibility to the user, but decreases the “stand-alone” semantics content of the representations. I.e. if non-standard symbols are used, their meaning must be defined in an external legend. It is of course possible to use only SBGN glyphs to encode maps. The visual rendering of such a file will be SBGN, but the automatic analysis of the underlying  format will be harder.

The SBML Layout package permits to encode the position of objects, points, curves and bounding boxes. Curves can have complex shapes, encoded as Béziers curves. The package allows to distinguish between different general types of nodes such as compartments, molecular species, reactions and text. However, there is little biological semantics encoded by the shapes, either regarding the nodes (e.g. nothing distinguishes a simple chemical from a protein) or the edges (one cannot distinguish an inhibition from a stimulation). In addition, the SBML Render package permits to define styles that can be applied to types of glyphs. This includes colours and gradients, geometric shapes, properties of text, lines, line-endings etc. Render can encode a wide variety of graphical properties, and pave the gap to generic graphical formats such as SVG.

If we are trying to visualise a model, one advantage of using SBML packages is that all the information is included in a single file, providing an easy mapping between the model constructs and their representation. This goes a long way to solve the issue of the biological semantics mentioned above, since it can be retrieved from the SBML Core elements, linked to the Layout elements. Let’s note that while SBML Layout+Render do not encode the nature of the objects represented by the glyphs (the what) using specific structures, this can be retrieved via the attributes sboTerm of the corresponding SBML Core elements, using the appropriate values from the Systems Biology Ontology (Courtot et al 2011).

CellDesigner notation

CellDesigner uses SBML (currently Level 2) as its native language. However, it extended it with its own proprietary annotation, keeping the SBML perfectly valid (which is also the way software tools such as JDesigner operate). Visually, the CellDesigner notation is close to SBGN Process Descriptions, having been the strongest inspiration for the community effort. CellDesigner offers an SBGN-View mode, that produce graphs closer to pure SBGN PD.

CellDesigner’s SBML extensions increase the semantics of SBML elements such as molecular species or regulatory arcs,  in a way not dissimilar to SBGN-ML. In addition, it provides a description of each glyph linked to the SBML elements, covering the ground of SBML Layout and Render. The SBML extensions being specific to CellDesigner, they do not offer the flexibility of SBML Render. However, the limited spectrum of possibility might makes the support easier.

CellDesigner notation SBML Layout+Render SBGN-ML
Encode the what
Encode the where
Encode the how
Contain the mathematical model part
Writing supported by more than 1 tool
Reading supported by more than 1 tool
Is a community standard

Examples of usages and conversions

Now let’s see the three formats in action. We start with SBGN-ML. First, we can load a model, for instance from BioModels (Chelliah et al 2015), in CellDesigner (version 4.4 at the time of writing). Here we will use the model BIOMD0000000010, an SBML version of  the MAP kinase model described in Kholodenko et al (2000).

image10

From an SBML file that does not contain any visual representation, CellDesigner created one using its auto-layout functions. One can then export an SBGN-ML file. This SBGN-ML file can be imported for instance in Cytoscape (Shannon et al.  2003) 2.8 using the CySBGN plugin (Gonçalves et al 2013).

image2

The position and size of nodes are conserved, but edges have different size (and the catalysis glyph is wrong). The same SBGN-ML file can be open in the online SBGN editor Newt.

image4

An alternative to CellDesigner to produce the SBGN-ML map could be Vanted (Junker et al 2006, version 2.6.4 at the time of writing). Using the same model from BioModels, we can auto-layout the map (we used the organic layout here) and then convert the graph to SBGN using the SBGN-ED plugin (Czauderna et al 2010).

image11

The map can then be saved as SBGN-ML, and as before opened in Newt.

image3

The positions of the nodes are conserved. But the connection of edges is a bit different. In that case, Newt is slightly more SBGN compliant.

Now, let’s start with a vanilla SBML file. We can import  our BIOMD0000000010 model in COPASI  (Hoops et al 2006, version 4.22 at the time of writing). COPASI now offers auto-layout capabilities, with possibilities of manually editing the resulting maps.

image5

Now, when we’ll export the model in SBML, it will contain the map encoded with the Layout and Render packages. When the model is uploaded in any software tool supporting the packages, we will retrieve the map. For instance, we can use the SBML Layout Viewer. Note that if the layout is conserved, it is not the case of the rendering.

image1

Alternatively, we can load the model to CellDesigner, and manually generate a nice map (NB: a CellDesigner plugin that can read SBML Layout was implemented during Google Summer of Code 2014 . It is part of the JSBML project).

image6

We can create an SBML Layout using CellDesigner layout converter. When we import the model in COPASI we can visualise the map encoded in Layout. NB: the difference of appearance here is due to a problem in CellDesigner converter, not COPASI.

image8

The same model can be loaded in the SBML Layout Viewer.

image7

How do I choose between the formats?

There is unfortunately no unique solution at the moment. The main question one has to ask is what do we want to do with the visual maps?

Are they meant to be a visual representation of an underlying model, the model being the important part, that needs to be exchanged? If that is the case, SBML packages or CellDesigner notation should be used.

Does the project mostly/only involves graphical representations, and those must be exchanged? CellDesigner or SBGN-ML would therefore be better.

Does the rendering of graphical elements matter? In that case, SBML packages or CellDesigner notations are currently better (but that is going to change soon).

Is standardisation important for the project, in addition to immediate interoperability? If yes, SBML packages or SBGN-ML would be the way to go.

All those questions and more have to be clearly spelled out at the beginning of a project. The answer will quickly emerge from the answers.

Acknowledgements

Thanks to Frank Bergmann, Andreas Dräger, Akira Funahashi, Sarah Keating, Herbert Sauro for help and corrections.

References

Bergmann FT, Adams R, Moodie S, Cooper J, Glont M, Golebiewski M, Hucka M, Laibe C, Miller AK, Nickerson DP, Olivier BG, Rodriguez N, Sauro HM, Scharm M, Soiland-Reyes S, Waltemath D, Yvon F, Le Novère N (2015) COMBINE archive and OMEX format: one file to share all information to reproduce a modeling project. BMC Syst Biol 15, 369. doi:10.1186/s12859-014-0369-z

Bergmann FT, Keating SM, Gauges R, Sahle S, Wengler K (2017) Render, Version 1 Release 1. Available from COMBINE <http://identifiers.org/combine.specifications/sbml.level-3.version-1.render.version-1.release-1>

Chelliah V, Juty N, Ajmera I, Raza A, Dumousseau M, Glont M, Hucka M, Jalowicki G, Keating S, Knight-Schrijver V, Lloret-Villas A, Natarajan K, Pettit J-B, Rodriguez N, Schubert M, Wimalaratne S, Zhou Y, Hermjakob H, Le Novère N, Laibe C (2015)  BioModels: ten year anniversary. Nucleic Acids Res 43(D1), D542-D548. doi:10.1093/nar/gku1181

Courtot M, Juty N, Knüpfer C, Waltemath D, Zhukova A, Dräger A, Dumontier M, Finney A, Golebiewski M, Hastings J, Hoops S, Keating S, Kell DB, Kerrien S, Lawson J, Lister A, Lu J, Machne R, Mendes P, Pocock M, Rodriguez N, Villeger A, Wilkinson DJ, Wimalaratne S, Laibe C, Hucka M, Le Novère N. Controlled vocabularies and semantics in Systems Biology. Mol Syst Biol  7, 543. doi:

Czauderna T, Klukas C, Schreiber F (2010) Editing, validating and translating of SBGN maps. Bioinformatics 26(18), 2340-2341. doi:10.1093/bioinformatics/btq407

Demir E, Cary MP, Paley S, Fukuda K, Lemer C, Vastrik I, Wu G, D’Eustachio P, Schaefer C, Luciano J, Schacherer F, Martinez-Flores I, Hu Z, Jimenez-Jacinto V, Joshi-Tope G, Kandasamy K, Lopez-Fuentes AC, Mi H, Pichler E, Rodchenkov I, Splendiani A, Tkachev S, Zucker J, Gopinathrao G, Rajasimha H, Ramakrishnan R, Shah I, Syed M, Anwar N, Babur O, Blinov M, Brauner E, Corwin D, Donaldson S, Gibbons F, Goldberg R, Hornbeck P, Luna A, Murray-Rust P, Neumann E, Ruebenacker O, Samwald M, van Iersel M, Wimalaratne S, Allen K, Braun B, Carrillo M, Cheung KH, Dahlquist K, Finney A, Gillespie M, Glass E, Gong L, Haw R, Honig M, Hubaut O, Kane D, Krupa S, Kutmon M, Leonard J, Marks D, Merberg D, Petri V, Pico A, Ravenscroft D, Ren L, Shah N, Sunshine M, Tang R, Whaley R, Letovksy S, Buetow KH, Rzhetsky A, Schachter V, Sobral BS, Dogrusoz U, McWeeney S, Aladjem M, Birney E, Collado-Vides J, Goto S, Hucka M, Le Novère N, Maltsev N, Pandey A, Thomas P, Wingender E, Karp PD, Sander C, Bader GD  (2010) The BioPAX Community Standard for Pathway Data Sharing. Nat Biotechnol, 28, 935–942. doi:10.1038/nbt.1666

Funahashi A, Morohashi M, Kitano H, Tanimura N (2003) CellDesigner: a process diagram editor for gene-regulatory and biochemical networks. Biosilico 1 (5), 159-162

Gauges R, Rost U, Sahle S, Wegner K (2006) A model diagram layout extension for SBML. Bioinformatics 22(15), 1879-1885. doi:10.1093/bioinformatics/btl195

Gauges R, Rost U, Sahle S, Wengler K, Bergmann FT (2015) The Systems Biology Markup Language (SBML) Level 3 Package: Layout, Version 1 Core. J Integr Bioinform 12(2), 267. doi:10.2390/biecoll-jib-2015-267

Gonçalves E, van Iersel M, Saez-Rodriguez J (2013) CySBGN: A Cytoscape plug-in to integrate SBGN maps. BMC Bioinfo 14, 17. doi:10.1186/1471-2105-14-17

Hoops S, Sahle S, Gauges R, Lee C, Pahle J, Simus N, Singhal M, Xu L, Mendes P, Kummer U (2006) COPASI-a COmplex PAthway SImulator. Bioinformatics 22(24), 3067-3074. doi:10.1093/bioinformatics/btl485

Hucka M, Bolouri H, Finney A, Sauro HM, Doyle JC, Kitano H, Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A, Cuellar AA, Dronov S, Ginkel M, Gor V, Goryanin II, Hedley WJ, Hodgman TC, Hunter PJ, Juty NS, Kasberger JL, Kremling A, Kummer U, Le Novère N, Loew LM, Lucio D, Mendes P, Mjolsness ED, Nakayama Y, Nelson MR, Nielsen PF, Sakurada T, Schaff JC, Shapiro BE, Shimizu TS, Spence HD, Stelling J, Takahashi K, Tomita M, Wagner J, Wang J (2003) The Systems Biology Markup Language (SBML): A Medium for Representation and Exchange of Biochemical Network Models. Bioinformatics, 19, 524-531. doi:10.1093/bioinformatics/btg015

Hucka M, Nickerson DP, Bader G, Bergmann FT, Cooper J, Demir E, Garny A, Golebiewski M, Myers CJ, Schreiber F, Waltemath D, Le Novère N (2015) Promoting coordinated development of community-based information standards for modeling in biology: the COMBINE initiative. Frontiers Bioeng Biotechnol 3, 19. doi:10.3389/fbioe.2015.00019

Junker BH, Klukas C, Schreiber F (2006) VANTED: A system for advanced data analysis and visualization in the context of biological networks. BMC Bioinfo 7, 109. doi:10.1186/1471-2105-7-109

Kholodenko BN (2000) Negative feedback and ultrasensitivity can bring about oscillations in the mitogen-activated protein kinase cascades. Eur J Biochem.267(6), 1583-1588. doi:10.1046/j.1432-1327.2000.01197.x

Kitano H (2003) A graphical notation for biochemical networks. Biosilico 1 (5), 169-176. doi:10.1016/S1478-5382(03)02380-1

Le Novère N, Hucka M, Mi H., Moodie S, Shreiber F, Sorokin A, Demir E, Wegner K, Aladjem M, Wimalaratne S, Bergman FT, Gauges R, Ghazal P, Kawaji H, Li L, Matsuoka Y, Villéger A, Boyd SE, Calzone L, Courtot M, Dogrusoz U, Freeman T, Funahashi A, Ghosh S, Jouraku A, Kim S, Kolpakov F, Luna A, Sahle S, Schmidt E, Watterson S, Goryanin I, Kell DB, Sander C, Sauro H, Snoep JL, Kohn K, Kitano H (2009) The Systems Biology Graphical Notation. Nat Biotechnol 27, 735-741. doi:10.1038/nbt.1558

Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramge D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks.  Bioinformatics 13, 2498-2504. doi:10.1101/gr.1239303

van Iersel MP, Villéger AC, Czauderna T, Boyd SE, Bergmann FT, Luna A, Demir E, Sorokin A, Dogrusoz U, Matsuoka Y, Funahashi A, Aladjem MI, Mi H, Moodie SL, Kitano H, Le Novère N, Schreiber F (2012) Software support for SBGN maps: SBGN-ML and LibSBGN. Bioinformatics 28, 2016-2021. doi:10.1093/bioinformatics/bts270

Setting things straight

If you are reading that, you are probably aware that I have been convicted and sentenced for a crime related to accessing indecent images on the internet. Following the sentence, there was quite a lot of media exposure, in particular follow-up articles in the Cambridge News and the BBC. These reports were quite misleading, misusing dates, and choosing words in order to maximise the shock effect.

I hoped that in time those reports would wane, but this was a pipe dream. I also cannot expect people to forget what they read. Although I cannot fight the free press, I can communicate on my own. This post aims at giving precisions on the what, when and how, and setting things straight.

First of all, I want to assure any reader that I am not seeking to minimise my crime, or trying to find excuses for it. I made mistakes, horrible ones, and I am paying for them. A side effect of the story is that it made me think deeply about what those crimes are, why and how they are committed.

Back to the facts. In August 2017, the police seized 14 electronic devices from my home, including computers, phones and hard-drives. In May 2018, I was summoned for an interview under caution at the Peterborough police station. The police gave me back 13 of the devices, that were found clean.

On an old Iomega hard disk, that I used as a home backup until a few years ago, they found two indecent moving images. The images were in the cache of a peer to peer software called Limewire. They were not readable, and were identified through digital forensic investigation. I was therefore charged for making images but not possession (i.e. I did not have these images and could not give them to anyone). Note that, of course, making an image is not at all the same as taking a picture. In the present case, it means creating a computer file of an image.

The police determined that the two files were opened only once, at the date of creation in May 2008, and never again. This was the charge for which I was sentenced. The police told me at the time that the offense was “at the lowest end of the scale”, and that I would probably only get a fine.

Now, during the interview under caution in May 2018, I was asked when I started using the Limewire software. I answered “in the early 2000s for downloading music”. I have since determined that I started using Limewire sometimes in 2005 (in fact, I only had broadband at home from end of 2003).

During the interview, the police also asked me about my use of chatrooms and forums. At one point, I was asked when I last saw an indecent image in those chatrooms. I answered “about a year ago”. Now, the specific chatrooms we were talking about were not illegal, or dedicated to the kind of material leading to the interview. And one does not choose what to see in forums and chatrooms, which is why people are not prosecuted for that (*).

These facts were then transformed through a Chinese whisper game. The police tape (that was never heard after the interview) led to a written report transmitted to the prosecution, which ordered a pre-sentencing report. This pre-sentencing report was sent back to the prosecution (different judges ordered the report and sentenced me). And finally, all of this was described by the judge in front of journalists who took hand-written notes.

As a consequence, the facts described above became, in the news reports, downloading child pornography between 2000 and 2016 and having two indecent moving images. Again, I am not trying to minimise my crime or finding excuses for it. But both these statements were untrue. And I believe the portrait those articles made of me instilled a climate of shock and fear that is not foreign to some reactions I am, and my family is, still suffering from.

I was sentenced in Cambridge crown court on 4th September. The judge was severe, and the sentence unexpectedly strong, taking solicitor, barrister, the police, probation services, and of course myself, by surprise. I was sentenced to 8 months custody, suspended for 2 years. During this period, I will be monitored by the probation services. I have also to complete a 30 days rehabilitation activity. This sentence will be spent after 4 years and 8 months.

In parallel to the sentence, I have to be on the sex offender registry for 10 years. In brief, this means that the police needs to know where I am at any time, plus a few other things. I also received a Sexual Harm Prevention Order, which forbids me to hide or delete my internet history and allows the police to monitor my electronic devices at any time without a warrant. However, this SHPO does not apply to electronic devices provided by employers.

Finally, I was fired from my job at the Babraham Institute and expelled from any board, panel, collaboration, community I was part of.

So, I made mistakes, and I am paying for them, dearly. I need to atone, and I will. There is no need to inflate my errors, and mislead people around me into fearing a monstrous predator.

(*) It is estimated that half a million adult UK males viewed such material in the past, which points to a societal problem that is large and complex, and is unlikely to be solved by casting out a few individuals. In other words, if you know 60 adult males, then it very likely one of them accessed this type of content. If you work in an institute employing 180 adult males, then 3 of them could have done so. And the proportion increases with the use of internet.

So, is it rare? No. Is it “normal”? No! Should the society deal with the problem? Yes. Is there redemption for the 1.7% of the population? I would hope so.

The Stop It Now campaign from the Lucy Faithfull foundation is a good source of information and help. If you are accessing this type of material, or thinking about it, you can contact them.

Tips for writing senior grant applications involving systems modelling

As I get older, I am asked to sit on more senior grant panels. Larger projects generally mean more diverse research approaches. I am also offered to sit on multi-disciplinary panels, where the projects must feature different approaches, or even different fields. Finally, those panels are often made up of more “mature” scientists. As a result of all that:

1) Panel members have less expertise, if any, in some of the fields covered by the applications. Even in their own field, they mostly master only part of the landscape. They left university decades ago, and kept abreast of developments via reading scientific literature. Since there are only 24 h in a day, we tend to focus our reading largely on what directly impacts our own research. When it comes to techniques, most of senior panel members left actual experiments/programming/equations for quite a while, and although they have an “academic” knowledge of the methods, they might not know the state of the art, or master the subtleties and pitfalls of given techniques.

2) Panel members/reviewers have a shorter attention span. This is not only because of age (although let’s face it, there is some decrease in focus and stamina), but also because senior scientists have more commitments and are always running for deadlines (there is also the increasing number of projects to evaluate. In my latest panel, I was in charge of 25 projects). Finally, there might also be a bit of arrogance and “cannot be bothered” attitude.

Applicants to large grants should take these facts into consideration. One area where it is particularly important is Systems Biology. I am not going to dwell on what is Systems Biology, but a subset includes the development of mathematical models and numerical simulations to reproduce the behaviour of biological systems. This is an area where I encountered systematic strategic errors in the way grant applications are written, that  decreased their chances of being successfully funded comparatively to projects in other areas. Below are a few advices that could allow senior panels to appreciate your projects better. This is only my opinion, and I might be wrong about the impact of those mistakes. Also, some of the advices seem obvious. But nevertheless, they are not systematically followed.

What is the question?

Building models is great. Models are integrators of knowledge, and building a model that can reproduce known behaviours of a system is the best way to see if we understand it. Models can also suggest new avenues of research or treatment. But models must fulfill a purpose. This purpose is not obligatory part of the project submitted itself, but it must be mentioned there. Why are you building the model? What questions do you want to answer with the model simulations and analysis? Note that this is not specific to computational modelling. I saw rejection of projects describing very complex experimental techniques which use was not justified. Technical activity must be commensurate with expected benefits.

Have a colleague to read your proposal, and ask them afterward “can you tell me what we are going to do with this model?” (Do not ask the question before, while handing them the application. Your colleague is clever and friendly. They will finds hints here and there, add their own ideas and put something together).

Clarity of the research plan

I cannot count the number of time where after a lengthy introduction, a detailed experiment part, projects end up with “and we will build a model”, or even worse, the dreadful “we will use a systems biology approach”. WTF is a “systems biology approach”? I have called myself a systems biologist for the best part of two decades and have not a clue. It does not mean anything, or it means too many things.

Explain what you are going to do and how. Which kind of modeling approach will you use? Why? Is this the best modeling approach considering the data you have and the questions you ask? If building a model, how will you build it? What will be the variables (e.g. which molecular species will be represented)? How will you relate them? Will you use numerical or logic approaches? Will you incorporate uncertainty? Will you study steady-states or kinetics? Will you use deterministic or stochastic simulations? Which software will you use? How are you going to analyse the data? How are you going to link the model to experimental evidence? Do you have a plan for parameterisation?

Don’t overdo it. One does not describe generic molecular biology kits or culture media in senior grant applications (except if this is at the core of the project), so we do not need to describe technical details that will not affect the results and their interpretation. But give enough details to convince the reviewers or panel members who might actually know what all that is about. An experimental plan that would not precise the organism or cell line to be used has almost no chance of getting through. Same for modelling! And actually, you also need to add enough explanation to allow for non-specialists to understand what you are going to do. For instance, many people are baffled by genome-scale constraint-based modelling of metabolism, confusing flux balance analysis and metabolic flux analysis, and therefore misunderstanding the (absent) role of metabolite concentrations. They also mistake them with ODE models, concluding that they are too big to be parameterised.

Have a colleague to read your proposal, and ask them afterward “Can you tell me which modelling method I will use and why it is the best for this project?

Provide preliminary data

Almost any experimental project comes packed with preliminary data that shows 1) why is the investigated question interesting, 2) why is the workplan feasible. It is no different with modelling. Why would people believe you? Past track record on other projects is not sufficient. At best it can show that you can do modelling. Good. But this is not enough. Remember, this blog post is particularly focused on large projects, with multiple lines of investigations and requesting large amounts of money. Often these projects have been written over many months or even a year. I know a few such projects for which the production of preliminary data required another, smaller, dedicated funding. So there is no excuse not to spend a sufficient effort benchmarking the modelling approaches you will use, and getting preliminary results, hopefully exciting and justifying a scale-up.

Describe the validation steps

This is a very important part, even more important than for experimental projects. A modeller cannot say “let the data talk”. Any number of models can lead to reasonable simulation results. That does not mean the model is the correct one, or that the results mean anything. You must convince the panel that you have a plan to check that your results are valid and are not a bunch of random numbers or graphs. How will you validate the results of your simulations and analysis? How will this validation feedback in your model design? How precisely will your predictions lead to new experiments? Who will do the validation? Where? When?

Have a colleague to read your proposal, and ask them afterward “Can you draw me a workflow of the modelling part of this project, and identify the points of contact with the rest of the project?

These were just some tips I came up with. Do you disagree with them? Would you add others?