Machine Translation having a laugh

Machine translation has seen astounding progress in the past few years. Software such as the Google Neural Machine Translation and (even more) DeepL really transformed the activity to a point that, in many cases, the result really sounds like it has been produced by a native speaker, but is also better than a translation made by a casual translator. However, it is still too much based on word for word translation, makes wrong choices in case of homonyms in the source language, and ignore basic facts of life. Note that the mistakes are not the same for both translation engines. Google Translate tends to be too literal, using the most frequent definition of the dictionaries. DeepL relies on curated translations, and therefore uses metaphors, minors meanings and reflect popular use. Sometimes, the result is rather funny. Below I record a few translations I encountered while using these tools, either to help with primary translations, or to back-translate my translations in order to check ambiguities.

Source: “They had watched the bright dot arc across the night sky as they lay next to each other in the back yard, their hands brushing accidently on purpose.”
Expected: “Ils avaient regardé le point brillant traverser le ciel nocturne alors qu’ils étaient couchés l’un à côté de l’autre dans l’arrière-cour, leurs mains se frôlant « accidentellement »”
Google Translate: “Ils avaient regardé le point brillant se projeter dans le ciel nocturne alors qu’ils étaient couchés l’un à côté de l’autre dans l’arrière-cour, leurs mains se frôlant accidentellement exprès.”
DeepL: “Ils avaient vu les points brillants s’étendre les uns à côté des autres dans la cour arrière, leurs mains se brossant accidentellement les dents dans le ciel nocturne.”

Source: “He crumpled to his knees and sobbed.”
Expected: “Il tomba à genoux et sanglota.”
DeepL: “Il s’est froissé jusqu’aux genoux et a sangloté ”

Source: “She broke down and sobbed.”
Expected: “Elle s’effondra en sanglotant.”
Google Translate: “Elle est tombée en panne et a sangloté.”

Source: “The sprinkles kept them cool and added to the thrill of sailing a small skiff.”
Expected: “Les éclaboussures les gardaient au frais et ajoutaient au frisson de naviguer dans un petit bateau.”
Google Translate: “Les éclaboussures les ont gardés au frais et ont ajouté au frisson de naviguer dans un petit skiff.”
DeepL: “Les pépites de chocolat les gardaient au frais et ajoutaient à l’excitation de naviguer à bord d’une petite yole.”

Source: “The bamboo wind chimes on the Patel’s porch clacked in the gentle breeze.”
Expected: “Le carillon de bambou sur le porche des Patel claquait sous la douce brise.”
DeepL: “Le vent de bambou sonne sur le porche du Patel’s claqua dans la brise légère”

Source: “She ached to comfort him”
Expected: “Elle avait très envie de le réconforter”
Google Translate: “Elle avait mal à le réconforter”
DeepL: “Elle lui faisait mal pour le réconforter”

Source: “Nose pressed to the glass, Emily steamed the window in the bedroom”
Expected: “Le nez pressé contre la vitre, Émily couvrait de vapeur la fenêtre de la chambre”
Google Translate: “Le nez pressé contre la vitre, Emily cuit à la vapeur la fenêtre de la chambre”
DeepL: “Le nez pressé sur le verre, Emily a fait cuire à la vapeur la fenêtre de la chambre à coucher.”

Source: “spun  the spigot handle”
Expected: “ouvrir le robinet”
Google Translate: “filé le manche”
DeepL: “faire tourner le manche de l’embout”

Source: “John’s jaw dropped”
Expected: “John en resta bouche bée”
DeepL: “La machoire de John a lâché”

Source: “He ate his date”
Expected: “Il mangea sa date”
Google Translate: “Il a mangé son rendez-vous”
DeepL: “Il a mangé son rencard”

Source: “He was nibbling at her tit”
Expected: “Il mordillait son sein”
DeepL: “Il grignotait son sein”
Google Translate: “il mordillait sa mésange”

Source: “She put a bow in her daughter’s hair”
Expected: “Elle mis un nœud dans les cheveux de sa fille”
Google Translate: “Elle a mis un arc dans les cheveux de sa fille”

Source: “He entered a ball”
Expected: “Il entra dans un bal”
Google Translate: “Il est entré dans un ballon”
DeepL: “Il est entré dans une balle”

Source: “Après qu’il ait couché Susan, son père installa”
Expected: “After he put Susan to bed, his father installed”
DeepL: “After he slept with Susan, his father installed”

Source: “The banging sent chills down her spine”
Expected: “Les coups lui donnèrent froid dans le dos”
Google Translate: “La défonce lui donna des frissons dans le dos”

Source: “She became aware of her breathing, of the lub-dub of her heart, of every swallow”
Expected: “Elle pris conscience de sa respiration, du ta-doum de son cœur, de chaque déglutition”
DeepL: “Elle a pris conscience de sa respiration, du lub-dub de son cœur, de chaque hirondelle”

Source: “Do you think I’m stupid? I felt for their pulses”
Expected: “Tu penses que je suis stupide ? J’ai senti leur pouls.”
Google Translate: “Penses-tu que je suis stupide? Je cherchais leurs légumineuses”

A rant about sloppy writing

This is a rant, I am sorry. I do not usually rant on this blog. In fact I believe this is a first. Also, I am at a stage of my life where I do not like pointing fingers and giving lessons. However, I think there is a real problem here. If you can read that until the end, I would even say it could sometimes be a matter of life or death.

I am known for making plenty of mistakes when writing, in particular e-mails (but not only). So this rant is first and foremost directed at myself! Not so long ago, I had to plunge back in time to clarify an historical point. I was stunned by the quality of the e-mails I wrote 15 years ago. Long, articulate, with really good grammar and spelling. How comes I now produce such bad messages? The answer is sloppiness. At the time, I was just starting as a group leader and was keen to please and convince. I was reading back the messages before pressing the enter key, restructuring what was unclear and fixing typos at the same time. And then I became sloppy.

Over the years, I worked with many researchers who wrote like they spoke, or worse. Students and post-docs provided me with initial drafts that were so mangled I could not even start to amend them (I am not only talking of foreign scientists). As an associate editor, I received manuscripts I was ashamed of sending for review. But youth or origin are not even required for such a behaviour. Some native English speaking collaborators of mine were among the worse offenders. One did not use any punctuation or capitals. You had to figure out what a message meant by reading it aloud! His stance was that form did not matter, so he could not be bothered. If the recipient could not understand the content of the message, they were probably not worth interacting with him.

But it matters.

First of all, sloppy writing is never cool. One or two typos here and there are forgettable. But what about a presentation with typos on each slide? Will that not distract the audience, and decrease the impact of the message? What about manuscripts submitted without a final proofread? Is it not a lack of respect for editors, reviewers and ultimately readers? Plus, if you are so careless that you do not care about such an issue, in a document which is meant to be evaluated by others, why would you not be equally light-handed with your statistics? What about your figures (both when meaning numbers or graphics)? Can I really trust them?

Then, misunderstandings arise much more easily when the communication is only digital, without the body language and the added semantics coming from hand gestures and facial clues. Using a word for another, omitting or misplacing a coma, forgetting an accent in French, all those things can modify the meaning of a sentence and possibly change the entire message.

An area where the consequences are dire is translation. Typically, a translator receives a text without extra-information, and with no possibilities of contacting the initial author. While this is always problematic, there is a case where it can be a question of life and death: the translation of medical documents.

I recently had to translate the record of a patient MRI examination. The writing was simply atrocious. The punctuation was bad, or just absent, the text riddled with spelling mistakes, typos etc. Some examples of what I found there:

gauclze => gauche
cart11age => cartilage
ex.tenseur => extenseur
liinites => limites?
rernanieme.n:ts => remaniements
Sur’ meniscopthie => sur meniscopathie
fem6:r:o-:patellaire => femoro-patellaire

I do not know which keyboard was used to type that, but it is not one I came across. Or the typist had very big fingertips. If that was due to an OCR software trying to decrypt the famously cryptic doctor hand-writing, we would have wrong, but actual words. One possible explanation would be a human typist trying to understand the doctor’s handwriting. But then, we come back to the sloppiness, this time in editing. In our days and age, any text processing software (even the raw text editor I am using to type this) provides a spell-checker with the now ubiquitously annoying wiggly red lines.

There is one silver lining to the story. Such a source text is absolutely inedible for Machine Translation tools such as Google NMT or DeepL. And so many errors make Translation Memory-based Computer Assisted Translation almost unfeasible. So the translator has to do all the work, and be extra cautious. The text has to be carefully processed word after word. In that specific case, I managed to reach a blistering 200 words per hour. This makes hard to make a living out of it.

But more importantly, this doctor, or this typist, by being a sloppy writer, played with a patient life.

Using medians instead of means for single cell measures

In the absence of information regarding the structure of variability (whether intrinsic noise, technical error or biological variation), one very often assumes, consciously or not, a normal distribution, i.e. a “bell curve”. This is probably due to an intuitive application of the central limit theorem which stipulates that when independent random variables are added, their normalized sum tends toward such a normal distribution, even if the original variables themselves are not normally distributed. The reasoning then goes that any biological process is the sum of many sub-processes, each with its own variability structure, therefore its “noise” should be Gaussian.

Although that sounds almost common sense, alarm bells start ringing when we use such distributions with molecular measurements. Firstly, a normal distribution ranges from -∞ to +∞. And there is no such things as negative amounts. So, at most, the variability would follow a truncated normal distribution, starting at 0. Secondly, the normal distribution is symmetrical. However, in everyday conversation, the biologists will talk of a variability “reaching twofold”. For a molecular measurement, a two-fold increase and a two-fold decrease do not represent the same amount. So there is an asymmetric notion here. We are talking about linking the addition and removal of the same “quantum of variability” to a multiplication or division by a same number. Immediately logarithms come to mind. And log2 fold changes are indeed one of the most used method to quantify differences. Populations of molecular measurements can also be – sometimes reasonably – fitted with log-normal distributions. Of course, several other distributions have been used to fit better cellular contents of RNA and protein, including the gamma, Poisson and negative binomial distributions, as well as more complicated mix.

Let’s look at some single-cell gene expression measurements. Below, I plotted the distribution of read counts (read counts per million reads to be accurate) for four genes in 232 cells. The asymmetry is obvious, even for NDUFAB1 (the acyl carrier protein, central to lipid metabolism). This dataset was generated using a SmartSeq approach and Illumina HiSeq sequencing. It is therefore likely that many of the observed 0 are “dropouts”, possibly due to the reverse transcriptase stochastically missing the mRNAs. This problem is probably even amplified with methods such as Chromium, that are known to detect less genes per cell. Nevertheless, even if we remove all 0, we observe extremely similar distributions.

FourGenes.png

One of the important consequences of the normal distribution’s symmetry, is that mean and median of the distribution are identical. In a population, we should have the same amounts of samples presenting less and presenting more substance than the mean. In other words, a “typical” sample, representative of the population, should display the mean amount of the substance measured. It is easy to see that this is not the case at all for our single cell gene expressions. The numbers of cells expressing more than the mean of the population are 99 for ACP (not hugely far from the 116 of the median), 86 for hexokinase, 78 for histone acetyl transferase P300 and 30 for actin 2. In fact, in the latter case, the median is 0, mRNAs having been detected in only 50 of the 232 cells ! So, if we take a cell randomly in the population, most of the time it presents a count of 0 CPM of actin 2. The mean expression of 52.5 CPM is certainly not representative!

If we want to model the cell type, and provide initial concentrations for some messenger RNAs, we must use the median of the measurements, not the mean (of course, the best route of action would be to build an ensemble model, cf below). The situation would be different if we wanted to model the tissue, that is a sum of non individualised cells representative of the population.

To explain how such asymmetric distributions can arise from noise following normal distributions, we can build a small model of gene expression. mRNA is transcribed at a constant flux, with a rate constant kT. It is then degraded following a unimolecular decay with rate kdeg (chosen to be 1 on average, for convenience). Both rate constants are computed from energies, following the Arrhenius equation, k = Ae-(E/RT), where R is the gas constant, 8.314 and T is the temperature, that we set at 310 K (37 deg C). To simplify we’ll just set the scaling factor A to 1, assuming it is included in the reference energy. E is 0 for degradation, and we modulate the reference transcription energy to control the level of transcript. Both transcription and degradation energy will be affected by normally distributed noises that represent differences between cells (e.g. concentration and state of enzymes). So Ei = E + noise. Because of Arrhenius equation, the normal distributions of energy are transformed into lognormal distributions of rates. Below I plot the distributions of the noises in the cells and the resulting rates.

EnergiesRates.png

The equilibrium concentration of the mRNA is then kdeg/kT (we could run stochastic simulations to add temporal fluctuations, but that would not change the message). The number of molecules is obtained by multiplying by volume (1e-15 l) and Avogadro number. Each panel presents 300 cells. The distribution of the top-left looks kind of intermediate between those of hexokinase and ACP above. To get the values on the top-right panel, we simulate an overall increase of the transcription rate by twofold, using a decrease of the energy by 8.314*310*ln(2). In this specific case, the observed ratios between the two medians and between the two means are both about 2.04, close to the “truth”. So we could correctly infer a twofold increase by looking at the means. In the bottom panels, we increase the variability of the systems by doubling the standard deviation of the energy noises. Now the ratio of the median is 1.8, inferring a 80% increase while the ratio of the means is 2.53, inferring an increase of 153%!

DifferentkT.png

In summary:

  1. Means of single cell molecular measurements are not a good way of getting a value representing the population;
  2. Comparing the means of single measurements in two populations does not provide an accurate estimation of the underlying changes;

Batch correcting using part of a dataset – illustration with cell-types in scRNA-seq

We all know that one of the main sources of variability in molecular measurements are batch effects. We perform the “same” experiment twice, using the same protocol and the same piece of kit. We believe we control everything and eliminate all sorts of unwanted perturbations. But at the end of day, everything is different, luminosity, air pressure, what we ate at lunch etc. Fortunately, we can (sometimes, and partially) correct these effects down the line. Now, what happens if we want to correct the batch effects for only part of a dataset? When I needed to, I had to dig quite a bit to find out how we can do so. I thought the trick could be useful to others (why we would like to do so will become clear latter).

NB: I use the example of single-cell RNA-seq in this post. However, the approach can of course be applied to all kind of data.

First, let’s look at the batch correction applied to an entire dataset. I will use scRNA-seq data, generated with the Smart-Seq protocol. The details of the sample preparation and the generation of the datasets is not relevant for the current post. So I will start directly with the count tables. The datasets haver been cleaned though. Counts have been corrected for library sizes (Count Per Million reads) and “bad” cells have been removed (cells with lot fewer reads than most, here less than 500000, and cells whose reads mapped to less genes, here 9000). I then removed all the genes that did not show at least 10 CPM in at least one cell, and genes that did not vary by at least two-fold across the entire datasets. Finally, I took the log of the counts.  Other approaches can be used, such as DESeq2’s rlog. It does not matter here.  Let’s load the counts.

# load counts
cpm1 <- read.table("OneCellType.csv", 
                   sep=",",fill=T,
                   header=T,row.names=1)

We can look at the resulting table. Columns are cells. The first part of the name indicate the name of the 96-well plate while the second part are the coordinates of the well. We have two batches, one composed of 1 plate, EPI1, and the other composed of 2 plates, EPI2 and EPI3. In total, we have 232 cells and 24235 genes.

cpm1

As we can see, many many cells show zeros, as expected with scRNA-seq. But no row is made up entirely of zeros. Let’s visualise the dataset with Principal Component Analysis.

# transpose the count matrix; needed for PCA
tcpm1<-t(cpm1)

# Run the PCA
PCA1<-prcomp(tcpm1)

# Provide the variance of components
library(factoextra)
eigen1<-get_eig(PCA1)  

# increase the left margin to avoid cropping the labels
par(mar=c(5.1, 5.1, 4.1, 2.1))

# Colour by plates. 
colour<-0
colour[grep("EPI1",colnames(cpm1))]<-rgb(230/255,159/255,0/255)
colour[grep("EPI2",colnames(cpm1))]<-rgb(86/255,180/255,233/255)
colour[grep("EPI3",colnames(cpm1))]<-rgb(0/255,158/255,115/255)

# Plot PC1 and PC2 (the first coordinates from PCA1$x)
plot(PCA1$x,col=colour,pch=16,
     cex=1.5,cex.lab=1.5,cex.axis=1.5,
     xlab=sprintf("PC1 %.1f %%",eigen1[1,2]),
     ylab=sprintf("PC2 %.1f %%",eigen1[2,2]))   

1celltypenocorr

The resulting plot does not exhibit much structure, and the variance landscape is very flat, the first 2 components showing only about 5% of it. This is all good since all those cells are supposed to belong to the same cell-type. However,  we can see that while EPI2 and EPI3 (blue and green) 0 the two plates processed in the same batch – overlap nicely, EPI1 (in orange) is off. And this batch effect aligns perfectly with PC1. Yes, the most important source of variability (albeit small) is the batch! So we need to correct for it. To do that, we will use the function ComBat from the Bioconductor package sva.

First we create a table that link our cells with the batches.

# load the package sva
library(sva)

# create a table with the cells and the batches they belong to
cells<-data.frame(batch = c(rep("b1",ncol(cpm1[,grep("EPI1",colnames(cpm1))])),
                            rep("b2",ncol(cpm1[,grep("EPI2",colnames(cpm1))])),
                            rep("b2",ncol(cpm1[,grep("EPI3",colnames(cpm1))]))),
                  row.names = colnames(cpm1))

cells

Since all cells belong to the same cell-type, we have only one column, linking each cell to its batch. We can now run the batch correction itself. Since we have only one variable in the model, we do not let anything out (~1). Then we replot the PCA.

# model to use in the batch correction
modcombat = model.matrix(~1,data=cells)
bcor_cpm1 = ComBat(dat=cpm1,batch=cells$batch,
                   mod=modcombat,
                   par.prior=TRUE, prior.plots=FALSE)

# redo the PCA
tbcpm1<-t(bcor_cpm1)
PCAb1<-prcomp(tbcpm1)
eigenb1<-get_eig(PCAb1)

plot(PCAb1$x,col=colour,pch=16,
     cex=1.5,cex.lab=1.5,cex.axis=1.5,
     xlab=sprintf("PC1 %.1f %%",eigenb1[1,2]),
     ylab=sprintf("PC2 %.1f %%",eigenb1[2,2]))

1celltypecorr

Success! Now, all three plates, from the first and second batch, are merged together.  Note the decrease of variance associated with PC1, from 3.1 % to 2.5 %. This was expected since the shift of EPI1 compared with EPI2 and EPI3 was along PC1 in the first place. Now, this is important. If the feature were were looking to analyse also aligned with PC1, we would have thrown the baby out with the bathwater.  Fortunately, in our case, the interesting stuff aligned with PC2. And PC2 is almost not affected by the batch correction. Of course you cannot know that in advance. It required a series of iterative analyses to become aware of it. As a rule, the more orthogonal the feature is with the batch effect, the less it will be affected by the batch correction.

Now, that was easy since all the cells belonged to the same cell-type. In fact, in the real study, that was not quite the case. The first batch contained one cell-type, while the second batch contained two cell types. So let’s load the complete dataset.

# load counts
cpm2 <- read.table("TwoCellTypes.csv", 
                   sep=",",fill=T,
                   header=T,row.names=1)

We have initially 326 cells and 25299 genes. In addition to the plates of EPI cells, we have now a plate of LPM cells (what EPI and LPM mean really does not matter here). We can now run a PCA again.

# transpose data for PCA
tcpm2_clean<-t(cpm2)

# Principal Component Analysis
PCA2<-prcomp(tcpm2)
eigen2<-get_eig(PCA2) 

colour<-0
colour[grep("EPI1",colnames(cpm2))]<-rgb(230/255,159/255,0/255)
colour[grep("EPI2",colnames(cpm2))]<-rgb(86/255,180/255,233/255)
colour[grep("EPI3",colnames(cpm2))]<-rgb(0/255,158/255,115/255)
colour[grep("LPM",colnames(cpm2))]<-rgb(0/255,0/255,0/255)

plot(PCA2$x,col=colour,pch=16,cex=1.5,
     cex.lab=1.5,cex.axis=1.5,
     xlab=sprintf("PC1 %.1f %%",eigen2[1,2]),
     ylab=sprintf("PC2 %.1f %%",eigen2[2,2]))

2celltypesnocorr

We can see that again, while EPI2 and EPI3 are together, the cloud of EPI1 cell is slightly shifted upward. The plate composed of cells belonging to another cell type, plotted in black, is clearly different from the three EPI ones. Let’s try to batch correct to bring together EPI1, EPI2 and EPI3. The procedure is absolutely identical as the previous one, except we now integrate the new plate, resulting in 3 plates associated with batch b2.

# create a table with the cells and the batches they belong to
cells2<-data.frame(batch = c(rep("b1",ncol(cpm2[,grep("EPI1",colnames(cpm2))])),
                    rep("b2",ncol(cpm2[,grep("EPI2",colnames(cpm2))])),
                    rep("b2",ncol(cpm2[,grep("EPI3",colnames(cpm2))])),
                    rep("b2",ncol(cpm2[,grep("LPM1",colnames(cpm2))]))),
                    row.names = colnames(cpm2))

modcombat = model.matrix(~1,data=cells2)
bcor_cpm2 = ComBat(dat=cpm2,batch=cells2$batch,
                   mod=modcombat,
                   par.prior=TRUE, prior.plots=FALSE)

tbcpm2<-t(bcor_cpm2)
PCAb2<-prcomp(tbcpm2)
eigenb2<-get_eig(PCAb2)

plot(PCAb2$x,col=colour,pch=16,cex=1.5,cex.lab=1.5,cex.axis=1.5,
     xlab=sprintf("PC1 %.1f %%",eigenb2[1,2]),ylab=sprintf("PC2 %.1f %%",eigenb2[2,2]))

2celltypeswrongcorr

Fail! EPI1 from batch1 has now been batch-corrected taking into account all three plates of batch b2, which include the LPM cells. As a result, EPI1 cells end up located somewhere in between EPI2/3 and LPM1. Clearly this is wrong. What we want is putting together EPI1 and EPI2/3, ignoring LPM1. We can do that without a problem, by declaring the cell-types as a variable of interest, that we will then ignore. But first we need to do something. The initial clean-up was performed on the entire dataset. Even if gene expression varied throughout the dataset, we could still have genes that do not vary either across EPI cells or across LPM cells. That would cause newest versions of ComBat to throw errors. So let’s remove the culprit genes.

# sanity check on variance. remove genes which
# variance is 0 either in EPI or in LPM
varEPI <- apply(cpm2[,grep("EPI",colnames(cpm2))], 1, var)
varLPM <- apply(cpm2[,grep("LPM",colnames(cpm2))], 1, var)
cpm2 <- cpm2[-which(varEPI == 0 | varLPM == 0 ),]

We still have 20885 genes to play with, which is largely enough. Now we will create a table linking the cells and the batches, as before, but in addition add the cell-type they belong to.

# create a table with the cells and the batches they belong to
cells3<-data.frame(batch = c(rep("b1",ncol(cpm2[,grep("EPI1",colnames(cpm2))])),
                            rep("b2",ncol(cpm2[,grep("EPI2",colnames(cpm2))])),
                            rep("b2",ncol(cpm2[,grep("EPI3",colnames(cpm2))])),
                            rep("b2",ncol(cpm2[,grep("LPM1",colnames(cpm2))]))),
                   celltype = c(rep("epi",ncol(cpm2[,grep("EPI",colnames(cpm2))])),
                              rep("lpm",ncol(cpm2[,grep("LPM",colnames(cpm2))]))),
                   row.names = colnames(cpm2))

cells3_1

cells3_2

We can now instruct ComBat to correct for the batch but taking into account the cell-type in the model (~cells3$celltype).

modcombat = model.matrix(~cells3$celltype,data=cells3)
bcor_cpm3 = ComBat(dat=cpm2,batch=cells3$batch,
                   mod=modcombat,
                   par.prior=TRUE, prior.plots=FALSE)

And we run the PCA again.

tbcpm3<-t(bcor_cpm3)
PCAb3<-prcomp(tbcpm3)
eigenb3<-get_eig(PCAb3)

plot(PCAb3$x,col=colour,pch=16,cex=1.5,
     cex.lab=1.5,cex.axis=1.5,
     xlab=sprintf("PC1 %.1f %%",eigenb3[1,2]),
     ylab=sprintf("PC2 %.1f %%",eigenb3[2,2]))

2celltypesgoodcorr

Hooray! Now we have EPI1, EPI2 and EPI3 together, regardless of the batch, while LPM cells stand quietly to the side. We can export those coordinates for future use (e.g. clustering).

write.csv(bcor_cpm3, file="batchCorrected.cvs",quote=FALSE)

DVDs I sell on eBay

Here is a list of the DVDs I am currently selling on eBay. The list is updated continuously so come back if you did not see something you wanted yet. All I sell can also be seen directly on my seller page

NB1: The descriptions I provided for each book are mostly not mine, but gathered from book back-covers, their website etc.

NB2: I only mention postage costs for the UK in the eBay listings. However, providing that the proper costs are covered, I am happy to send them anywhere.

Adaptation

front[eBay link]

A lovelorn screenwriter becomes desperate as he tries and fails to adapt ‘The Orchid Thief’ by Susan Orlean for the screen.

 

Director: Spike Jonze

Writers: Susan Orlean (book), Charlie Kaufman (screenplay)

Along came Polly, 2004

front[eBay link]

A buttoned up newlywed finds his too organized life falling into chaos when he falls ihttps://www.ebay.co.uk/itm/183824291262n love with an old classmate.

Director:John Hamburg

Writer:John Hamburg

Angus, Thongs and perfect Snogging, 2008

front[eBay link]

The story centers on a 14-year-old girl who keeps a diary about the ups and downs of being a teenager, including the things she learns about kissing.

Arthur and the invisibles, from Luc Besson

front[eBay link]

Ten-year-old Arthur, in a bid to save his grandfather’s house from being demolished, goes looking for some much-fabled hidden treasure in the land of the Minimoys, tiny people living in harmony with nature.

 

Director:Luc Besson

Bananas

front[eBay link]

When a bumbling New Yorker is dumped by his activist girlfriend, he travels to a tiny Latin American nation and becomes involved in its latest rebellion.

 

Director:Woody Allen

Bewitched, 2006

front[eBay link]

Thinking he can overshadow an unknown actress in the part, an egocentric actor unknowingly gets a witch cast in an upcoming television remake of the classic sitcom Bewitched (1964).

 

Director:Nora Ephron

The Bonfire of the Vanities, 1990

front[eBay link]

After his mistress runs over a young teen, a Wall Street hotshot sees his life unravel in the spotlight and attracting the interest of a down and out reporter.

 

Director: Brian De Palma (as Brian DePalma)

Writers: Michael Cristofer (screenplay), Tom Wolfe (novel)

 

 

Bridesmaids, 2011

front[eBay link]

Competition between the maid of honor and a bridesmaid, over who is the bride’s best friend, threatens to upend the life of an out-of-work pastry chef.
Director: Paul Feig

 

The Cave, 2005

Bloodthirsty creatures await a pack of divers who become trapped in an underwater cave network.

 

Director: Bruce Hunt

City of Angels, 1999

front[eBay link]

Inspired by the modern classic, Wings of Desire, City involves an angel (Cage) who is spotted by a doctor in an operating room. Franz plays Cage’s buddy who somehow knows a lot about angels.

Director: Brad Silberling

Writers:Wim Wenders (screenplay “Der Himmel über Berlin”), Peter Handke (screenplay “Der Himmel über Berlin”)

Stars:Nicolas Cage, Meg Ryan, Andre Braugher

 

Closer, 2004

front[eBay link]

The relationships of two couples become complicated and deceitful when the man from one couple meets the woman of the other.

Writers: Patrick Marber (play), Patrick Marber (screenplay)

 

Cold Mountain, 2004

In the waning days of the American Civil War, a wounded soldier embarks on a perilous journey back home to Cold Mountain, North Carolina to reunite with his sweetheart.

Writers: Charles Frazier (book), Anthony Minghella (screenplay)

Stars: Jude Law, Nicole Kidman, Renée Zellweger

The Devil’s Chair, 2007

front[eBay link]

With a pocketful of drugs, Nick West takes out his girlfriend Sammy, for a good time. When they explore an abandoned asylum, the discovery of a bizarre device, a cross between an electric chair and sadistic fetish machine, transforms drugged-out bliss into agony and despair.

 

Director: Adam Mason

 

Embrassez qui vous voudrez (Summer Things), 2004

front[eBay link]

Two couple of friends, one very rich the other almost homeless, decides to go on Holiday. Julie, a single mother, joins them too. Once at seaside, it starts a complicate love cross among them that will involve also a transsexual, a jealous brother,a Latin Lover and another nervous stressed couple. Not to mention about the daughter of one of them that is secretly in Chicago with one of his father’s employee… More, at the end of the summer all of them will join the same party…

 

Director:Michel Blanc

Garfield, 2004

Jon Arbuckle buys a second pet, a dog named Odie. However, Odie is then abducted and it is up to Jon’s cat, Garfield, to find and rescue the canine.

 

Director: Peter Hewitt (as Pete Hewitt)

Writers: Jim Davis (comic strip “Garfield”), Joel Cohen

 

Garfield 2, 2006

front[eBay link]

Jon and Garfield visit the United Kingdom, where a case of mistaken cat identity finds Garfield ruling over a castle. His reign is soon jeopardized by the nefarious Lord Dargis, who has designs on the estate.

 

Director: Tim Hill

Half Light

front[eBay link]

Rachel Carlson, a successful novelist moves to a small Scottish village to move on with her life after the death of her son. Strange things start to happen when she is haunted by ghosts and real life terror.

 

Director:Craig Rosenberg

He’s just not that into you, 2009

frong[eBay link]

This Baltimore-set movie of interconnecting story arcs deals with the challenges of reading or misreading human behavior.

 

Director:Ken Kwapis

 

I Could Never Be Your Woman, 2007

cover[eBay link]

A mother falls for a younger man while her daughter falls in love for the first time. Mother Nature messes with their fates.

Director:Amy Heckerling

Writer:Amy Heckerling

Stars:Michelle Pfeiffer, Paul Rudd, Saoirse Ronan

The Interpreter, 2005 (DVD 2010)

front[eBay link]

Political intrigue and deception unfold inside the United Nations, where a U.S. Secret Service agent is assigned to investigate an interpreter who overhears an assassination plot.

Writers: Martin Stellman (story), Brian Ward (story)

 

Kingdom of heaven, 2005

front[eBay link]

Balian of Ibelin travels to Jerusalem during the Crusades of the 12th century, and there he finds himself as the defender of the city and its people.

Ladies in lavender, 2005

front[eBay link]

Two sisters befriend a mysterious foreigner who washes up on the beach of their 1930’s Cornish seaside village.

 

Director: Charles Dance

Writers: William J. Locke (short story), Charles Dance

 

Little Children

front[eBay link]

The lives of two lovelorn spouses from separate marriages, a registered sex offender, and a disgraced ex-police officer intersect as they struggle to resist their vulnerabilities and temptations in suburban Massachusetts.

 

Director:Todd Field

Writers:Todd Field (screenplay), Tom Perrotta (screenplay)

Stars:Kate Winslet, Jennifer Connelly, Patrick Wilson

 

Lord of War, 2005

front[eBay link]

An arms dealer confronts the morality of his work as he is being chased by an INTERPOL Agent.

Director: Andrew Niccol

 

Michael Clayton, 2006

front[eBay link]

A law firm brings in its “fixer” to remedy the situation after a lawyer has a breakdown while representing a chemical company that he knows is guilty in a multibillion-dollar class action suit.

 

Director:Tony Gilroy

Writer:Tony Gilroy

Mickey Blue Eyes, 2000

front[eBay link]

An English auctioneer proposes to the daughter of a Mafia kingpin, only to realize that certain “favors” would be asked of him.

Director: Kelly Makin

 

The Peculiar Adventures of Hector, 2007

front[eBay link]

This mini series follows the adventures of Hector and his friends on a series of amazingly imaginative adventures on their way to school each morning.

Priceless, 2008

Through a set of wacky circumstances, a young gold digger mistakenly woos a mild-mannered bartender thinking he’s a wealthy suitor.

 

Director: Pierre Salvadori

Writers: Pierre Salvadori (scenario), Benoît Graffin (scenario) | 1 more credit »

 

Rachel getting married

front[eBay link]

A young woman who has been in and out of rehab for the past ten years, returns home for the weekend for her sister’s wedding.

 

Director:Jonathan Demme

Writer:Jenny Lumet

Rire et Chatiment (Laughter and Punishment)

front[eBay link]

Vincent Roméro is an osteopath who makes a point of being funny every minute he is awake. In fact it is his way to remain the center of attention. His antics wind up irritating his wife Camille, a geriatric doctor, to such an extent that she leaves him one day. At first Vincent is so persuaded his many qualities are hard to beat that he can’t doubt she will soon come back. But as she doesn’t, he starts questioning himself. All the more as he realizes, to his dismay, that each time he clowns the people around him start dying of laughter…literally that is!

Director:Isabelle Doval

Stars:José Garcia, Isabelle Doval, Laurent Lucas, Benoit Poolvoerde

Shaggy and Scooby-Doo get a clue Vol 1

ebay-front[eBay link]

Shaggy and Scooby-Doo get a clue Vol 2

front[eBay link]

Surf’s Up, 2007

front[eBay link]

A behind-the-scenes look at the annual Penguin World Surfing Championship, and its newest participant, up-and-comer Cody Maverick.

Writers: Don Rhymer (screenplay by), Ash Brannon (screenplay by)

 

Trigger Happy TV season 1, DVD 2006

front[eBay link]

Trigger Happy TV is a hidden camera/practical joke reality television series. The original British edition of the show, produced by Absolutely Productions, starred Dom Joly and ran for two series on the British television channel Channel 4 from 2000 to 2003. Joly made a name for himself as the sole star of the show, which he produced and directed with cameraman Sam Cadman.

 

 

Two brothers, 2004

front[eBay link]

Two tigers are separated as cubs and taken into captivity, only to be reunited years later as enemies by an explorer (Pearce) who inadvertently forces them to fight each other.

 

Director:Jean-Jacques Annaud

Writers:Alain Godard (scenario), Jean-Jacques Annaud (scenario)

 

Walk the line, 2005

front[eBay link]

A chronicle of country music legend Johnny Cash‘s life, from his early days on an Arkansas cotton farm to his rise to fame with Sun Records in Memphis, where he recorded alongside Elvis Presley, Jerry Lee Lewis, and Carl Perkins.

Director: James Mangold

Writers: Johnny Cash (book), Johnny Cash (book)

 

 

Wild Target, 2010

front[eBay link]

A hitman tries to retire but a beautiful thief may change his plans.

 

Director: Jonathan Lynn

Writers: Lucinda Coxon (screenplay), Pierre Salvadori (film “Cible émouvante”)

Books I sell on eBay

NB1: The descriptions I provided for each book are mostly not mine, but gathered from book back-covers, their website etc.

NB2: I only mention postage costs for the UK in the eBay listings. However, providing that the proper costs are covered, I am happy to send them anywhere.

SCIENCE BOOKS

Comparative Vertebrate Neuroanatomy, Butler and Hodos, 1996

Ebay-cover.jpg[eBay link]

By applying the tools of modern neuroanatomy to brain structure and function in various species, researchers have discovered that numerous cell groups and interconnections, known to be present in mammals, also exist in non-mammalian vertebrates. This book reveals how the brains of various vertebrates are astoundingly similar in some ways, while in others they are quite different. The authors examine how the form of the brain is modified and magnified to perfect and capitalize on a specific function, making any particular animal a “specialist” in its area. They also clarify the forms and functions of the nervous system that have allowed vertebrates to adapt to almost every aspect of the earth’s environment.

Handbook of Basal Ganglia structure and function, Steiner and Tseng, 2010

Ebay-cover[eBay link]

The Basal Ganglia comprise a group of forebrain nuclei that are interconnected with the cerebral cortex, thalamus and brainstem. Basal ganglia circuits are involved in various functions, including motor control and learning, sensorimotor integration, reward and cognition. The importance of these nuclei for normal brain function and behavior is emphasized by the numerous and diverse disorders associated with basal ganglia dysfunction, including Parkinson’s disease, Tourette’s syndrome, Huntington’s disease, obsessive-compulsive disorder, dystonia, and psychostimulant addiction.

The Handbook of Basal Ganglia provides a comprehensive overview of the structural and functional organization of the basal ganglia, with special emphasis on the progress achieved over the last 10-15 years. Organized in six parts, the volume describes the general anatomical organization and provides a review of the evolution of the basal ganglia, followed by detailed accounts of recent advances in anatomy, cellular/molecular, and cellular/physiological mechanisms, and our understanding of the behavioral and clinical aspects of basal ganglia function and dysfunction.

Modeling Neural Development, 2003

Ebay-cover[eBay link]

This is one of the first books to study neural development using computational and mathematical modeling. Modeling provides precise and exact ways of expression, which allow us to go beyond the insights that intuitive or commonsense reasoning alone can yield. Most neural modeling focuses on information processing in the adult nervous system; Modeling Neural Development shows how models can be used to study the development of the nervous system at different levels of organization and at different phases of development, from molecule to system and from neurulation to cognition.

The book’s fourteen chapters follow loosely the chronology of neural development. Chapters 1 and 2 study the very early development of the nervous system, discussing gene networks, cell differentiation, and neural tube development. Chapters 3-5 examine neuronal morphogenesis and neurite outgrowth. Chapters 6-8 study different aspects of the self-organization of neurons into networks. Chapters 9-12 cover refinement of connectivity and the development of specific connectivity patterns. Chapters 13 and 14 focus on some of the functional implications of morphology and development.

Each chapter contains an overview of the biology of the topic in question, a review of the modeling efforts in the field, a discussion in more detail of some of the models, and some perspectives on future theoretical and experimental work.

The science of studying neural development by computational and mathematical modeling is relatively new; this book, as Dale Purves writes in the foreword, “serves as an important progress report” in the effort to understand the complexities of neural development.

CHILDREN BOOKS

A dog’s best friend, 1999

Ebay-cover

[eBay link]

Toby’s life is all fillet steak and walks in the park, but one night he gets lost – and everything changes!
He gets wet, cold and hungry, but also makes a new friend. And when his old owners turn up, Toby had to make the toughest decision of his life!

Children of seven and above will enjoy reading this story, but is can be read to younger children too.

24 pages.

 

The Booktime Book of Fantastic First Poems

Ebay-cover[eBay link]

28 pages.

 

 

Fat Puss and Friends, Harriet Castor

Ebay-cover[eBay link]

92 page.

Contains the stories:
Fat Puss
Fat Puss finds a friend
Fat Puss in summer
Fat Puss meets a stranger
Fat Puss at Christmas

Horrid Henry, 1995 (book 2008)

front[eBay link]

The original novel that initiated the very successful series.

“His fiendish plots will make you ache with laughter”

Horrid Henry’s purple hand gang joke book, 2011

front[eBay link]

“Laugh your head off with Henry and the rest of the purple hand gang in this collection of the best, most ridiculously rib-tickling jokes by Horrid Henry fans everywhere”

A Giant Slice of Horrid Henry

front[eBay link]

Contains the three stories:
Horrid Henry meets the Queen
Horrid Henry’s underpants
Horrid Henry’s stinkbomb

A Helping of Horrid Henry

front[eBay link]

Three hilarious books in one volume:
Horrid Henry’s Nits
Horrid Henry Gets Rich Quick
(a.k.a. Horrid Henry Strikes it Rich)
Horrid Henry’s Haunted House

 

The Hundred-Mile-an-Hour Dog Goes for Gold, Jeremy Strong

Ebay-Cover[eBay link]

“Guess what’s coming to town!
The animal games!

There’ll be show-jumping for horses AND rabbits, and discus for dogs – so of course I have to enter Streaker.

Mum says a CARROT is more obedient than my dog, but I think she can do it – Streaker can go for GOLD!”

149 pages.

Lego Star Wars character encyclopedia

Ebay-front[eBay link]

Describing more than 300 minifigures, all the main characters of the saga, in their different outfits.

A figurine is provided, although it is not Hans Solo as written on the book, but a rebel trooper. It was already the case when we bought the book new. This must have been a mistake made in factory.

 

 

Monster stories for under fives, Joan Stimson, Ladybird

Ebay-cover[eBay link]

43 pages.

 

 

Pigs Might Fly, Red House Younger Children Award 2006

Ebay-front[eBay link]

“Let me win, little pig, LET ME WIN!” The Big Bad Wolf is back and badder than ever! So when the Three Pigs enter the “Pie in the Sky” Air Race, he’s determined to snaffle the prize pies and have the pigs for pudding. Will the Wolf win – or can Wilbur save the day? A fast-paced, frantically-funny sequel to a well-loved tale.

32 pages.

Winner of the Red House Children’s Book Award, category Younger Children.

 

Puzzle Castle, Usborne young puzzles

front[eBay link]

32 pages

My Treasury of Stories and Rhymes, hardback 2005

Ebay-front[eBay link]

384 pages of children delight. Many stories and many songs, of different lengths and levels.

Where’s my Teddy? Jez Alborough 2004

Ebay-cover[eBay link]

Eddy’s lost his teddy, Frddy.
So off he goes to the wood to find him.
But the wood is dark and horrible and little Eddy is in for a gigantic surprise!

 

32 pages.

 

 

Wriggle and Roar, by Julia Donaldson (“The Gruffalo”) and Nick Sharratt (“Tracy Beaker”)

ebay-cover[eBay link]

Rhymes to join in with.

A classic by Julia Donaldson, illustrated by Nick Sharratt of “Tracy Beaker” fame

40 pages.

OTHER BOOKS

Gardens of delight, by Erica James

front[eBay link]

The Gardens of Delight brochure promises the opportunity to visit some of the most beautiful gardens in the Lake Como area of Italy. For Lucy, the chance to go to Italy offers more than just gardens. Lake Como is where her father lives, and the last time she saw him was when she was just a teenager.

Recently married Helen and her wealthy husband have just moved into the Old Rectory. With her husband spending so much time away from home, Helen throws herself into caring for the garden. But Helen needs help – and friends – and so decides to take the plunge and join the local Garden Club.

Conrad isn’t the least bit interested in gardening. Widowed for five years, his life revolves around work and humouring Mac, his elderly uncle who lives with him, and who has expressed a desire to go on the Gardens of Delight tour. Reluctantly, Conrad agrees to accompany him. ‘Anything for a peaceful life,’ he concedes. But a peaceful life is the last thing any of them are in for…

479 pages

Hidden Talents, by Erica James

front[eBay link]

Dulcie Ballantyne knows that creative writers’ groups attract an unlikely mix of people, so when she starts up Hidden Talents, she is well prepared for the assortment of people she is bringing together.

Beth King is facing empty-nest syndrome as her only son, Nathan, leaves home for university. Jack Solomon, a local estate agent, is having trouble coming to terms with the shock of his wife leaving him for his best friend. Jaz Rafferty is an intensely private seventeen-year-old girl who writes to escape her large, boisterous family. Victor Blackmore is a know-it-all, who claims to be writing the blockbuster novel every publisher will be clamouring for.

What they all have in common is a need to escape, as well as a desire to keep their lives as private as possible. As they grow more confident in their writing skills, friendships develop, and gradually they come to realise that a little openness isn’t necessarily a bad thing.

488 pages.

Powershift, by Alvin Toffler

front[eBay link]

Alvin Toffler’s Future Shock and The Third Wave are among the most influential books of our time. Now, in Powershift, he brings to a climax the ideas set forth in his previous works to offer a stunning vision of the future that will change your life.

In Powershift, Toffler argues that while headlines focus on shifts of power at the global level, equally significant shifts are taking place in the everyday world we all inhabit–the world of supermarkets and hospitals, banks and business offices, television and telephones, politics and personal life. The very nature of power is changing under our eyes. Powershift maps the “info-wars” of tomorrow and outlines a new system of wealth creation based on individualism, innovation, and information. As old political antagonisms fade, Toffler identifies where the next, far more impohttps://www.ebay.co.uk/itm/183761149586rtant world division will arise–not between East and West or North and South, but between the “fast” and the “slow.” 612 pages

The Singularity project, by FM Busby

front[eBay link]

Busby, in the classic mode of Robert A. Heinlein (who was a friend and fan of Busby’s work), tells a rich, fast-paced and cleverly plotted tale of the day after tomorrow in The Singularity Project. Mitch Banning is a free-lance engineer in Seattle, who hires onto a secret high-tech project financed by industrialist George Detweiler that will change the world…if it works. And Mitch doesn’t believe it will, since the people creating the hush-hush demonstration of the world’s first matter transmitter include an elderly con-man, an addict-physicist, and a tough South American Indian with a knife. 349 pages

 

Sunset in St Tropez, by Danielle Steele

front[eBay link]

In her 55th bestselling novel, Danielle Steel explores the seasons of an extraordinary friendship, weaving the story of three couples, lifelong friends, for whom a month’s holiday in St. Tropez becomes a summer of change, revelation, secrets, surprises, and new beginnings…

The Swiss Family Robinson, by Johann Wyss, 1968 edition

front[eBay link]

American print of the Swiss classic. This is the full-length version of WHG Kingston’s translation (446 pages).

 

The Time Ships, by Stephen Baxter

front[eBay link]

The highly-acclaimed sequel to H G Wells’s The Time Machine, from the heir to Arthur C. Clarke.

Written to celebrate the centenary of the publication of H G Wells’s classic story THE TIME MACHINE, Stephen Baxter’s stunning sequel is an outstanding work of imaginative fiction.

The Time Traveller has abandoned his charming and helpless Eloi friend Weena to the cannibal appetites of the Morlocks, the devolved race of future humans from whom he was forced to flee. He promptly embarks on a second journey to the year AD 802,701, pledged to rescue Weena. He never arrives. The future was changed by his presence… and will be changed again. Hurling towards infinity, the Traveller must resolve the paradoxes building around him in a dazzling temporal journey of discovery. He must achieve the impossible if Weena is to be saved.

629 pages

Zero Coupon, by Paul Erdman

front[eBay link]

Willy Saxon is a financier extraordinaire, until his code of honour gets him three years in jail. Now he’s free to shake the world’s money tree in a daring hustle of global dimensions, tapping the greed of Wall Street and challenging the arrogance of European bankers until both are brought to their knees. The plan’s airtight. The prize is untold billions. And this time Willy plays for keeps.
350 pages

 

plotGODESeq: differential expression and Gene Ontology enrichment on one plot

I recently came across the package GOplot by Wencke Walter http://wencke.github.io/. In particular I liked the function GOBubble. However, I found difficult to customise the plot. In particular I wanted to colour the bubbles differently, and to control the plotting area. So I took the idea and extended it. Many aspects of the plot can be configured. It is a work in progress. Not all features of GOBubble are implemented at the moment. For instance, we cannot separate the different branches of Gene Ontology, or add a table listing labelled terms. I also have a few ideas to make the plot more versatile. If you have suggestions, please tell me. The code and the example below can be found at https://github.com/lenov/plotGODESeq/

What we want to obtain at the end is the following plot:

final-inkscape

The function plotGODESeq() takes two mandatory inputs: 1) a file containing Gene Ontology enrichment data,  2) a file containing differential gene expression data. Note that the function works better if the dataset is limited, in particular the number of GO terms. It is useful to analyse the effect of a perturbation, chemical or genetic, or to compare two cell types that are not too dissimilar. Comparing samples that exhibit several thousands of differentially expressed genes, resulting in thousands of enriched GO terms, will not only slow the function to a halt, it is also useless (GO enrichment should not be used in these conditions anyway. The results always show things like “neuronal transmission” enriched in neurons versus “immune process” enriched in leucocytes). A large variety of other arguments can be used to customise the plot, but none are mandatory.

To use the function, you need to source the script from where it is; In this example it is located in the session directory. (I know I should make a package of the function. On my ToDo list)

source('plotGODESeq.R')

Input

The Gene Ontology enrichment data must be a dataframe containing at least the columns: ID – the identifier of the GO term, description– the description of the term, Enrich – the ratio of observed over expected enriched genes annotated with the GO term, FDR – the False Discovery Rate (a.k.a. adjusted p-value), computed e.g. with the Benjamini-Hochberg correction, and genes – the list of observed genes annotated with the GO term. Any other column can be present. It will not be taken into account. The order of columns does not matter. Here we will load results coming from and analysis run on the server WebGestalt. Feel free to use whatever Gene Ontology enrichment tool you want, as far as the format of the input fits.

# load results from WebGestalt
goenrich_data <- read.table("GO-example.csv", 
                            sep="\t",fill=T,quote="\"",header=T)

# rename the columns to make them less weird 
# and compatible with the GOPlot package
colnames(goenrich_data)[
colnames(goenrich_data) %in% c("geneset","R","OverlapGene_UserID")
] <- c("ID","Enrich","genes")

# remove commas from GO term descriptions, because they suck
goenrich_data$description <- gsub(',',"",goenrich_data$description)

The differential expression data must be a dataframe which rownames are the gene symbols, from the same namespace as the genes column of the GO enrichment data above. In addition, one column must be named log2FoldChange, containing the quantitative difference of expression between two conditions. Any other column can be present. It will not be taken into account. The order of columns does not matter.

# Load results from DESeq2
deseq_data <- read.table("DESeq-example.csv", 
                         sep=",",fill=T,header=T,row.names=1)

Now we can create the plot.

plotGODESeq(goenrich_data,deseq_data)

The y-axis is the negative log of the FDR (adjusted p-value). The x-axis is the zscore, that is for a given GO term:

(nb(genes up) – nb(genes down))/sqrt(nb(genes up) + nb(genes down))

The genes associated with each GO term are taken from the GO enrichment input, while the up or down nature of each gene is taken from the differential expression input file. The area of each bubble is proportional to the enrichment (number of observed genes divided by number of expected genes). This is the proper way of doing it, rather than using the radius, although of course, the visual impact is less important.

Raw

Choosing what to plot

The console output tells us that we plotted 1431 bubbles. That is not very pretty or informative … The first thing we can note is that we have a big mess at the bottom of the plot, which corresponds to the highest values of FDR. Let’s restrict ourselves to the most significant results, by setting the argument maxFDR to 1e-8.

maxFDR

This is better. We now plot only 181 GO terms. Note the large number of terms aligned at the top of the plot. Those are terms with an FDR of 0. The Y axis being logarithmic, we plot them by setting their FDR to a tenth of the smallest non-0 value. GO over-representation results are often very redundant. We can use GOplot’s function reduce_overlap by setting the argument collapse to the proportion of genes that needs to be identical so that GO terms are merged in one bubble. Let’s use collapse=0.9 (GO terms are merged if 90% of the annotated genes are identical) .

collapse

Now we only plot 62 bubbles, i.e. two-third of the terms are now “hidden”. Use this procedure with caution. Note how the plot now looks distorted towards one condition. More “green” terms have been hidden than “red” terms.

The colour used by default for the bubbles is the zscore. It is kind of redundant with the x-axis. Also, the zscore only considers the number of genes up or down-regulated. It does not take into account the amplitude of the change. By setting the argument color to l2fc, we can use the average fold change of all the genes annotated with the GO term instead.

l2fc

Now we can see that while the proportion of genes annotated by GO:0006333 that are down-regulated is lower than for GO:0008380, the amplitude of their average down-regulation is larger.

WARNING: The current code does not work if the color scheme chosen for the bubbles is based on variable, l2fc or zscore, that do not contain negative and positive values. Sometimes, the “collapsing” can cause this situation, if there is an initial unbalance between zscores and/or l2fc. It is a bug, I know. On the ToDo list …

Using GO identifiers is handy and terse, but since I do not know GO by heart, it makes the plot hard to interpret. We can use the full description of each term instead, by setting the argument label to description.

label

Customising the bubbles

The width of the labels can be modified by setting the argument wrap to the maximum number of characters (the default used here is 15). Depending of the breadth of values for FDR and zscore, the size of the bubbles can be an issue, either by overlapping too much, or on the contrary by being tiny. We can change that by the argument scale which scales the radius of the bubbles. Let’s fix it to 0.7, to decrease the size of each bubble by a third (the radius, not the area!).

scale

There is often a big crowd of terms at the bottom and centre of the plot. This is not so clear here, with the harsh FDR threshold, but look at the first plot of the post. These terms are generally the least interesting, since they have a lower significance (higher FDR) and mild zscore. We can decide to label the bubbles only under a certain FDR with the argument maxFDRLab and/or above a certain absolute zscore with the argument minZscoreLab. Let’s fix them to 1e-12 and 2 respectively.

labelThreshold

Finally, you are perhaps not too fond of the default color scheme. This can be changed with the arguments lowCol, midCol, highCol. Let’s set them to  “deepskyblue4”, “#DDDDDD” and “firebrick”,

bubbleColor

Customising the plotting area

The first modifications my collaborators asked me to introduced was to centre the plot on a zscore of 0, and to add space around so they could annotate the plot. One can centre the plot by declaring centered = TRUE (the default is FALSE). Since our example is extremely skewed towards negative zscores, this would not be a good idea. However, adding some space on both side will come handy in the last step of beautification. We can do that by declaring extrawidth=3 (default is 1).

extrawidth

The legend position can be optimised with the arguments leghoffset and legvoffset. Setting them to {-0.5,1.5}

legend

The complete call:

plotGODESeq(goenrich_data,
            deseq_data,
            maxFDR = 1e-8,
            collapse = 0.9,
            color="l2fc",
            lowCol = "deepskyblue4",
            midCol = "#DDDDDD",
            highCol = "firebrick",
            extrawidth=3,
            centered=FALSE,
            leghoffset=-0.5,
            legvoffset=1.5,
            label = "description",
            scale = 0.7,
            maxFDRLab = 1e-12,
            minZscoreLab = 2.5,
            wrap = 15)

Now we can export an SVG version and play with the labels in Inkscape. This part is unfortunately the most demanding …

final-inkscape