Gene expression enrichment tests are sensitive enough to detect where your background data came from #TopAnat

When you invent such a cool toy as TopAnat, you play with it. And then sometimes, you’re afraid that you might have broken it. But then, maybe it’s more robust than you expected. This is such a story.

One lab member was analyzing a large RNA-seq data set, with 27 human tissues (from Fagerberg et al. 2014). Another lab member decided to have a look at the ubiquitously expressed genes from this set which have conserved protein sequences (low dN/dS). Using TopAnat, the person found very significant tissue enrichments (click on picture to see analysis):


This was very surprising, since the data were selected to not be tissue-specific. A bug was suspected. The result was robust to changes in decorrelation algorithm, and after verification, yes TopAnat does remove redundancy in the input gene list.

A potential biological explanation was that tissue-specific genes may be more in common in some tissues, which would create an apparent enrichment for ubiquitous genes in the other tissues. But then we did a simple control: a random set of the genes called expressed in any of the 27 tissues (i.e., the genes on which tissue specificity could be calculated):


Ooo-Kay. Again very significant, and similar anatomical entities. But these are random genes, right? Well, randomly from expression over 27 tissues. Which surely means mostly random? Ah-ha. What are these tissues?

colon, kidney, liver, pancreas, lung, prostate, brain, stomach, spleen, lymphnode, appendix, small intestine, adrenal gland, duodenum, fat, endometrium, placenta, testis, gallbladder, urinary bladder, thyroid, esophagus, heart, skin, ovary, bone marry, salivary gland

Suddenly the results above start to make sense. And indeed, if we now take randomly from the full list of genes used for the RNA-seq mapping, not restricting to those called present at least once, we get:


Nothing significant!

We were detecting the signal of which specific tissues were sampled in this large dataset of 27 human tissues. A presumably subtle signal, but which left enough trace in the data that we could detect it. This shows both the sensitivity of the TopAnat test, and the power which comes from integrating so much data of different types into Bgee and TopAnat. Our toy was not broken, it was showing off what a cool toy it really is.

We hope that this convinces you that all data is biased TopAnat can detect relevant signal. 🙂

Update: here is the missing test from this story: the original low dN/dS ubiquitously expressed gene list, with the background set to genes called present at least once in Fagerberg et al 2014 (removing the sampling bias):



About bgeedb

This is the blog of the database of gene expression evolution Bgee, at University of Lausanne and Swiss Institute of Bioinformatics.
This entry was posted in RNA-Seq, topanat and tagged , , , . Bookmark the permalink.

One Response to Gene expression enrichment tests are sensitive enough to detect where your background data came from #TopAnat

  1. Pingback: BioGPS Spotlight on Bgee | The Su Lab

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s