When you invent such a cool toy as TopAnat, you play with it. And then sometimes, you’re afraid that you might have broken it. But then, maybe it’s more robust than you expected. This is such a story.
One lab member was analyzing a large RNA-seq data set, with 27 human tissues (from Fagerberg et al. 2014). Another lab member decided to have a look at the ubiquitously expressed genes from this set which have conserved protein sequences (low dN/dS). Using TopAnat, the person found very significant tissue enrichments (click on picture to see analysis):
This was very surprising, since the data were selected to not be tissue-specific. A bug was suspected. The result was robust to changes in decorrelation algorithm, and after verification, yes TopAnat does remove redundancy in the input gene list.
A potential biological explanation was that tissue-specific genes may be more in common in some tissues, which would create an apparent enrichment for ubiquitous genes in the other tissues. But then we did a simple control: a random set of the genes called expressed in any of the 27 tissues (i.e., the genes on which tissue specificity could be calculated):
Ooo-Kay. Again very significant, and similar anatomical entities. But these are random genes, right? Well, randomly from expression over 27 tissues. Which surely means mostly random? Ah-ha. What are these tissues?
colon, kidney, liver, pancreas, lung, prostate, brain, stomach, spleen, lymphnode, appendix, small intestine, adrenal gland, duodenum, fat, endometrium, placenta, testis, gallbladder, urinary bladder, thyroid, esophagus, heart, skin, ovary, bone marry, salivary gland
Suddenly the results above start to make sense. And indeed, if we now take randomly from the full list of genes used for the RNA-seq mapping, not restricting to those called present at least once, we get:
Nothing significant!
We were detecting the signal of which specific tissues were sampled in this large dataset of 27 human tissues. A presumably subtle signal, but which left enough trace in the data that we could detect it. This shows both the sensitivity of the TopAnat test, and the power which comes from integrating so much data of different types into Bgee and TopAnat. Our toy was not broken, it was showing off what a cool toy it really is.
We hope that this convinces you that all data is biased TopAnat can detect relevant signal. 🙂
Update: here is the missing test from this story: the original low dN/dS ubiquitously expressed gene list, with the background set to genes called present at least once in Fagerberg et al 2014 (removing the sampling bias):
Pingback: BioGPS Spotlight on Bgee | The Su Lab