In Bgee, we integrate gene expression data from RNA-seq, Affymetrix microarrays, ESTs and in situ hybridization data. It is natural to think that with RNA-seq being so powerful, we should not bother with other sources of information.
Yet we still have an order much more data with microarrays: in Bgee release 13, we have in total:
- 41 RNA-seq experiments, for 526 libraries.
- 1170 microarray experiments, for 13070 chips.
In situ hybridization provides an amazing level of anatomical detail which is way beyond what other techniques can offer for now. And ESTs? Well let’s check.
In TopAnat, we can compute enrichment of gene lists for expression in anatomical structures (organs, tissues, cell types) based on the integration of all data, or only some subtypes:
So let’s try. Starting the example provided in “Quickstart” of “mouse genes annotated to GO term “spermatogenesis”. We expect these genes to be very tissue-specific (see also our recent analysis in Briefings in Bioinformatics), so it should be an easy case for each data type. We will choose one datatype at a time. To give each datatype its chance, we will perform these analyses with the lowest stringency: “Data quality” to “All”, no Decorrelation algorithm, and removing the FDR≤0.2 limit for reporting results.
Here are the results (click on images to go to the results on the TopAnat webpage):
With RNA-seq we obtain relevant organs, but very few. Essentially all the signal comes from testis, which is part of the male reproductive system, gonad, etc. The signal that we do get is very significant, which is reassuring.
With microarrays, we have more tissues and organs significant, with some more detailed structures. This is probably because we have many more experiments from microarrays, and thus some more detailed, than for RNA-seq. Again, statistics are good, and the organs reported are relevant. Don’t throw those microarray data quite yet! (Keep in mind though that Bgee only uses curated microarrays which are from healthy wild type and pass quality control.) On the other hand, notice that from our 457 genes, 442 were called expressed in gonad with RNA-seq, but only 402 with microarray: it is probable that lowly expressed genes were missed by the microarray experiments.
In situ hybridization gives us much more detailed structures, with very good statistical significance. Because we didn’t use any decorrelation, the results are difficult to read: a germ cell is a eukaryotic cell, and we get this information although it is not of great interest.
That’s a point in favor of using decorrelation for most analyses. For example, if we redo that analysis with “Weight”, which removes most of the signal due to the non independence of these structures (spermatocyte is a male germ cell, which is a cell, etc), we obtain: the following structures (FDR <0.2):
male germ cell; male reproductive organ; ooblast; hindgut diverticulum (mouse); gonad; testis sex cord; pharyngeal arch 2; membranous layer; meiotic oocytes (mouse); ventricular zone; entire extraembryonic component; otic pit; 1st arch maxillary component.
We see here the great level of detail obtained with in situ hybridizations. On the other hand, for each structure we have 13-28 genes only called present.
Finally, for ESTs we obtain something similar to RNA-seq, although with less genes called present. It is noteworthy that there is so much biological signal in ESTs, although this type of data is nowadays largely neglected.
|microarray||seminiferous tubule of testis||351||1.411||7.91E-30||7.37E-27|
|insitu||male germ cell||17||7.556||2.88E-11||2.15E-08|
|insitu||male reproductive organ||19||3.506||3.01E-08||0.0000112|
|insitu||hindgut diverticulum (mouse)||10||5.495||0.0000122||0.01073|
|insitu||testis sex cord||21||2.668||0.0000407||0.01914|
|insitu||pharyngeal arch 2||23||2.312||0.000237||0.08915|
|insitu||meiotic oocytes (mouse)||8||4.324||0.000433||0.108|
|insitu||entire extraembryonic component||34||2.457||0.000738||0.17353|
|insitu||1st arch maxillary component||21||2.111||0.00102||0.19113|
|EST||male reproductive system||312||1.495||1.36E-35||6.70E-33|
Second, the analysis with Weight algorithm, FDR<0.2, all data integrated:
Note how the integration of data types allows us to obtain both statistical power and anatomical specificity.
- if you have good quality older data, don’t throw it, it contains good biology;
- when you use TopAnat, integrate all data and use decorrelation.