#TopAnat: GO-like enrichment of anatomical terms mapped to genes by expression patterns

Update: This was the first blogpost on TopAnat. There are now quite a few more, under the tag TopAnat.

We are glad to roll out the first public beta version of our new tool, TopAnat:

http://bgee.org/?page=top_anat

TopAnat uses the classical GO-enrichment approach (specifically, from TopGO), comparing terms associated to your gene list to those associated to a background list, and reports terms which are over-represented. The main differences with GO-enrichment are that

  1. the ontology used described anatomy rather than gene function;
  2. the association between genes and ontology terms is obtained from gene expression patterns rather than annotation.

Like for GO-enrichment, you can use a default background of all genes in your organism with expression data in Bgee, or you can upload your own background set. Because this is based on TopGO, you can use several algorithms to decorrelate the ontology, to avoid reporting terms which are annotated to the same genes because of part_of relations, such as “prefrontal cortex” and “frontal cortex”. The options are:

  • No decorrelation
  • Elim (by default)
  • Weigth (update: now by default)
  • Parent-child.

For now, genes are only associated to anatomical structures through present / absent calls, i.e. you will obtain structures with more “present” calls than expected by chance. In the future, we will be adding enrichment based on expression being higher in some tissues than others (“over-expression”). The test as it is shows that present calls of expression already contain a lot of biologically relevant information. For example, here are the top anatomical structures for genes associated to the GO term “neurological system process” in mouse (see full results here; loading can be a bit slow):

topanatBecause developmental expression patterns can be very different from adult ones, we compute enrichment twice by default, for expression patterns in development (“embryo stage”) and in new born to adult (“post embryonic stage”). If you feel that more detailed breakdown is needed, please ask us, although note that in some cases we will probably lack statistical power.

Because Bgee annotates only healthy wild type expression data, you can rest assured that the results are not pulled by expression in tumors, KOs, or other diseases, but are representative of healthy biological processes.

And because Bgee integrates in situ hybridization with microarrays and RNA-seq, you can obtain very detailed anatomical information in mouse, fly, zebrafish or nematode, as shown in the mouse neurological process genes above.

Finally, because Bgee annotates expression data from  17 species, you can immediately perform tests in all these species. Although be warned that in species with less data tests will lack power. Example : tissue enrichment of genes on the X chromosomes of platypus (loading can be a bit slow). Note that while the FDR is not significant (lack of power), the top structures make sense for sex chromosomes.

We will be testing and improving TopAnat over the next weeks, and we already look forward to playing with it. We are confident that it will be useful, and hope that many of you will be able to use it to gain biological insight into those gene lists you get in this big data biology age.

TopAnat is based on an adaptation by Julien Roux of TopGO, by Adrian Alexa. It has been incorporated into Bgee by the Bgee team (names tagged Bgee in this list), and the graphical user interface has been developped by the SIB WebTeam. All data in Bgee are annotated to the Uberon ontology.

Advertisements
Posted in bgee update, ontology, topanat | 1 Comment

10 million calls of differential gene expression for 11 species, from #RNAseq and #microarray

Continuing to roll out Bgee release 13 (see first installment), we now provide pre-computed files of differential gene expression. As for present / absent calls, files are organized per species, and come in “simple” and “complete” versions.

bgee13_diffexpr

A test of differential expression is made when one experiment has several conditions, with replicates for each condition. We separate comparisons of different tissues in the same developmental stage (“anatomy”) from comparisons of the same tissue at different life history stages (“development” – this includes aging). For each gene and condition, a gene can be called “over-expressed”, if it is more expressed than in the average of conditions, “under-expressed”, if it is less expressed than in the average of conditions, or “no diff expression” otherwise. The test is a modified ANOVA.

A same gene in a same condition can be called from several experiments. These may use the same or different platforms (RNA-seq or microarray), and compare the same or different conditions (different combinations of tissue and stage). For a given technique, we resolve conflicts (e.g., the same gene called over- and under-expressed in liver in different experiments) based on a voting system weighted by p-values. Importantly, whenever this is a conflict, the summary call that we provide is noted “low quality”.High quality calls are those with no conflict and p<0.01. We provide summary calls from different experimental types separately in the “complete” file, while we provide one overall summary call per gene-condition combination in the “simple” file. In the latter case, conflicts between different techniques a summarized by an “ambiguity” call.

Overall, we provide 10 299 406 calls for 11 species: C. elegans, chicken, cow, fruitfly, human, macaque, mouse, platypus, rat, xenopus and zebrafish.

In the only species for which we have both abundant RNA-seq and microarray data, mouse, we find that 88% of gene-condition combinations have no conflict between calls from the two techniques. Considering differences in sampling and techniques, this is quite satisfying.

Come and try: http://bgee.org/?page=download

Posted in bgee update, microarray, RNA-Seq | Leave a comment

Rolling out Bgee 13, from 5 to 17 species with gene expression calls

We are proud to be rolling out the new and improved Bgee over the next weeks! Welcome to release 13!

In the 2 years since our last release, we have transferred all our annotations to the bilaterian ontology Uberon (see this paper for the ontology work) and made many other back-end changes which will make Bgee more powerful and better equipped to deal with the increasing diversity of species with RNA-seq data.

As a result of these changes, we can now provide expression calls for 17 species, up from 5 species in the previous release. The new species include model organisms such as C. elegans and chicken, as well as a diversity of tetrapods. For each species, we have developed a developmental ontology (ontologies on Google Code; developmental stage modeling collaboration on Github). As before, Present/Absent expression calls are a consensus of information from in situ hybridizations, microarrays, RNA-seq and ESTs, depending on the data available for each species.

blogpost_bgee13_species

Our download page; click on any species photo to access files to download. If the page doesn’t look like this, you may need to disable AdBlock for Bgee.

In addition to the increase in species and in data, a major novelty in Bgee 13 is the change in usability. Fewer users want to browse a database through a webpage, as we provide since 2006. More users want to download a file, to analyze in R, browse in Excel, or include into their pre-existing analysis. And as our data become larger and more complex, browsing in a webpage is just counter-productive. From now on, our primary effort to make Bgee data accessible will thus be through customized downloadable files, starting with TSVs of precomputed results.

Because there are different needs for different users, each file will be provided in two versions:

  • “simple files” will contain final calls, with minimum additional information; for Present/Absent expression calls, this means one line for the expression of a given gene in a combination of organ and developmental stage, and a call of confidence; easy to plug into your favorite analysis, i.e., are my metabolic disease genes expressed in human liver?
  • “advanced files” will contain additional columns to provide detailed information for further parsing and analysis; for Present/Absent calls, we add the data type used and additional lines for extra propagated expression data (see below); ideal to start your more in-depth analysis, which you may want to restrict for example to only RNA-seq evidence, or only expression supported by two different data types.

What is the propagation of expression? When a gene is called expressed in, e.g., cerebelum, we can infer that it is expressed in brain because cerebelum is fully included in the brain. This is useful information when you need to recover all genes expressed in the brain, and thus is provided. This propagation is performed based on the relations in the Uberon ontology, and is specific to a species and a developmental stage.

What next? We will provide files of gene over-expression (e.g., this gene has significantly higher expression in adult liver than in other organs) and under-expression (e.g., this gene has significantly lower expression in juvenile muscle than in other juvenile organs) in each organ and developmental stage, based on microarray and RNA-seq data. We will provide files of orthologous gene expression in homologous organs between species, both pairs of model organisms (e.g., orthologs expressed in homologous tissues in human and fly), and taxonomic groups such as Primates or mammals (e.g., orthologs expressed in homologous tissues in all mammals). And we are working on new summary pages, for those who do not want only to download, but that’s for a bit later.

In the meantime, please tell us if any features, data types or species that you would like to see yet are missing, in the comments on this blog, on Twitter or by email.

Posted in bgee update, new release, new species, ontology, usability | 1 Comment

New taxon Dipnotetrapodomorpha in NCBI taxonomy

For the annotation of homologies in Bgee, we are presently curating in Uberon the taxonomic level at which each homologous relation is correct. For example, muscle is homologous accross Metazoa, while placenta is only homologous among Mammalia. For this, we use NCBI taxonomy IDs.

In the process, we ran into a problem: many structures are homologous between tetrapods and lungfish, but not shared by coelacanths. And there is no NCBI ID for the grouping of tetrapods and lungfish.

lungfishAfter checking the literature, we found (i) the taxon Rhipidistia in Wikipedia, but which seems mostly used for fossils, and (ii) the taxon Dipnotetrapodomorpha introduced in the latest edition of Fishes of the World (Nelson 2006), exactly for the purpose we needed.

We thus wrote the following email to NCBI:

In the NCBI taxonomy, at present, Sarcopterygii (8287) is divided in three groups which are not further clustered.But the literature now strongly supports the position of Dipnoi (7878) as sister group to Tetrapoda (32523) to the exclusion of Coelacanthimorpha (118072), e.g.:

http://www.nature.com/nature/journal/v496/n7445/full/nature12027.html

http://mbe.oxfordjournals.org/content/early/2013/05/13/molbev.mst072.full

The standard reference book for fish taxonomy has suggested the term Dipnotetrapodomorpha for this grouping of Dipnoi and Tetrapoda (Nelson 2006, p. 461):

http://books.google.ch/books?id=exTV-GLnCB4C&lpg=PA461&ots=aW0gNAETbp&dq=dipnotetrapodomorpha&pg=PA461#v=onepage&q=dipnotetrapodomorpha&f=false

This term has started to be used by the phylogenetic community, e.g.:

http://currents.plos.org/treeoflife/article/the-tree-of-life-and-a-new-classification-of-bony-fishes/ (Figure 1)

We use NCBI taxonomy to annotate computationally homology of anatomy between animals, in the context of Bgee (http://bgee.unil.ch/) and of Uberon (http://uberon.org/), and there are many anatomical structures which are homologous between lungfishes and tetrapodes, to the exclusion of coelacanths. Thus an NCBI identifier for this grouping would be extremely useful.

We thus request the introduction of the term Dipnotetrapodomorpha in the NCBI taxonomy, designing the grouping of Dipnoi and Tetrapoda.

The NCBI taxonomy team has been very reactive, and four days after our request the term has been added:

http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1338369

So next time you look at the long string of taxonomic terms linking Homo sapiens to the root of the tree of life in NCBI, you know whom you have to blame for having yet another term in that list.

And all our thanks go out to the wonderfully reactive NCBI taxonomy team!

Posted in Uncategorized | 2 Comments

Expressed, or not expressed? That is the question!

Bgee provides information about where and when genes are expressed in different species. But what does it mean “to be expressed”? That’s a fundamental question we had to answer before we could introduce RNA-Seq data into the database.

Expression does not equal transcription. Expression is the process during which information encoded in the DNA sequence is transformed into a functional product.  It means that every DNA sequence that is transcribed is not necessarily expressed. Take introns, for example. They are transcribed but the information they contain is not used to form a protein.

To define where and when a gene is expressed based on RNA-Seq data, we thus decided to use intergenic regions, potentially transcribed but mainly unexpressed fragments of the genome, as a reference. If we take the presence of at least one uniquely mapped read as a criterion for transcription, then more than 50% and more than 60% of the human and mouse intergenic regions, respectively, have been transcribed in at least one of the RNA-Seq samples we introduced so far into Bgee. In order to avoid an excess of false positive calls, we set a cutoff value on the transcription level at which we define a gene as expressed. This cutoff value is set so that any genomic feature whose transcription level is above the cutoff has a probability less than 1:20 to be an intergenic region.

Using this cutoff to analyze the RNA-Seq data in Bgee results in less than 15% and less than 20% of mouse and, respectively, human intergenic regions being defined as expressed in any of the analyzed tissues. On the other hand, more than 80% and more than 90% of mouse and human protein coding genes are at least once characterized as expressed. Moreover, more than 50% of the human protein coding genes and more than 40% of the mouse protein coding genes are ubiquitously expressed in the different RNA-Seq samples now in Bgee.

As of today Bgee contains 33 RNA-Seq libraries from the experiment GSE30352 representing gene expression data from 7 human organs (frontal lobe, temporal lobe, cerebellum, heart, kidney, liver and testis) as well as from 6 mouse organs (brain, cerebellum, heart, kidney, liver and testis). But that’s just a start. We’re planning to add more species in the future. So stay connected!

 

Posted in bgee update, RNA-Seq, using bgee | Tagged , , , | Leave a comment

Wait! That’s actually the same data…

Imagine you have a list of genes and you would like to know if these genes are expressed, let’s say, in human blood. But you’re not going to do the experiment yourself. You’re either a bioinformatician who doesn’t do wet lab. Or you need this expression information but it’s not so important that you can spare funding money for the experiment. Either way you’re going to use publicly available data. After checking the public repository for microarray data Gene Expression Omnibus (GEO) you identify four different experiments (GSE13904, GSE9692, GSE8121 and GSE26440) that sampled blood from healthy children. That’s great, you think, the expression data of your genes of interest will be confirmed by four independent evidences. Well, think again. It turns out 15 samples are actually identical between the four experiments, 3 samples are shared by two experiments and one experiment has 3 samples in common with yet another experiment (GSE26378). And the bad news is that it’s nowhere explicitly stated on GEO that these samples from different experiments are in reality the same data. Sample IDs are different. The names of the data files are different. If you look closely at the experiments, you would see that the contact name is the same and that each experiment compares control samples versus samples from septic shock patients, which might give you a hint there’s something fishy going on here but, still, there’s a chance you’ll dismiss this as just one lab submitting several different experiments related to its research subject.

Now you’re probably cursing yourself for not listening to your colleague who warned you against using other people’s data. Not so fast. That doesn’t mean you shouldn’t use publicly available data, it just means you that have to be careful with it. And that highlights the need for secondary databases whose job is to take the time to check the data and provide it clean and of high quality.

At Bgee we retrieve publicly available microarray data and present it so the user can compare gene expression between several animal species, and we’ve recently realized that many samples which we considered as different and independent are in fact identical. The example above is just one among many. In our data we found 1031 samples, representing 42 experiments, to be duplicates of samples from other experiments. It seems quite a widespread practice among scientists to “reuse” data and submit it to public repositories as new experiments.

If this redundancy of samples is not explicitly stated on GEO, how did we come to notice it? Well, sometimes the redundancy is evident. The sample ID and the name of the data files are identical between experiments. For example it’s clear that sample GSM2334 belongs to experiments GSE75 and GSE760. This prompted us to check if more experiments shared similar samples. We computed several parameters for the data files of every sample we use in Bgee (for full details see the Bgee documentation), and it turned out that several files had identical parameters, revealing there were in fact duplicates.

And there’s more. Authors don’t always submit the same information for identical samples. In our human blood example GSE9692 gives you information on the sex of the individual the sample was taken from while the other experiments don’t provide this detail. GSE26440 supplies the exact age of the patient while the three other experiments merely tell you the patients were children. More disconcerting is when the information for identical samples is completely different. Take the samples GSM647625 and GSM648688 of experiments GSE26378 and GSE26440, respectively. Regarding the patient’s age, the former specifies 2.9 years while the latter indicates 6 years. Which is the correct information? Ok, you might say it’s just a detail. So what about samples GSM322066 and GSM336955 from experiment GSE12826 and GSE13348? The first was taken from a whole zebrafish embryo and the second from the brain of a zebrafish embryo, but the data files are identical. Is that still just a detail?

The same issues also exist in ArrayExpress, the other public repository for microarray data. And it’s not just because ArrayExpress imports some of its data from GEO. Experiments unique to ArrayExpress share common samples. For example, the mouse data of experiment E-MTAB-406 is identical to the data of experiment E-TABM-163.

We’ve removed all the redundant data from Bgee and the expression data in the latest release of the database is free of duplicates. But maybe our experience can be of use to others. You might want to add one more item to the list of details to check out before you use public data.

Posted in microarray, using bgee | Tagged , , , | Leave a comment

New release of Bgee: extra clean transcriptome data

We have released a new version of the database (Bgee Release 11)

The highlights are:

  • Affymetrix chips are now filtered before inclusion into Bgee, based on new quality measurements, on the identification of duplicated content, and on the control of chip types for incompatibility or errors. (Please note that even “low quality” data had to pass these filters for inclusion.)
  • We have developed a new human developmental ontology, in coordination with neXtProt.

For details, please see the release notes at the Bgee website. We apologize for the delay, but we’re very proud of the result!

Partial view of the human developmental ontology, with expression data

Posted in bgee update, new release | Tagged , , | Leave a comment