Genome Wide Association Studies (GWAS) report many significant SNPs, which may be associated to many genes. How can we make sense of them? One way is to compute enrichment of the resulting gene list for properties of interest.
For example expression patterns! Thanks to TopAnat.
grep "breast cancer" gwas_catalog_v1.0.1-downloaded_2016-01-08.tsv | cut -f18
This gives us the genes which have a breast cancer associated SNP within the gene. It gives us Entrez IDs, and TopAnat beta only takes Ensembl IDs (I am told that GWAS catalog will soon have Ensembl, and we are pushing to support other identifiers in TopAnat…). So let’s send this list to biomart:
- chose Ensembl genes
- chose Homo sapiens genes
- in Filters, chose Gene, “Input external references ID list”, and use our list of Entrez IDs
- in Attributes, remove Transcript ID
- in Results, we will get the list of Ensembl gene IDs.
Similarly, we can recover the genes which have significant SNPs upstream or downstream of the gene:
grep "breast cancer" gwas_catalog_v1.0.1-downloaded_2016-01-08.tsv | cut -f16,17
Now, let’s paste our lists of Ensembl gene IDs into TopAnat.
We chose the following options:
- background: all Bgee data, because we expect all genes to be detectable in GWAS; it is true that there might be a bias for gene length (longer genes have more SNPs by chance), which we will try to control for.
- decorrelation type: Weight, which allows to reduce the redundancy of the results by down-weighting less specific anatomical terms (organs, systems) relative to more specific ones (cell types, tissues).
We get 140 anatomical structures with FDR<0.2, which shows that there is some signal. It is clear that the top hits by p-value are enriched in the female reproductive system, and for epithelial tissues. Ranking them by fold-change, we obtain lacrimal gland on top, with a 3 fold enrichment relative to chance (p = 6.10E-07).
Of note, the same gene list put into the gene ontology enrichment tool at GOrilla gives many “positive regulation” terms, as well as weaker signal for prostate and muscle-related terms. Prostate is significant in the TopAnat analysis (p = 3.96e-21), as are 5 muscle types.
Let’s now look at the genes which are neighboring non genic SNPs:
First, the signal is weaker, but less good p-values and slightly less anatomical terms. Second, we get again epithelial tissues on top, but not so much reproductive system (although vagina is significant with an enrichment of 1.66 and p = 2.52e-9). Third, lacrimal gland is again the top enriched term (enrichment of 2.69, p = 0.00408).
The writer of this post is not a specialist of breast cancer, but a rapid search tells us that:
Of all metastatic tumors to the orbit, breast carcinoma is considered to be the most prevalent primary tumor, accounting for 29% to 70% of all metastases.
Finally, as a simple control for gene length, here is the TopAnat result for the 500 longest human genes (according to Ensembl):
Lacrimal gland comes out, although weakly (enrichment 1.53, p = 0.0205), as does prostate. There are some components of the reproductive and mammary systems, such as “subdivision of uterine tube”, “epithelium of mammary gland”, “mammalian vulva” or “mammary gland”. (Hey, neat trick: you can filter terms by using the “Search” box above the table.)
Of course, more analyses would be needed to get to solid conclusions. This illustrates how TopAnat can provide additional biological information on gene lists from large scale scans.