Imagine you have a list of genes and you would like to know if these genes are expressed, let’s say, in human blood. But you’re not going to do the experiment yourself. You’re either a bioinformatician who doesn’t do wet lab. Or you need this expression information but it’s not so important that you can spare funding money for the experiment. Either way you’re going to use publicly available data. After checking the public repository for microarray data Gene Expression Omnibus (GEO) you identify four different experiments (GSE13904, GSE9692, GSE8121 and GSE26440) that sampled blood from healthy children. That’s great, you think, the expression data of your genes of interest will be confirmed by four independent evidences. Well, think again. It turns out 15 samples are actually identical between the four experiments, 3 samples are shared by two experiments and one experiment has 3 samples in common with yet another experiment (GSE26378). And the bad news is that it’s nowhere explicitly stated on GEO that these samples from different experiments are in reality the same data. Sample IDs are different. The names of the data files are different. If you look closely at the experiments, you would see that the contact name is the same and that each experiment compares control samples versus samples from septic shock patients, which might give you a hint there’s something fishy going on here but, still, there’s a chance you’ll dismiss this as just one lab submitting several different experiments related to its research subject.
Now you’re probably cursing yourself for not listening to your colleague who warned you against using other people’s data. Not so fast. That doesn’t mean you shouldn’t use publicly available data, it just means you that have to be careful with it. And that highlights the need for secondary databases whose job is to take the time to check the data and provide it clean and of high quality.
At Bgee we retrieve publicly available microarray data and present it so the user can compare gene expression between several animal species, and we’ve recently realized that many samples which we considered as different and independent are in fact identical. The example above is just one among many. In our data we found 1031 samples, representing 42 experiments, to be duplicates of samples from other experiments. It seems quite a widespread practice among scientists to “reuse” data and submit it to public repositories as new experiments.
If this redundancy of samples is not explicitly stated on GEO, how did we come to notice it? Well, sometimes the redundancy is evident. The sample ID and the name of the data files are identical between experiments. For example it’s clear that sample GSM2334 belongs to experiments GSE75 and GSE760. This prompted us to check if more experiments shared similar samples. We computed several parameters for the data files of every sample we use in Bgee (for full details see the Bgee documentation), and it turned out that several files had identical parameters, revealing there were in fact duplicates.
And there’s more. Authors don’t always submit the same information for identical samples. In our human blood example GSE9692 gives you information on the sex of the individual the sample was taken from while the other experiments don’t provide this detail. GSE26440 supplies the exact age of the patient while the three other experiments merely tell you the patients were children. More disconcerting is when the information for identical samples is completely different. Take the samples GSM647625 and GSM648688 of experiments GSE26378 and GSE26440, respectively. Regarding the patient’s age, the former specifies 2.9 years while the latter indicates 6 years. Which is the correct information? Ok, you might say it’s just a detail. So what about samples GSM322066 and GSM336955 from experiment GSE12826 and GSE13348? The first was taken from a whole zebrafish embryo and the second from the brain of a zebrafish embryo, but the data files are identical. Is that still just a detail?
The same issues also exist in ArrayExpress, the other public repository for microarray data. And it’s not just because ArrayExpress imports some of its data from GEO. Experiments unique to ArrayExpress share common samples. For example, the mouse data of experiment E-MTAB-406 is identical to the data of experiment E-TABM-163.
We’ve removed all the redundant data from Bgee and the expression data in the latest release of the database is free of duplicates. But maybe our experience can be of use to others. You might want to add one more item to the list of details to check out before you use public data.