Mitigating Batch Effects in Gene Expression Data Using a Novel Replicates-Based Empirical Bayes Correction Method

Shiyun Ling, The University of Texas School of Public Health


High-throughput technologies are widely used in a variety of biomedical research fields to enable rapid, intelligent and parallel gathering of data. To achieve statistical significance, data is usually acquired from a large number of samples across different platforms, with different processing times and variable laboratory conditions. This creates the very real problem of introducing batch effects into the data set. Traditional methods of removing batch effects rely on assuming an equal population distribution between batches, an assumption that cannot be guaranteed. This kind of algorithm may remove the real biological factors that play a substantial role in the fields of interest. ^ In this study, we used an empirical Bayes framework to develop a technical replicate based algorithm to eliminate batch effects. Using a replicate based framework can guarantee the same population distribution of the training set if the upstream factors before adding the technical replicates are the same. In this study, we used actual brain tumor and breast cancer mRNA expression for our real datasets. ^ We then used the hierarchical clustering and correlation coefficients to test the algorithm performance with simulated and real datasets. ^ Our results demonstrated that our algorithm mitigated the batch effects in mRNA data. The algorithm clustered more than 95% of the replicates together immediately next to each other, while maintaining the biological patterns. Further analysis demonstrated that the tissue type of the replicates might have very limited influence on the performance of the algorithm at least in mRNA data. This implies that in the future scientists may be able to use cell lines as technical replicates to overcome the limitation of the sources of technical replicates.^

Subject Area


Recommended Citation

Ling, Shiyun, "Mitigating Batch Effects in Gene Expression Data Using a Novel Replicates-Based Empirical Bayes Correction Method" (2017). Texas Medical Center Dissertations (via ProQuest). AAI10272008.