Language
English
Publication Date
8-22-2025
Journal
Scientific Reports
DOI
10.1038/s41598-025-14395-0
PMID
40846739
PMCID
PMC12373826
PubMedCentral® Posted Date
8-22-2025
PubMedCentral® Full Text Version
Post-print
Abstract
Missing gene expression values are a common issue in RNAseq-based analyses of gene expression. However, an analysis of genetic and environmental factors contributing to data missingness in RNAseq-based assessment of gene expression has never been conducted. In this study we tried to identify factors in RNAseq data missingness. We used RNAseq data from 66 lung adenocarcinoma tumors and corresponding adjacent normal lung tissues. We found a strong negative association between the gene expression level and missingness, supporting the idea that the borderline expression level is a key contributor to missingness. In a more detailed analysis, the relationship between gene expression and missingness was more complex: while the expected negative association between missingness and the expression level was observed for genes with low missingness, mean expression spiked at the right end of the distribution which included genes with very high missingness. We hypothesized that genes with a high missing rate include not only genes with borderline expression but also genes with high expression in some individuals but no expression in others (true biological missingness, TBM). The results of the comparative analysis of missingness in smokers and nonsmokers, an examination of the proportion of known tobacco smoke-sensitive genes by missing rate, and gene enrichment analysis support the hypothesis. We argue that it would be beneficial first to check data for the presence of genes with true biological missingness. The presence of highly expressed genes with missingness is an indication of TBM related to inter-individual variation in gene expression level. The results of our analysis call for caution in indiscriminatory imputation of missing values. When true biological missingness is present, it is advisable to identify genes with true biological missingness and analyze them separately because including such genes in imputation will lead to a bias: expression values will be assigned to a subset of the genes that are not expressed.
Keywords
Humans, Lung Neoplasms, Gene Expression Profiling, Adenocarcinoma of Lung, Sequence Analysis, RNA, RNA-Seq, Gene Expression Regulation, Neoplastic, Female, Male, Gene expression, RNAseq, Missing values, Environmental exposure
Published Open-Access
yes
Recommended Citation
Gorlova, Olga Y; Gorlov, Ivan P; Ripley, R Taylor; et al., "Exposure-Inducible Genes May Contribute to Missingness in RNAseq-Based Gene Expression Analyses" (2025). Faculty and Staff Publications. 4406.
https://digitalcommons.library.tmc.edu/baylor_docs/4406