Date of Award
Doctor of Philosophy (PhD)
RNA-seq is the next-generation sequencing technology for gene expression and while many tools have been developed to assess differential expression, most focus on gene-level statistics. Gene-level statistics implicitly ignore any dependence among genes. In order to directly incorporate correlation into testing for differential expression, genes are sorted into networks using Ingenuity Pathway Analysis (IPA) and the gene expression of each network is modeled using Generalized Estimating Equations (GEE). Since the gene network data often exhibits correlation structures containing positive and negative values, a new intermediate correlation structure is developed. This structure provides a compromise between an exchangeable (one parameter for all gene pairs) and unstructured (one parameter for each gene pair). A log-linear regression model and Wald test are proposed for differentially expressed gene networks via hypothesis testing. Additionally, a statistical test to determine whether a given gene network is independent or correlated is given. Numerical studies via simulations of correlated negative binomial data are used to compare different correlation structures of GEE based on type I error and statistical power. Also, these simulations are used to benchmark the test of gene network independence. Models that incorporate correlation into estimation are able to conserve type I error, while independent correlation structures do not. Positive correlations unaccounted for in the independence model lead to increases in type I error, while negative correlations lead to decreases in type I error. Power between models is roughly the same. Goodness-of-fit tests reveal that the correlated negative binomial data is a better fit to the actual data than the univariate negative binomial distribution. Data analysis consisted of two stages: (i) analyzing gene networks with GEE to find differentially expressed networks and (ii) performing a single gene analysis on these differentially expressed networks. A RNA-seq dataset from the Cancer Genome Atlas (TCGA) of breast cancer patients was analyzed, adjusting for relevant clinical covariates. The top genes from each network are scrutinized for biological relevance using PubMed searches and other knowledge-based databases such as OMIM. These genes show a mix of genes with no citations to genes with many citations. This implies that this data analysis approach finds both novel genes as well as genes that have been well-studied in the field of breast cancer research.
ASCHENBRENNER, ANDREW RICHARD, "MOVING BEYOND THE SINGLE GENE: INTEGRATIVE GENE SET ANALYSIS FOR RNA-SEQ" (2019). Dissertations (Open Access). 196.