Date of Graduation

8-2016

Document Type

Dissertation (PhD)

Program Affiliation

Biomedical Sciences

Degree Name

Doctor of Philosophy (PhD)

Advisor/Committee Chair

Dean Lee

Committee Member

Gilbert Cote

Committee Member

Michelle Hildebrandt

Committee Member

Stephen Ullrich

Committee Member

Eduardo Vilar-Sanchez

Abstract

Gene content determination and variant calling in the complex KIR genomic region are useful for immune system function analysis, pathogenesis and disease risk factor elucidation, immunotherapy development, evolutionary investigations, and human migration modeling. Sequence-specific oligonucleotide and sequence-specific primer PCR methods are the de facto standards for KIR presence/absence identification, but the current platforms are unsuitable for SNP calling, impractical for KIR typing large cohorts of DNA samples, and inapplicable for typing repositories in which sequence data, but not cells or cell analytes, are available. Alternative typing methods, such as in silico sequence-based typing, can address the problems associated with amplicon-based approaches. However, common next generation sequencing technologies that map short reads to a reference genome exhibit high rates of read mismapping in regions containing loci that are homologous, polymorphic, and variable in content. I developed a novel approach for rescuing KIR genotyping function from sequence read alignment data files exclusively, with no requirement for additional DNA or amplification reactions. This Sequence Read Remapping And Motif Counting (SeRRAMC) approach utilizes a pattern-matching algorithm to select mapped reads containing oligomotifs that uniquely discriminate the variants of a single KIR gene and remap the selected reads to the consensus coding sequence of the respective KIR. The process dramatically reduces read mismapping to near zero, thereby enabling high confidence KIR typing and variant calling from short sequence reads. I used this approach for an analysis of KIR variation in 2535 modern humans across 26 populations from around the world, plus three Neandertals and a Denisovan. The results identified 175 unique KIR genotypes, the four most frequent of which vary significantly across modern humans, and found 36 population-associated nonsynonymous single base polymorphisms. On average, African and Far Eastern genomes encode fewer KIR genes but a greater number of population-correlated variants, while South Asians have higher gene content with moderate variation. Archaic humans match European and South Asian genotypes at some key polymorphic sites and African genotypes at others. I also applied the SeRRAMC method to an analysis of 5489 germline exomes from cancer patients spanning 22 disease groups. After stratifying the samples by race and ethnicity, the results show no significant correlations between KIR gene content and cancer but did identify a transmembrane domain polymorphism in the KIR3DL3 framework gene associated with melanoma and another variant in the KIR3DP1 pseudogene associated with prostate adenocarcinoma. The analysis also revealed significant batch effects due to disparities in the lengths of sequence reads produced by different sequencing centers. SeRRAMC processing is the first approach to enable these types of immunogenetic analyses. It also offers the first solution for quantifying KIR-specific gene transcripts from RNAseq data and for calling structural variants from de novo sequence assemblies generated by third generation single molecule or synthetic long read sequencing technologies.

Keywords

immunogenetics, bioinformatics, computational biology, genomics, statistical genetics, population genetics, cancer immunology, innate immunity, KIR, NK cells

Share

COinS