Date of Graduation
Doctor of Philosophy (PhD)
Gene content determination and variant calling in the complex KIR genomic region are useful for immune system function analysis, pathogenesis and disease risk factor elucidation, immunotherapy development, evolutionary investigations, and human migration modeling. Sequence-specific oligonucleotide and sequence-specific primer PCR methods are the de facto standards for KIR presence/absence identification, but the current platforms are unsuitable for SNP calling, impractical for KIR typing large cohorts of DNA samples, and inapplicable for typing repositories in which sequence data, but not cells or cell analytes, are available. Alternative typing methods, such as in silico sequence-based typing, can address the problems associated with amplicon-based approaches. However, common next generation sequencing technologies that map short reads to a reference genome exhibit high rates of read mismapping in regions containing loci that are homologous, polymorphic, and variable in content. I developed a novel approach for rescuing KIR genotyping function from sequence read alignment data files exclusively, with no requirement for additional DNA or amplification reactions. This Sequence Read Remapping And Motif Counting (SeRRAMC) approach utilizes a pattern-matching algorithm to select mapped reads containing oligomotifs that uniquely discriminate the variants of a single KIR gene and remap the selected reads to the consensus coding sequence of the respective KIR. The process dramatically reduces read mismapping to near zero, thereby enabling high confidence KIR typing and variant calling from short sequence reads. I used this approach for an analysis of KIR variation in 2535 modern humans across 26 populations from around the world, plus three Neandertals and a Denisovan. The results identified 175 unique KIR genotypes, the four most frequent of which vary significantly across modern humans, and found 36 population-associated nonsynonymous single base polymorphisms. On average, African and Far Eastern genomes encode fewer KIR genes but a greater number of population-correlated variants, while South Asians have higher gene content with moderate variation. Archaic humans match European and South Asian genotypes at some key polymorphic sites and African genotypes at others. I also applied the SeRRAMC method to an analysis of 5489 germline exomes from cancer patients spanning 22 disease groups. After stratifying the samples by race and ethnicity, the results show no significant correlations between KIR gene content and cancer but did identify a transmembrane domain polymorphism in the KIR3DL3 framework gene associated with melanoma and another variant in the KIR3DP1 pseudogene associated with prostate adenocarcinoma. The analysis also revealed significant batch effects due to disparities in the lengths of sequence reads produced by different sequencing centers. SeRRAMC processing is the first approach to enable these types of immunogenetic analyses. It also offers the first solution for quantifying KIR-specific gene transcripts from RNAseq data and for calling structural variants from de novo sequence assemblies generated by third generation single molecule or synthetic long read sequencing technologies.
immunogenetics, bioinformatics, computational biology, genomics, statistical genetics, population genetics, cancer immunology, innate immunity, KIR, NK cells
Bioinformatics Commons, Computational Biology Commons, Genetics Commons, Genomics Commons, Immunity Commons, Medicine and Health Sciences Commons, Molecular Genetics Commons, Population Biology Commons