Theoretical investigation of the feasibility and accuracy of SNP discovery in extremely low-coverage NGS-based cohort studies

Naveen Ramesh, The University of Texas School of Public Health

Abstract

As sequencing technology has evolved, the cost of sequencing the human genome has reduced substantially from $3 billion per sample to less than $1000 per sample with the advent of Illumina's HiSeq X Ten sequencing system. Although a genome with a coverage of 30x can be sequenced for a nominal cost of $1000, the overall cost is still quite high for sequencing a large population cohort. Therefore, recovering a large proportion of all the SNPs with a high minor allele frequency (MAF > 0.2) from the population using an extremely low-coverage study design is of great significance in genomic research because of the significant reduction in the sequencing costs. In our current study, we will assess the power of a study design using extremely low-coverage sequencing data as they provide several times the effective sample size while having a sequencing cost lower than low-coverage study designs such as the 1000 Genomes Project. We used a downsampling algorithm to generate the extremely low-coverage BAM files using the 1000 Genomes Phase 1 BAM files which were used to test the feasibility and accuracy of an extremely low-coverage NGS-based study design. We found that 200 samples, 300 samples with a coverage of 1x and 0.75x coverage respectively is sufficient to discover 80% of the true SNPs with an MAF > 0.2 and a false positive rate (FPR) < 3% while 499 samples with coverage of 0.5x is sufficient to discover 75% of the true SNPs with MAF > 0.2 and FPR <3%. This study revealed that it is feasible to use extremely low-coverage NGS-based cohorts to call more than 80% of true SNPs with MAF>0.2 and FPR < 3%. The relative genetic variability between populations for the samples with reduced coverage was preserved in comparison with the 1000 Genomes Phase 1 samples.

Subject Area

Biostatistics|Genetics|Public health|Bioinformatics

Recommended Citation

Ramesh, Naveen, "Theoretical investigation of the feasibility and accuracy of SNP discovery in extremely low-coverage NGS-based cohort studies" (2014). Texas Medical Center Dissertations (via ProQuest). AAI1567536.
https://digitalcommons.library.tmc.edu/dissertations/AAI1567536

Share

COinS