Author ORCID Identifier
0000-0001-7466-8589
Date of Graduation
5-2023
Document Type
Thesis (MS)
Program Affiliation
Biomedical Sciences
Degree Name
Masters of Science (MS)
Advisor/Committee Chair
Zhongming Zhao
Committee Member
Yulin Dai
Committee Member
Huihui Fan
Committee Member
Jeffrey Chang
Committee Member
Balveen Kaur
Abstract
In the 1980s, researchers found the first human oncogenic retrovirus called human T-lymphotrophic virus type 1 (HTLV-1). Since then, HTLV-1 has been identified as the causative agent behind several diseases such as adult T-cell leukemia/lymphoma (ATL) and a HTLV-1 associated myelopathy or tropical spastic paraparesis (HAM/TSP). As part of its normal replication cycle, the genome is converted into DNA and integrated into the genome. With several hundreds to thousands of unique viral integration sites (VISs) distributed with indeterminate preference throughout the genome, detection of HTLV-1 VISs is a challenging task. Experimental studies typically use molecular biology techniques such as fluorescent in-situ hybridization (FISH) or using rt-qPCR (reverse transcriptase quantitative PCR) to detect VISs. While these methods are accurate, they cannot be applied in a high throughput manner. Next generation sequencing (NGS) has generated vast amounts of data, resulting in the development of several computational methods for VIS detection such as VERSE, VirusFinder, or DeepVISP for the task of rapid detection VIS across an entire genome. However, no such model exists for predicting HTLV-1 VISs. In this study, we have developed DeepHTLV: the first deep neural network for accurate detection of HTLV-1 insertion sites. We focused on 1) accurately predicting HTLV-1 VISs by extracting and generating superior feature representations and 2) uncovering the cis-regulatory features surrounding the insertion sites. DeepHTLV was implemented as a deep convolutional neural network (CNN) with self-attention architecture after comparing with several other deep neural network structures. To improve model accuracy, we trained the model using a bootstrap balanced sampling method with 10-fold CV. Furthermore, we demonstrated that this model has higher accuracy than several traditional machine learning models, with a modest improvement in area under the curve (AUC) values by 3-10%. To study the cis-regulatory features around HTLV-1 insertion sites, we extracted informative motifs from convolutional layer. Clustering of these motifs yielded eight unique consensus sequence motifs that represented potential integration sites in humans. The informative motif sequences were matched with a known transcription factor (TF) binding profile database, JASPAR2020, with the sequence matching tool TOMTOM. 79 TFs associations were enriched in regions surrounding HTLV-1 VISs. Furthermore, literature screening of HTLV-1, ATL, and HAM/TSP validated nearly half (34) of the predicted TFs interactions. This work demonstrates that DeepHTLV can accurately identify HTLV-1 VISs, elucidate surrounding features regulating these insertion sites, and make biologically meaningful predictions about cis-regulatory elements surrounding the insertion sites.
Keywords
bioinformatics, deep learning, retrovirus
Included in
Bioinformatics Commons, Data Science Commons, Genomics Commons