Author ORCID Identifier


Date of Graduation


Document Type

Thesis (MS)

Program Affiliation

Biomedical Sciences

Degree Name

Masters of Science (MS)

Advisor/Committee Chair

Zhongming Zhao

Committee Member

Yulin Dai

Committee Member

Huihui Fan

Committee Member

Jeffrey Chang

Committee Member

Balveen Kaur


In the 1980s, researchers found the first human oncogenic retrovirus called human T-lymphotrophic virus type 1 (HTLV-1). Since then, HTLV-1 has been identified as the causative agent behind several diseases such as adult T-cell leukemia/lymphoma (ATL) and a HTLV-1 associated myelopathy or tropical spastic paraparesis (HAM/TSP). As part of its normal replication cycle, the genome is converted into DNA and integrated into the genome. With several hundreds to thousands of unique viral integration sites (VISs) distributed with indeterminate preference throughout the genome, detection of HTLV-1 VISs is a challenging task. Experimental studies typically use molecular biology techniques such as fluorescent in-situ hybridization (FISH) or using rt-qPCR (reverse transcriptase quantitative PCR) to detect VISs. While these methods are accurate, they cannot be applied in a high throughput manner. Next generation sequencing (NGS) has generated vast amounts of data, resulting in the development of several computational methods for VIS detection such as VERSE, VirusFinder, or DeepVISP for the task of rapid detection VIS across an entire genome. However, no such model exists for predicting HTLV-1 VISs. In this study, we have developed DeepHTLV: the first deep neural network for accurate detection of HTLV-1 insertion sites. We focused on 1) accurately predicting HTLV-1 VISs by extracting and generating superior feature representations and 2) uncovering the cis-regulatory features surrounding the insertion sites. DeepHTLV was implemented as a deep convolutional neural network (CNN) with self-attention architecture after comparing with several other deep neural network structures. To improve model accuracy, we trained the model using a bootstrap balanced sampling method with 10-fold CV. Furthermore, we demonstrated that this model has higher accuracy than several traditional machine learning models, with a modest improvement in area under the curve (AUC) values by 3-10%. To study the cis-regulatory features around HTLV-1 insertion sites, we extracted informative motifs from convolutional layer. Clustering of these motifs yielded eight unique consensus sequence motifs that represented potential integration sites in humans. The informative motif sequences were matched with a known transcription factor (TF) binding profile database, JASPAR2020, with the sequence matching tool TOMTOM. 79 TFs associations were enriched in regions surrounding HTLV-1 VISs. Furthermore, literature screening of HTLV-1, ATL, and HAM/TSP validated nearly half (34) of the predicted TFs interactions. This work demonstrates that DeepHTLV can accurately identify HTLV-1 VISs, elucidate surrounding features regulating these insertion sites, and make biologically meaningful predictions about cis-regulatory elements surrounding the insertion sites.


bioinformatics, deep learning, retrovirus



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.