Phenotype Prediction of DNA Sequence Data: A Machine- And Statistical Learning Approach
Abstract
Advancements made in high-throughput sequencing technologies have continued to generate large amounts of sequencing data enabling the holistic investigation of complex biological phenomena. Genomic sequence data are used for a wide range of applications such as gene annotations, expression studies, personalized treatment and precision medicine. However, this rapid expansion in available sequence data poses a tremendous computational challenge, calling for the development of novel data processing and analytic methods, as well as computing resources to match the volume of these datasets. In this work, a machine- and statistical learning approach for classification based on k-mer representations of DNA sequence data is proposed. While targeted sequencing focuses on a specific region of interest, whole genome sequencing enables a view of a species' entire genome. Thus, the approach is tested using whole genome sequences of Mycobacterium tuberculosis isolates to (i) reduce the size of genomic sequence data, (ii) identify an optimum size of k-mers and utilize it to build classification models, and (iii) predict the phenotype from whole genome sequence data of a given bacterial isolate. Furthermore, the computing challenges associated with whole genome sequence data analyses in producing interpretable and explainable insights are described. Classification models were trained using 104 Mycobacterium tuberculosis isolates. Cluster analyses showed that k-mers can be used to discriminate phenotypes and the discrimination becomes more concise as the k-mer size increases. The best performing classification model had a k-mer size of 10 (longest k-mer considered in this study) an accuracy, recall, precision, specificity, and Matthews Correlation coefficient of 72.0%, 80.5%, 80.5%, 63.6%, and 0.4, respectively. This study provides a comprehensive approach for resampling whole genome sequencing data, objectively selecting a k-mer size, and performing classification for phenotype prediction. The analysis also highlights the importance of increasing the k-mer size to produce more biologically explainable results, highlighting the interplay that exists between accuracy, computing resources such as processing and memory, and explainability of classification results. Furthermore, the analysis provides a new way to extract genetic information from genomic data and identify phenotype relationships which are integral for explaining complex biological mechanisms.