Abstract
Machine learning, particularly neural networks and deep learning, has significantly advanced bioinformatics by enabling the analysis of complex biological data. Nanopore sequencing, a third generation sequencing technology, deciphers DNA sequences and detects epigenetic modifications by measuring electric signals affected by DNA bases and their modifications. This thesis explores the integration of machine learning models into nanopore sequencing to address challenges in DNA sequencing, modification detection, and rapid tumor diagnostics.
Basecalling, the translation of electric signals into DNA sequences, is currently dominated by neural network models due to advances in sequence-to-sequence translation. In Chapter 2, a standardized benchmark was developed to evaluate existing basecalling models, ensuring consistency in datasets and metrics. This benchmark identifies factors contributing to model performance and mistake patterns, providing a robust framework for comparing future basecallers without retraining.
Chapter 3 introduces a deep learning model, esox, designed to detect 8-oxo-7,8-dihydro-2’-deoxyguanosine (8-oxo-dG), an oxidative DNA lesion implicated in mutagenesis and epigenetic regulation. Due to the scarcity of training data for 8-oxo-dG, synthetic oligonucleotides were developed to generate ground truth data. Using esox, genome-wide patterns of 8-oxo-dG were analyzed, revealing insights into its distribution, its relationship with C>A mutations, and its interactions with methylation. This approach demonstrates the potential for expanding the range of detectable base modifications using nanopore sequencing.
In Chapter 4, a machine learning model, Sturgeon, was developed to classify central nervous system (CNS) tumors in real time during surgery. Current tumor diagnostics rely on microarray-based methylation profiling, which is too slow for intraoperative decision-making. Nanopore sequencing offers real-time data analysis, and by simulating nanopore sequencing data from microarray profiles, Sturgeon enables tumor classification in under 90 minutes, providing actionable insights during surgery.
Finally, Chapter 5 explores challenges and future directions in applying machine learning to bioinformatics and nanopore sequencing. These include the scarcity of ground truth data, the rarity of certain DNA modifications, and the need to integrate diverse datasets generated by older and newer technologies. The thesis proposes that neural networks are well-suited for creating platform-agnostic models to bridge these gaps, potentially advancing the field of bioinformatics.
Basecalling, the translation of electric signals into DNA sequences, is currently dominated by neural network models due to advances in sequence-to-sequence translation. In Chapter 2, a standardized benchmark was developed to evaluate existing basecalling models, ensuring consistency in datasets and metrics. This benchmark identifies factors contributing to model performance and mistake patterns, providing a robust framework for comparing future basecallers without retraining.
Chapter 3 introduces a deep learning model, esox, designed to detect 8-oxo-7,8-dihydro-2’-deoxyguanosine (8-oxo-dG), an oxidative DNA lesion implicated in mutagenesis and epigenetic regulation. Due to the scarcity of training data for 8-oxo-dG, synthetic oligonucleotides were developed to generate ground truth data. Using esox, genome-wide patterns of 8-oxo-dG were analyzed, revealing insights into its distribution, its relationship with C>A mutations, and its interactions with methylation. This approach demonstrates the potential for expanding the range of detectable base modifications using nanopore sequencing.
In Chapter 4, a machine learning model, Sturgeon, was developed to classify central nervous system (CNS) tumors in real time during surgery. Current tumor diagnostics rely on microarray-based methylation profiling, which is too slow for intraoperative decision-making. Nanopore sequencing offers real-time data analysis, and by simulating nanopore sequencing data from microarray profiles, Sturgeon enables tumor classification in under 90 minutes, providing actionable insights during surgery.
Finally, Chapter 5 explores challenges and future directions in applying machine learning to bioinformatics and nanopore sequencing. These include the scarcity of ground truth data, the rarity of certain DNA modifications, and the need to integrate diverse datasets generated by older and newer technologies. The thesis proposes that neural networks are well-suited for creating platform-agnostic models to bridge these gaps, potentially advancing the field of bioinformatics.
Original language | English |
---|---|
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 13 Jan 2025 |
Place of Publication | Utrecht |
Publisher | |
Print ISBNs | 978-90-393-7765-9 |
DOIs | |
Publication status | Published - 13 Jan 2025 |
Keywords
- nanopore sequencing
- machine learning
- deep learning
- tumor diagnostics
- genomics
- epigenomics