Skip to main navigation Skip to search Skip to main content

Unlocking patterns in limited omics data: Machine learning-based diagnostics from small and sparse cancer omics datasets

  • Alexandra Danyi

Research output: ThesisDoctoral thesis 1 (Research UU / Graduation UU)

17 Downloads (Pure)

Abstract

Accurate clinical diagnostic tests are essential for effective cancer treatment. Machine learning (ML) models trained on omics data - such as genomics or transcriptomics - hold promise for enhancing diagnostic practices. However, training reliable ML models is often challenged by limitations in available datasets.

Solid tissue biopsy remains the gold standard for tumor sampling, but liquid biopsy is being applied more frequently due to its non-invasive, safer, and easier procedure. However, liquid biopsy data presents unique challenges. The sparsity of mutation profiles in liquid biopsy data can hinder model training as the underlying patterns may be determined less reliably and can differ from those in solid tissue biopsy. Domain shift represents another challenge, where datasets from different sources have varying distributions due to differences in experimental protocols or patient populations. This shift can affect model performance when applied to new, unseen data. Additionally, clinical cohorts with limited sample sizes and high feature dimensionality (e.g., gene expression data) are often affected by the “curse of dimensionality” and overfitting. Class imbalance, where the majority class(es) dominate(s) the training set, can also decrease model performance on the minority class(es).

This thesis focuses on ML applications for cancer omics datasets affected by sparsity, domain shift, limited sample sizes, and class imbalance.

In Chapter 2, a deep learning model, originally trained on solid tissue biopsy data, is adapted to predict the “cell-of-origin” of cancer using liquid biopsy data. Synthetic datasets model the sparse mutation profiles of liquid biopsies, and data augmentation combined with diverse feature types are integrated to improve model performance. The adapted deep learning model achieves classification accuracy on synthetic sparse data comparable to its performance on solid tissue data, demonstrating the potential of liquid biopsies for advanced cancer diagnostics.

In Chapter 3, the same classification model is evaluated on a mixed primary-metastatic solid biopsy dataset. Synthetic oversampling techniques are tested to mitigate class imbalance, while subspace-centric and data-centric domain adaptation methods address domain shift. The findings offer strategies to enhance cancer type classification models under these conditions.

In Chapter 4, a novel ML approach predicts treatment response to abiraterone and enzalutamide (ARSI) in metastatic castration-resistant prostate cancer (mCRPC) patients. Using whole genome and transcriptome sequencing data from a small clinical cohort (n=155 and n=113, respectively), potential predictive features are identified. Models combining prior treatment information with genomic markers or transcriptomic data projected in a lower dimensional space show promising results. With further optimization and validation, this method could guide treatment decisions and identify patients most likely to benefit from new therapies.

In Chapter 5, the broader implications of this thesis are discussed, including limitations and future directions, emphasizing the need for robust, generalizable models to improve patient outcomes.
Original languageEnglish
Awarding Institution
  • University Medical Center (UMC) Utrecht
Supervisors/Advisors
  • de Ridder, Jeroen, Supervisor
  • Jager, Myrthe, Co-supervisor
Award date2 Sept 2025
Place of PublicationUtrecht
Publisher
Print ISBNs978-94-6473-882-7
DOIs
Publication statusPublished - 2 Sept 2025

Keywords

  • cancer diagnostics
  • cancer of unknown primary
  • metastatic castration-resistant prostate cancer
  • mCRPC
  • machine learning
  • bioinformatics

Fingerprint

Dive into the research topics of 'Unlocking patterns in limited omics data: Machine learning-based diagnostics from small and sparse cancer omics datasets'. Together they form a unique fingerprint.

Cite this