VADA: A Data-Driven Simulator for Nanopore Sequencing

Jonas Niederle, Simon Koop*, Marc Pagès-Gallego, Vlado Menkovski

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

Nanopore sequencing offers the ability for real-time analysis of long DNA sequences at a low cost, enabling new applications such as early detection of cancer. Due to the complex nature of nanopore measurements and the high cost of obtaining ground truth datasets, there is a need for nanopore simulators. Existing simulators rely on handcrafted rules and parameters and do not learn an internal representation that would allow for analyzing underlying biological factors of interest. Instead, we propose VADA, a purely data-driven method for simulating nanopores based on an autoregressive latent variable model. We embed subsequences of DNA and introduce a conditional prior to address the challenge of a collapsing conditioning. We experiment with an auxiliary regressor on the latent variable to encourage our model to learn an informative latent representation. We empirically demonstrate that our model achieves competitive simulation performance on experimental nanopore data. Moreover, we show our model learns an informative latent representation that is predictive of the DNA labels. We hypothesize that other biological factors of interest, beyond the DNA labels, can potentially be extracted from such a learned latent representation.

Original languageEnglish
Title of host publicationDiscovery Science - 27th International Conference, DS 2024, Proceedings
EditorsDino Pedreschi, Anna Monreale, Riccardo Guidotti, Roberto Pellungrini, Francesca Naretto
PublisherSpringer
Pages198-210
Number of pages13
ISBN (Electronic)978-3-031-78977-9
ISBN (Print)978-3-031-78976-2
DOIs
Publication statusPublished - 2025
Event27th International Conference on Discovery Science, DS 2024 - Pisa, Italy
Duration: 14 Oct 202416 Oct 2024

Publication series

NameLecture Notes in Computer Science
Volume15243
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference27th International Conference on Discovery Science, DS 2024
Country/TerritoryItaly
CityPisa
Period14/10/2416/10/24

Keywords

  • autoregressive models
  • computer simulation
  • generative AI
  • latent variable models
  • nanopore sequencing

Fingerprint

Dive into the research topics of 'VADA: A Data-Driven Simulator for Nanopore Sequencing'. Together they form a unique fingerprint.

Cite this