Ragnar Groot Koerkamp, MSc

Benchmarking is harder than you think, even taking into account this rule.

PhD Student

ETH Zürich
Department of Computer Science
Biomedical Informatics Group
Universitätsstrasse 6
8092 Zürich

I work on exact pairwise alignment algorithms. In particular I am working on an algorithm using A*. See my blog research.curiouscoding.nl for thoughts and ideas regarding this and other algorithms.

After completing my masters in mathematics and computer science in Oxford I worked at Google for 2.5 years. I went back to academia to get back to research and have the freedom to solve problems I think are interesting. My focus is on mathematics and algorithms.

Abstract Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and demonstrate that it yields a better time-accuracy trade-off in settings with up to a 25% mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of 25%. For such queries, longer sketch-based seeds yield a 4× increase in recall compared to exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment.

Authors Amir Joudaki, Alexandru Meterez, Harun Mustafa, Ragnar Groot Koerkamp, André Kahles1 and Gunnar Rätsch

Submitted Genome Research, RECOMB 2023

Link DOI

Abstract MOTIVATION Sequence alignment has been a core problem in computational biology for the last half-century. It is an open problem whether exact pairwise alignment is possible in linear time for related sequences (Medvedev, 2022b). METHODS We solve exact global pairwise alignment with respect to edit distance by using the A* shortest path algorithm on the edit graph. In order to efficiently align long sequences with high error rate, we extend the seed heuristic for A* (Ivanov et al., 2022) with match chaining, inexact matches, and the novel match pruning optimization. We prove the correctness of our algorithm and provide an efficient implementation in A*PA. RESULTS We evaluate A*PA on synthetic data (random sequences of length n with uniform mutations with error rate e) and on real long ONT reads of human data. On the synthetic data with e=5% and n≤10^7 bp, A*PA exhibits a near-linear empirical runtime scaling of n1.08 and achieves >250× speedup compared to the leading exact aligners EDLIB and BIWFA. Even for a high error rate of e=15%, the empirical scaling is n1.28 for n≤10^7 bp. On two real datasets, A*PA is the fastest aligner for 58% of the alignments when the reads contain only sequencing errors, and for 17% of the alignments when the reads also include biological variation. Availability github.com/RagnarGrootKoerkamp/astar-pairwise-aligner

Authors Ragnar Groot Koerkamp, Pesho Ivanov

Submitted bioRxiv

Link DOI