Ragnar Groot Koerkamp, MSc
Benchmarking is harder than you think, even taking into account this rule.
PhD Student
- ragnar.grootkoerkamp@ inf.ethz.ch
- Address
-
ETH Zürich
Department of Computer Science
Biomedical Informatics Group
Universitätsstrasse 6
8092 Zürich - Room
- CAB F39
I work on exact pairwise alignment, minimizers, and high throughput bioinformatics algorithms.
See my blog at https://curiouscoding.nl.
After completing my masters in mathematics and computer science in Oxford I worked at Google for some years. I went back to academia to get back to research and have the freedom to solve problems I think are interesting. My focus is on mathematics, exact algorithms, and speeding up existing implementations by efficiently using CPU capabilities.
Latest Publications
Abstract MOTIVATION Pairwise alignment is at the core of computational biology. Most commonly used exact methods are either based on O(ns) band doubling or O(n+s²) diagonal transition, where n is the sequence length and s the number of errors. However, as the length of sequences has grown, these exact methods are often replaced by approximate methods based on e.g. seed-and-extend and heuristics to bound the computed region. We would like to develop an exact method that matches the performance of these approximate methods. Recently, Astarix introduced the A* shortest path algorithm with the seed heuristic for exact sequence-to-graph alignment. A*PA adapted and improved this for pairwise sequence alignment and achieves near-linear runtime when divergence (error rate) is low, at the cost of being very slow when divergence is high. METHODS We introduce A*PA2, an exact global pairwise aligner with respect to edit distance. The goal of A*PA2 is to unify the near-linear runtime of A*PA on similar sequences with the efficiency of dynamic programming (DP) based methods. Like Edlib, A*PA2 uses Ukkonen’s band doubling in combination with Myers' bitpacking. A*PA2 1) uses large block sizes inspired by Block Aligner, 2) extends this with SIMD (single instruction, multiple data), 3) introduces a new profile for efficient computations, 4) introduces a new optimistic technique for traceback based on diagonal transition, 5) avoids recomputation of states where possible, and 6) applies the heuristics developed in A*PA and improves them using pre-pruning. RESULTS With the first 4 engineering optimizations, A*PA2-simple has complexity O(ns) and is 6× to 8× faster than Edlib for sequences ≥ 10 kbp. A*PA2-full also includes the heuristic and is often near-linear in practice for sequences with small divergence. The average runtime of A*PA2 is 19× faster than the exact aligners BiWFA and Edlib on >500 kbp long ONT (Oxford Nanopore Technologies) reads of a human genome having 6% divergence on average. On shorter ONT reads of 11% average divergence the speedup is 5.6× (avg. length 11 kbp) and 0.81× (avg. length 800 bp). On all tested datasets, A*PA2 is competitive with or faster than approximate methods.
Authors Ragnar Groot Koerkamp
Submitted WABI24
Abstract MOTIVATION Given a string S, a minimizer scheme is an algorithm defined by a triple (k,w,O) that samples a subset of k-mers (k-long substrings) from a string S. Specifically, it samples the smallest k-mer according to the order O from each window of w consecutive k-mers in S. Because consecutive windows can sample the same k-mer, the set of the sampled k-mers is typically much smaller than S. More generally, we consider substring sampling algorithms that respect a window guarantee: at least one k-mer must be sampled from every window of w consecutive k-mers. As a sampled k-mer is uniquely identified by its absolute position in S, we can define the density of a sampling algorithm as the fraction of distinct sampled positions. Good methods have low density which, by respecting the window guarantee, is lower bounded by 1/w. It is however difficult to design a sequence-agnostic algorithm with provably optimal density. In practice, the order O is usually implemented using a pseudo-random hash function to obtain the so-called random minimizer. This scheme is simple to implement, very fast to compute even in streaming fashion, and easy to analyze. However, its density is almost a factor of 2 away from the lower bound for large windows. METHODS In this work we introduce mod-sampling, a two-step sampling algorithm to obtain new minimizer schemes. Given a (small) parameter t, the mod-sampling algorithm finds the position p of the smallest t-mer in a window. It then samples the k-mer at position pod w. The lr-minimizer uses t = k-w and the mod-minimizer uses t = k (mod w). RESULTS These new schemes have provably lower density than random minimizers and other schemes when k is large compared to w, while being as fast to compute. Importantly, the mod-minimizer achieves optimal density when k goes to infinity. Although the mod-minimizer is not the first method to achieve optimal density for large k, its proof of optimality is simpler than previous work. We provide pseudocode for a number of other methods and compare to them. In practice, the mod-minimizer has considerably lower density than the random minimizer and other state-of-the-art methods, like closed syncmers and miniception, when k > w. We plugged the mod-minimizer into SSHash, a k-mer dictionary based on minimizers. For default parameters (w,k) = (11,21), space usage decreases by 15% when indexing the whole human genome (GRCh38), while maintaining its fast query time.
Authors Ragnar Groot Koerkamp, Giulio Ermanno Pibiri
Submitted WABI24
Abstract MOTIVATION Sequence alignment has been at the core of computational biology for half a century. Still, it is an open problem to design a practical algorithm for exact alignment of a pair of related sequences in linear-like time. RESULTS We solve exact global pairwise alignment with respect to edit distance by using the A* shortest path algorithm. In order to efficiently align long sequences with high divergence, we extend the recently proposed seed heuristic with match chaining, gap costs, and inexact matches. We additionally integrate the novel match pruning technique and diagonal transition to improve the A* search. We prove the correctness of our algorithm, implement it in the A*PA aligner, and justify our extensions intuitively and empirically. RESULTS On random sequences of divergence d=4% and length n, the empirical runtime of A*PA scales near-linearly with length (best~fit n^1.06, n<10^7 bp). A similar scaling remains up to d=12% (best~fit ~n^1.24, n<10^7 bp). For n=10^7bp and d=4%, A*PA reaches >500x speedup compared to the leading exact aligners Edlib and WFA. The performance of A*PA is highly influenced by long gaps. On long (n>500 kbp) ONT reads of a human sample it efficiently aligns sequences with d<10%, leading to 3x median speedup compared to Edlib and WFA. When the sequences come from different human samples, A*PA performs 1.7x faster than Edlib and WFA. Availability github.com/RagnarGrootKoerkamp/astar-pairwise-aligner
Authors Ragnar Groot Koerkamp, Pesho Ivanov
Submitted Bioinformatics
Abstract Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and demonstrate that it yields a better time-accuracy trade-off in settings with up to a 25% mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of 25%. For such queries, longer sketch-based seeds yield a 4× increase in recall compared to exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment.
Authors Amir Joudaki, Alexandru Meterez, Harun Mustafa, Ragnar Groot Koerkamp, André Kahles1 and Gunnar Rätsch
Submitted Genome Research, RECOMB 2023