Ragnar Groot Koerkamp, MSc
Benchmarking is harder than you think, even taking into account this rule.
PhD Student
 ragnar.grootkoerkamp@ inf.ethz.ch
 Address

ETH Zürich
Department of Computer Science
Biomedical Informatics Group
Universitätsstrasse 6
8092 Zürich  Room
 CAB F39
I work on exact pairwise alignment, minimizers, and high throughput bioinformatics algorithms.
See my blog at https://curiouscoding.nl.
After completing my masters in mathematics and computer science in Oxford I worked at Google for some years. I went back to academia to get back to research and have the freedom to solve problems I think are interesting. My focus is on mathematics, exact algorithms, and speeding up existing implementations by efficiently using CPU capabilities.
Latest Publications
Abstract MOTIVATION Pairwise alignment is at the core of computational biology. Most commonly used exact methods are either based on O(ns) band doubling or O(n+s²) diagonal transition, where n is the sequence length and s the number of errors. However, as the length of sequences has grown, these exact methods are often replaced by approximate methods based on e.g. seedandextend and heuristics to bound the computed region. We would like to develop an exact method that matches the performance of these approximate methods. Recently, Astarix introduced the A* shortest path algorithm with the seed heuristic for exact sequencetograph alignment. A*PA adapted and improved this for pairwise sequence alignment and achieves nearlinear runtime when divergence (error rate) is low, at the cost of being very slow when divergence is high. METHODS We introduce A*PA2, an exact global pairwise aligner with respect to edit distance. The goal of A*PA2 is to unify the nearlinear runtime of A*PA on similar sequences with the efficiency of dynamic programming (DP) based methods. Like Edlib, A*PA2 uses Ukkonen’s band doubling in combination with Myers' bitpacking. A*PA2 1) uses large block sizes inspired by Block Aligner, 2) extends this with SIMD (single instruction, multiple data), 3) introduces a new profile for efficient computations, 4) introduces a new optimistic technique for traceback based on diagonal transition, 5) avoids recomputation of states where possible, and 6) applies the heuristics developed in A*PA and improves them using prepruning. RESULTS With the first 4 engineering optimizations, A*PA2simple has complexity O(ns) and is 6× to 8× faster than Edlib for sequences ≥ 10 kbp. A*PA2full also includes the heuristic and is often nearlinear in practice for sequences with small divergence. The average runtime of A*PA2 is 19× faster than the exact aligners BiWFA and Edlib on >500 kbp long ONT (Oxford Nanopore Technologies) reads of a human genome having 6% divergence on average. On shorter ONT reads of 11% average divergence the speedup is 5.6× (avg. length 11 kbp) and 0.81× (avg. length 800 bp). On all tested datasets, A*PA2 is competitive with or faster than approximate methods.
Authors Ragnar Groot Koerkamp
Submitted WABI24
Abstract MOTIVATION Given a string S, a minimizer scheme is an algorithm defined by a triple (k,w,O) that samples a subset of kmers (klong substrings) from a string S. Specifically, it samples the smallest kmer according to the order O from each window of w consecutive kmers in S. Because consecutive windows can sample the same kmer, the set of the sampled kmers is typically much smaller than S. More generally, we consider substring sampling algorithms that respect a window guarantee: at least one kmer must be sampled from every window of w consecutive kmers. As a sampled kmer is uniquely identified by its absolute position in S, we can define the density of a sampling algorithm as the fraction of distinct sampled positions. Good methods have low density which, by respecting the window guarantee, is lower bounded by 1/w. It is however difficult to design a sequenceagnostic algorithm with provably optimal density. In practice, the order O is usually implemented using a pseudorandom hash function to obtain the socalled random minimizer. This scheme is simple to implement, very fast to compute even in streaming fashion, and easy to analyze. However, its density is almost a factor of 2 away from the lower bound for large windows. METHODS In this work we introduce modsampling, a twostep sampling algorithm to obtain new minimizer schemes. Given a (small) parameter t, the modsampling algorithm finds the position p of the smallest tmer in a window. It then samples the kmer at position pod w. The lrminimizer uses t = kw and the modminimizer uses t = k (mod w). RESULTS These new schemes have provably lower density than random minimizers and other schemes when k is large compared to w, while being as fast to compute. Importantly, the modminimizer achieves optimal density when k goes to infinity. Although the modminimizer is not the first method to achieve optimal density for large k, its proof of optimality is simpler than previous work. We provide pseudocode for a number of other methods and compare to them. In practice, the modminimizer has considerably lower density than the random minimizer and other stateoftheart methods, like closed syncmers and miniception, when k > w. We plugged the modminimizer into SSHash, a kmer dictionary based on minimizers. For default parameters (w,k) = (11,21), space usage decreases by 15% when indexing the whole human genome (GRCh38), while maintaining its fast query time.
Authors Ragnar Groot Koerkamp, Giulio Ermanno Pibiri
Submitted WABI24
Abstract MOTIVATION Sequence alignment has been at the core of computational biology for half a century. Still, it is an open problem to design a practical algorithm for exact alignment of a pair of related sequences in linearlike time. RESULTS We solve exact global pairwise alignment with respect to edit distance by using the A* shortest path algorithm. In order to efficiently align long sequences with high divergence, we extend the recently proposed seed heuristic with match chaining, gap costs, and inexact matches. We additionally integrate the novel match pruning technique and diagonal transition to improve the A* search. We prove the correctness of our algorithm, implement it in the A*PA aligner, and justify our extensions intuitively and empirically. RESULTS On random sequences of divergence d=4% and length n, the empirical runtime of A*PA scales nearlinearly with length (best~fit n^1.06, n<10^7 bp). A similar scaling remains up to d=12% (best~fit ~n^1.24, n<10^7 bp). For n=10^7bp and d=4%, A*PA reaches >500x speedup compared to the leading exact aligners Edlib and WFA. The performance of A*PA is highly influenced by long gaps. On long (n>500 kbp) ONT reads of a human sample it efficiently aligns sequences with d<10%, leading to 3x median speedup compared to Edlib and WFA. When the sequences come from different human samples, A*PA performs 1.7x faster than Edlib and WFA. Availability github.com/RagnarGrootKoerkamp/astarpairwisealigner
Authors Ragnar Groot Koerkamp, Pesho Ivanov
Submitted Bioinformatics
Abstract Sequencetograph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and demonstrate that it yields a better timeaccuracy tradeoff in settings with up to a 25% mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a knearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasilogarithmic query time for queries with an edit distance of 25%. For such queries, longer sketchbased seeds yield a 4× increase in recall compared to exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequencetograph alignment.
Authors Amir Joudaki, Alexandru Meterez, Harun Mustafa, Ragnar Groot Koerkamp, André Kahles1 and Gunnar Rätsch
Submitted Genome Research, RECOMB 2023