Gunnar Rätsch, Prof. Dr.

Interdisciplinary research is about forging collaborations across disciplinary and geographic borders.

Head

E-Mail
raetsch@get-your-addresses-elsewhere.inf.ethz.ch
Phone
+41 44 632 2036
Address
ETH Zürich
Department of Computer Science
Biomedical Informatics Group Universitätsstrasse 6
CAB F53.2
8092 Zürich
Room
CAB F53.2
twitter
@gxr

Data scientist Gunnar Rätsch develops and applies advanced data analysis and modeling techniques to data from deep molecular profiling, medical and health records, as well as images.

He earned his Ph.D. at the German National Laboratory for Information Technology under supervision of Klaus-Robert Müller and was a postdoc with Bob Williamson and Bernhard Schölkopf. He received the Max Planck Young and Independent Investigator award and was leading the group on Machine Learning in Genome Biology at the Friedrich Miescher Laboratory in Tübingen (2005-2011). In 2012, he joined Memorial Sloan Kettering Cancer Center as Associate Faculty. In May 2016, he and his lab moved to Zürich to join the Computer Science Department of ETH Zürich.


The Rätsch laboratory focuses on bridging medicine and biology with computer science. The group’s research interests are relatively broad as it covers an area from algorithmic computer science to biomedical application fields. On the one hand, this includes work on algorithms that can learn or extract insights from data, on the other hand it involves developing tools that we and others employ for the analysis of large genomic or medical data sets, often in collaboration with biologists and physicians. These tools aim to solve real-world biomedical problems. In short, the group advances the state-of-the-art in data science algorithms, turns them into commonly usable tools for specific applications, and then collaborate with biologists and physicians on life science problems. Along the way, we learn more and can go back to improve the algorithms.

Abstract Applying reinforcement learning (RL) to real-world problems is often made challenging by the inability to interact with the environment and the difficulty of designing reward functions. Offline RL addresses the first challenge by considering access to an offline dataset of environment interactions labeled by the reward function. In contrast, Preference-based RL does not assume access to the reward function and learns it from preferences, but typically requires an online interaction with the environment. We bridge the gap between these frameworks by exploring efficient methods for acquiring preference feedback in a fully offline setup. We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm, which leverages a learned environment model to elicit preference feedback on simulated rollouts. Drawing on insights from both the offline RL and the preference-based RL literature, our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy. We provide theoretical guarantees regarding the sample complexity of our approach, dependent on how well the offline data covers the optimal policy. Finally, we demonstrate the empirical performance of Sim-OPRL in different environments.

Authors Alizée Pace, Bernhard Schölkopf, Gunnar Rätsch, Giorgia Ramponi

Submitted ICML 2024 MFHAIA

Link

Abstract Knowing which features of a multivariate time series to measure and when is a key task in medicine, wearables, and robotics. Better acquisition policies can reduce costs while maintaining or even improving the performance of downstream predictors. Inspired by the maximization of conditional mutual information, we propose an approach to train acquirers end-to-end using only the downstream loss. We show that our method outperforms random acquisition policy, matches a model with an unrestrained budget, but does not yet overtake a static acquisition strategy. We highlight the assumptions and outline avenues for future work.

Authors Fedor Sergeev, Paola Malsot, Gunnar Rätsch, Vincent Fortuin

Submitted SPIGM ICML Workshop

Link DOI

Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. Results We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment.

Authors Harun Mustafa, Mikhail Karasikov, Nika Mansouri Ghiasi, Gunnar Rätsch, André Kahles

Submitted Bioinformatics, ISMB 2024

Link DOI

Abstract Acute kidney injury (AKI) is a syndrome that affects a large fraction of all critically ill patients, and early diagnosis to receive adequate treatment is as imperative as it is challenging to make early. Consequently, machine learning approaches have been developed to predict AKI ahead of time. However, the prevalence of AKI is often underestimated in state-of-the-art approaches, as they rely on an AKI event annotation solely based on creatinine, ignoring urine output. We construct and evaluate early warning systems for AKI in a multi-disciplinary ICU setting, using the complete KDIGO definition of AKI. We propose several variants of gradient-boosted decision tree (GBDT)-based models, including a novel time-stacking based approach. A state-of-the-art LSTM-based model previously proposed for AKI prediction is used as a comparison, which was not specifically evaluated in ICU settings yet. We find that optimal performance is achieved by using GBDT with the time-based stacking technique (AUPRC = 65.7%, compared with the LSTM-based model’s AUPRC = 62.6%), which is motivated by the high relevance of time since ICU admission for this task. Both models show mildly reduced performance in the limited training data setting, perform fairly across different subcohorts, and exhibit no issues in gender transfer. Following the official KDIGO definition substantially increases the number of annotated AKI events. In our study GBDTs outperform LSTM models for AKI prediction. Generally, we find that both model types are robust in a variety of challenging settings arising for ICU data.

Authors Xinrui Lyu, Bowen Fan, Matthias Hüser, Philip Hartout, Thomas Gumbsch, Martin Faltys, Tobias M. Merz, Gunnar Rätsch, and Karsten Borgwardt

Submitted Bioinformatics, ISMB 2024

Link DOI

Abstract The amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making it full-text searchable and easily accessible to researchers in life and data science is an unsolved problem. In this work, we take advantage of recently developed, very efficient data structures and algorithms for representing sequence sets. We make Petabases of DNA sequences across all clades of life, including viruses, bacteria, fungi, plants, animals, and humans, fully searchable. Our indexes are freely available to the research community. This highly compressed representation of the input sequences (up to 5800×) fits on a single consumer hard drive (≈100 USD), making this valuable resource cost-effective to use and easily transportable. We present the underlying methodological framework, called MetaGraph, that allows us to scalably index very large sets of DNA or protein sequences using annotated De Bruijn graphs. We demonstrate the feasibility of indexing the full extent of existing sequencing data and present new approaches for efficient and cost-effective full-text search at an on-demand cost of $0.10 per queried Mbp. We explore several practical use cases to mine existing archives for interesting associations and demonstrate the utility of our indexes for integrative analyses.

Authors Mikhail Karasikov, Harun Mustafa, Daniel Danciu, Marc Zimmermann, Christopher Barber, Gunnar Rätsch, André Kahles

Submitted bioRxiv

Link DOI

Abstract This study advances Early Event Prediction (EEP) in healthcare through Dynamic Survival Analysis (DSA), offering a novel approach by integrating risk localization into alarm policies to enhance clinical event metrics. By adapting and evaluating DSA models against traditional EEP benchmarks, our research demonstrates their ability to match EEP models on a time-step level and significantly improve event-level metrics through a new alarm prioritization scheme (up to 11% AuPRC difference). This approach represents a significant step forward in predictive healthcare, providing a more nuanced and actionable framework for early event prediction and management.

Authors Hugo Yèche, Manuel Burger, Dinara Veshchezerova, Gunnar Rätsch

Submitted CHIL 2024

Link

Abstract Electronic Health Record (EHR) datasets from Intensive Care Units (ICU) contain a diverse set of data modalities. While prior works have successfully leveraged multiple modalities in supervised settings, we apply advanced self-supervised multi-modal contrastive learning techniques to ICU data, specifically focusing on clinical notes and time-series for clinically relevant online prediction tasks. We introduce a loss function Multi-Modal Neighborhood Contrastive Loss (MM-NCL), a soft neighborhood function, and showcase the excellent linear probe and zero-shot performance of our approach.

Authors Fabian Baldenweg, Manuel Burger, Gunnar Rätsch, Rita Kuznetsova

Submitted TS4H ICLR Workshop

Link DOI

Authors Sonali Andani, Boqi Chen, Joanna Ficek-Pascual, Simon Heinke, Ruben Casanova, Bettina Sobottka, Bernd Bodenmiller, Tumor Profiler Consortium, Viktor H Kölzer, Gunnar Rätsch

Submitted medRxiv

Abstract The rapid expansion of genomic sequence data calls for new methods to achieve robust sequence representations. Existing techniques often neglect intricate structural details, emphasizing mainly contextual information. To address this, we developed k-mer embeddings that merge contextual and structural string information by enhancing De Bruijn graphs with structural similarity connections. Subsequently, we crafted a self-supervised method based on Contrastive Learning that employs a heterogeneous Graph Convolutional Network encoder and constructs positive pairs based on node similarities. Our embeddings consistently outperform prior techniques for Edit Distance Approximation and Closest String Retrieval tasks.

Authors Kacper Kapusniak, Manuel Burger, Gunnar Rätsch, Amir Joudaki

Submitted NeurIPS 2023 Workshop: Frontiers in Graph Learning

Link DOI

Abstract Recent advances in deep learning architectures for sequence modeling have not fully transferred to tasks handling time-series from electronic health records. In particular, in problems related to the Intensive Care Unit (ICU), the state-of-the-art remains to tackle sequence classification in a tabular manner with tree-based methods. Recent findings in deep learning for tabular data are now surpassing these classical methods by better handling the severe heterogeneity of data input features. Given the similar level of feature heterogeneity exhibited by ICU time-series and motivated by these findings, we explore these novel methods' impact on clinical sequence modeling tasks. By jointly using such advances in deep learning for tabular data, our primary objective is to underscore the importance of step-wise embeddings in time-series modeling, which remain unexplored in machine learning methods for clinical data. On a variety of clinically relevant tasks from two large-scale ICU datasets, MIMIC-III and HiRID, our work provides an exhaustive analysis of state-of-the-art methods for tabular time-series as time-step embedding models, showing overall performance improvement. In particular, we evidence the importance of feature grouping in clinical time-series, with significant performance gains when considering features within predefined semantic groups in the step-wise embedding module.

Authors Rita Kuznetsova, Alizée Pace, Manuel Burger, Hugo Yèche, Gunnar Rätsch

Submitted ML4H 2023 (PMLR)

Link DOI

Abstract Clinicians are increasingly looking towards machine learning to gain insights about patient evolutions. We propose a novel approach named Multi-Modal UMLS Graph Learning (MMUGL) for learning meaningful representations of medical concepts using graph neural networks over knowledge graphs based on the unified medical language system. These representations are aggregated to represent entire patient visits and then fed into a sequence model to perform predictions at the granularity of multiple hospital visits of a patient. We improve performance by incorporating prior medical knowledge and considering multiple modalities. We compare our method to existing architectures proposed to learn representations at different granularities on the MIMIC-III dataset and show that our approach outperforms these methods. The results demonstrate the significance of multi-modal medical concept representations based on prior medical knowledge.

Authors Manuel Burger, Gunnar Rätsch, Rita Kuznetsova

Submitted ML4H 2023 (PMLR)

Link DOI

Abstract Intensive Care Units (ICU) require comprehensive patient data integration for enhanced clinical outcome predictions, crucial for assessing patient conditions. Recent deep learning advances have utilized patient time series data, and fusion models have incorporated unstructured clinical reports, improving predictive performance. However, integrating established medical knowledge into these models has not yet been explored. The medical domain's data, rich in structural relationships, can be harnessed through knowledge graphs derived from clinical ontologies like the Unified Medical Language System (UMLS) for better predictions. Our proposed methodology integrates this knowledge with ICU data, improving clinical decision modeling. It combines graph representations with vital signs and clinical reports, enhancing performance, especially when data is missing. Additionally, our model includes an interpretability component to understand how knowledge graph nodes affect predictions.

Authors Samyak Jain, Manuel Burger, Gunnar Rätsch, Rita Kuznetsova

Submitted ML4H 2023 (Findings Track)

Link DOI

Abstract In research areas with scarce data, representation learning plays a significant role. This work aims to enhance representation learning for clinical time series by deriving universal embeddings for clinical features, such as heart rate and blood pressure. We use self-supervised training paradigms for language models to learn high-quality clinical feature embeddings, achieving a finer granularity than existing time-step and patient-level representation learning. We visualize the learnt embeddings via unsupervised dimension reduction techniques and observe a high degree of consistency with prior clinical knowledge. We also evaluate the model performance on the MIMIC-III benchmark and demonstrate the effectiveness of using clinical feature embeddings. We publish our code online for replication.

Authors Yurong Hu, Manuel Burger, Gunnar Rätsch, Rita Kuznetsova

Submitted NeurIPS 2023 Workshop: Self-Supervised Learning - Theory and Practice

Link DOI

Abstract Selecting hyperparameters in deep learning greatly impacts its effectiveness but requires manual effort and expertise. Recent works show that Bayesian model selection with Laplace approximations can allow to optimize such hyperparameters just like standard neural network parameters using gradients and on the training data. However, estimating a single hyperparameter gradient requires a pass through the entire dataset, limiting the scalability of such algorithms. In this work, we overcome this issue by introducing lower bounds to the linearized Laplace approximation of the marginal likelihood. In contrast to previous estimators, these bounds are amenable to stochastic-gradient-based optimization and allow to trade off estimation accuracy against computational complexity. We derive them using the function-space form of the linearized Laplace, which can be estimated using the neural tangent kernel. Experimentally, we show that the estimators can significantly accelerate gradient-based hyperparameter optimization.

Authors Alexander Immer, Tycho FA van der Ouderaa, Mark van der Wilk, Gunnar Rätsch, Bernhard Schölkopf

Submitted ICML 2023

Link

Abstract Deep neural networks are highly effective but suffer from a lack of interpretability due to their black-box nature. Neural additive models (NAMs) solve this by separating into additive sub-networks, revealing the interactions between features and predictions. In this paper, we approach the NAM from a Bayesian perspective in order to quantify the uncertainty in the recovered interactions. Linearized Laplace approximation enables inference of these interactions directly in function space and yields a tractable estimate of the marginal likelihood, which can be used to perform implicit feature selection through an empirical Bayes procedure. Empirically, we show that Laplace-approximated NAMs (LA-NAM) are both more robust to noise and easier to interpret than their non-Bayesian counterpart for tabular regression and classification tasks.

Authors Kouroche Bouchiat, Alexander Immer, Hugo Yèche, Gunnar Rätsch, Vincent Fortuin

Submitted AABI 2023

Link

Abstract The number of published metagenome assemblies is rapidly growing due to advances in sequencing technologies. However, sequencing errors, variable coverage, repetitive genomic regions, and other factors can produce misassemblies, which are challenging to detect for taxonomically novel genomic data. Assembly errors can affect all downstream analyses of the assemblies. Accuracy for the state of the art in reference-free misassembly prediction does not exceed an AUPRC of 0.57, and it is not clear how well these models generalize to real-world data. Here, we present the Residual neural network for Misassembled Contig identification (ResMiCo), a deep learning approach for reference-free identification of misassembled contigs. To develop ResMiCo, we first generated a training dataset of unprecedented size and complexity that can be used for further benchmarking and developments in the field. Through rigorous validation, we show that ResMiCo is substantially more accurate than the state of the art, and the model is robust to novel taxonomic diversity and varying assembly methods. ResMiCo estimated 4.7% misassembled contigs per metagenome across multiple real-world datasets. We demonstrate how ResMiCo can be used to optimize metagenome assembly hyperparameters to improve accuracy, instead of optimizing solely for contiguity. The accuracy, robustness, and ease-of-use of ResMiCo make the tool suitable for general quality control of metagenome assemblies and assembly methodology optimization.

Authors Olga Mineeva, Daniel Danciu, Bernhard Schölkopf, Ruth E. Ley, Gunnar Rätsch, Nicholas D. Youngblut

Submitted PLoS Computational Biology

Link DOI

Abstract Understanding deep learning model behavior is critical to accepting machine learning-based decision support systems in the medical community. Previous research has shown that jointly using clinical notes with electronic health record (EHR) data improved predictive performance for patient monitoring in the intensive care unit (ICU). In this work, we explore the underlying reasons for these improvements. While relying on a basic attention-based model to allow for interpretability, we first confirm that performance significantly improves over state-of-the-art EHR data models when combining EHR data and clinical notes. We then provide an analysis showing improvements arise almost exclusively from a subset of notes containing broader context on patient state rather than clinician notes. We believe such findings highlight deep learning models for EHR data to be more limited by partially-descriptive data than by modeling choice, motivating a more data-centric approach in the field.

Authors Severin Husmann, Hugo Yèche, Gunnar Rätsch, Rita Kuznetsova

Submitted Workshop on Learning from Time Series for Health, 36th Conference on Neural Information Processing Systems (NeurIPS 2022)

Link

Abstract Data augmentation is commonly applied to improve performance of deep learning by enforcing the knowledge that certain transformations on the input preserve the output. Currently, the used data augmentation is chosen by human effort and costly cross-validation, which makes it cumbersome to apply to new datasets. We develop a convenient gradient-based method for selecting the data augmentation without validation data and during training of a deep neural network. Our approach relies on phrasing data augmentation as an invariance in the prior distribution and learning it using Bayesian model selection, which has been shown to work in Gaussian processes, but not yet for deep neural networks. We propose a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective, which can be optimised without human supervision or validation data. We show that our method can successfully recover invariances present in the data, and that this improves generalisation and data efficiency on image datasets.

Authors Alexander Immer, Tycho FA van der Ouderaa, Gunnar Rätsch, Vincent Fortuin, Mark van der Wilk

Submitted NeurIPS 2022

Link

Abstract The amount of data stored in genomic sequence databases is growing exponentially, far exceeding traditional indexing strategies' processing capabilities. Many recent indexing methods organize sequence data into a sequence graph to succinctly represent large genomic data sets from reference genome and sequencing read set databases. These methods typically use De Bruijn graphs as the graph model or the underlying index model, with auxiliary graph annotation data structures to associate graph nodes with various metadata. Examples of metadata can include a node's source samples (called labels), genomic coordinates, expression levels, etc. An important property of these graphs is that the set of sequences spelled by graph walks is a superset of the set of input sequences. Thus, when aligning to graphs indexing samples derived from low-coverage sequencing sets, sequence information present in many target samples can compensate for missing information resulting from a lack of sequence coverage. Aligning a query against an entire sequence graph (as in traditional sequence-to-graph alignment) using state-of-the-art algorithms can be computationally intractable for graphs constructed from thousands of samples, potentially searching through many non-biological combinations of samples before converging on the best alignment. To address this problem, we propose a novel alignment strategy called multi-label alignment (MLA) and an algorithm implementing this strategy using annotated De Bruijn graphs within the MetaGraph framework, called MetaGraph-MLA. MLA extends current sequence alignment scoring models with additional label change operations for incorporating mixtures of samples into an alignment, penalizing mixtures that are dissimilar in their sequence content. To overcome disconnects in the graph that result from a lack of sequencing coverage, we further extend our graph index to utilize a variable-order De Bruijn graph and introduce node length change as an operation. In this model, traversal between nodes that share a suffix of length < k-1 acts as a proxy for inserting nodes into the graph. MetaGraph-MLA constructs an MLA of a query by chaining single-label alignments using sparse dynamic programming. We evaluate MetaGraph-MLA on simulated data against state-of-the-art sequence-to-graph aligners. We demonstrate increases in alignment lengths for simulated viral Illumina-type (by 6.5%), PacBio CLR-type (by 6.2%), and PacBio CCS-type (by 6.7%) sequencing reads, respectively, and show that the graph walks incorporated into our MLAs originate predominantly from samples of the same strain as the reads' ground-truth samples. We envision MetaGraph-MLA as a step towards establishing sequence graph tools for sequence search against a wide variety of target sequence types.

Authors Harun Mustafa, Mikhail Karasikov, Gunnar Rätsch, André Kahles

Submitted bioRxiv

Link DOI

Abstract Methods In a single-center retrospective study of matched pairs of initial and post-therapeutic glioma cases with a recurrence period greater than one year, we performed whole exome sequencing combined with mRNA and microRNA expression profiling to identify processes that are altered in recurrent gliomas. Results Mutational analysis of recurrent gliomas revealed early branching evolution in seventy-five percent of patients. High plasticity was confirmed at the mRNA and miRNA levels. SBS1 signature was reduced and SBS11 was elevated, demonstrating the effect of alkylating agent therapy on the mutational landscape. There was no evidence for secondary genomic alterations driving therapy resistance. ALK7/ACVR1C and LTBP1 were upregulated, whereas LEFTY2 was downregulated, pointing towards enhanced Tumor Growth Factor β (TGF-β) signaling in recurrent gliomas. Consistently, altered microRNA expression profiles pointed towards enhanced Nuclear Factor Kappa B and Wnt signaling that, cooperatively with TGF-β, induces epithelial to mesenchymal transition (EMT), migration and stemness. TGF-β-induced expression of pro-apoptotic proteins and repression of anti-apoptotic proteins were uncoupled in the recurrent tumor. Conclusions Our results suggest an important role of TGF-β signaling in recurrent gliomas. This may have clinical implication, since TGF-β inhibitors have entered clinical phase studies and may potentially be used in combination therapy to interfere with chemoradiation resistance. Recurrent gliomas show high incidence of early branching evolution. High tumor plasticity is confirmed at the level of microRNA and mRNA expression profiles.

Authors Elham Kashani, Désirée Schnidrig, Ali Hashemi Gheinani, Martina Selina Ninck, Philipp Zens, Theoni Maragkou, Ulrich Baumgartner, Philippe Schucht, Gunnar Rätsch, Mark A Rubin, Sabina Berezowska, Charlotte KY Ng, Erik Vassella

Submitted Neuro-oncology

Link DOI

Abstract Background. Glioblastoma (GBM) is the most aggressive primary brain tumor and represents a particular challenge of therapeutic intervention. Methods. In a single-center retrospective study of matched pairs of initial and post-therapeutic GBM cases with a recurrence period greater than one year, we performed whole exome sequencing combined with mRNA and microRNA expression profiling to identify processes that are altered in recurrent GBM. Results. Expression and mutational profiling of recurrent GBM revealed evidence for early branching evolution in seventy-five percent of patients. SBS1 signature was reduced in the recurrent tumor and SBS11 was elevated, demonstrating the effect of alkylating agent therapy on the mutational landscape. There was no evidence for secondary genomic alterations driving therapy resistance. ALK7/ACVR1C and LTBP1 were upregulated, whereas LEFTY2 was downregulated, pointing towards enhanced Tumor Growth Factor β (TGF-β) signaling in the recurrent GBM. Consistently, altered microRNA expression profiles pointed towards enhanced Nuclear Factor Kappa B signaling that, cooperatively with TGF-β, induces epithelial to mesenchymal transition (EMT), migration and stemness. In contrast, TGF-β-induced expression of pro-apoptotic proteins and repression of anti-apoptotic proteins were uncoupled in the recurrent tumor. Conclusions. Our results suggest an important role of TGF-β signaling in recurrent GBM. This may have clinical implication, since TGF-β inhibitors have entered clinical phase studies and may potentially be used in combination therapy to interfere with chemoradiation resistance. Recurrent GBM show high incidence of early branching evolution. High tumor plasticity is confirmed at the level of microRNA and mRNA expression profiles.

Authors Elham Kashani, Désirée Schnidrig, Ali Hashemi Gheinani, Martina Selina Ninck, Philipp Zens, Theoni Maragkou, Ulrich Baumgartner, Philippe Schucht, Gunnar Rätsch, Mark Andrew Rubin, Sabina Berezowska, Charlotte KY Ng, Erik Vassella

Submitted Research Square (Preprint Platform)

Link DOI

Abstract Several recently developed single-cell DNA sequencing technologies enable whole-genome sequencing of thousands of cells. However, the ultra-low coverage of the sequenced data (<0.05× per cell) mostly limits their usage to the identification of copy number alterations in multi-megabase segments. Many tumors are not copy number-driven, and thus single-nucleotide variant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumor heterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible when superimposing the sequenced genomes of hundreds of genetically similar cells. Thus, we have developed a new approach to efficiently cluster tumor cells based on a Bayesian filtering approach of relevant loci and exploiting read overlap and phasing.

Authors Hana Rozhoňová, Daniel Danciu, Stefan Stark, Gunnar Rätsch, André Kahles, Kjong-Van Lehmann

Submitted Bioinformatics

Link DOI

Abstract Models that can predict the occurrence of events ahead of time with low false-alarm rates are critical to the acceptance of decision support systems in the medical community. This challenging task is typically treated as a simple binary classification, ignoring temporal dependencies between samples, whereas we propose to exploit this structure. We first introduce a common theoretical framework unifying dynamic survival analysis and early event prediction. Following an analysis of objectives from both fields, we propose Temporal Label Smoothing (TLS), a simpler, yet best-performing method that preserves prediction monotonicity over time. By focusing the objective on areas with a stronger predictive signal, TLS improves performance over all baselines on two large-scale benchmark tasks. Gains are particularly notable along clinically relevant measures, such as event recall at low false-alarm rates. TLS reduces the number of missed events by up to a factor of two over previously used approaches in early event prediction.

Authors Hugo Yèche, Alizée Pace, Gunnar Rätsch, Rita Kuznetsova

Submitted ICML 2023

Link DOI

Abstract Mutations in the splicing factor SF3B1 are frequently occurring in various cancers and drive tumor progression through the activation of cryptic splice sites in multiple genes. Recent studies also demonstrate a positive correlation between the expression levels of wild-type SF3B1 and tumor malignancy. Here, we demonstrate that SF3B1 is a hypoxia-inducible factor (HIF)-1 target gene that positively regulates HIF1 pathway activity. By physically interacting with HIF1α, SF3B1 facilitates binding of the HIF1 complex to hypoxia response elements (HREs) to activate target gene expression. To further validate the relevance of this mechanism for tumor progression, we show that a reduction in SF3B1 levels via monoallelic deletion of Sf3b1 impedes tumor formation and progression via impaired HIF signaling in a mouse model for pancreatic cancer. Our work uncovers an essential role of SF3B1 in HIF1 signaling, thereby providing a potential explanation for the link between high SF3B1 expression and aggressiveness of solid tumors.

Authors Patrik Simmler, Cédric Cortijo, Lisa Maria Koch, Patricia Galliker, Silvia Angori, Hella Anna Bolck, Christina Mueller, Ana Vukolic, Peter Mirtschink, Yann Christinat, Natalie R Davidson, Kjong-Van Lehmann, Giovanni Pellegrini, Chantal Pauli, Daniela Lenggenhager, Ilaria Guccini, Till Ringel, Christian Hirt, Kim Fabiano Marquart, Moritz Schaefer, Gunnar Rätsch, Matthias Peter, Holger Moch, Markus Stoffel, Gerald Schwank

Submitted Cell Reports

Link DOI

Abstract Understanding and predicting molecular responses towards external perturbations is a core question in molecular biology. Technological advancements in the recent past have enabled the generation of high-resolution single-cell data, making it possible to profile individual cells under different experimentally controlled perturbations. However, cells are typically destroyed during measurement, resulting in unpaired distributions over either perturbed or non-perturbed cells. Leveraging the theory of optimal transport and the recent advents of convex neural architectures, we learn a coupling describing the response of cell populations upon perturbation, enabling us to predict state trajectories on a single-cell level. We apply our approach, CellOT, to predict treatment responses of 21,650 cells subject to four different drug perturbations. CellOT outperforms current state-of-the-art methods both qualitatively and quantitatively, accurately capturing cellular behavior shifts across all different drugs.

Authors Charlotte Bunne, Stefan Stark, Gabriele Gut, Jacobo Sarabia del Castillo, Mitchell Levesque, Kjong Van Lehmann, Lucas Pelkmans, Andreas Krause, Gunnar Rätsch

Submitted BioRxiv

Link DOI

Abstract Alternative splicing (AS) is a regulatory process during mRNA maturation that shapes higher eukaryotes’ complex transcriptomes. High-throughput sequencing of RNA (RNA-Seq) allows for measurements of AS transcripts at an unprecedented depth and diversity. The ever-expanding catalog of known AS events provides biological insights into gene regulation, population genetics, or in the context of disease. Here, we present an overview on the usage of SplAdder, a graph-based alternative splicing toolbox, which can integrate an arbitrarily large number of RNA-Seq alignments and a given annotation file to augment the shared annotation based on RNA-Seq evidence. The shared augmented annotation graph is then used to identify, quantify, and confirm alternative splicing events based on the RNA-Seq data. Splice graphs for individual alignments can also be tested for significant quantitative differences between other samples or groups of samples.

Authors Philipp Markolin, Gunnar Rätsch, André Kahles

Submitted Variant Calling

Link

Abstract Complex multivariate time series arise in many fields, ranging from computer vision to robotics or medicine. Often we are interested in the independent underlying factors that give rise to the high-dimensional data we are observing. While many models have been introduced to learn such disentangled representations, only few attempt to explicitly exploit the structure of sequential data. We investigate the disentanglement properties of Gaussian process variational autoencoders, a class of models recently introduced that have been successful in different tasks on time series data. Our model exploits the temporal structure of the data by modeling each latent channel with a GP prior and employing a structured variational distribution that can capture dependencies in time. We demonstrate the competitiveness of our approach against state-of-the-art unsupervised and weakly-supervised disentanglement methods on a benchmark task. Moreover, we provide evidence that we can learn meaningful disentangled representations on real-world medical time series data.

Authors Simon Bing, Vincent Fortuin, Gunnar Rätsch

Submitted AABI 2022

Link

Abstract Multi-layered omics technologies can help define relationships between genetic factors, biochemical processes and phenotypes thus extending research of monogenic diseases beyond identifying their cause. We implemented a multi-layered omics approach for the inherited metabolic disorder methylmalonic aciduria. We performed whole genome sequencing, transcriptomic sequencing, and mass spectrometry-based proteotyping from matched primary fibroblast samples of 230 individuals (210 affected, 20 controls) and related the molecular data to 105 phenotypic features. Integrative analysis identified a molecular diagnosis for 84% (179/210) of affected individuals, the majority (150) of whom had pathogenic variants in methylmalonyl-CoA mutase (MMUT). Untargeted integration of all three omics layers revealed dysregulation of TCA cycle and surrounding metabolic pathways, a finding that was further supported by multi-organ metabolomics of a hemizygous Mmut mouse model. Stratification by phenotypic severity indicated downregulation of oxoglutarate dehydrogenase and upregulation of glutamate dehydrogenase in disease. This was supported by metabolomics and isotope tracing studies which showed increased glutamine-derived anaplerosis. We further identified MMUT to physically interact with both, oxoglutarate dehydrogenase and glutamate dehydrogenase providing a mechanistic link. This study emphasizes the utility of a multi-modal omics approach to investigate metabolic diseases and highlights glutamine anaplerosis as a potential therapeutic intervention point in methylmalonic aciduria.

Authors Patrick Forny, Ximena Bonilla, David Lamparter, Wenguang Shao, Tanja Plessl, Caroline Frei, Anna Bingisser, Sandra Goetze, Audrey van Drogen, Keith Harshmann, Patrick GA Pedrioli, Cedric Howald, Florian Traversi, Sarah Cherkaoui, Raphael J Morscher, Luke Simmons, Merima Forny, Ioannis Xenarios, Ruedi Aebersold, Nicola Zamboni, Gunnar Rätsch, Emmanouil Dermitzakis, Bernd Wollscheid, Matthias R Baumgartner, D Sean Froese

Submitted medRxiv

Link DOI

Abstract We propose a stochastic conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. Existing CGM variants for this template either suffer from slow convergence rates, or require carefully increasing the batch size over the course of the algorithm’s execution, which leads to computing full gradients. In contrast, the proposed method, equipped with a stochastic average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques. In applications we put special emphasis on problems with a large number of separable constraints. Such problems are prevalent among semidefinite programming (SDP) formulations arising in machine learning and theoretical computer science. We provide numerical experiments on matrix completion, unsupervised clustering, and sparsest-cut SDPs.

Authors Gideon Dresdner, Maria-Luiza Vladarean, Gunnar Rätsch, Francesco Locatello, Volkan Cevher, Alp Yurtsever

Submitted Proceedings of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS-22)

Link

Abstract The recent success of machine learning methods applied to time series collected from Intensive Care Units (ICU) exposes the lack of standardized machine learning benchmarks for developing and comparing such methods. While raw datasets, such as MIMIC-IV or eICU, can be freely accessed on Physionet, the choice of tasks and pre-processing is often chosen ad-hoc for each publication, limiting comparability across publications. In this work, we aim to improve this situation by providing a benchmark covering a large spectrum of ICU-related tasks. Using the HiRID dataset, we define multiple clinically relevant tasks in collaboration with clinicians. In addition, we provide a reproducible end-to-end pipeline to construct both data and labels. Finally, we provide an in-depth analysis of current state-of-the-art sequence modeling methods, highlighting some limitations of deep learning approaches for this type of data. With this benchmark, we hope to give the research community the possibility of a fair comparison of their work.

Authors Hugo Yèche, Rita Kuznetsova, Marc Zimmermann, Matthias Hüser, Xinrui Lyu, Martin Faltys, Gunnar Rätsch

Submitted NeurIPS 2021 (Datasets and Benchmarks)

Link

Abstract High-throughput sequencing data is rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in solving the experiment discovery problem and building compressed representations of annotated de Bruijn graphs where k-mer sets can be efficiently indexed and interactively queried. However, approaches for representing and retrieving other quantitative attributes such as gene expression or genome positions in a general manner have yet to be developed. In this work, we propose the concept of Counting de Bruijn graphs generalizing the notion of annotated (or colored) de Bruijn graphs. Counting de Bruijn graphs supplement each node-label relation with one or many attributes (e.g., a k-mer count or its positions in genome). To represent them, we first observe that many schemes for the representation of compressed binary matrices already support the rank operation on the columns or rows, which can be used to define an inherent indexing of any additional quantitative attributes. Based on this property, we generalize these schemes and introduce a new approach for representing non-binary sparse matrices in compressed data structures. Finally, we notice that relation attributes are often easily predictable from a node’s local neighborhood in the graph. Notable examples are genome positions shifting by 1 for neighboring nodes in the graph, or expression levels that are often shared across neighbors. We exploit this regularity of graph annotations and apply an invertible delta-like coding to achieve better compression. We show that Counting de Bruijn graphs index k-mer counts from 2,652 human RNA-Seq read sets in representations over 8-fold smaller and yet faster to query compared to state-of-the-art bioinformatics tools. Furthermore, Counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip -9 for human Illumina RNA-Seq and 57% smaller for PacBio HiFi sequencing of viral samples. A complete joint searchable index of all viral PacBio SMRT reads from NCBI’s SRA (152,884 read sets, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, they generate a lossless and fully queryable index that is 4.4-fold smaller compared to the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools employing de Bruijn graphs and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed and fully searchable graph-based sequence indexes.

Authors Mikhail Karasikov, Harun Mustafa, Gunnar Rätsch, André Kahles

Submitted RECOMB 2022

Link DOI

Abstract Marginal-likelihood based model-selection, even though promising, is rarely used in deep learning due to estimation difficulties. Instead, most approaches rely on validation data, which may not be readily available. In this work, we present a scalable marginal-likelihood estimation method to select both hyperparameters and network architectures, based on the training data alone. Some hyperparameters can be estimated online during training, simplifying the procedure. Our marginal-likelihood estimate is based on Laplace's method and Gauss-Newton approximations to the Hessian, and it outperforms cross-validation and manual-tuning on standard regression and image classification datasets, especially in terms of calibration and out-of-distribution detection. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable (e.g., in nonstationary settings).

Authors Alexander Immer, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, Mohammad Emtiyaz Khan

Submitted ICML 2021

Link

Authors Patrik T Simmler, Tamara Mengis, Kjong-Van Lehmann, André Kahles, Tinu Thomas, Gunnar Rätsch, Markus Stoffel, Gerald Schwank

Submitted bioRxiv

Link

Abstract Intensive care units (ICU) are increasingly looking towards machine learning for methods to provide online monitoring of critically ill patients. In machine learning, online monitoring is often formulated as a supervised learning problem. Recently, contrastive learning approaches have demonstrated promising improvements over competitive supervised benchmarks. These methods rely on well-understood data augmentation techniques developed for image data which do not apply to online monitoring. In this work, we overcome this limitation by supplementing time-series data augmentation techniques with a novel contrastive learning objective which we call neighborhood contrastive learning (NCL). Our objective explicitly groups together contiguous time segments from each patient while maintaining state-specific information. Our experiments demonstrate a marked improvement over existing work applying contrastive methods to medical time-series.

Authors Hugo Yèche, Gideon Dresdner, Francesco Locatello, Matthias Hüser, Gunnar Rätsch

Submitted ICML 2021

Link

Abstract We present a global atlas of 4,728 metagenomic samples from mass-transit systems in 60 cities over 3 years, representing the first systematic, worldwide catalog of the urban microbial ecosystem. This atlas provides an annotated, geospatial profile of microbial strains, functional characteristics, antimicrobial resistance (AMR) markers, and genetic elements, including 10,928 viruses, 1,302 bacteria, 2 archaea, and 838,532 CRISPR arrays not found in reference databases. We identified 4,246 known species of urban microorganisms and a consistent set of 31 species found in 97% of samples that were distinct from human commensal organisms. Profiles of AMR genes varied widely in type and density across cities. Cities showed distinct microbial taxonomic signatures that were driven by climate and geographic differences. These results constitute a high-resolution global metagenomic atlas that enables discovery of organisms and genes, highlights potential public health and forensic applications, and provides a culture-independent view of AMR burden in cities.

Authors David Danko, Daniela Bezdan, Evan E. Afshin, Sofia Ahsanuddin, Chandrima Bhattacharya, Daniel J. Butler, Kern Rei Chng, Daisy Donnellan, Jochen Hecht, Katelyn Jackson, Katerina Kuchin, Mikhail Karasikov, Abigail Lyons, Lauren Mak, Dmitry Meleshko, Harun Mustafa, Beth Mutai, Russell Y. Neches, Amanda Ng, Olga Nikolayeva, Tatyana Nikolayeva, Eileen Png, Krista A. Ryon, Jorge L. Sanchez, Heba Shaaban, Maria A. Sierra, Dominique Thomas, Ben Young, Omar O. Abudayyeh, Josue Alicea, Malay Bhattacharyya, Ran Blekhman, Eduardo Castro-Nallar, Ana M. Cañas, Aspassia D. Chatziefthimiou, Robert W. Crawford, Francesca De Filippis, Youping Deng, Christelle Desnues, Emmanuel Dias-Neto, Marius Dybwad, Eran Elhaik, Danilo Ercolini, Alina Frolova, Dennis Gankin, Jonathan S. Gootenberg, Alexandra B. Graf, David C. Green, Iman Hajirasouliha, Jaden J.A. Hastings, Mark Hernandez, Gregorio Iraola, Soojin Jang, Andre Kahles, Frank J. Kelly, Kaymisha Knights, Nikos C. Kyrpides, Paweł P. Łabaj, Patrick K.H. Lee, Marcus H.Y. Leung, Per O. Ljungdahl, Gabriella Mason-Buck, Ken McGrath, Cem Meydan, Emmanuel F. Mongodin, Milton Ozorio Moraes, Niranjan Nagarajan, Marina Nieto-Caballero, Houtan Noushmehr, Manuela Oliveira, Stephan Ossowski, Olayinka O. Osuolale, Orhan Özcan, David Paez-Espino, Nicolás Rascovan, Hugues Richard, Gunnar Rätsch, Lynn M. Schriml, Torsten Semmler, Osman U. Sezerman, Leming Shi, Tieliu Shi, Rania Siam, Le Huu Song, Haruo Suzuki, Denise Syndercombe Court, Scott W. Tighe, Xinzhao Tong, Klas I. Udekwu, Juan A. Ugalde, Brandon Valentine, Dimitar I. Vassilev, Elena M. Vayndorf, Thirumalaisamy P. Velavan, Jun Wu, María M. Zambrano, Jifeng Zhu, Sibo Zhu, Christopher E. Mason, The International MetaSUB Consortium

Submitted Cell

Link DOI

Abstract The development of respiratory failure is common among patients in intensive care units (ICU). Large data quantities from ICU patient monitoring systems make timely and comprehensive analysis by clinicians difficult but are ideal for automatic processing by machine learning algorithms. Early prediction of respiratory system failure could alert clinicians to patients at risk of respiratory failure and allow for early patient reassessment and treatment adjustment. We propose an early warning system that predicts moderate/severe respiratory failure up to 8 hours in advance. Our system was trained on HiRID-II, a data-set containing more than 60,000 admissions to a tertiary care ICU. An alarm is typically triggered several hours before the beginning of respiratory failure. Our system outperforms a clinical baseline mimicking traditional clinical decision-making based on pulse-oximetric oxygen saturation and the fraction of inspired oxygen. To provide model introspection and diagnostics, we developed an easy-to-use web browser-based system to explore model input data and predictions visually.

Authors Matthias Hüser, Martin Faltys, Xinrui Lyu, Chris Barber, Stephanie L. Hyland, Thomas M. Merz, Gunnar Rätsch

Submitted arXiv Preprints

Link

Abstract Pancreatic adenocarcinoma (PDAC) epitomizes a deadly cancer driven by abnormal KRAS signaling. Here, we show that the eIF4A RNA helicase is required for translation of key KRAS signaling molecules and that pharmacological inhibition of eIF4A has single-agent activity against murine and human PDAC models at safe dose levels. EIF4A was uniquely required for the translation of mRNAs with long and highly structured 5′ untranslated regions, including those with multiple G-quadruplex elements. Computational analyses identified these features in mRNAs encoding KRAS and key downstream molecules. Transcriptome-scale ribosome footprinting accurately identified eIF4A-dependent mRNAs in PDAC, including critical KRAS signaling molecules such as PI3K, RALA, RAC2, MET, MYC, and YAP1. These findings contrast with a recent study that relied on an older method, polysome fractionation, and implicated redox-related genes as eIF4A clients. Together, our findings highlight the power of ribosome footprinting in conjunction with deep RNA sequencing in accurately decoding translational control mechanisms and define the therapeutic mechanism of eIF4A inhibitors in PDAC.

Authors Kamini Singh, Jianan Lin, Nicolas Lecomte, Prathibha Mohan, Askan Gokce, Viraj R Sanghvi, Man Jiang, Olivera Grbovic-Huezo, Antonija Burčul, Stefan G Stark, Paul B Romesser, Qing Chang, Jerry P Melchor, Rachel K Beyer, Mark Duggan, Yoshiyuki Fukase, Guangli Yang, Ouathek Ouerfelli, Agnes Viale, Elisa De Stanchina, Andrew W Stamford, Peter T Meinke, Gunnar Rätsch, Steven D Leach, Zhengqing Ouyang, Hans-Guido Wendel

Submitted Journal Cancer research

Link DOI

Abstract Conventional variational autoencoders fail in modeling correlations between data points due to their use of factorized priors. Amortized Gaussian process inference through GP-VAEs has led to significant improvements in this regard, but is still inhibited by the intrinsic complexity of exact GP inference. We improve the scalability of these methods through principled sparse inference approaches. We propose a new scalable GP-VAE model that outperforms existing approaches in terms of runtime and memory footprint, is easy to implement, and allows for joint end-to-end optimization of all components.

Authors Metod Jazbec, Vincent Fortuin, Michael Pearce, Stephan Mandt, Gunnar Rätsch

Submitted AISTATS 2021

Link

Abstract Generating interpretable visualizations of multivariate time series in the intensive care unit is of great practical importance. Clinicians seek to condense complex clinical observations into intuitively understandable critical illness patterns, like failures of different organ systems. They would greatly benefit from a low-dimensional representation in which the trajectories of the patients' pathology become apparent and relevant health features are highlighted. To this end, we propose to use the latent topological structure of Self-Organizing Maps (SOMs) to achieve an interpretable latent representation of ICU time series and combine it with recent advances in deep clustering. Specifically, we (a) present a novel way to fit SOMs with probabilistic cluster assignments (PSOM), (b) propose a new deep architecture for probabilistic clustering (DPSOM) using a VAE, and (c) extend our architecture to cluster and forecast clinical states in time series (T-DPSOM). We show that our model achieves superior clustering performance compared to state-of-the-art SOM-based clustering methods while maintaining the favorable visualization properties of SOMs. On the eICU data-set, we demonstrate that T-DPSOM provides interpretable visualizations of patient state trajectories and uncertainty estimation. We show that our method rediscovers well-known clinical patient characteristics, such as a dynamic variant of the Acute Physiology And Chronic Health Evaluation (APACHE) score. Moreover, we illustrate how it can disentangle individual organ dysfunctions on disjoint regions of the two-dimensional SOM map.

Authors Laura Manduchi, Matthias Hüser, Martin Faltys, Julia Vogt, Gunnar Rätsch, Vincent Fortuin

Submitted ACM-CHIL 2021

Link

Abstract Dynamic assessment of mortality risk in the intensive care unit (ICU) can be used to stratify patients, inform about treatment effectiveness or serve as part of an early-warning system. Static risk scoring systems, such as APACHE or SAPS, have recently been supplemented with data-driven approaches that track the dynamic mortality risk over time. Recent works have focused on enhancing the information delivered to clinicians even further by producing full survival distributions instead of point predictions or fixed horizon risks. In this work, we propose a non-parametric ensemble model, Weighted Resolution Survival Ensemble (WRSE), tailored to estimate such dynamic individual survival distributions. Inspired by the simplicity and robustness of ensemble methods, the proposed approach combines a set of binary classifiers spaced according to a decay function reflecting the relevance of short-term mortality predictions. Models and baselines are evaluated under weighted calibration and discrimination metrics for individual survival distributions which closely reflect the utility of a model in ICU practice. We show competitive results with state-of-the-art probabilistic models, while greatly reducing training time by factors of 2-9x.

Authors Jonathan Heitz, Joanna Ficek-Pascual, Martin Faltys, Tobias M. Merz, Gunnar Rätsch, Matthias Hüser

Submitted Proceedings of the AAAI-2021 - Spring Symposium on Survival Prediction

Link

Abstract Intra-tumor hypoxia is a common feature in many solid cancers. Although transcriptional targets of hypoxia-inducible factors (HIFs) have been well characterized, alternative splicing or processing of pre-mRNA transcripts which occurs during hypoxia and subsequent HIF stabilization is much less understood. Here, we identify many HIF-dependent alternative splicing events after whole transcriptome sequencing in pancreatic cancer cells exposed to hypoxia with and without downregulation of the aryl hydrocarbon receptor nuclear translocator (ARNT), a protein required for HIFs to form a transcriptionally active dimer. We correlate the discovered hypoxia-driven events with available sequencing data from pan-cancer TCGA patient cohorts to select a narrow set of putative biologically relevant splice events for experimental validation. We validate a small set of candidate HIF-dependent alternative splicing events in multiple human gastrointestinal cancer cell lines as well as patient-derived human pancreatic cancer organoids. Lastly, we report the discovery of a HIF-dependent mechanism to produce a hypoxia-dependent, long and coding isoform of the UDP-N-acetylglucosamine transporter SLC35A3.

Authors Philipp Markolin, Natalie Davidson, Christian K Hirt, Christophe D Chabbert, Nicola Zamboni, Gerald Schwank, Wilhelm Krek, Gunnar Rätsch

Submitted Genomics

Link DOI

Abstract Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. In this paper, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10,000 RNA-seq datasets show that RowDiff combined with Multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a Multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST.

Authors Daniel Danciu, Mikhail Karasikov, Harun Mustafa, André Kahles, Gunnar Rätsch

Submitted ISMB/ECCB 2021

Link DOI

Abstract The sharp increase in next-generation sequencing technologies’ capacity has created a demand for algorithms capable of quickly searching a large corpus of biological sequences. The complexity of biological variability and the magnitude of existing data sets have impeded finding algorithms with guaranteed accuracy that efficiently run in practice. Our main contribution is the Tensor Sketch method that efficiently and accurately estimates edit distances. In our experiments, Tensor Sketch had 0.88 Spearman’s rank correlation with the exact edit distance, almost doubling the 0.466 correlation of the closest competitor while running 8.8 times faster. Finally, all sketches can be updated dynamically if the input is a sequence stream, making it appealing for large-scale applications where data cannot fit into memory. Conceptually, our approach has three steps: 1) represent sequences as tensors over their sub-sequences, 2) apply tensor sketching that preserves tensor inner products, 3) implicitly compute the sketch. The sub-sequences, which are not necessarily contiguous pieces of the sequence, allow us to outperform fc-mer-based methods, such as min-hash sketching over a set of k-mers. Typically, the number of sub-sequences grows exponentially with the sub-sequence length, introducing both memory and time overheads. We directly address this problem in steps 2 and 3 of our method. While the sketching of rank-1 or super-symmetric tensors is known to admit efficient sketching, the sub-sequence tensor does not satisfy either of these properties. Hence, we propose a new sketching scheme that completely avoids the need for constructing the ambient space. Our tensor-sketching technique’s main advantages are three-fold: 1) Tensor Sketch has higher accuracy than any of the other assessed sketching methods used in practice. 2) All sketches can be computed in a streaming fashion, leading to significant time and memory savings when there is overlap between input sequences. 3) It is straightforward to extend tensor sketching to different settings leading to efficient methods for related sequence analysis tasks. We view tensor sketching as a framework to tackle a wide range of relevant bioinformatics problems, and we are confident that it can bring significant improvements for applications based on edit distance estimation.

Authors Amir Joudaki, Gunnar Rätsch, André Kahles

Submitted RECOMB 2021

Link