Gunnar Rätsch, Prof. Dr.

Interdisciplinary research is about forging collaborations across disciplinary and geographic borders.


+41 44 632 2036
ETH Zürich
Department of Computer Science
Biomedical Informatics Group Universitätsstrasse 6
CAB F53.2
8092 Zürich
CAB F53.2

Data scientist Gunnar Rätsch develops and applies advanced data analysis and modeling techniques to data from deep molecular profiling, medical and health records, as well as images.

He earned his Ph.D. at the German National Laboratory for Information Technology under supervision of Klaus-Robert Müller and was a postdoc with Bob Williamson and Bernhard Schölkopf. He received the Max Planck Young and Independent Investigator award and was leading the group on Machine Learning in Genome Biology at the Friedrich Miescher Laboratory in Tübingen (2005-2011). In 2012, he joined Memorial Sloan Kettering Cancer Center as Associate Faculty. In May 2016, he and his lab moved to Zürich to join the Computer Science Department of ETH Zürich.

The Rätsch laboratory focuses on bridging medicine and biology with computer science. The group’s research interests are relatively broad as it covers an area from algorithmic computer science to biomedical application fields. On the one hand, this includes work on algorithms that can learn or extract insights from data, on the other hand it involves developing tools that we and others employ for the analysis of large genomic or medical data sets, often in collaboration with biologists and physicians. These tools aim to solve real-world biomedical problems. In short, the group advances the state-of-the-art in data science algorithms, turns them into commonly usable tools for specific applications, and then collaborate with biologists and physicians on life science problems. Along the way, we learn more and can go back to improve the algorithms.

Abstract Understanding deep learning model behavior is critical to accepting machine learning-based decision support systems in the medical community. Previous research has shown that jointly using clinical notes with electronic health record (EHR) data improved predictive performance for patient monitoring in the intensive care unit (ICU). In this work, we explore the underlying reasons for these improvements. While relying on a basic attention-based model to allow for interpretability, we first confirm that performance significantly improves over state-of-the-art EHR data models when combining EHR data and clinical notes. We then provide an analysis showing improvements arise almost exclusively from a subset of notes containing broader context on patient state rather than clinician notes. We believe such findings highlight deep learning models for EHR data to be more limited by partially-descriptive data than by modeling choice, motivating a more data-centric approach in the field.

Authors Severin Husmann, Hugo Yèche, Gunnar Rätsch, Rita Kuznetsova

Submitted Workshop on Learning from Time Series for Health, 36th Conference on Neural Information Processing Systems (NeurIPS 2022)


Abstract Data augmentation is commonly applied to improve performance of deep learning by enforcing the knowledge that certain transformations on the input preserve the output. Currently, the used data augmentation is chosen by human effort and costly cross-validation, which makes it cumbersome to apply to new datasets. We develop a convenient gradient-based method for selecting the data augmentation without validation data and during training of a deep neural network. Our approach relies on phrasing data augmentation as an invariance in the prior distribution and learning it using Bayesian model selection, which has been shown to work in Gaussian processes, but not yet for deep neural networks. We propose a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective, which can be optimised without human supervision or validation data. We show that our method can successfully recover invariances present in the data, and that this improves generalisation and data efficiency on image datasets.

Authors Alexander Immer, Tycho FA van der Ouderaa, Gunnar Rätsch, Vincent Fortuin, Mark van der Wilk

Submitted NeurIPS 2022


Abstract The amount of data stored in genomic sequence databases is growing exponentially, far exceeding traditional indexing strategies' processing capabilities. Many recent indexing methods organize sequence data into a sequence graph to succinctly represent large genomic data sets from reference genome and sequencing read set databases. These methods typically use De Bruijn graphs as the graph model or the underlying index model, with auxiliary graph annotation data structures to associate graph nodes with various metadata. Examples of metadata can include a node's source samples (called labels), genomic coordinates, expression levels, etc. An important property of these graphs is that the set of sequences spelled by graph walks is a superset of the set of input sequences. Thus, when aligning to graphs indexing samples derived from low-coverage sequencing sets, sequence information present in many target samples can compensate for missing information resulting from a lack of sequence coverage. Aligning a query against an entire sequence graph (as in traditional sequence-to-graph alignment) using state-of-the-art algorithms can be computationally intractable for graphs constructed from thousands of samples, potentially searching through many non-biological combinations of samples before converging on the best alignment. To address this problem, we propose a novel alignment strategy called multi-label alignment (MLA) and an algorithm implementing this strategy using annotated De Bruijn graphs within the MetaGraph framework, called MetaGraph-MLA. MLA extends current sequence alignment scoring models with additional label change operations for incorporating mixtures of samples into an alignment, penalizing mixtures that are dissimilar in their sequence content. To overcome disconnects in the graph that result from a lack of sequencing coverage, we further extend our graph index to utilize a variable-order De Bruijn graph and introduce node length change as an operation. In this model, traversal between nodes that share a suffix of length < k-1 acts as a proxy for inserting nodes into the graph. MetaGraph-MLA constructs an MLA of a query by chaining single-label alignments using sparse dynamic programming. We evaluate MetaGraph-MLA on simulated data against state-of-the-art sequence-to-graph aligners. We demonstrate increases in alignment lengths for simulated viral Illumina-type (by 6.5%), PacBio CLR-type (by 6.2%), and PacBio CCS-type (by 6.7%) sequencing reads, respectively, and show that the graph walks incorporated into our MLAs originate predominantly from samples of the same strain as the reads' ground-truth samples. We envision MetaGraph-MLA as a step towards establishing sequence graph tools for sequence search against a wide variety of target sequence types.

Authors Harun Mustafa, Mikhail Karasikov, Gunnar Rätsch, André Kahles

Submitted bioRxiv

Link DOI

Abstract Methods In a single-center retrospective study of matched pairs of initial and post-therapeutic glioma cases with a recurrence period greater than one year, we performed whole exome sequencing combined with mRNA and microRNA expression profiling to identify processes that are altered in recurrent gliomas. Results Mutational analysis of recurrent gliomas revealed early branching evolution in seventy-five percent of patients. High plasticity was confirmed at the mRNA and miRNA levels. SBS1 signature was reduced and SBS11 was elevated, demonstrating the effect of alkylating agent therapy on the mutational landscape. There was no evidence for secondary genomic alterations driving therapy resistance. ALK7/ACVR1C and LTBP1 were upregulated, whereas LEFTY2 was downregulated, pointing towards enhanced Tumor Growth Factor β (TGF-β) signaling in recurrent gliomas. Consistently, altered microRNA expression profiles pointed towards enhanced Nuclear Factor Kappa B and Wnt signaling that, cooperatively with TGF-β, induces epithelial to mesenchymal transition (EMT), migration and stemness. TGF-β-induced expression of pro-apoptotic proteins and repression of anti-apoptotic proteins were uncoupled in the recurrent tumor. Conclusions Our results suggest an important role of TGF-β signaling in recurrent gliomas. This may have clinical implication, since TGF-β inhibitors have entered clinical phase studies and may potentially be used in combination therapy to interfere with chemoradiation resistance. Recurrent gliomas show high incidence of early branching evolution. High tumor plasticity is confirmed at the level of microRNA and mRNA expression profiles.

Authors Elham Kashani, Désirée Schnidrig, Ali Hashemi Gheinani, Martina Selina Ninck, Philipp Zens, Theoni Maragkou, Ulrich Baumgartner, Philippe Schucht, Gunnar Rätsch, Mark A Rubin, Sabina Berezowska, Charlotte KY Ng, Erik Vassella

Submitted Neuro-oncology

Link DOI

Abstract Background. Glioblastoma (GBM) is the most aggressive primary brain tumor and represents a particular challenge of therapeutic intervention. Methods. In a single-center retrospective study of matched pairs of initial and post-therapeutic GBM cases with a recurrence period greater than one year, we performed whole exome sequencing combined with mRNA and microRNA expression profiling to identify processes that are altered in recurrent GBM. Results. Expression and mutational profiling of recurrent GBM revealed evidence for early branching evolution in seventy-five percent of patients. SBS1 signature was reduced in the recurrent tumor and SBS11 was elevated, demonstrating the effect of alkylating agent therapy on the mutational landscape. There was no evidence for secondary genomic alterations driving therapy resistance. ALK7/ACVR1C and LTBP1 were upregulated, whereas LEFTY2 was downregulated, pointing towards enhanced Tumor Growth Factor β (TGF-β) signaling in the recurrent GBM. Consistently, altered microRNA expression profiles pointed towards enhanced Nuclear Factor Kappa B signaling that, cooperatively with TGF-β, induces epithelial to mesenchymal transition (EMT), migration and stemness. In contrast, TGF-β-induced expression of pro-apoptotic proteins and repression of anti-apoptotic proteins were uncoupled in the recurrent tumor. Conclusions. Our results suggest an important role of TGF-β signaling in recurrent GBM. This may have clinical implication, since TGF-β inhibitors have entered clinical phase studies and may potentially be used in combination therapy to interfere with chemoradiation resistance. Recurrent GBM show high incidence of early branching evolution. High tumor plasticity is confirmed at the level of microRNA and mRNA expression profiles.

Authors Elham Kashani, Désirée Schnidrig, Ali Hashemi Gheinani, Martina Selina Ninck, Philipp Zens, Theoni Maragkou, Ulrich Baumgartner, Philippe Schucht, Gunnar Rätsch, Mark Andrew Rubin, Sabina Berezowska, Charlotte KY Ng, Erik Vassella

Submitted Research Square (Preprint Platform)

Link DOI

Abstract Several recently developed single-cell DNA sequencing technologies enable whole-genome sequencing of thousands of cells. However, the ultra-low coverage of the sequenced data (<0.05× per cell) mostly limits their usage to the identification of copy number alterations in multi-megabase segments. Many tumors are not copy number-driven, and thus single-nucleotide variant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumor heterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible when superimposing the sequenced genomes of hundreds of genetically similar cells. Thus, we have developed a new approach to efficiently cluster tumor cells based on a Bayesian filtering approach of relevant loci and exploiting read overlap and phasing.

Authors Hana Rozhoňová, Daniel Danciu, Stefan Stark, Gunnar Rätsch, André Kahles, Kjong-Van Lehmann

Submitted Bioinformatics

Link DOI

Abstract Models that can predict adverse events ahead of time with low false-alarm rates are critical to the acceptance of decision support systems in the medical community. This challenging machine learning task remains typically treated as simple binary classification, with few bespoke methods proposed to leverage temporal dependency across samples. We propose Temporal Label Smoothing (TLS), a novel learning strategy that modulates smoothing strength as a function of proximity to the event of interest. This regularization technique reduces model confidence at the class boundary, where the signal is often noisy or uninformative, thus allowing training to focus on clinically informative data points away from this boundary region. From a theoretical perspective, we also show that our method can be framed as an extension of multi-horizon prediction, a learning heuristic proposed in other early prediction work. TLS empirically matches or outperforms considered competing methods on various early prediction benchmark tasks. In particular, our approach significantly improves performance on clinically-relevant metrics such as event recall at low false-alarm rates.

Authors Hugo Yèche, Alizée Pace, Gunnar Rätsch, Rita Kuznetsova

Link DOI

Abstract Mutations in the splicing factor SF3B1 are frequently occurring in various cancers and drive tumor progression through the activation of cryptic splice sites in multiple genes. Recent studies also demonstrate a positive correlation between the expression levels of wild-type SF3B1 and tumor malignancy. Here, we demonstrate that SF3B1 is a hypoxia-inducible factor (HIF)-1 target gene that positively regulates HIF1 pathway activity. By physically interacting with HIF1α, SF3B1 facilitates binding of the HIF1 complex to hypoxia response elements (HREs) to activate target gene expression. To further validate the relevance of this mechanism for tumor progression, we show that a reduction in SF3B1 levels via monoallelic deletion of Sf3b1 impedes tumor formation and progression via impaired HIF signaling in a mouse model for pancreatic cancer. Our work uncovers an essential role of SF3B1 in HIF1 signaling, thereby providing a potential explanation for the link between high SF3B1 expression and aggressiveness of solid tumors.

Authors Patrik Simmler, Cédric Cortijo, Lisa Maria Koch, Patricia Galliker, Silvia Angori, Hella Anna Bolck, Christina Mueller, Ana Vukolic, Peter Mirtschink, Yann Christinat, Natalie R Davidson, Kjong-Van Lehmann, Giovanni Pellegrini, Chantal Pauli, Daniela Lenggenhager, Ilaria Guccini, Till Ringel, Christian Hirt, Kim Fabiano Marquart, Moritz Schaefer, Gunnar Rätsch, Matthias Peter, Holger Moch, Markus Stoffel, Gerald Schwank

Submitted Cell Reports

Link DOI

Abstract Understanding and predicting molecular responses towards external perturbations is a core question in molecular biology. Technological advancements in the recent past have enabled the generation of high-resolution single-cell data, making it possible to profile individual cells under different experimentally controlled perturbations. However, cells are typically destroyed during measurement, resulting in unpaired distributions over either perturbed or non-perturbed cells. Leveraging the theory of optimal transport and the recent advents of convex neural architectures, we learn a coupling describing the response of cell populations upon perturbation, enabling us to predict state trajectories on a single-cell level. We apply our approach, CellOT, to predict treatment responses of 21,650 cells subject to four different drug perturbations. CellOT outperforms current state-of-the-art methods both qualitatively and quantitatively, accurately capturing cellular behavior shifts across all different drugs.

Authors Charlotte Bunne, Stefan Stark, Gabriele Gut, Jacobo Sarabia del Castillo, Mitchell Levesque, Kjong Van Lehmann, Lucas Pelkmans, Andreas Krause, Gunnar Rätsch

Submitted BioRxiv

Link DOI

Abstract Alternative splicing (AS) is a regulatory process during mRNA maturation that shapes higher eukaryotes’ complex transcriptomes. High-throughput sequencing of RNA (RNA-Seq) allows for measurements of AS transcripts at an unprecedented depth and diversity. The ever-expanding catalog of known AS events provides biological insights into gene regulation, population genetics, or in the context of disease. Here, we present an overview on the usage of SplAdder, a graph-based alternative splicing toolbox, which can integrate an arbitrarily large number of RNA-Seq alignments and a given annotation file to augment the shared annotation based on RNA-Seq evidence. The shared augmented annotation graph is then used to identify, quantify, and confirm alternative splicing events based on the RNA-Seq data. Splice graphs for individual alignments can also be tested for significant quantitative differences between other samples or groups of samples.

Authors Philipp Markolin, Gunnar Rätsch, André Kahles

Submitted Variant Calling


Abstract The number of published metagenome assemblies is rapidly growing due to advances in sequencing technologies. However, sequencing errors, variable coverage, repetitive genomic regions, and other factors can produce misassemblies, which are challenging to detect for taxonomically novel genomic data. Assembly errors can affect all downstream analyses of the assemblies. Accuracy for the state of the art in reference-free misassembly prediction does not exceed an AUPRC of 0.57, and it is not clear how well these models generalize to real-world data. Here, we present the Residual neural network for Misassembled Contig identification (ResMiCo), a deep learning approach for reference-free identification of misassembled contigs. To develop ResMiCo, we first generated a training dataset of unprecedented size and complexity that can be used for further benchmarking and developments in the field. Through rigorous validation, we show that ResMiCo is substantially more accurate than the state of the art, and the model is robust to novel taxonomic diversity and varying assembly methods. ResMiCo estimated 4.7% misassembled contigs per metagenome across multiple real-world datasets. We demonstrate how ResMiCo can be used to optimize metagenome assembly hyperparameters to improve accuracy, instead of optimizing solely for contiguity. The accuracy, robustness, and ease-of-use of ResMiCo make the tool suitable for general quality control of metagenome assemblies and assembly methodology optimization.

Authors Olga Mineeva, Daniel Danciu, Bernhard Schölkopf, Ruth E. Ley, Gunnar Rätsch, Nicholas D. Youngblut

Submitted bioRxiv

Link DOI

Abstract Complex multivariate time series arise in many fields, ranging from computer vision to robotics or medicine. Often we are interested in the independent underlying factors that give rise to the high-dimensional data we are observing. While many models have been introduced to learn such disentangled representations, only few attempt to explicitly exploit the structure of sequential data. We investigate the disentanglement properties of Gaussian process variational autoencoders, a class of models recently introduced that have been successful in different tasks on time series data. Our model exploits the temporal structure of the data by modeling each latent channel with a GP prior and employing a structured variational distribution that can capture dependencies in time. We demonstrate the competitiveness of our approach against state-of-the-art unsupervised and weakly-supervised disentanglement methods on a benchmark task. Moreover, we provide evidence that we can learn meaningful disentangled representations on real-world medical time series data.

Authors Simon Bing, Vincent Fortuin, Gunnar Rätsch

Submitted AABI 2022


Abstract Multi-layered omics technologies can help define relationships between genetic factors, biochemical processes and phenotypes thus extending research of monogenic diseases beyond identifying their cause. We implemented a multi-layered omics approach for the inherited metabolic disorder methylmalonic aciduria. We performed whole genome sequencing, transcriptomic sequencing, and mass spectrometry-based proteotyping from matched primary fibroblast samples of 230 individuals (210 affected, 20 controls) and related the molecular data to 105 phenotypic features. Integrative analysis identified a molecular diagnosis for 84% (179/210) of affected individuals, the majority (150) of whom had pathogenic variants in methylmalonyl-CoA mutase (MMUT). Untargeted integration of all three omics layers revealed dysregulation of TCA cycle and surrounding metabolic pathways, a finding that was further supported by multi-organ metabolomics of a hemizygous Mmut mouse model. Stratification by phenotypic severity indicated downregulation of oxoglutarate dehydrogenase and upregulation of glutamate dehydrogenase in disease. This was supported by metabolomics and isotope tracing studies which showed increased glutamine-derived anaplerosis. We further identified MMUT to physically interact with both, oxoglutarate dehydrogenase and glutamate dehydrogenase providing a mechanistic link. This study emphasizes the utility of a multi-modal omics approach to investigate metabolic diseases and highlights glutamine anaplerosis as a potential therapeutic intervention point in methylmalonic aciduria.

Authors Patrick Forny, Ximena Bonilla, David Lamparter, Wenguang Shao, Tanja Plessl, Caroline Frei, Anna Bingisser, Sandra Goetze, Audrey van Drogen, Keith Harshmann, Patrick GA Pedrioli, Cedric Howald, Florian Traversi, Sarah Cherkaoui, Raphael J Morscher, Luke Simmons, Merima Forny, Ioannis Xenarios, Ruedi Aebersold, Nicola Zamboni, Gunnar Rätsch, Emmanouil Dermitzakis, Bernd Wollscheid, Matthias R Baumgartner, D Sean Froese

Submitted medRxiv

Link DOI

Abstract We propose a stochastic conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. Existing CGM variants for this template either suffer from slow convergence rates, or require carefully increasing the batch size over the course of the algorithm’s execution, which leads to computing full gradients. In contrast, the proposed method, equipped with a stochastic average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques. In applications we put special emphasis on problems with a large number of separable constraints. Such problems are prevalent among semidefinite programming (SDP) formulations arising in machine learning and theoretical computer science. We provide numerical experiments on matrix completion, unsupervised clustering, and sparsest-cut SDPs.

Authors Gideon Dresdner, Maria-Luiza Vladarean, Gunnar Rätsch, Francesco Locatello, Volkan Cevher, Alp Yurtsever

Submitted Proceedings of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS-22)


Abstract The recent success of machine learning methods applied to time series collected from Intensive Care Units (ICU) exposes the lack of standardized machine learning benchmarks for developing and comparing such methods. While raw datasets, such as MIMIC-IV or eICU, can be freely accessed on Physionet, the choice of tasks and pre-processing is often chosen ad-hoc for each publication, limiting comparability across publications. In this work, we aim to improve this situation by providing a benchmark covering a large spectrum of ICU-related tasks. Using the HiRID dataset, we define multiple clinically relevant tasks in collaboration with clinicians. In addition, we provide a reproducible end-to-end pipeline to construct both data and labels. Finally, we provide an in-depth analysis of current state-of-the-art sequence modeling methods, highlighting some limitations of deep learning approaches for this type of data. With this benchmark, we hope to give the research community the possibility of a fair comparison of their work.

Authors Hugo Yèche, Rita Kuznetsova, Marc Zimmermann, Matthias Hüser, Xinrui Lyu, Martin Faltys, Gunnar Rätsch

Submitted NeurIPS 2021 (Datasets and Benchmarks)


Abstract High-throughput sequencing data is rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in solving the experiment discovery problem and building compressed representations of annotated de Bruijn graphs where k-mer sets can be efficiently indexed and interactively queried. However, approaches for representing and retrieving other quantitative attributes such as gene expression or genome positions in a general manner have yet to be developed. In this work, we propose the concept of Counting de Bruijn graphs generalizing the notion of annotated (or colored) de Bruijn graphs. Counting de Bruijn graphs supplement each node-label relation with one or many attributes (e.g., a k-mer count or its positions in genome). To represent them, we first observe that many schemes for the representation of compressed binary matrices already support the rank operation on the columns or rows, which can be used to define an inherent indexing of any additional quantitative attributes. Based on this property, we generalize these schemes and introduce a new approach for representing non-binary sparse matrices in compressed data structures. Finally, we notice that relation attributes are often easily predictable from a node’s local neighborhood in the graph. Notable examples are genome positions shifting by 1 for neighboring nodes in the graph, or expression levels that are often shared across neighbors. We exploit this regularity of graph annotations and apply an invertible delta-like coding to achieve better compression. We show that Counting de Bruijn graphs index k-mer counts from 2,652 human RNA-Seq read sets in representations over 8-fold smaller and yet faster to query compared to state-of-the-art bioinformatics tools. Furthermore, Counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip -9 for human Illumina RNA-Seq and 57% smaller for PacBio HiFi sequencing of viral samples. A complete joint searchable index of all viral PacBio SMRT reads from NCBI’s SRA (152,884 read sets, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, they generate a lossless and fully queryable index that is 4.4-fold smaller compared to the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools employing de Bruijn graphs and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed and fully searchable graph-based sequence indexes.

Authors Mikhail Karasikov, Harun Mustafa, Gunnar Rätsch, André Kahles

Submitted RECOMB 2022


Abstract Marginal-likelihood based model-selection, even though promising, is rarely used in deep learning due to estimation difficulties. Instead, most approaches rely on validation data, which may not be readily available. In this work, we present a scalable marginal-likelihood estimation method to select both hyperparameters and network architectures, based on the training data alone. Some hyperparameters can be estimated online during training, simplifying the procedure. Our marginal-likelihood estimate is based on Laplace's method and Gauss-Newton approximations to the Hessian, and it outperforms cross-validation and manual-tuning on standard regression and image classification datasets, especially in terms of calibration and out-of-distribution detection. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable (e.g., in nonstationary settings).

Authors Alexander Immer, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, Mohammad Emtiyaz Khan

Submitted ICML 2021


Authors Patrik T Simmler, Tamara Mengis, Kjong-Van Lehmann, André Kahles, Tinu Thomas, Gunnar Rätsch, Markus Stoffel, Gerald Schwank

Submitted bioRxiv


Abstract Intensive care units (ICU) are increasingly looking towards machine learning for methods to provide online monitoring of critically ill patients. In machine learning, online monitoring is often formulated as a supervised learning problem. Recently, contrastive learning approaches have demonstrated promising improvements over competitive supervised benchmarks. These methods rely on well-understood data augmentation techniques developed for image data which do not apply to online monitoring. In this work, we overcome this limitation by supplementing time-series data augmentation techniques with a novel contrastive learning objective which we call neighborhood contrastive learning (NCL). Our objective explicitly groups together contiguous time segments from each patient while maintaining state-specific information. Our experiments demonstrate a marked improvement over existing work applying contrastive methods to medical time-series.

Authors Hugo Yèche, Gideon Dresdner, Francesco Locatello, Matthias Hüser, Gunnar Rätsch

Submitted ICML 2021


Abstract We present a global atlas of 4,728 metagenomic samples from mass-transit systems in 60 cities over 3 years, representing the first systematic, worldwide catalog of the urban microbial ecosystem. This atlas provides an annotated, geospatial profile of microbial strains, functional characteristics, antimicrobial resistance (AMR) markers, and genetic elements, including 10,928 viruses, 1,302 bacteria, 2 archaea, and 838,532 CRISPR arrays not found in reference databases. We identified 4,246 known species of urban microorganisms and a consistent set of 31 species found in 97% of samples that were distinct from human commensal organisms. Profiles of AMR genes varied widely in type and density across cities. Cities showed distinct microbial taxonomic signatures that were driven by climate and geographic differences. These results constitute a high-resolution global metagenomic atlas that enables discovery of organisms and genes, highlights potential public health and forensic applications, and provides a culture-independent view of AMR burden in cities.

Authors David Danko, Daniela Bezdan, Evan E. Afshin, Sofia Ahsanuddin, Chandrima Bhattacharya, Daniel J. Butler, Kern Rei Chng, Daisy Donnellan, Jochen Hecht, Katelyn Jackson, Katerina Kuchin, Mikhail Karasikov, Abigail Lyons, Lauren Mak, Dmitry Meleshko, Harun Mustafa, Beth Mutai, Russell Y. Neches, Amanda Ng, Olga Nikolayeva, Tatyana Nikolayeva, Eileen Png, Krista A. Ryon, Jorge L. Sanchez, Heba Shaaban, Maria A. Sierra, Dominique Thomas, Ben Young, Omar O. Abudayyeh, Josue Alicea, Malay Bhattacharyya, Ran Blekhman, Eduardo Castro-Nallar, Ana M. Cañas, Aspassia D. Chatziefthimiou, Robert W. Crawford, Francesca De Filippis, Youping Deng, Christelle Desnues, Emmanuel Dias-Neto, Marius Dybwad, Eran Elhaik, Danilo Ercolini, Alina Frolova, Dennis Gankin, Jonathan S. Gootenberg, Alexandra B. Graf, David C. Green, Iman Hajirasouliha, Jaden J.A. Hastings, Mark Hernandez, Gregorio Iraola, Soojin Jang, Andre Kahles, Frank J. Kelly, Kaymisha Knights, Nikos C. Kyrpides, Paweł P. Łabaj, Patrick K.H. Lee, Marcus H.Y. Leung, Per O. Ljungdahl, Gabriella Mason-Buck, Ken McGrath, Cem Meydan, Emmanuel F. Mongodin, Milton Ozorio Moraes, Niranjan Nagarajan, Marina Nieto-Caballero, Houtan Noushmehr, Manuela Oliveira, Stephan Ossowski, Olayinka O. Osuolale, Orhan Özcan, David Paez-Espino, Nicolás Rascovan, Hugues Richard, Gunnar Rätsch, Lynn M. Schriml, Torsten Semmler, Osman U. Sezerman, Leming Shi, Tieliu Shi, Rania Siam, Le Huu Song, Haruo Suzuki, Denise Syndercombe Court, Scott W. Tighe, Xinzhao Tong, Klas I. Udekwu, Juan A. Ugalde, Brandon Valentine, Dimitar I. Vassilev, Elena M. Vayndorf, Thirumalaisamy P. Velavan, Jun Wu, María M. Zambrano, Jifeng Zhu, Sibo Zhu, Christopher E. Mason, The International MetaSUB Consortium

Submitted Cell

Link DOI

Abstract The development of respiratory failure is common among patients in intensive care units (ICU). Large data quantities from ICU patient monitoring systems make timely and comprehensive analysis by clinicians difficult but are ideal for automatic processing by machine learning algorithms. Early prediction of respiratory system failure could alert clinicians to patients at risk of respiratory failure and allow for early patient reassessment and treatment adjustment. We propose an early warning system that predicts moderate/severe respiratory failure up to 8 hours in advance. Our system was trained on HiRID-II, a data-set containing more than 60,000 admissions to a tertiary care ICU. An alarm is typically triggered several hours before the beginning of respiratory failure. Our system outperforms a clinical baseline mimicking traditional clinical decision-making based on pulse-oximetric oxygen saturation and the fraction of inspired oxygen. To provide model introspection and diagnostics, we developed an easy-to-use web browser-based system to explore model input data and predictions visually.

Authors Matthias Hüser, Martin Faltys, Xinrui Lyu, Chris Barber, Stephanie L. Hyland, Thomas M. Merz, Gunnar Rätsch

Submitted arXiv Preprints


Abstract Pancreatic adenocarcinoma (PDAC) epitomizes a deadly cancer driven by abnormal KRAS signaling. Here, we show that the eIF4A RNA helicase is required for translation of key KRAS signaling molecules and that pharmacological inhibition of eIF4A has single-agent activity against murine and human PDAC models at safe dose levels. EIF4A was uniquely required for the translation of mRNAs with long and highly structured 5′ untranslated regions, including those with multiple G-quadruplex elements. Computational analyses identified these features in mRNAs encoding KRAS and key downstream molecules. Transcriptome-scale ribosome footprinting accurately identified eIF4A-dependent mRNAs in PDAC, including critical KRAS signaling molecules such as PI3K, RALA, RAC2, MET, MYC, and YAP1. These findings contrast with a recent study that relied on an older method, polysome fractionation, and implicated redox-related genes as eIF4A clients. Together, our findings highlight the power of ribosome footprinting in conjunction with deep RNA sequencing in accurately decoding translational control mechanisms and define the therapeutic mechanism of eIF4A inhibitors in PDAC.

Authors Kamini Singh, Jianan Lin, Nicolas Lecomte, Prathibha Mohan, Askan Gokce, Viraj R Sanghvi, Man Jiang, Olivera Grbovic-Huezo, Antonija Burčul, Stefan G Stark, Paul B Romesser, Qing Chang, Jerry P Melchor, Rachel K Beyer, Mark Duggan, Yoshiyuki Fukase, Guangli Yang, Ouathek Ouerfelli, Agnes Viale, Elisa De Stanchina, Andrew W Stamford, Peter T Meinke, Gunnar Rätsch, Steven D Leach, Zhengqing Ouyang, Hans-Guido Wendel

Submitted Journal Cancer research

Link DOI

Abstract Conventional variational autoencoders fail in modeling correlations between data points due to their use of factorized priors. Amortized Gaussian process inference through GP-VAEs has led to significant improvements in this regard, but is still inhibited by the intrinsic complexity of exact GP inference. We improve the scalability of these methods through principled sparse inference approaches. We propose a new scalable GP-VAE model that outperforms existing approaches in terms of runtime and memory footprint, is easy to implement, and allows for joint end-to-end optimization of all components.

Authors Metod Jazbec, Vincent Fortuin, Michael Pearce, Stephan Mandt, Gunnar Rätsch

Submitted AISTATS 2021


Abstract Generating interpretable visualizations of multivariate time series in the intensive care unit is of great practical importance. Clinicians seek to condense complex clinical observations into intuitively understandable critical illness patterns, like failures of different organ systems. They would greatly benefit from a low-dimensional representation in which the trajectories of the patients' pathology become apparent and relevant health features are highlighted. To this end, we propose to use the latent topological structure of Self-Organizing Maps (SOMs) to achieve an interpretable latent representation of ICU time series and combine it with recent advances in deep clustering. Specifically, we (a) present a novel way to fit SOMs with probabilistic cluster assignments (PSOM), (b) propose a new deep architecture for probabilistic clustering (DPSOM) using a VAE, and (c) extend our architecture to cluster and forecast clinical states in time series (T-DPSOM). We show that our model achieves superior clustering performance compared to state-of-the-art SOM-based clustering methods while maintaining the favorable visualization properties of SOMs. On the eICU data-set, we demonstrate that T-DPSOM provides interpretable visualizations of patient state trajectories and uncertainty estimation. We show that our method rediscovers well-known clinical patient characteristics, such as a dynamic variant of the Acute Physiology And Chronic Health Evaluation (APACHE) score. Moreover, we illustrate how it can disentangle individual organ dysfunctions on disjoint regions of the two-dimensional SOM map.

Authors Laura Manduchi, Matthias Hüser, Martin Faltys, Julia Vogt, Gunnar Rätsch, Vincent Fortuin

Submitted ACM-CHIL 2021


Abstract Dynamic assessment of mortality risk in the intensive care unit (ICU) can be used to stratify patients, inform about treatment effectiveness or serve as part of an early-warning system. Static risk scoring systems, such as APACHE or SAPS, have recently been supplemented with data-driven approaches that track the dynamic mortality risk over time. Recent works have focused on enhancing the information delivered to clinicians even further by producing full survival distributions instead of point predictions or fixed horizon risks. In this work, we propose a non-parametric ensemble model, Weighted Resolution Survival Ensemble (WRSE), tailored to estimate such dynamic individual survival distributions. Inspired by the simplicity and robustness of ensemble methods, the proposed approach combines a set of binary classifiers spaced according to a decay function reflecting the relevance of short-term mortality predictions. Models and baselines are evaluated under weighted calibration and discrimination metrics for individual survival distributions which closely reflect the utility of a model in ICU practice. We show competitive results with state-of-the-art probabilistic models, while greatly reducing training time by factors of 2-9x.

Authors Jonathan Heitz, Joanna Ficek, Martin Faltys, Tobias M. Merz, Gunnar Rätsch, Matthias Hüser

Submitted Proceedings of the AAAI-2021 - Spring Symposium on Survival Prediction


Abstract Intra-tumor hypoxia is a common feature in many solid cancers. Although transcriptional targets of hypoxia-inducible factors (HIFs) have been well characterized, alternative splicing or processing of pre-mRNA transcripts which occurs during hypoxia and subsequent HIF stabilization is much less understood. Here, we identify many HIF-dependent alternative splicing events after whole transcriptome sequencing in pancreatic cancer cells exposed to hypoxia with and without downregulation of the aryl hydrocarbon receptor nuclear translocator (ARNT), a protein required for HIFs to form a transcriptionally active dimer. We correlate the discovered hypoxia-driven events with available sequencing data from pan-cancer TCGA patient cohorts to select a narrow set of putative biologically relevant splice events for experimental validation. We validate a small set of candidate HIF-dependent alternative splicing events in multiple human gastrointestinal cancer cell lines as well as patient-derived human pancreatic cancer organoids. Lastly, we report the discovery of a HIF-dependent mechanism to produce a hypoxia-dependent, long and coding isoform of the UDP-N-acetylglucosamine transporter SLC35A3.

Authors Philipp Markolin, Natalie Davidson, Christian K Hirt, Christophe D Chabbert, Nicola Zamboni, Gerald Schwank, Wilhelm Krek, Gunnar Rätsch

Submitted Genomics

Link DOI

Abstract Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. In this paper, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10,000 RNA-seq datasets show that RowDiff combined with Multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a Multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST.

Authors Daniel Danciu, Mikhail Karasikov, Harun Mustafa, André Kahles, Gunnar Rätsch

Submitted ISMB/ECCB 2021


Abstract The sharp increase in next-generation sequencing technologies’ capacity has created a demand for algorithms capable of quickly searching a large corpus of biological sequences. The complexity of biological variability and the magnitude of existing data sets have impeded finding algorithms with guaranteed accuracy that efficiently run in practice. Our main contribution is the Tensor Sketch method that efficiently and accurately estimates edit distances. In our experiments, Tensor Sketch had 0.88 Spearman’s rank correlation with the exact edit distance, almost doubling the 0.466 correlation of the closest competitor while running 8.8 times faster. Finally, all sketches can be updated dynamically if the input is a sequence stream, making it appealing for large-scale applications where data cannot fit into memory. Conceptually, our approach has three steps: 1) represent sequences as tensors over their sub-sequences, 2) apply tensor sketching that preserves tensor inner products, 3) implicitly compute the sketch. The sub-sequences, which are not necessarily contiguous pieces of the sequence, allow us to outperform fc-mer-based methods, such as min-hash sketching over a set of k-mers. Typically, the number of sub-sequences grows exponentially with the sub-sequence length, introducing both memory and time overheads. We directly address this problem in steps 2 and 3 of our method. While the sketching of rank-1 or super-symmetric tensors is known to admit efficient sketching, the sub-sequence tensor does not satisfy either of these properties. Hence, we propose a new sketching scheme that completely avoids the need for constructing the ambient space. Our tensor-sketching technique’s main advantages are three-fold: 1) Tensor Sketch has higher accuracy than any of the other assessed sketching methods used in practice. 2) All sketches can be computed in a streaming fashion, leading to significant time and memory savings when there is overlap between input sequences. 3) It is straightforward to extend tensor sketching to different settings leading to efficient methods for related sequence analysis tasks. We view tensor sketching as a framework to tackle a wide range of relevant bioinformatics problems, and we are confident that it can bring significant improvements for applications based on edit distance estimation.

Authors Amir Joudaki, Gunnar Rätsch, André Kahles

Submitted RECOMB 2021


Abstract Motivation Deep learning techniques have yielded tremendous progress in the field of computational biology over the last decade, however many of these techniques are opaque to the user. To provide interpretable results, methods have incorporated biological priors directly into the learning task; one such biological prior is pathway structure. While pathways represent most biological processes in the cell, the high level of correlation and hierarchical structure make it complicated to determine an appropriate computational representation. Results Here, we present pathway module Variational Autoencoder (pmVAE). Our method encodes pathway information by restricting the structure of our VAE to mirror gene-pathway memberships. Its architecture is composed of a set of subnetworks, which we refer to as pathway modules. The subnetworks learn interpretable latent representations by factorizing the latent space according to pathway gene sets. We directly address correlation between pathways by balancing a module-specific local loss and a global reconstruction loss. Furthermore, since many pathways are by nature hierarchical and therefore the product of multiple downstream signals, we model each pathway as a multidimensional vector. Due to their factorization over pathways, the representations allow for easy and interpretable analysis of multiple downstream effects, such as cell type and biological stimulus, within the contexts of each pathway. We compare pmVAE against two other state-of-the-art methods on two single-cell RNA-seq case-control data sets, demonstrating that our pathway representations are both more discriminative and consistent in detecting pathways targeted by a perturbation. Availability and implementation

Authors Gilles Gut, Stefan G Stark, Gunnar Rätsch, Natalie R Davidson

Submitted biorxiv

Link DOI

Abstract The application and integration of molecular profiling technologies create novel opportunities for personalized medicine. Here, we introduce the Tumor Profiler Study, an observational trial combining a prospective diagnostic approach to assess the relevance of in-depth tumor profiling to support clinical decision-making with an exploratory approach to improve the biological understanding of the disease.

Authors Anja Irmisch, Ximena Bonilla, Stéphane Chevrier, Kjong-Van Lehmann, Franziska Singer, Nora C. Toussaint, Cinzia Esposito, Julien Mena, Emanuela S. Milani, Ruben Casanova, Daniel J. Stekhoven, Rebekka Wegmann, Francis Jacob, Bettina Sobottka, Sandra Goetze, Jack Kuipers, Jacobo Sarabia del Castillo, Michael Prummer, Mustafa A. Tuncel, Ulrike Menzel, Andrea Jacobs, Stefanie Engler, Sujana Sivapatham, Anja L. Frei, Gabriele Gut, Joanna Ficek, Nicola Miglino, Melike Ak, Faisal S. Al-Quaddoomi, Jonas Albinus, Ilaria Alborelli, Sonali Andani, Per-Olof Attinger, Daniel Baumhoer, Beatrice Beck-Schimmer, Lara Bernasconi, Anne Bertolini, Natalia Chicherova, Maya D'Costa, Esther Danenberg, Natalie Davidson, Monica-Andreea Drăgan, Martin Erkens, Katja Eschbach, André Fedier, Pedro Ferreira, Bruno Frey, Linda Grob, Detlef Günther, Martina Haberecker, Pirmin Haeuptle, Sylvia Herter, Rene Holtackers, Tamara Huesser, Tim M. Jaeger, Katharina Jahn, Alva R. James, Philip M. Jermann, André Kahles, Abdullah Kahraman, Werner Kuebler, Christian P. Kunze, Christian Kurzeder, Sebastian Lugert, Gerd Maass, Philipp Markolin, Julian M. Metzler, Simone Muenst, Riccardo Murri, Charlotte K.Y. Ng, Stefan Nicolet, Marta Nowak, Patrick G.A. Pedrioli, Salvatore Piscuoglio, Mathilde Ritter, Christian Rommel, María L. Rosano-González, Natascha Santacroce, Ramona Schlenker, Petra C. Schwalie, Severin Schwan, Tobias Schär, Gabriela Senti, Vipin T. Sreedharan, Stefan Stark, Tinu M. Thomas, Vinko Tosevski, Marina Tusup, Audrey Van Drogen, Marcus Vetter, Tatjana Vlajnic, Sandra Weber, Walter P. Weber, Michael Weller, Fabian Wendt, Norbert Wey, Mattheus H.E. Wildschut, Shuqing Yu, Johanna Ziegler, Marc Zimmermann, Martin Zoche, Gregor Zuend, Rudolf Aebersold, Marina Bacac, Niko Beerenwinkel, Christian Beisel, Bernd Bodenmiller, Reinhard Dummer, Viola Heinzelmann-Schwarz, Viktor H. Koelzer, Markus G. Manz, Holger Moch, Lucas Pelkmans, Berend Snijder, Alexandre P.A. Theocharides, Markus Tolnay, Andreas Wicki, Bernd Wollscheid, Gunnar Rätsch, Mitchell P. Levesque

Submitted Cancer Cell (Commentary)

Link DOI

Abstract Isotropic Gaussian priors are the de facto standard for modern Bayesian neural network inference. However, it is unclear whether these priors accurately reflect our true beliefs about the weight distributions or give optimal performance. To find better priors, we study summary statistics of neural network weights in networks trained using SGD. We find that convolutional neural network (CNN) weights display strong spatial correlations, while fully connected networks (FCNNs) display heavy-tailed weight distributions. Building these observations into priors leads to improved performance on a variety of image classification datasets. Surprisingly, these priors mitigate the cold posterior effect in FCNNs, but slightly increase the cold posterior effect in ResNets.

Authors Vincent Fortuin, Adrià Garriga-Alonso, Florian Wenzel, Gunnar Rätsch, Richard Turner, Mark van der Wilk, Laurence Aitchison

Submitted AABI 2021


Abstract Tumors contain multiple subpopulations of genetically distinct cancer cells. Reconstructing their evolutionary history can improve our understanding of how cancers develop and respond to treatment. Subclonal reconstruction methods cluster mutations into groups that co-occur within the same subpopulations, estimate the frequency of cells belonging to each subpopulation, and infer the ancestral relationships among the subpopulations by constructing a clone tree. However, often multiple clone trees are consistent with the data and current methods do not efficiently capture this uncertainty; nor can these methods scale to clone trees with a large number of subclonal populations. Here, we formalize the notion of a partially-defined clone tree (partial clone tree for short) that defines a subset of the pairwise ancestral relationships in a clone tree, thereby implicitly representing the set of all clone trees that have these defined pairwise relationships. Also, we introduce a special partial clone tree, the Maximally-Constrained Ancestral Reconstruction (MAR), which summarizes all clone trees fitting the input data equally well. Finally, we extend commonly used clone tree validity conditions to apply to partial clone trees and describe SubMARine, a polynomial-time algorithm producing the subMAR, which approximates the MAR and guarantees that its defined relationships are a subset of those present in the MAR. We also extend SubMARine to work with subclonal copy number aberrations and define equivalence constraints for this purpose. Further, we extend SubMARine to permit noise in the estimates of the subclonal frequencies while retaining its validity conditions and guarantees. In contrast to other clone tree reconstruction methods, SubMARine runs in time and space that scale polynomially in the number of subclones. We show through extensive noise-free simulation, a large lung cancer dataset and a prostate cancer dataset that the subMAR equals the MAR in all cases where only a single clone tree exists and that it is a perfect match to the MAR in most of the other cases. Notably, SubMARine runs in less than 70 seconds on a single thread with less than one Gb of memory on all datasets presented in this paper, including ones with 50 nodes in a clone tree. On the real-world data, SubMARine almost perfectly recovers the previously reported trees and identifies minor errors made in the expert-driven reconstructions of those trees. The freely-available open-source code implementing SubMARine can be downloaded at

Authors Linda K. Sundermann, Jeff Wintersinger, Gunnar Rätsch, Jens Stoye, Quaid Morris

Submitted PLOS Computational Biology

Link DOI

Abstract Variational Inference makes a trade-off between the capacity of the variational family and the tractability of finding an approximate posterior distribution. Instead, Boosting Variational Inference allows practitioners to obtain increasingly good posterior approximations by spending more compute. The main obstacle to widespread adoption of Boosting Variational Inference is the amount of resources necessary to improve over a strong Variational Inference baseline. In our work, we trace this limitation back to the global curvature of the KL-divergence. We characterize how the global curvature impacts time and memory consumption, address the problem with the notion of local curvature, and provide a novel approximate backtracking algorithm for estimating local curvature. We give new theoretical convergence rates for our algorithms and provide experimental validation on synthetic and real-world datasets.

Authors Gideon Dresdner, Saurav Shekhar, Fabian Pedregosa, Francesco Locatello, Gunnar Rätsch

Submitted International Joint Conference on Artificial Intelligence (IJCAI-21)


Abstract Motivation Recent technological advances have led to an increase in the production and availability of single-cell data. The ability to integrate a set of multi-technology measurements would allow the identification of biologically or clinically meaningful observations through the unification of the perspectives afforded by each technology. In most cases, however, profiling technologies consume the used cells and thus pairwise correspondences between datasets are lost. Due to the sheer size single-cell datasets can acquire, scalable algorithms that are able to universally match single-cell measurements carried out in one cell to its corresponding sibling in another technology are needed. Results We propose Single-Cell data Integration via Matching (SCIM), a scalable approach to recover such correspondences in two or more technologies. SCIM assumes that cells share a common (low-dimensional) underlying structure and that the underlying cell distribution is approximately constant across technologies. It constructs a technology-invariant latent space using an autoencoder framework with an adversarial objective. Multi-modal datasets are integrated by pairing cells across technologies using a bipartite matching scheme that operates on the low-dimensional latent representations. We evaluate SCIM on a simulated cellular branching process and show that the cell-to-cell matches derived by SCIM reflect the same pseudotime on the simulated dataset. Moreover, we apply our method to two real-world scenarios, a melanoma tumor sample and a human bone marrow sample, where we pair cells from a scRNA dataset to their sibling cells in a CyTOF dataset achieving 90% and 78% cell-matching accuracy for each one of the samples, respectively.

Authors Stefan G Stark, Joanna Ficek, Francesco Locatello, Ximena Bonilla, Stéphane Chevrier, Franziska Singer, Tumor Profiler Consortium, Gunnar Rätsch, Kjong-Van Lehmann

Submitted Bioinformatics

Link DOI

Abstract The amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making all this sequencing data searchable and easily accessible to life science and data science researchers is an unsolved problem. We present MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories. MetaGraph efficiently indexes vast collections of sequences to enable fast search and comprehensive analysis. A wide range of underlying data structures offer different practically relevant trade-offs between the space taken by the index and its query performance. MetaGraph provides a flexible methodological framework allowing for index construction to be scaled from consumer laptops to distribution onto a cloud compute cluster for processing terabases to petabases of input data. Achieving compression ratios of up to 1,000-fold over the already compressed raw input data, MetaGraph can represent the content of large sequencing archives in the working memory of a single compute server. We demonstrate our framework’s scalability by indexing over 1.4 million whole genome sequencing (WGS) records from NCBI’s Sequence Read Archive, representing a total input of more than three petabases. Besides demonstrating the utility of MetaGraph indexes on key applications, such as experiment discovery, sequence alignment, error correction, and differential assembly, we make a wide range of indexes available as a community resource, including those over 450,000 microbial WGS records, more than 110,000 fungi WGS records, and more than 20,000 whole metagenome sequencing records. A subset of these indexes is made available online for interactive queries. All indexes created from public data comprising in total more than 1 million records are available for download or usage in the cloud. As an example of our indexes’ integrative analysis capabilities, we introduce the concept of differential assembly, which allows for the extraction of sequences present in a foreground set of samples but absent in a given background set. We apply this technique to differentially assemble contigs to identify pathogenic agents transfected via human kidney transplants. In a second example, we indexed more than 20,000 human RNA-Seq records from the TCGA and GTEx cohorts and use them to extract transcriptome features that are hard to characterize using a classical linear reference. We discovered over 200 trans-splicing events in GTEx and found broad evidence for tissue-specific non-A-to-I RNA-editing in GTEx and TCGA.

Authors Mikhail Karasikov, Harun Mustafa, Daniel Danciu, Marc Zimmermann, Christopher Barber, Gunnar Rätsch, André Kahles

Submitted bioRxiv


Abstract Motivation Understanding the underlying mutational processes of cancer patients has been a long-standing goal in the community and promises to provide new insights that could improve cancer diagnoses and treatments. Mutational signatures are summaries of the mutational processes, and improving the derivation of mutational signatures can yield new discoveries previously obscured by technical and biological confounders. Results from existing mutational signature extraction methods depend on the size of available patient cohort and solely focus on the analysis of mutation count data without considering the exploitation of metadata. Results Here we present a supervised method that utilizes cancer type as metadata to extract more distinctive signatures. More specifically, we use a negative binomial non-negative matrix factorization and add a support vector machine loss. We show that mutational signatures extracted by our proposed method have a lower reconstruction error and are designed to be more predictive of cancer type than those generated by unsupervised methods. This design reduces the need for elaborate post-processing strategies in order to recover most of the known signatures unlike the existing unsupervised signature extraction methods. Signatures extracted by a supervised model used in conjunction with cancer-type labels are also more robust, especially when using small and potentially cancer-type limited patient cohorts. Finally, we adapted our model such that molecular features can be utilized to derive an according mutational signature. We used APOBEC expression and MUTYH mutation status to demonstrate the possibilities that arise from this ability. We conclude that our method, which exploits available metadata, improves the quality of mutational signatures as well as helps derive more interpretable representations.

Authors Xinrui Lyu, Jean Garret, Gunnar Rätsch, Kjong-Van Lehmann

Submitted Bioinformatics

Link DOI

Abstract Intelligent agents should be able to learn useful representations by observing changes in their environment. We model such observations as pairs of non-i.i.d. images sharing at least one of the underlying factors of variation. First, we theoretically show that only knowing how many factors have changed, but not which ones, is sufficient to learn disentangled representations. Second, we provide practical algorithms that learn disentangled representations from pairs of images without requiring annotation of groups, individual factors, or the number of factors that have changed. Third, we perform a large-scale empirical study and show that such pairs of observations are sufficient to reliably learn disentangled representations on several benchmark data sets. Finally, we evaluate our learned representations and find that they are simultaneously useful on a diverse suite of tasks, including generalization under covariate shifts, fairness, and abstract reasoning. Overall, our results demonstrate that weak supervision enables learning of useful disentangled representations in realistic scenarios.

Authors Francesco Locatello, Ben Poole, Gunnar Rätsch, Bernhard Schölkopf, Olivier Bachem, Michael Tschannen

Submitted ICML 2020

Link DOI

Abstract We call upon the research community to standardize efforts to use daily self-reported data about COVID-19 symptoms in the response to the pandemic and to form a collaborative consortium to maximize global gain while protecting participant privacy.

Authors Eran Segal , Feng Zhang, Xihong Lin , Gary King , Ophir Shalem , Smadar Shilo, William E. Allen, Faisal Alquaddoomi, Han Altae-Tran, Simon Anders , Ran Balicer, Tal Bauman, Ximena Bonilla , Gisel Booman , Andrew T. Chan , Ori Cohen, Silvano Coletti, Natalie Davidson, Yuval Dor, David A. Drew , Olivier Elemento, Georgina Evans, Phil Ewels , Joshua Gale, Amir Gavrieli, Benjamin Geiger, Yonatan H. Grad , Casey S. Greene, Iman Hajirasouliha, Roman Jerala , Andre Kahles, Olli Kallioniemi, Ayya Keshet, Ljupco Kocarev, Gregory Landua, Tomer Meir, Aline Muller, Long H. Nguyen, Matej Oresic , Svetlana Ovchinnikova, Hedi Peterson , Jana Prodanova, Jay Rajagopal, Gunnar Rätsch, Hagai Rossman, Johan Rung , Andrea Sboner, Alexandros Sigaras , Tim Spector , Ron Steinherz, Irene Stevens, Jaak Vilo , Paul Wilmes

Submitted Nature Medicine


Abstract Kernel methods on discrete domains have shown great promise for many challenging tasks, e.g., on biological sequence data as well as on molecular structures. Scalable kernel methods like support vector machines offer good predictive performances but they often do not provide uncertainty estimates. In contrast, probabilistic kernel methods like Gaussian Processes offer uncertainty estimates in addition to good predictive performance but fall short in terms of scalability. We present the first sparse Gaussian Process approximation framework on discrete input domains. Our framework achieves good predictive performance as well as uncertainty estimates using different discrete optimization techniques. We present competitive results comparing our framework to support vector machine and full Gaussian Process baselines on synthetic data as well as on challenging real-world DNA sequence data.

Authors Vincent Fortuin, Gideon Dresdner, Heiko Strathmann, Gunnar Rätsch

Submitted IEEE Access

Link DOI

Abstract Jaccard Similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. However, little efforts have been made to develop a scalable and high-performance scheme for computing the Jaccard Similarity for today's large data sets. To address this issue, we design and implement SimilarityAtScale, the first communicationefficient distributed algorithm for computing the Jaccard Similarity. The key idea is to express the problem algebraically, as a sequence of matrix operations, and implement these operations with communication-avoiding distributed routines to minimize the amount of transferred data and ensure both high scalability and low latency. We then apply our algorithm to the problem of obtaining distances between whole-genome sequencing samples, a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using large-scale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains.

Authors Maciej Besta, Raghavendra Kanakagiri, Harun Mustafa, Mikhail Karasikov, Gunnar Rätsch, Torsten Hoefler, Edgar Solomonik

Submitted IPDPS 2020

Abstract We present an algorithm for the optimal alignment of sequences to genome graphs. It works by phrasing the edit distance minimization task as finding a shortest path on an implicit alignment graph. To find a shortest path, we instantiate the A⋆ paradigm with a novel domain-specific heuristic function that accounts for the upcoming sub-sequence in the query to be aligned, resulting in a provably optimal alignment algorithm called AStarix. Experimental evaluation of AStarix shows that it is 1–2 orders of magnitude faster than state-of-the-art optimal algorithms on the task of aligning Illumina reads to reference genome graphs. Implementations and evaluations are available at

Authors Pesho Ivanov, Benjamin Bichsel, Harun Mustafa, André Kahles, Gunnar Rätsch, Martin Vechev

Submitted RECOMB 2020

Link DOI

Abstract Transcript alterations often result from somatic changes in cancer genomes. Various forms of RNA alterations have been described in cancer, including overexpression, altered splicing and gene fusions; however, it is difficult to attribute these to underlying genomic changes owing to heterogeneity among patients and tumour types, and the relatively small cohorts of patients for whom samples have been analysed by both transcriptome and whole-genome sequencing. Here we present, to our knowledge, the most comprehensive catalogue of cancer-associated gene alterations to date, obtained by characterizing tumour transcriptomes from 1,188 donors of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). Using matched whole-genome sequencing data, we associated several categories of RNA alterations with germline and somatic DNA alterations, and identified probable genetic mechanisms. Somatic copy-number alterations were the major drivers of variations in total gene and allele-specific expression. We identified 649 associations of somatic single-nucleotide variants with gene expression in cis, of which 68.4% involved associations with flanking non-coding regions of the gene. We found 1,900 splicing alterations associated with somatic mutations, including the formation of exons within introns in proximity to Alu elements. In addition, 82% of gene fusions were associated with structural variants, including 75 of a new class, termed ‘bridged’ fusions, in which a third genomic location bridges two genes. We observed transcriptomic alteration signatures that differ between cancer types and have associations with variations in DNA mutational signatures. This compendium of RNA alterations in the genomic context provides a rich resource for identifying genes and mechanisms that are functionally implicated in cancer.

Authors PCAWG Transcriptome Core Group, Claudia Calabrese, Natalie R Davidson, Deniz Demircioğlu, Nuno A. Fonseca, Yao He, André Kahles, Kjong-Van Lehmann, Fenglin Liu, Yuichi Shiraishi, Cameron M. Soulette, Lara Urban, Liliana Greger, Siliang Li, Dongbing Liu, Marc D. Perry, Qian Xiang, Fan Zhang, Junjun Zhang, Peter Bailey, Serap Erkek, Katherine A. Hoadley, Yong Hou, Matthew R. Huska, Helena Kilpinen, Jan O. Korbel, Maximillian G. Marin, Julia Markowski, Tannistha Nandi, Qiang Pan-Hammarström, Chandra Sekhar Pedamallu, Reiner Siebert, Stefan G. Stark, Hong Su, Patrick Tan, Sebastian M. Waszak, Christina Yung, Shida Zhu, Philip Awadalla, Chad J. Creighton, Matthew Meyerson, B. F. Francis Ouellette, Kui Wu, Huanming Yang, PCAWG Transcriptome Working Group, Alvis Brazma, Angela N. Brooks, Jonathan Göke, Gunnar Rätsch, Roland F. Schwarz, Oliver Stegle, Zemin Zhang & PCAWG Consortium- Show fewer authors Nature volume 578, pages129–136(2020)Cite this article

Submitted Nature

Link DOI

Abstract The goal of the unsupervised learning of disentangled representations is to separate the independent explanatory factors of variation in the data without access to supervision. In this paper, we summarize the results of Locatello et al., 2019, and focus on their implications for practitioners. We discuss the theoretical result showing that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases and the practical challenges it entails. Finally, we comment on our experimental findings, highlighting the limitations of state-of-the-art approaches and directions for future research.

Authors Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, Olivier Bachem

Submitted AAAI 2020


Abstract High-throughput DNA sequencing data are accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. Although there has been good progress toward representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this study, we present a new compression approach, Multi-binary relation wavelet tree (BRWT), which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data, and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world data sets.

Authors Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, André Kahles

Submitted Journal of Computational Biology

Link DOI

Abstract Intra-tumor hypoxia is a common feature in many solid cancers. Although transcriptional targets of hypoxia-inducible factors (HIFs) have been well characterized, alternative splicing or processing of pre-mRNA transcripts which occurs during hypoxia and subsequent HIF stabilization is much less understood. Here, we identify HIF-dependent alternative splicing events after whole transcriptome sequencing in pancreatic cancer cells exposed to hypoxia with and without downregulation of the aryl hydrocarbon receptor nuclear translocator (ARNT), a protein required for HIFs to form a transcriptionally active dimer. We correlate the discovered hypoxia-driven events with available sequencing data from pan-cancer TCGA patient cohorts to select a narrow set of putative biologically relevant splice events for experimental validation. We validate a small set of candidate HIF-dependent alternative splicing events in multiple human cancer cell lines as well as patient-derived human pancreatic cancer organoids. Lastly, we report the discovery of a HIF-dependent mechanism to produce a hypoxia-dependent, long and coding isoform of the UDP-N-acetylglucosamine transporter SLC35A3.

Authors Philipp Markolin, Natalie R Davidson, Christian K. Hirt, Christophe D. Chabbert, Nicola Zamboni, Gerald Schwank, Wilhelm Krek, Gunnar Rätsch

Submitted bioaRxiv

Link DOI

Abstract Metagenomic studies have increasingly utilized sequencing technologies in order to analyze DNA fragments found in environmental samples.One important step in this analysis is the taxonomic classification of the DNA fragments. Conventional read classification methods require large databases and vast amounts of memory to run, with recent deep learning methods suffering from very large model sizes. We therefore aim to develop a more memory-efficient technique for taxonomic classification. A task of particular interest is abundance estimation in metagenomic samples. Current attempts rely on classifying single DNA reads independently from each other and are therefore agnostic to co-occurence patterns between taxa. In this work, we also attempt to take these patterns into account. We develop a novel memory-efficient read classification technique, combining deep learning and locality-sensitive hashing. We show that this approach outperforms conventional mapping-based and other deep learning methods for single-read taxonomic classification when restricting all methods to a fixed memory footprint. Moreover, we formulate the task of abundance estimation as a Multiple Instance Learning (MIL) problem and we extend current deep learning architectures with two different types of permutation-invariant MIL pooling layers: a) deepsets and b) attention-based pooling. We illustrate that our architectures can exploit the co-occurrence of species in metagenomic read sets and outperform the single-read architectures in predicting the distribution over taxa at higher taxonomic ranks.

Authors Andreas Georgiou, Vincent Fortuin, Harun Mustafa, Gunnar Rätsch

Submitted arXiv Preprints