Gunnar Rätsch, Prof. Dr.

Interdisciplinary research is about forging collaborations across disciplinary and geographic borders.

E-Mail
raetsch@inf.ethz.ch
Phone
+41 44 632 2036
ETH Zürich
Department of Computer Science
Biomedical Informatics Group Universitätsstrasse 6
CAB F53.2
8092 Zürich
Room
CAB F53.2
@gxr

Data scientist Gunnar Rätsch develops and applies advanced data analysis and modeling techniques to data from deep molecular profiling, medical and health records, as well as images.

He earned his Ph.D. at the German National Laboratory for Information Technology under supervision of Klaus-Robert Müller and was a postdoc with Bob Williamson and Bernhard Schölkopf. He received the Max Planck Young and Independent Investigator award and was leading the group on Machine Learning in Genome Biology at the Friedrich Miescher Laboratory in Tübingen (2005-2011). In 2012, he joined Memorial Sloan Kettering Cancer Center as Associate Faculty. In May 2016, he and his lab moved to Zürich to join the Computer Science Department of ETH Zürich.

The Rätsch laboratory focuses on bridging medicine and biology with computer science. The group’s research interests are relatively broad as it covers an area from algorithmic computer science to biomedical application fields. On the one hand, this includes work on algorithms that can learn or extract insights from data, on the other hand it involves developing tools that we and others employ for the analysis of large genomic or medical data sets, often in collaboration with biologists and physicians. These tools aim to solve real-world biomedical problems. In short, the group advances the state-of-the-art in data science algorithms, turns them into commonly usable tools for specific applications, and then collaborate with biologists and physicians on life science problems. Along the way, we learn more and can go back to improve the algorithms.

Hugo Yèche, Gideon Dresdner, Francesco Locatello, Matthias Hüser, Gunnar Rätsch Neighborhood Contrastive Learning Applied to Online Patient Monitoring ICML 2021

Abstract Intensive care units (ICU) are increasingly looking towards machine learning for methods to provide online monitoring of critically ill patients. In machine learning, online monitoring is often formulated as a supervised learning problem. Recently, contrastive learning approaches have demonstrated promising improvements over competitive supervised benchmarks. These methods rely on well-understood data augmentation techniques developed for image data which do not apply to online monitoring. In this work, we overcome this limitation by supplementing time-series data augmentation techniques with a novel contrastive learning objective which we call neighborhood contrastive learning (NCL). Our objective explicitly groups together contiguous time segments from each patient while maintaining state-specific information. Our experiments demonstrate a marked improvement over existing work applying contrastive methods to medical time-series.

Authors Hugo Yèche, Gideon Dresdner, Francesco Locatello, Matthias Hüser, Gunnar Rätsch

Submitted ICML 2021

David Danko, Daniela Bezdan, Evan E. Afshin, Sofia Ahsanuddin, Chandrima Bhattacharya, Daniel J. Butler, Kern Rei Chng, Daisy Donnellan, Jochen Hecht, Katelyn Jackson, Katerina Kuchin, Mikhail Karasikov, Abigail Lyons, Lauren Mak, Dmitry Meleshko, Harun Mustafa, Beth Mutai, Russell Y. Neches, Amanda Ng, Olga Nikolayeva, Tatyana Nikolayeva, Eileen Png, Krista A. Ryon, Jorge L. Sanchez, Heba Shaaban, Maria A. Sierra, Dominique Thomas, Ben Young, Omar O. Abudayyeh, Josue Alicea, Malay Bhattacharyya, Ran Blekhman, Eduardo Castro-Nallar, Ana M. Cañas, Aspassia D. Chatziefthimiou, Robert W. Crawford, Francesca De Filippis, Youping Deng, Christelle Desnues, Emmanuel Dias-Neto, Marius Dybwad, Eran Elhaik, Danilo Ercolini, Alina Frolova, Dennis Gankin, Jonathan S. Gootenberg, Alexandra B. Graf, David C. Green, Iman Hajirasouliha, Jaden J.A. Hastings, Mark Hernandez, Gregorio Iraola, Soojin Jang, Andre Kahles, Frank J. Kelly, Kaymisha Knights, Nikos C. Kyrpides, Paweł P. Łabaj, Patrick K.H. Lee, Marcus H.Y. Leung, Per O. Ljungdahl, Gabriella Mason-Buck, Ken McGrath, Cem Meydan, Emmanuel F. Mongodin, Milton Ozorio Moraes, Niranjan Nagarajan, Marina Nieto-Caballero, Houtan Noushmehr, Manuela Oliveira, Stephan Ossowski, Olayinka O. Osuolale, Orhan Özcan, David Paez-Espino, Nicolás Rascovan, Hugues Richard, Gunnar Rätsch, Lynn M. Schriml, Torsten Semmler, Osman U. Sezerman, Leming Shi, Tieliu Shi, Rania Siam, Le Huu Song, Haruo Suzuki, Denise Syndercombe Court, Scott W. Tighe, Xinzhao Tong, Klas I. Udekwu, Juan A. Ugalde, Brandon Valentine, Dimitar I. Vassilev, Elena M. Vayndorf, Thirumalaisamy P. Velavan, Jun Wu, María M. Zambrano, Jifeng Zhu, Sibo Zhu, Christopher E. Mason, The International MetaSUB Consortium A global metagenomic map of urban microbiomes and antimicrobial resistance Cell

Abstract We present a global atlas of 4,728 metagenomic samples from mass-transit systems in 60 cities over 3 years, representing the first systematic, worldwide catalog of the urban microbial ecosystem. This atlas provides an annotated, geospatial profile of microbial strains, functional characteristics, antimicrobial resistance (AMR) markers, and genetic elements, including 10,928 viruses, 1,302 bacteria, 2 archaea, and 838,532 CRISPR arrays not found in reference databases. We identified 4,246 known species of urban microorganisms and a consistent set of 31 species found in 97% of samples that were distinct from human commensal organisms. Profiles of AMR genes varied widely in type and density across cities. Cities showed distinct microbial taxonomic signatures that were driven by climate and geographic differences. These results constitute a high-resolution global metagenomic atlas that enables discovery of organisms and genes, highlights potential public health and forensic applications, and provides a culture-independent view of AMR burden in cities.

Authors David Danko, Daniela Bezdan, Evan E. Afshin, Sofia Ahsanuddin, Chandrima Bhattacharya, Daniel J. Butler, Kern Rei Chng, Daisy Donnellan, Jochen Hecht, Katelyn Jackson, Katerina Kuchin, Mikhail Karasikov, Abigail Lyons, Lauren Mak, Dmitry Meleshko, Harun Mustafa, Beth Mutai, Russell Y. Neches, Amanda Ng, Olga Nikolayeva, Tatyana Nikolayeva, Eileen Png, Krista A. Ryon, Jorge L. Sanchez, Heba Shaaban, Maria A. Sierra, Dominique Thomas, Ben Young, Omar O. Abudayyeh, Josue Alicea, Malay Bhattacharyya, Ran Blekhman, Eduardo Castro-Nallar, Ana M. Cañas, Aspassia D. Chatziefthimiou, Robert W. Crawford, Francesca De Filippis, Youping Deng, Christelle Desnues, Emmanuel Dias-Neto, Marius Dybwad, Eran Elhaik, Danilo Ercolini, Alina Frolova, Dennis Gankin, Jonathan S. Gootenberg, Alexandra B. Graf, David C. Green, Iman Hajirasouliha, Jaden J.A. Hastings, Mark Hernandez, Gregorio Iraola, Soojin Jang, Andre Kahles, Frank J. Kelly, Kaymisha Knights, Nikos C. Kyrpides, Paweł P. Łabaj, Patrick K.H. Lee, Marcus H.Y. Leung, Per O. Ljungdahl, Gabriella Mason-Buck, Ken McGrath, Cem Meydan, Emmanuel F. Mongodin, Milton Ozorio Moraes, Niranjan Nagarajan, Marina Nieto-Caballero, Houtan Noushmehr, Manuela Oliveira, Stephan Ossowski, Olayinka O. Osuolale, Orhan Özcan, David Paez-Espino, Nicolás Rascovan, Hugues Richard, Gunnar Rätsch, Lynn M. Schriml, Torsten Semmler, Osman U. Sezerman, Leming Shi, Tieliu Shi, Rania Siam, Le Huu Song, Haruo Suzuki, Denise Syndercombe Court, Scott W. Tighe, Xinzhao Tong, Klas I. Udekwu, Juan A. Ugalde, Brandon Valentine, Dimitar I. Vassilev, Elena M. Vayndorf, Thirumalaisamy P. Velavan, Jun Wu, María M. Zambrano, Jifeng Zhu, Sibo Zhu, Christopher E. Mason, The International MetaSUB Consortium

Submitted Cell

Matthias Hüser, Martin Faltys, Xinrui Lyu, Chris Barber, Stephanie L. Hyland, Thomas M. Merz, Gunnar Rätsch Early prediction of respiratory failure in the intensive care unit arXiv Preprints

Abstract The development of respiratory failure is common among patients in intensive care units (ICU). Large data quantities from ICU patient monitoring systems make timely and comprehensive analysis by clinicians difficult but are ideal for automatic processing by machine learning algorithms. Early prediction of respiratory system failure could alert clinicians to patients at risk of respiratory failure and allow for early patient reassessment and treatment adjustment. We propose an early warning system that predicts moderate/severe respiratory failure up to 8 hours in advance. Our system was trained on HiRID-II, a data-set containing more than 60,000 admissions to a tertiary care ICU. An alarm is typically triggered several hours before the beginning of respiratory failure. Our system outperforms a clinical baseline mimicking traditional clinical decision-making based on pulse-oximetric oxygen saturation and the fraction of inspired oxygen. To provide model introspection and diagnostics, we developed an easy-to-use web browser-based system to explore model input data and predictions visually.

Authors Matthias Hüser, Martin Faltys, Xinrui Lyu, Chris Barber, Stephanie L. Hyland, Thomas M. Merz, Gunnar Rätsch

Submitted arXiv Preprints

Laura Manduchi, Matthias Hüser, Martin Faltys, Julia Vogt, Gunnar Rätsch, Vincent Fortuin T-DPSOM - An Interpretable Clustering Method for Unsupervised Learning of Patient Health States ACM-CHIL 2021

Abstract Generating interpretable visualizations of multivariate time series in the intensive care unit is of great practical importance. Clinicians seek to condense complex clinical observations into intuitively understandable critical illness patterns, like failures of different organ systems. They would greatly benefit from a low-dimensional representation in which the trajectories of the patients' pathology become apparent and relevant health features are highlighted. To this end, we propose to use the latent topological structure of Self-Organizing Maps (SOMs) to achieve an interpretable latent representation of ICU time series and combine it with recent advances in deep clustering. Specifically, we (a) present a novel way to fit SOMs with probabilistic cluster assignments (PSOM), (b) propose a new deep architecture for probabilistic clustering (DPSOM) using a VAE, and (c) extend our architecture to cluster and forecast clinical states in time series (T-DPSOM). We show that our model achieves superior clustering performance compared to state-of-the-art SOM-based clustering methods while maintaining the favorable visualization properties of SOMs. On the eICU data-set, we demonstrate that T-DPSOM provides interpretable visualizations of patient state trajectories and uncertainty estimation. We show that our method rediscovers well-known clinical patient characteristics, such as a dynamic variant of the Acute Physiology And Chronic Health Evaluation (APACHE) score. Moreover, we illustrate how it can disentangle individual organ dysfunctions on disjoint regions of the two-dimensional SOM map.

Authors Laura Manduchi, Matthias Hüser, Martin Faltys, Julia Vogt, Gunnar Rätsch, Vincent Fortuin

Submitted ACM-CHIL 2021

Jonathan Heitz, Joanna Ficek, Martin Faltys, Tobias M. Merz, Gunnar Rätsch, Matthias Hüser WRSE - a non-parametric weighted-resolution ensemble for predicting individual survival distributions in the ICU Proceedings of the AAAI-2021 - Spring Symposium on Survival Prediction

Abstract Dynamic assessment of mortality risk in the intensive care unit (ICU) can be used to stratify patients, inform about treatment effectiveness or serve as part of an early-warning system. Static risk scoring systems, such as APACHE or SAPS, have recently been supplemented with data-driven approaches that track the dynamic mortality risk over time. Recent works have focused on enhancing the information delivered to clinicians even further by producing full survival distributions instead of point predictions or fixed horizon risks. In this work, we propose a non-parametric ensemble model, Weighted Resolution Survival Ensemble (WRSE), tailored to estimate such dynamic individual survival distributions. Inspired by the simplicity and robustness of ensemble methods, the proposed approach combines a set of binary classifiers spaced according to a decay function reflecting the relevance of short-term mortality predictions. Models and baselines are evaluated under weighted calibration and discrimination metrics for individual survival distributions which closely reflect the utility of a model in ICU practice. We show competitive results with state-of-the-art probabilistic models, while greatly reducing training time by factors of 2-9x.

Authors Jonathan Heitz, Joanna Ficek, Martin Faltys, Tobias M. Merz, Gunnar Rätsch, Matthias Hüser

Submitted Proceedings of the AAAI-2021 - Spring Symposium on Survival Prediction

Daniel Danciu, Mikhail Karasikov, Harun Mustafa, André Kahles, Gunnar Rätsch Topology-based Sparsification of Graph Annotations ISMB/ECCB 2021

Abstract Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. In this paper, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10,000 RNA-seq datasets show that RowDiff combined with Multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a Multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST.

Authors Daniel Danciu, Mikhail Karasikov, Harun Mustafa, André Kahles, Gunnar Rätsch

Submitted ISMB/ECCB 2021

Gilles Gut, Stefan G Stark, Gunnar Rätsch, Natalie R Davidson pmVAE: Learning Interpretable Single-Cell Representations with Pathway Modules biorxiv

Abstract Motivation Deep learning techniques have yielded tremendous progress in the field of computational biology over the last decade, however many of these techniques are opaque to the user. To provide interpretable results, methods have incorporated biological priors directly into the learning task; one such biological prior is pathway structure. While pathways represent most biological processes in the cell, the high level of correlation and hierarchical structure make it complicated to determine an appropriate computational representation. Results Here, we present pathway module Variational Autoencoder (pmVAE). Our method encodes pathway information by restricting the structure of our VAE to mirror gene-pathway memberships. Its architecture is composed of a set of subnetworks, which we refer to as pathway modules. The subnetworks learn interpretable latent representations by factorizing the latent space according to pathway gene sets. We directly address correlation between pathways by balancing a module-specific local loss and a global reconstruction loss. Furthermore, since many pathways are by nature hierarchical and therefore the product of multiple downstream signals, we model each pathway as a multidimensional vector. Due to their factorization over pathways, the representations allow for easy and interpretable analysis of multiple downstream effects, such as cell type and biological stimulus, within the contexts of each pathway. We compare pmVAE against two other state-of-the-art methods on two single-cell RNA-seq case-control data sets, demonstrating that our pathway representations are both more discriminative and consistent in detecting pathways targeted by a perturbation. Availability and implementation https://github.com/ratschlab/pmvae

Authors Gilles Gut, Stefan G Stark, Gunnar Rätsch, Natalie R Davidson

Submitted biorxiv

Anja Irmisch, Ximena Bonilla, Stéphane Chevrier, Kjong-Van Lehmann, Franziska Singer, Nora C. Toussaint, Cinzia Esposito, Julien Mena, Emanuela S. Milani, Ruben Casanova, Daniel J. Stekhoven, Rebekka Wegmann, Francis Jacob, Bettina Sobottka, Sandra Goetze, Jack Kuipers, Jacobo Sarabia del Castillo, Michael Prummer, Mustafa A. Tuncel, Ulrike Menzel, Andrea Jacobs, Stefanie Engler, Sujana Sivapatham, Anja L. Frei, Gabriele Gut, Joanna Ficek, Nicola Miglino, Melike Ak, Faisal S. Al-Quaddoomi, Jonas Albinus, Ilaria Alborelli, Sonali Andani, Per-Olof Attinger, Daniel Baumhoer, Beatrice Beck-Schimmer, Lara Bernasconi, Anne Bertolini, Natalia Chicherova, Maya D'Costa, Esther Danenberg, Natalie Davidson, Monica-Andreea Drăgan, Martin Erkens, Katja Eschbach, André Fedier, Pedro Ferreira, Bruno Frey, Linda Grob, Detlef Günther, Martina Haberecker, Pirmin Haeuptle, Sylvia Herter, Rene Holtackers, Tamara Huesser, Tim M. Jaeger, Katharina Jahn, Alva R. James, Philip M. Jermann, André Kahles, Abdullah Kahraman, Werner Kuebler, Christian P. Kunze, Christian Kurzeder, Sebastian Lugert, Gerd Maass, Philipp Markolin, Julian M. Metzler, Simone Muenst, Riccardo Murri, Charlotte K.Y. Ng, Stefan Nicolet, Marta Nowak, Patrick G.A. Pedrioli, Salvatore Piscuoglio, Mathilde Ritter, Christian Rommel, María L. Rosano-González, Natascha Santacroce, Ramona Schlenker, Petra C. Schwalie, Severin Schwan, Tobias Schär, Gabriela Senti, Vipin T. Sreedharan, Stefan Stark, Tinu M. Thomas, Vinko Tosevski, Marina Tusup, Audrey Van Drogen, Marcus Vetter, Tatjana Vlajnic, Sandra Weber, Walter P. Weber, Michael Weller, Fabian Wendt, Norbert Wey, Mattheus H.E. Wildschut, Shuqing Yu, Johanna Ziegler, Marc Zimmermann, Martin Zoche, Gregor Zuend, Rudolf Aebersold, Marina Bacac, Niko Beerenwinkel, Christian Beisel, Bernd Bodenmiller, Reinhard Dummer, Viola Heinzelmann-Schwarz, Viktor H. Koelzer, Markus G. Manz, Holger Moch, Lucas Pelkmans, Berend Snijder, Alexandre P.A. Theocharides, Markus Tolnay, Andreas Wicki, Bernd Wollscheid, Gunnar Rätsch, Mitchell P. Levesque The Tumor Profiler Study: integrated, multi-omic, functional tumor profiling for clinical decision support Cancer Cell (Commentary)

Abstract The application and integration of molecular profiling technologies create novel opportunities for personalized medicine. Here, we introduce the Tumor Profiler Study, an observational trial combining a prospective diagnostic approach to assess the relevance of in-depth tumor profiling to support clinical decision-making with an exploratory approach to improve the biological understanding of the disease.

Authors Anja Irmisch, Ximena Bonilla, Stéphane Chevrier, Kjong-Van Lehmann, Franziska Singer, Nora C. Toussaint, Cinzia Esposito, Julien Mena, Emanuela S. Milani, Ruben Casanova, Daniel J. Stekhoven, Rebekka Wegmann, Francis Jacob, Bettina Sobottka, Sandra Goetze, Jack Kuipers, Jacobo Sarabia del Castillo, Michael Prummer, Mustafa A. Tuncel, Ulrike Menzel, Andrea Jacobs, Stefanie Engler, Sujana Sivapatham, Anja L. Frei, Gabriele Gut, Joanna Ficek, Nicola Miglino, Melike Ak, Faisal S. Al-Quaddoomi, Jonas Albinus, Ilaria Alborelli, Sonali Andani, Per-Olof Attinger, Daniel Baumhoer, Beatrice Beck-Schimmer, Lara Bernasconi, Anne Bertolini, Natalia Chicherova, Maya D'Costa, Esther Danenberg, Natalie Davidson, Monica-Andreea Drăgan, Martin Erkens, Katja Eschbach, André Fedier, Pedro Ferreira, Bruno Frey, Linda Grob, Detlef Günther, Martina Haberecker, Pirmin Haeuptle, Sylvia Herter, Rene Holtackers, Tamara Huesser, Tim M. Jaeger, Katharina Jahn, Alva R. James, Philip M. Jermann, André Kahles, Abdullah Kahraman, Werner Kuebler, Christian P. Kunze, Christian Kurzeder, Sebastian Lugert, Gerd Maass, Philipp Markolin, Julian M. Metzler, Simone Muenst, Riccardo Murri, Charlotte K.Y. Ng, Stefan Nicolet, Marta Nowak, Patrick G.A. Pedrioli, Salvatore Piscuoglio, Mathilde Ritter, Christian Rommel, María L. Rosano-González, Natascha Santacroce, Ramona Schlenker, Petra C. Schwalie, Severin Schwan, Tobias Schär, Gabriela Senti, Vipin T. Sreedharan, Stefan Stark, Tinu M. Thomas, Vinko Tosevski, Marina Tusup, Audrey Van Drogen, Marcus Vetter, Tatjana Vlajnic, Sandra Weber, Walter P. Weber, Michael Weller, Fabian Wendt, Norbert Wey, Mattheus H.E. Wildschut, Shuqing Yu, Johanna Ziegler, Marc Zimmermann, Martin Zoche, Gregor Zuend, Rudolf Aebersold, Marina Bacac, Niko Beerenwinkel, Christian Beisel, Bernd Bodenmiller, Reinhard Dummer, Viola Heinzelmann-Schwarz, Viktor H. Koelzer, Markus G. Manz, Holger Moch, Lucas Pelkmans, Berend Snijder, Alexandre P.A. Theocharides, Markus Tolnay, Andreas Wicki, Bernd Wollscheid, Gunnar Rätsch, Mitchell P. Levesque

Submitted Cancer Cell (Commentary)

Stefan G Stark, Joanna Ficek, Francesco Locatello, Ximena Bonilla, Stéphane Chevrier, Franziska Singer, Tumor Profiler Consortium, Gunnar Rätsch, Kjong-Van Lehmann SCIM: universal single-cell matching with unpaired feature sets Bioinformatics

Abstract Motivation Recent technological advances have led to an increase in the production and availability of single-cell data. The ability to integrate a set of multi-technology measurements would allow the identification of biologically or clinically meaningful observations through the unification of the perspectives afforded by each technology. In most cases, however, profiling technologies consume the used cells and thus pairwise correspondences between datasets are lost. Due to the sheer size single-cell datasets can acquire, scalable algorithms that are able to universally match single-cell measurements carried out in one cell to its corresponding sibling in another technology are needed. Results We propose Single-Cell data Integration via Matching (SCIM), a scalable approach to recover such correspondences in two or more technologies. SCIM assumes that cells share a common (low-dimensional) underlying structure and that the underlying cell distribution is approximately constant across technologies. It constructs a technology-invariant latent space using an autoencoder framework with an adversarial objective. Multi-modal datasets are integrated by pairing cells across technologies using a bipartite matching scheme that operates on the low-dimensional latent representations. We evaluate SCIM on a simulated cellular branching process and show that the cell-to-cell matches derived by SCIM reflect the same pseudotime on the simulated dataset. Moreover, we apply our method to two real-world scenarios, a melanoma tumor sample and a human bone marrow sample, where we pair cells from a scRNA dataset to their sibling cells in a CyTOF dataset achieving 90% and 78% cell-matching accuracy for each one of the samples, respectively.

Authors Stefan G Stark, Joanna Ficek, Francesco Locatello, Ximena Bonilla, Stéphane Chevrier, Franziska Singer, Tumor Profiler Consortium, Gunnar Rätsch, Kjong-Van Lehmann

Submitted Bioinformatics

Mikhail Karasikov, Harun Mustafa, Daniel Danciu, Marc Zimmermann, Christopher Barber, Gunnar Rätsch, André Kahles MetaGraph: Indexing and Analysing Nucleotide Archives at Petabase-scale bioRxiv

Abstract The amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making all this sequencing data searchable and easily accessible to life science and data science researchers is an unsolved problem. We present MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories. MetaGraph efficiently indexes vast collections of sequences to enable fast search and comprehensive analysis. A wide range of underlying data structures offer different practically relevant trade-offs between the space taken by the index and its query performance. MetaGraph provides a flexible methodological framework allowing for index construction to be scaled from consumer laptops to distribution onto a cloud compute cluster for processing terabases to petabases of input data. Achieving compression ratios of up to 1,000-fold over the already compressed raw input data, MetaGraph can represent the content of large sequencing archives in the working memory of a single compute server. We demonstrate our framework’s scalability by indexing over 1.4 million whole genome sequencing (WGS) records from NCBI’s Sequence Read Archive, representing a total input of more than three petabases. Besides demonstrating the utility of MetaGraph indexes on key applications, such as experiment discovery, sequence alignment, error correction, and differential assembly, we make a wide range of indexes available as a community resource, including those over 450,000 microbial WGS records, more than 110,000 fungi WGS records, and more than 20,000 whole metagenome sequencing records. A subset of these indexes is made available online for interactive queries. All indexes created from public data comprising in total more than 1 million records are available for download or usage in the cloud. As an example of our indexes’ integrative analysis capabilities, we introduce the concept of differential assembly, which allows for the extraction of sequences present in a foreground set of samples but absent in a given background set. We apply this technique to differentially assemble contigs to identify pathogenic agents transfected via human kidney transplants. In a second example, we indexed more than 20,000 human RNA-Seq records from the TCGA and GTEx cohorts and use them to extract transcriptome features that are hard to characterize using a classical linear reference. We discovered over 200 trans-splicing events in GTEx and found broad evidence for tissue-specific non-A-to-I RNA-editing in GTEx and TCGA.

Authors Mikhail Karasikov, Harun Mustafa, Daniel Danciu, Marc Zimmermann, Christopher Barber, Gunnar Rätsch, André Kahles

Submitted bioRxiv

Metod Jazbec, Vincent Fortuin, Michael Pearce, Stephan Mandt, Gunnar Rätsch Scalable Gaussian Process Variational Autoencoders arXiv Preprints

Abstract Conventional variational autoencoders fail in modeling correlations between data points due to their use of factorized priors. Amortized Gaussian process inference through GP-VAEs has led to significant improvements in this regard, but is still inhibited by the intrinsic complexity of exact GP inference. We improve the scalability of these methods through principled sparse inference approaches. We propose a new scalable GP-VAE model that outperforms existing approaches in terms of runtime and memory footprint, is easy to implement, and allows for joint end-to-end optimization of all components.

Authors Metod Jazbec, Vincent Fortuin, Michael Pearce, Stephan Mandt, Gunnar Rätsch

Submitted arXiv Preprints

Xinrui Lyu, Jean Garret, Gunnar Rätsch, Kjong-Van Lehmann Mutational signature learning with supervised negative binomial non-negative matrix factorization Bioinformatics

Abstract Motivation Understanding the underlying mutational processes of cancer patients has been a long-standing goal in the community and promises to provide new insights that could improve cancer diagnoses and treatments. Mutational signatures are summaries of the mutational processes, and improving the derivation of mutational signatures can yield new discoveries previously obscured by technical and biological confounders. Results from existing mutational signature extraction methods depend on the size of available patient cohort and solely focus on the analysis of mutation count data without considering the exploitation of metadata. Results Here we present a supervised method that utilizes cancer type as metadata to extract more distinctive signatures. More specifically, we use a negative binomial non-negative matrix factorization and add a support vector machine loss. We show that mutational signatures extracted by our proposed method have a lower reconstruction error and are designed to be more predictive of cancer type than those generated by unsupervised methods. This design reduces the need for elaborate post-processing strategies in order to recover most of the known signatures unlike the existing unsupervised signature extraction methods. Signatures extracted by a supervised model used in conjunction with cancer-type labels are also more robust, especially when using small and potentially cancer-type limited patient cohorts. Finally, we adapted our model such that molecular features can be utilized to derive an according mutational signature. We used APOBEC expression and MUTYH mutation status to demonstrate the possibilities that arise from this ability. We conclude that our method, which exploits available metadata, improves the quality of mutational signatures as well as helps derive more interpretable representations.

Authors Xinrui Lyu, Jean Garret, Gunnar Rätsch, Kjong-Van Lehmann

Submitted Bioinformatics

Francesco Locatello, Ben Poole, Gunnar Rätsch, Bernhard Schölkopf, Olivier Bachem, Michael Tschannen Weakly-Supervised Disentanglement without Compromises ICML 2020

Abstract Intelligent agents should be able to learn useful representations by observing changes in their environment. We model such observations as pairs of non-i.i.d. images sharing at least one of the underlying factors of variation. First, we theoretically show that only knowing how many factors have changed, but not which ones, is sufficient to learn disentangled representations. Second, we provide practical algorithms that learn disentangled representations from pairs of images without requiring annotation of groups, individual factors, or the number of factors that have changed. Third, we perform a large-scale empirical study and show that such pairs of observations are sufficient to reliably learn disentangled representations on several benchmark data sets. Finally, we evaluate our learned representations and find that they are simultaneously useful on a diverse suite of tasks, including generalization under covariate shifts, fairness, and abstract reasoning. Overall, our results demonstrate that weak supervision enables learning of useful disentangled representations in realistic scenarios.

Authors Francesco Locatello, Ben Poole, Gunnar Rätsch, Bernhard Schölkopf, Olivier Bachem, Michael Tschannen

Submitted ICML 2020

Eran Segal , Feng Zhang, Xihong Lin , Gary King , Ophir Shalem , Smadar Shilo, William E. Allen, Faisal Alquaddoomi, Han Altae-Tran, Simon Anders , Ran Balicer, Tal Bauman, Ximena Bonilla , Gisel Booman , Andrew T. Chan , Ori Cohen, Silvano Coletti, Natalie Davidson, Yuval Dor, David A. Drew , Olivier Elemento, Georgina Evans, Phil Ewels , Joshua Gale, Amir Gavrieli, Benjamin Geiger, Yonatan H. Grad , Casey S. Greene, Iman Hajirasouliha, Roman Jerala , Andre Kahles, Olli Kallioniemi, Ayya Keshet, Ljupco Kocarev, Gregory Landua, Tomer Meir, Aline Muller, Long H. Nguyen, Matej Oresic , Svetlana Ovchinnikova, Hedi Peterson , Jana Prodanova, Jay Rajagopal, Gunnar Rätsch, Hagai Rossman, Johan Rung , Andrea Sboner, Alexandros Sigaras , Tim Spector , Ron Steinherz, Irene Stevens, Jaak Vilo , Paul Wilmes Building an international consortium for tracking coronavirus health status Nature Medicine

Abstract We call upon the research community to standardize efforts to use daily self-reported data about COVID-19 symptoms in the response to the pandemic and to form a collaborative consortium to maximize global gain while protecting participant privacy.

Authors Eran Segal , Feng Zhang, Xihong Lin , Gary King , Ophir Shalem , Smadar Shilo, William E. Allen, Faisal Alquaddoomi, Han Altae-Tran, Simon Anders , Ran Balicer, Tal Bauman, Ximena Bonilla , Gisel Booman , Andrew T. Chan , Ori Cohen, Silvano Coletti, Natalie Davidson, Yuval Dor, David A. Drew , Olivier Elemento, Georgina Evans, Phil Ewels , Joshua Gale, Amir Gavrieli, Benjamin Geiger, Yonatan H. Grad , Casey S. Greene, Iman Hajirasouliha, Roman Jerala , Andre Kahles, Olli Kallioniemi, Ayya Keshet, Ljupco Kocarev, Gregory Landua, Tomer Meir, Aline Muller, Long H. Nguyen, Matej Oresic , Svetlana Ovchinnikova, Hedi Peterson , Jana Prodanova, Jay Rajagopal, Gunnar Rätsch, Hagai Rossman, Johan Rung , Andrea Sboner, Alexandros Sigaras , Tim Spector , Ron Steinherz, Irene Stevens, Jaak Vilo , Paul Wilmes

Submitted Nature Medicine

Maciej Besta, Raghavendra Kanakagiri, Harun Mustafa, Mikhail Karasikov, Gunnar Rätsch, Torsten Hoefler, Edgar Solomonik Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons IPDPS 2020

Abstract Jaccard Similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. However, little efforts have been made to develop a scalable and high-performance scheme for computing the Jaccard Similarity for today's large data sets. To address this issue, we design and implement SimilarityAtScale, the first communicationefficient distributed algorithm for computing the Jaccard Similarity. The key idea is to express the problem algebraically, as a sequence of matrix operations, and implement these operations with communication-avoiding distributed routines to minimize the amount of transferred data and ensure both high scalability and low latency. We then apply our algorithm to the problem of obtaining distances between whole-genome sequencing samples, a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using large-scale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains.

Authors Maciej Besta, Raghavendra Kanakagiri, Harun Mustafa, Mikhail Karasikov, Gunnar Rätsch, Torsten Hoefler, Edgar Solomonik

Submitted IPDPS 2020

Pesho Ivanov, Benjamin Bichsel, Harun Mustafa, André Kahles, Gunnar Rätsch, Martin Vechev AStarix: Fast and Optimal Sequence-to-Graph Alignment RECOMB 2020

Abstract We present an algorithm for the optimal alignment of sequences to genome graphs. It works by phrasing the edit distance minimization task as finding a shortest path on an implicit alignment graph. To find a shortest path, we instantiate the A⋆ paradigm with a novel domain-specific heuristic function that accounts for the upcoming sub-sequence in the query to be aligned, resulting in a provably optimal alignment algorithm called AStarix. Experimental evaluation of AStarix shows that it is 1–2 orders of magnitude faster than state-of-the-art optimal algorithms on the task of aligning Illumina reads to reference genome graphs. Implementations and evaluations are available at https://github.com/eth-sri/astarix.

Authors Pesho Ivanov, Benjamin Bichsel, Harun Mustafa, André Kahles, Gunnar Rätsch, Martin Vechev

Submitted RECOMB 2020

PCAWG Transcriptome Core Group, Claudia Calabrese, Natalie R Davidson, Deniz Demircioğlu, Nuno A. Fonseca, Yao He, André Kahles, Kjong-Van Lehmann, Fenglin Liu, Yuichi Shiraishi, Cameron M. Soulette, Lara Urban, Liliana Greger, Siliang Li, Dongbing Liu, Marc D. Perry, Qian Xiang, Fan Zhang, Junjun Zhang, Peter Bailey, Serap Erkek, Katherine A. Hoadley, Yong Hou, Matthew R. Huska, Helena Kilpinen, Jan O. Korbel, Maximillian G. Marin, Julia Markowski, Tannistha Nandi, Qiang Pan-Hammarström, Chandra Sekhar Pedamallu, Reiner Siebert, Stefan G. Stark, Hong Su, Patrick Tan, Sebastian M. Waszak, Christina Yung, Shida Zhu, Philip Awadalla, Chad J. Creighton, Matthew Meyerson, B. F. Francis Ouellette, Kui Wu, Huanming Yang, PCAWG Transcriptome Working Group, Alvis Brazma, Angela N. Brooks, Jonathan Göke, Gunnar Rätsch, Roland F. Schwarz, Oliver Stegle, Zemin Zhang & PCAWG Consortium- Show fewer authors Nature volume 578, pages129–136(2020)Cite this article Genomic basis for RNA alterations in cancer Nature

Abstract Transcript alterations often result from somatic changes in cancer genomes. Various forms of RNA alterations have been described in cancer, including overexpression, altered splicing and gene fusions; however, it is difficult to attribute these to underlying genomic changes owing to heterogeneity among patients and tumour types, and the relatively small cohorts of patients for whom samples have been analysed by both transcriptome and whole-genome sequencing. Here we present, to our knowledge, the most comprehensive catalogue of cancer-associated gene alterations to date, obtained by characterizing tumour transcriptomes from 1,188 donors of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). Using matched whole-genome sequencing data, we associated several categories of RNA alterations with germline and somatic DNA alterations, and identified probable genetic mechanisms. Somatic copy-number alterations were the major drivers of variations in total gene and allele-specific expression. We identified 649 associations of somatic single-nucleotide variants with gene expression in cis, of which 68.4% involved associations with flanking non-coding regions of the gene. We found 1,900 splicing alterations associated with somatic mutations, including the formation of exons within introns in proximity to Alu elements. In addition, 82% of gene fusions were associated with structural variants, including 75 of a new class, termed ‘bridged’ fusions, in which a third genomic location bridges two genes. We observed transcriptomic alteration signatures that differ between cancer types and have associations with variations in DNA mutational signatures. This compendium of RNA alterations in the genomic context provides a rich resource for identifying genes and mechanisms that are functionally implicated in cancer.

Authors PCAWG Transcriptome Core Group, Claudia Calabrese, Natalie R Davidson, Deniz Demircioğlu, Nuno A. Fonseca, Yao He, André Kahles, Kjong-Van Lehmann, Fenglin Liu, Yuichi Shiraishi, Cameron M. Soulette, Lara Urban, Liliana Greger, Siliang Li, Dongbing Liu, Marc D. Perry, Qian Xiang, Fan Zhang, Junjun Zhang, Peter Bailey, Serap Erkek, Katherine A. Hoadley, Yong Hou, Matthew R. Huska, Helena Kilpinen, Jan O. Korbel, Maximillian G. Marin, Julia Markowski, Tannistha Nandi, Qiang Pan-Hammarström, Chandra Sekhar Pedamallu, Reiner Siebert, Stefan G. Stark, Hong Su, Patrick Tan, Sebastian M. Waszak, Christina Yung, Shida Zhu, Philip Awadalla, Chad J. Creighton, Matthew Meyerson, B. F. Francis Ouellette, Kui Wu, Huanming Yang, PCAWG Transcriptome Working Group, Alvis Brazma, Angela N. Brooks, Jonathan Göke, Gunnar Rätsch, Roland F. Schwarz, Oliver Stegle, Zemin Zhang & PCAWG Consortium- Show fewer authors Nature volume 578, pages129–136(2020)Cite this article

Submitted Nature

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, Olivier Bachem A Commentary on the Unsupervised Learning of Disentangled Representations AAAI 2020

Abstract The goal of the unsupervised learning of disentangled representations is to separate the independent explanatory factors of variation in the data without access to supervision. In this paper, we summarize the results of Locatello et al., 2019, and focus on their implications for practitioners. We discuss the theoretical result showing that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases and the practical challenges it entails. Finally, we comment on our experimental findings, highlighting the limitations of state-of-the-art approaches and directions for future research.

Authors Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, Olivier Bachem

Submitted AAAI 2020

Mikhail Karasikov , Harun Mustafa , Amir Joudaki , Sara Javadzadeh-no , Gunnar Rätsch , and André Kahles Sparse Binary Relation Representations for Genome Graph Annotation Journal of Computational Biology

Abstract High-throughput DNA sequencing data are accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. Although there has been good progress toward representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this study, we present a new compression approach, Multi-binary relation wavelet tree (BRWT), which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data, and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world data sets.

Authors Mikhail Karasikov , Harun Mustafa , Amir Joudaki , Sara Javadzadeh-no , Gunnar Rätsch , and André Kahles

Submitted Journal of Computational Biology

Philipp Markolin, Natalie R Davidson, Christian K. Hirt, Christophe D. Chabbert, Nicola Zamboni, Gerald Schwank, Wilhelm Krek, Gunnar Rätsch Characterisation of HIF-dependent alternative isoforms in pancreatic cancer bioaRxiv

Abstract Intra-tumor hypoxia is a common feature in many solid cancers. Although transcriptional targets of hypoxia-inducible factors (HIFs) have been well characterized, alternative splicing or processing of pre-mRNA transcripts which occurs during hypoxia and subsequent HIF stabilization is much less understood. Here, we identify HIF-dependent alternative splicing events after whole transcriptome sequencing in pancreatic cancer cells exposed to hypoxia with and without downregulation of the aryl hydrocarbon receptor nuclear translocator (ARNT), a protein required for HIFs to form a transcriptionally active dimer. We correlate the discovered hypoxia-driven events with available sequencing data from pan-cancer TCGA patient cohorts to select a narrow set of putative biologically relevant splice events for experimental validation. We validate a small set of candidate HIF-dependent alternative splicing events in multiple human cancer cell lines as well as patient-derived human pancreatic cancer organoids. Lastly, we report the discovery of a HIF-dependent mechanism to produce a hypoxia-dependent, long and coding isoform of the UDP-N-acetylglucosamine transporter SLC35A3.

Authors Philipp Markolin, Natalie R Davidson, Christian K. Hirt, Christophe D. Chabbert, Nicola Zamboni, Gerald Schwank, Wilhelm Krek, Gunnar Rätsch

Submitted bioaRxiv

Andreas Georgiou, Vincent Fortuin, Harun Mustafa, Gunnar Rätsch META^2: Memory-efficient taxonomic classification and abundance estimation for metagenomics with deep learning arXiv Preprints

Abstract Metagenomic studies have increasingly utilized sequencing technologies in order to analyze DNA fragments found in environmental samples.One important step in this analysis is the taxonomic classification of the DNA fragments. Conventional read classification methods require large databases and vast amounts of memory to run, with recent deep learning methods suffering from very large model sizes. We therefore aim to develop a more memory-efficient technique for taxonomic classification. A task of particular interest is abundance estimation in metagenomic samples. Current attempts rely on classifying single DNA reads independently from each other and are therefore agnostic to co-occurence patterns between taxa. In this work, we also attempt to take these patterns into account. We develop a novel memory-efficient read classification technique, combining deep learning and locality-sensitive hashing. We show that this approach outperforms conventional mapping-based and other deep learning methods for single-read taxonomic classification when restricting all methods to a fixed memory footprint. Moreover, we formulate the task of abundance estimation as a Multiple Instance Learning (MIL) problem and we extend current deep learning architectures with two different types of permutation-invariant MIL pooling layers: a) deepsets and b) attention-based pooling. We illustrate that our architectures can exploit the co-occurrence of species in metagenomic read sets and outperform the single-read architectures in predicting the distribution over taxa at higher taxonomic ranks.

Authors Andreas Georgiou, Vincent Fortuin, Harun Mustafa, Gunnar Rätsch

Submitted arXiv Preprints

Vincent Fortuin, Dmitry Baranchuk, Gunnar Rätsch, Stephan Mandt GP-VAE: Deep Probabilistic Time Series Imputation AISTATS 2020

Abstract Multivariate time series with missing values are common in areas such as healthcare and finance, and have grown in number and complexity over the years. This raises the question whether deep learning methodologies can outperform classical data imputation methods in this domain. However, naive applications of deep learning fall short in giving reliable confidence estimates and lack interpretability. We propose a new deep sequential latent variable model for dimensionality reduction and data imputation. Our modeling assumption is simple and interpretable: the high dimensional time series has a lower-dimensional representation which evolves smoothly in time according to a Gaussian process. The non-linear dimensionality reduction in the presence of missing data is achieved using a VAE approach with a novel structured variational approximation. We demonstrate that our approach outperforms several classical and deep learning-based data imputation methods on high-dimensional data from the domains of computer vision and healthcare, while additionally improving the smoothness of the imputations and providing interpretable uncertainty estimates.

Authors Vincent Fortuin, Dmitry Baranchuk, Gunnar Rätsch, Stephan Mandt

Submitted AISTATS 2020

Francesco Locatello, Michael Tschannen, Stefan Bauer, Gunnar Rätsch, Bernhard Schölkopf, Olivier Bachem Disentangling factors of variation using few labels ICLR 2020

Abstract Learning disentangled representations is considered a cornerstone problem in representation learning. Recently, Locatello et al.(2019) demonstrated that unsupervised disentanglement learning without inductive biases is theoretically impossible and that existing inductive biases and unsupervised methods do not allow to consistently learn disentangled representations. However, in many practical settings, one might have access to a very limited amount of supervision, for example through manual labeling of training examples. In this paper, we investigate the impact of such supervision on state-of-the-art disentanglement methods and perform a large scale study, training over 29000 models under well-defined and reproducible experimental conditions. We first observe that a very limited number of labeled examples (0.01--0.5% of the data set) is sufficient to perform model selection on state-of-the-art unsupervised models. Yet, if one has access to labels for supervised model selection, this raises the natural question of whether they should also be incorporated into the training process. As a case-study, we test the benefit of introducing (very limited) supervision into existing state-of-the-art unsupervised disentanglement methods exploiting both the values of the labels and the ordinal information that can be deduced from them. Overall, we empirically validate that with very little and potentially imprecise supervision it is possible to reliably learn disentangled representations.

Authors Francesco Locatello, Michael Tschannen, Stefan Bauer, Gunnar Rätsch, Bernhard Schölkopf, Olivier Bachem

Submitted ICLR 2020

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, Olivier Bachem Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations ICML 2019 - Best Paper Award

Abstract In recent years, the interest in \emph{unsupervised} learning of \emph{disentangled} representations has significantly increased. The key assumption is that real-world data is generated by a few explanatory factors of variation and that these factors can be recovered by unsupervised learning algorithms. A large number of unsupervised learning approaches based on \emph{auto-encoding} and quantitative evaluation metrics of disentanglement have been proposed; yet, the efficacy of the proposed approaches and utility of proposed notions of disentanglement has not been challenged in prior work. In this paper, we provide a sober look on recent progress in the field and challenge some common assumptions. We first theoretically show that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases on both the models and the data. Then, we train more than $\num{12000}$ models covering the six most prominent methods, and evaluate them across six disentanglement metrics in a reproducible large-scale experimental study on seven different data sets. On the positive side, we observe that different methods successfully enforce properties encouraged'' by the corresponding losses. On the negative side, we observe that in our study (1) good'' hyperparameters seemingly cannot be identified without access to ground-truth labels, (2) good hyperparameters neither transfer across data sets nor across disentanglement metrics, and (3) that increased disentanglement does not seem to lead to a decreased sample complexity of learning for downstream tasks. These results suggest that future work on disentanglement learning should be explicit about the role of inductive biases and (implicit) supervision, investigate concrete benefits of enforcing disentanglement of the learned representations, and consider a reproducible experimental setup covering several data sets.

Authors Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, Olivier Bachem

Submitted ICML 2019 - Best Paper Award

David Kuo, Jennifer Ding, Ian Cohn, Fan Zhang, Kevin Wei, Deepak Rao, Cristina Rozo, Upneet K Sokhi, Sara Shanaj, David J. Oliver, Adriana P. Echeverria, Edward F. DiCarlo, Michael B. Brenner, Vivian P. Bykerk, Susan M. Goodman, Soumya Raychaudhuri, Gunnar Rätsch, Lionel B. Ivashkiv, Laura T. Donlin HBEGF+ macrophages in rheumatoid arthritis induce fibroblast invasiveness Science Translational Medicine

Abstract Macrophages tailor their function according to the signals found in tissue microenvironments, assuming a wide spectrum of phenotypes. A detailed understanding of macrophage phenotypes in human tissues is limited. Using single-cell RNA sequencing, we defined distinct macrophage subsets in the joints of patients with the autoimmune disease rheumatoid arthritis (RA), which affects ~1% of the population. The subset we refer to as HBEGF+ inflammatory macrophages is enriched in RA tissues and is shaped by resident fibroblasts and the cytokine tumor necrosis factor (TNF). These macrophages promoted fibroblast invasiveness in an epidermal growth factor receptor–dependent manner, indicating that intercellular cross-talk in this inflamed setting reshapes both cell types and contributes to fibroblast-mediated joint destruction. In an ex vivo synovial tissue assay, most medications used to treat RA patients targeted HBEGF+ inflammatory macrophages; however, in some cases, medication redirected them into a state that is not expected to resolve inflammation. These data highlight how advances in our understanding of chronically inflamed human tissues and the effects of medications therein can be achieved by studies on local macrophage phenotypes and intercellular interactions.

Authors David Kuo, Jennifer Ding, Ian Cohn, Fan Zhang, Kevin Wei, Deepak Rao, Cristina Rozo, Upneet K Sokhi, Sara Shanaj, David J. Oliver, Adriana P. Echeverria, Edward F. DiCarlo, Michael B. Brenner, Vivian P. Bykerk, Susan M. Goodman, Soumya Raychaudhuri, Gunnar Rätsch, Lionel B. Ivashkiv, Laura T. Donlin

Submitted Science Translational Medicine

Melanie F. Pradier, Stephanie L. Hyland, Stefan G. Stark, Kjong-Van Lehmann, Julia E. Vogt, Fernando Perez-Cruz, Gunnar Rätsch A Bayesian Nonparametric Approach to Discover Clinico-Genetic Associations across Cancer Types biorxiv

Abstract Motivation: Personalized medicine aims at combining genetic, clinical, and environmental data to improve medical diagnosis and disease treatment, tailored to each patient. This paper presents a Bayesian nonparametric (BNP) approach to identify genetic associations with clinical/environmental features in cancer. We propose an unsupervised approach to generate data-driven hypotheses and bring potentially novel insights about cancer biology. Our model combines somatic mutation information at gene-level with features extracted from the Electronic Health Record. We propose a hierarchical approach, the hierarchical Poisson factor analysis (HPFA) model, to share information across patients having different types of cancer. To discover statistically significant associations, we combine Bayesian modeling with bootstrapping techniques and correct for multiple hypothesis testing. Results: Using our approach, we empirically demonstrate that we can recover well-known associations in cancer literature. We compare the results of H-PFA with two other classical methods in the field: case-control (CC) setups, and linear mixed models (LMMs).

Authors Melanie F. Pradier, Stephanie L. Hyland, Stefan G. Stark, Kjong-Van Lehmann, Julia E. Vogt, Fernando Perez-Cruz, Gunnar Rätsch

Submitted biorxiv

Stefan G Stark, Stephanie L Hyland, Melanie F Pradier, Kjong-Van Lehmann, Andreas Wicki, Fernando Perez Cruz, Julia E Vogt, Gunnar Rätsch Unsupervised Extraction of Phenotypes from Cancer Clinical Notes for Association Studies arxiv

Abstract The recent adoption of Electronic Health Records (EHRs) by health care providers has introduced an important source of data that provides detailed and highly specific insights into patient phenotypes over large cohorts. These datasets, in combination with machine learning and statistical approaches, generate new opportunities for research and clinical care. However, many methods require the patient representations to be in structured formats, while the information in the EHR is often locked in unstructured texts designed for human readability. In this work, we develop the methodology to automatically extract clinical features from clinical narratives from large EHR corpora without the need for prior knowledge. We consider medical terms and sentences appearing in clinical narratives as atomic information units. We propose an efficient clustering strategy suitable for the analysis of large text corpora and to utilize the clusters to represent information about the patient compactly. To demonstrate the utility of our approach, we perform an association study of clinical features with somatic mutation profiles from 4,007 cancer patients and their tumors. We apply the proposed algorithm to a dataset consisting of about 65 thousand documents with a total of about 3.2 million sentences. We identify 341 significant statistical associations between the presence of somatic mutations and clinical features. We annotated these associations according to their novelty, and report several known associations. We also propose 32 testable hypotheses where the underlying biological mechanism does not appear to be known but plausible. These results illustrate that the …

Authors Stefan G Stark, Stephanie L Hyland, Melanie F Pradier, Kjong-Van Lehmann, Andreas Wicki, Fernando Perez Cruz, Julia E Vogt, Gunnar Rätsch

Submitted arxiv

David Kuo, Jennifer Ding, Ian Cohn, Fan Zhang, Kevin Wei, Deepak Rao, Cristina Rozo, Upneet K Sokhi, Accelerating Medicines Partnership RA/SLE Network, Edward F. DiCarlo, Michael B. Brenner, Vivian P. Bykerk, VSusan M. Goodman, Soumya Raychaudhuri, Gunnar Rätsch, Lionel B. Ivashkiv, Laura T. Donlin HBEGF+ macrophages identified in rheumatoid arthritis promote joint tissue invasiveness and are reshaped differentially by medications bioRxiv

Abstract Macrophages tailor their function to the signals found in tissue microenvironments, taking on a wide spectrum of phenotypes. In human tissues, a detailed understanding of macrophage phenotypes is limited. Using single-cell RNA-sequencing, we define distinct macrophage subsets in the joints of patients with the autoimmune disease rheumatoid arthritis (RA), which affects ~1% of the population. The subset we refer to as HBEGF+ inflammatory macrophages is enriched in RA tissues and shaped by resident fibroblasts and the cytokine TNF. These macrophages promote fibroblast invasiveness in an EGF receptor dependent manner, indicating that inflammatory intercellular crosstalk reshapes both cell types and contributes to fibroblast-mediated joint destruction. In an ex vivo tissue assay, the HBEGF+ inflammatory macrophage is targeted by several anti-inflammatory RA medications, however, COX inhibition redirects it towards a different inflammatory phenotype that is also expected to perpetuate pathology. These data highlight advances in understanding the pathophysiology and drug mechanisms in chronic inflammatory disorders can be achieved by focusing on macrophage phenotypes in the context of complex interactions in human tissues.

Authors David Kuo, Jennifer Ding, Ian Cohn, Fan Zhang, Kevin Wei, Deepak Rao, Cristina Rozo, Upneet K Sokhi, Accelerating Medicines Partnership RA/SLE Network, Edward F. DiCarlo, Michael B. Brenner, Vivian P. Bykerk, VSusan M. Goodman, Soumya Raychaudhuri, Gunnar Rätsch, Lionel B. Ivashkiv, Laura T. Donlin

Submitted bioRxiv

Vincent Fortuin, Heiko Strathmann, Gunnar Rätsch Meta-Learning Mean Functions for Gaussian Processes arXiv Preprints

Abstract When fitting Bayesian machine learning models on scarce data, the main challenge is to obtain suitable prior knowledge and encode it into the model. Recent advances in meta-learning offer powerful methods for extracting such prior knowledge from data acquired in related tasks. When it comes to meta-learning in Gaussian process models, approaches in this setting have mostly focused on learning the kernel function of the prior, but not on learning its mean function. In this work, we explore meta-learning the mean function of a Gaussian process prior. We present analytical and empirical evidence that mean function learning can be useful in the meta-learning setting, discuss the risk of overfitting, and draw connections to other meta-learning approaches, such as model agnostic meta-learning and functional PCA.

Authors Vincent Fortuin, Heiko Strathmann, Gunnar Rätsch

Submitted arXiv Preprints

Melissa S. Cline , Rachel G. Liao , Michael T. Parsons , Benedict Paten , Faisal Alquaddoomi, Antonis Antoniou, Samantha Baxter, Larry Brody, Robert Cook-Deegan, Amy Coffin, Fergus J. Couch, Brian Craft, Robert Currie, Chloe C. Dlott, Lena Dolman, Johan T. den Dunnen, Stephanie O. M. Dyke, Susan M. Domchek, Douglas Easton, Zachary Fischmann, William D. Foulkes, Judy Garber, David Goldgar, Mary J. Goldman, Peter Goodhand, Steven Harrison, David Haussler, Kazuto Kato, Bartha Knoppers, Charles Markello, Robert Nussbaum, Kenneth Offit, Sharon E. Plon, Jem Rashbass, Heidi L. Rehm, Mark Robson, Wendy S. Rubinstein, Dominique Stoppa-Lyonnet, Sean Tavtigian, Adrian Thorogood, Can Zhang, Marc Zimmermann, BRCA Challenge Authors , John Burn , Stephen Chanock , Gunnar Rätsch , Amanda B. Spurdle BRCA Challenge: BRCA Exchange as a global resource for variants in BRCA1 and BRCA2 PLOS Genetics

Abstract The BRCA Challenge is a long-term data-sharing project initiated within the Global Alliance for Genomics and Health (GA4GH) to aggregate BRCA1 and BRCA2 data to support highly collaborative research activities. Its goal is to generate an informed and current understanding of the impact of genetic variation on cancer risk across the iconic cancer predisposition genes, BRCA1 and BRCA2. Initially, reported variants in BRCA1 and BRCA2 available from public databases were integrated into a single, newly created site, www.brcaexchange.org. The purpose of the BRCA Exchange is to provide the community with a reliable and easily accessible record of variants interpreted for a high-penetrance phenotype. More than 20,000 variants have been aggregated, three times the number found in the next-largest public database at the project’s outset, of which approximately 7,250 have expert classifications. The data set is based on shared information from existing clinical databases—Breast Cancer Information Core (BIC), ClinVar, and the Leiden Open Variation Database (LOVD)—as well as population databases, all linked to a single point of access. The BRCA Challenge has brought together the existing international Evidence-based Network for the Interpretation of Germline Mutant Alleles (ENIGMA) consortium expert panel, along with expert clinicians, diagnosticians, researchers, and database providers, all with a common goal of advancing our understanding of BRCA1 and BRCA2 variation. Ongoing work includes direct contact with national centers with access to BRCA1 and BRCA2 diagnostic data to encourage data sharing, development of methods suitable for extraction of genetic variation at the level of individual laboratory reports, and engagement with participant communities to enable a more comprehensive understanding of the clinical significance of genetic variation in BRCA1 and BRCA2.

Authors Melissa S. Cline , Rachel G. Liao , Michael T. Parsons , Benedict Paten , Faisal Alquaddoomi, Antonis Antoniou, Samantha Baxter, Larry Brody, Robert Cook-Deegan, Amy Coffin, Fergus J. Couch, Brian Craft, Robert Currie, Chloe C. Dlott, Lena Dolman, Johan T. den Dunnen, Stephanie O. M. Dyke, Susan M. Domchek, Douglas Easton, Zachary Fischmann, William D. Foulkes, Judy Garber, David Goldgar, Mary J. Goldman, Peter Goodhand, Steven Harrison, David Haussler, Kazuto Kato, Bartha Knoppers, Charles Markello, Robert Nussbaum, Kenneth Offit, Sharon E. Plon, Jem Rashbass, Heidi L. Rehm, Mark Robson, Wendy S. Rubinstein, Dominique Stoppa-Lyonnet, Sean Tavtigian, Adrian Thorogood, Can Zhang, Marc Zimmermann, BRCA Challenge Authors , John Burn , Stephen Chanock , Gunnar Rätsch , Amanda B. Spurdle

Submitted PLOS Genetics

Vincent Fortuin, Matthias Hüser, Francesco Locatello, Heiko Strathmann, Gunnar Rätsch SOM-VAE: Interpretable Discrete Representation Learning on Time Series ICLR 2019

Abstract High-dimensional time series are common in many domains. Since human cognition is not optimized to work well in high-dimensional spaces, these areas could benefit from interpretable low-dimensional representations. However, most representation learning algorithms for time series data are difficult to interpret. This is due to non-intuitive mappings from data features to salient properties of the representation and non-smoothness over time. To address this problem, we propose a new representation learning framework building on ideas from interpretable discrete dimensionality reduction and deep generative modeling. This framework allows us to learn discrete representations of time series, which give rise to smooth and interpretable embeddings with superior clustering performance. We introduce a new way to overcome the non-differentiability in discrete representation learning and present a gradient-based version of the traditional self-organizing map algorithm that is more performant than the original. Furthermore, to allow for a probabilistic interpretation of our method, we integrate a Markov model in the representation space. This model uncovers the temporal transition structure, improves clustering performance even further and provides additional explanatory insights as well as a natural representation of uncertainty. We evaluate our model in terms of clustering performance and interpretability on static (Fashion-)MNIST data, a time series of linearly interpolated (Fashion-)MNIST images, a chaotic Lorenz attractor system with two macro states, as well as on a challenging real world medical time series application on the eICU data set. Our learned representations compare favorably with competitor methods and facilitate downstream tasks on the real world data.

Authors Vincent Fortuin, Matthias Hüser, Francesco Locatello, Heiko Strathmann, Gunnar Rätsch

Submitted ICLR 2019

Xinrui Lyu, Matthias Hüser, Stephanie L. Hyland, George Zerveas, Gunnar Rätsch Improving Clinical Predictions through Unsupervised Time Series Representation Learning Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 - Spotlight

Abstract In this work, we investigate unsupervised representation learning on medical time series, which bears the promise of leveraging copious amounts of existing unlabeled data in order to eventually assist clinical decision making. By evaluating on the prediction of clinically relevant outcomes, we show that in a practical setting, unsupervised representation learning can offer clear performance benefits over end-to-end supervised architectures. We experiment with using sequence-to-sequence (Seq2Seq) models in two different ways, as an autoencoder and as a forecaster, and show that the best performance is achieved by a forecasting Seq2Seq model with an integrated attention mechanism, proposed here for the first time in the setting of unsupervised learning for medical time series.

Authors Xinrui Lyu, Matthias Hüser, Stephanie L. Hyland, George Zerveas, Gunnar Rätsch

Submitted Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 - Spotlight

Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, Andre Kahles Sparse Binary Relation Representations for Genome Graph Annotation RECOMB 2019

Abstract High-throughput DNA sequencing data is accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. While there has been good progress towards representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this work, we present a new compression approach, Multi-BRWT, which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world datasets.

Authors Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, Andre Kahles

Submitted RECOMB 2019

Vincent Fortuin, Gideon Dresdner, Heiko Strathmann, Gunnar Rätsch Scalable Gaussian Processes on Discrete Domains arXiv Preprints

Abstract Kernel methods on discrete domains have shown great promise for many challenging tasks, e.g., on biological sequence data as well as on molecular structures. Scalable kernel methods like support vector machines offer good predictive performances but they often do not provide uncertainty estimates. In contrast, probabilistic kernel methods like Gaussian Processes offer uncertainty estimates in addition to good predictive performance but fall short in terms of scalability. We present the first sparse Gaussian Process approximation framework on discrete input domains. Our framework achieves good predictive performance as well as uncertainty estimates using different discrete optimization techniques. We present competitive results comparing our framework to support vector machine and full Gaussian Process baselines on synthetic data as well as on challenging real-world DNA sequence data.

Authors Vincent Fortuin, Gideon Dresdner, Heiko Strathmann, Gunnar Rätsch

Submitted arXiv Preprints

Andre Kahles, Kjong-Van Lehmann, Nora C. Toussaint, Matthias Hüser, Stefan Stark, Timo Sachsenberg, Oliver Stegle, Oliver Kohlbacher, Chris Sander, Gunnar Rätsch, The Cancer Genome Atlas Research Network Comprehensive Analysis of Alternative Splicing Across Tumors from 8,705 Patients Cancer Cell

Abstract Our comprehensive analysis of alternative splicing across 32 The Cancer Genome Atlas cancer types from 8,705 patients detects alternative splicing events and tumor variants by reanalyzing RNA and whole-exome sequencing data. Tumors have up to 30% more alternative splicing events than normal samples. Association analysis of somatic variants with alternative splicing events confirmed known trans associations with variants in SF3B1 and U2AF1 and identified additional trans-acting variants (e.g., TADA1, PPP2R1A). Many tumors have thousands of alternative splicing events not detectable in normal samples; on average, we identified ≈930 exon-exon junctions (“neojunctions”) in tumors not typically found in GTEx normals. From Clinical Proteomic Tumor Analysis Consortium data available for breast and ovarian tumor samples, we confirmed ≈1.7 neojunction- and ≈0.6 single nucleotide variant-derived peptides per tumor sample that are also predicted major histocompatibility complex-I binders (“putative neoantigens”).

Authors Andre Kahles, Kjong-Van Lehmann, Nora C. Toussaint, Matthias Hüser, Stefan Stark, Timo Sachsenberg, Oliver Stegle, Oliver Kohlbacher, Chris Sander, Gunnar Rätsch, The Cancer Genome Atlas Research Network

Submitted Cancer Cell

Stephanie O. M. Dyke, Mikael Linden, […], Gunnar Rätsch, […], Paul Flicek Registered access: authorizing data access European Journal of Human Genetics

Abstract The Global Alliance for Genomics and Health (GA4GH) proposes a data access policy model—“registered access”—to increase and improve access to data requiring an agreement to basic terms and conditions, such as the use of DNA sequence and health data in research. A registered access policy would enable a range of categories of users to gain access, starting with researchers and clinical care professionals. It would also facilitate general use and reuse of data but within the bounds of consent restrictions and other ethical obligations. In piloting registered access with the Scientific Demonstration data sharing projects of GA4GH, we provide additional ethics, policy and technical guidance to facilitate the implementation of this access model in an international setting.

Authors Stephanie O. M. Dyke, Mikael Linden, […], Gunnar Rätsch, […], Paul Flicek

Submitted European Journal of Human Genetics

Stephanie Hyland, Matthias Hüser, Xinrui Lyu, Martin Faltys, Tobias Merz, Gunnar Rätsch Predicting circulatory system deterioration in intensive care unit patients Proceedings of the First Joint Workshop on AI in Health

Abstract The deterioration of organ function in ICU patients requires swift response to prevent further damage to vital systems. Focusing on the circulatory system, we build a model to predict if a patient’s state will deteriorate in the near future. We identify circulatory system dys- function using the combination of excess lactic acid in the blood and low mean arterial blood pressure or the presence of vasoactive drugs. Using an observational cohort of 45,000 patients from a Swiss ICU, we extract and process patient time series and identify periods of circulatory system dysfunction to develop an early warning system. We train a gra- dient boosting model to perform binary classification every five minutes on whether the patient will deteriorate during an increasingly large win- dow into the future, up to the duration of a shift (8 hours). The model achieves an AUROC between 0.952 and 0.919 across the prediction win- dows, and an AUPRC between 0.223 and 0.384 for events with positive prevalence between 0.014 and 0.042. We also show preliminary results from a recurrent neural network. These results show that contemporary machine learning approaches combined with careful preprocessing of raw data collected during routine care yield clinically useful predictions in near real time [Workshop Abstract]

Authors Stephanie Hyland, Matthias Hüser, Xinrui Lyu, Martin Faltys, Tobias Merz, Gunnar Rätsch

Submitted Proceedings of the First Joint Workshop on AI in Health

Harun Mustafa, Ingo Schilken, Mikhail Karasikov, Carsten Eickhoff, Gunnar Rätsch, Andre Kahles Dynamic compression schemes for graph coloring Bioinformatics

Abstract Motivation: Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata. Results: We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain. Availability: We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation.

Authors Harun Mustafa, Ingo Schilken, Mikhail Karasikov, Carsten Eickhoff, Gunnar Rätsch, Andre Kahles

Submitted Bioinformatics

Francesco Locatello, Anant Raj, Sai Praneeth Reddy, Gunnar Rätsch, Bernhard Schölkopf, Sebastian U Stich, Martin Jaggi On Matching Pursuit and Coordinate Descent ICML 2018

Abstract Two popular examples of first-order optimization methods over linear spaces are coordinate descent and matching pursuit algorithms, with their randomized variants. While the former targets the optimization by moving along coordinates, the latter considers a generalized notion of directions. Exploiting the connection between the two algorithms, we present a unified analysis of both, providing affine invariant sublinear $O(1/t)$ rates on smooth objectives and linear convergence on strongly convex objectives. As a byproduct of our affine invariant analysis of matching pursuit, our rates for steepest coordinate descent are the tightest known. Furthermore, we show the first accelerated convergence rate $O(1/t^2)$ for matching pursuit and steepest coordinate descent on convex objectives.

Authors Francesco Locatello, Anant Raj, Sai Praneeth Reddy, Gunnar Rätsch, Bernhard Schölkopf, Sebastian U Stich, Martin Jaggi

Submitted ICML 2018

Francesco Locatello, Gideon Dresdner, Rajiv Khanna, Isabel Valera, Gunnar Rätsch Boosting Black Box Variational Inference NeurIPS 2018 (spotlight)

Abstract Approximating a probability density in a tractable manner is a central task in Bayesian statistics. Variational Inference (VI) is a popular technique that achieves tractability by choosing a relatively simple variational family. Borrowing ideas from the classic boosting framework, recent approaches attempt to \emph{boost} VI by replacing the selection of a single density with a greedily constructed mixture of densities. In order to guarantee convergence, previous works impose stringent assumptions that require significant effort for practitioners. Specifically, they require a custom implementation of the greedy step (called the LMO) for every probabilistic model with respect to an unnatural variational family of truncated distributions. Our work fixes these issues with novel theoretical and algorithmic insights. On the theoretical side, we show that boosting VI satisfies a relaxed smoothness assumption which is sufficient for the convergence of the functional Frank-Wolfe (FW) algorithm. Furthermore, we rephrase the LMO problem and propose to maximize the Residual ELBO (RELBO) which replaces the standard ELBO optimization in VI. These theoretical enhancements allow for black box implementation of the boosting subroutine. Finally, we present a stopping criterion drawn from the duality gap in the classic FW analyses and exhaustive experiments to illustrate the usefulness of our theoretical and algorithmic contributions.

Authors Francesco Locatello, Gideon Dresdner, Rajiv Khanna, Isabel Valera, Gunnar Rätsch

Submitted NeurIPS 2018 (spotlight)

Francesco Locatello, Damien Vincent, Ilya Tolstikhin, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf Clustering Meets Implicit Generative Models Arxiv

Abstract Clustering is a cornerstone of unsupervised learning which can be thought as disentangling the multiple generative mechanisms underlying the data. In this paper we introduce an algorithmic framework to train mixtures of implicit generative models which we instantiate for variational autoencoders. Relying on an additional set of discriminators, we propose a competitive procedure in which the models only need to approximate the portion of the data distribution from which they can produce realistic samples. As a byproduct, each model is simpler to train, and a clustering interpretation arises naturally from the partitioning of the training points among the models. We empirically show that our approach splits the training distribution in a reasonable way and increases the quality of the generated samples.

Authors Francesco Locatello, Damien Vincent, Ilya Tolstikhin, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf

Submitted Arxiv

Francesco Locatello, Rajiv Khanna, Joydeep Ghosh, Gunnar Rätsch Boosting Variational Inference: an Optimization Perspective AISTATS 2018

Abstract Variational Inference is a popular technique to approximate a possibly intractable Bayesian posterior with a more tractable one. Recently, Boosting Variational Inference has been proposed as a new paradigm to approximate the posterior by a mixture of densities by greedily adding components to the mixture. In the present work, we study the convergence properties of this approach from a modern optimization viewpoint by establishing connections to the classic Frank-Wolfe algorithm. Our analyses yields novel theoretical insights on the Boosting of Variational Inference regarding the sufficient conditions for convergence, explicit sublinear/linear rates, and algorithmic simplifications.

Authors Francesco Locatello, Rajiv Khanna, Joydeep Ghosh, Gunnar Rätsch

Submitted AISTATS 2018

Claudia Calabrese, Natalie R Davidson, Nuno A Fonseca, Yao He, André Kahles, Kjong-Van Lehmann, Fenglin Liu, Yuichi Shiraishi, Cameron M Soulette, Lara Urban, Deniz Demircioğlu, Liliana Greger, Siliang Li, Dongbing Liu, Marc D Perry, Linda Xiang, Fan Zhang, Junjun Zhang, Peter Bailey, Serap Erkek, Katherine A Hoadley, Yong Hou, Helena Kilpinen, Jan O Korbel, Maximillian G Marin, Julia Markowski, Tannistha Nandi, Qiang Pan-Hammarström, Chandra S Pedamallu, Reiner Siebert, Stefan G Stark, Hong Su, Patrick Tan, Sebastian M Waszak, Christina Yung, Shida Zhu, Philip Awadalla, Chad J Creighton, Matthew Meyerson, B Francis F Ouellette, Kui Wu, Huanming Yang, Alvis Brazma, Angela N Brooks, Jonathan Göke, Gunnar Rätsch, Roland F Schwarz, Oliver Stegle, Zemin Zhang Genomic basis for RNA alterations revealed by whole-genome analyses of 27 cancer types bioRxiv

Abstract We present the most comprehensive catalogue of cancer-associated gene alterations through characterization of tumor transcriptomes from 1,188 donors of the Pan-Cancer Analysis of Whole Genomes project. Using matched whole-genome sequencing data, we attributed RNA alterations to germline and somatic DNA alterations, revealing likely genetic mechanisms. We identified 444 associations of gene expression with somatic non-coding single-nucleotide variants. We found 1,872 splicing alterations associated with somatic mutation in intronic regions, including novel exonization events associated with Alu elements. Somatic copy number alterations were the major driver of total gene and allele-specific expression (ASE) variation. Additionally, 82% of gene fusions had structural variant support, including 75 of a novel class called "bridged" fusions, in which a third genomic location bridged two different genes. Globally, we observe transcriptomic alteration signatures that differ between cancer types and have associations with DNA mutational signatures. Given this unique dataset of RNA alterations, we also identified 1,012 genes significantly altered through both DNA and RNA mechanisms. Our study represents an extensive catalog of RNA alterations and reveals new insights into the heterogeneous molecular mechanisms of cancer gene alterations.

Authors Claudia Calabrese, Natalie R Davidson, Nuno A Fonseca, Yao He, André Kahles, Kjong-Van Lehmann, Fenglin Liu, Yuichi Shiraishi, Cameron M Soulette, Lara Urban, Deniz Demircioğlu, Liliana Greger, Siliang Li, Dongbing Liu, Marc D Perry, Linda Xiang, Fan Zhang, Junjun Zhang, Peter Bailey, Serap Erkek, Katherine A Hoadley, Yong Hou, Helena Kilpinen, Jan O Korbel, Maximillian G Marin, Julia Markowski, Tannistha Nandi, Qiang Pan-Hammarström, Chandra S Pedamallu, Reiner Siebert, Stefan G Stark, Hong Su, Patrick Tan, Sebastian M Waszak, Christina Yung, Shida Zhu, Philip Awadalla, Chad J Creighton, Matthew Meyerson, B Francis F Ouellette, Kui Wu, Huanming Yang, Alvis Brazma, Angela N Brooks, Jonathan Göke, Gunnar Rätsch, Roland F Schwarz, Oliver Stegle, Zemin Zhang

Submitted bioRxiv

Ingo Schilken, Harun Mustafa, Gunnar Rätsch, Carsten Eickhoff, Andre Kahles Efficient graph-color compression with neighborhood-informed Bloom filters bioRxiv

Abstract Technological advancements in high throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains inaccessible to the research com- munity through a lack efficient data representation and indexing solutions. One of the available techniques to represent read data on a more abstract level is its transformation into an assem- bly graph. Although the sequence information is now accessible, any contextual annotation and metadata is lost. We present a new approach for a compressed representation of a graph coloring based on a set of Bloom filters. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph to decide on false positives, we can reduce the memory requirements for a given set of colors per edge by three orders of magnitude. As insertion and query on a Bloom filter are constant time operations, the complexity to compress and decompress an edge color is linear in the number of color bits. Representing individual colors as independent filters, our approach is fully dynamic and can be easily parallelized. These properties allow for an easy upscaling to the problem sizes common in the biomedical domain. A prototype implementation of our method is available in Java.

Authors Ingo Schilken, Harun Mustafa, Gunnar Rätsch, Carsten Eickhoff, Andre Kahles

Submitted bioRxiv