Welcome to the Biomedical Informatics Lab of Prof. Dr. Gunnar Rätsch

The research in our group lies at the interface between methods research in Machine Learning, Genomics and Medical Informatics and relevant applications in biology and medicine.

We develop new analysis techniques that are capable of dealing with large amounts of medical and genomic data. These techniques aim to provide accurate predictions on the phenomenon at hand and to comprehensibly provide reasons for their prognoses, and thereby assist in gaining new biomedical insights.

Current research includes a) Machine Learning related to time-series analysis and iterative optimization algorithms, b) methods for transcriptome analyses to study transcriptome alterations in cancer, c) developing clinical decision support systems, in particular, for time series data from intensive care units, d) new graph genome algorithms to store and analyze very large sets of genomic sequences, and e) developing methods and resources for international sharing of genomic and clinical data, for instance, about variants in BRCA1/2.

Abstract High-dimensional time series are common in many domains. Since human cognition is not optimized to work well in high-dimensional spaces, these areas could benefit from interpretable low-dimensional representations. However, most representation learning algorithms for time series data are difficult to interpret. This is due to non-intuitive mappings from data features to salient properties of the representation and non-smoothness over time. To address this problem, we propose a new representation learning framework building on ideas from interpretable discrete dimensionality reduction and deep generative modeling. This framework allows us to learn discrete representations of time series, which give rise to smooth and interpretable embeddings with superior clustering performance. We introduce a new way to overcome the non-differentiability in discrete representation learning and present a gradient-based version of the traditional self-organizing map algorithm that is more performant than the original. Furthermore, to allow for a probabilistic interpretation of our method, we integrate a Markov model in the representation space. This model uncovers the temporal transition structure, improves clustering performance even further and provides additional explanatory insights as well as a natural representation of uncertainty. We evaluate our model in terms of clustering performance and interpretability on static (Fashion-)MNIST data, a time series of linearly interpolated (Fashion-)MNIST images, a chaotic Lorenz attractor system with two macro states, as well as on a challenging real world medical time series application on the eICU data set. Our learned representations compare favorably with competitor methods and facilitate downstream tasks on the real world data.

Authors Vincent Fortuin, Matthias Hüser, Francesco Locatello, Heiko Strathmann, Gunnar Rätsch

Submitted ICLR 2019

Link

Abstract In recent years, the interest in \emph{unsupervised} learning of \emph{disentangled} representations has significantly increased. The key assumption is that real-world data is generated by a few explanatory factors of variation and that these factors can be recovered by unsupervised learning algorithms. A large number of unsupervised learning approaches based on \emph{auto-encoding} and quantitative evaluation metrics of disentanglement have been proposed; yet, the efficacy of the proposed approaches and utility of proposed notions of disentanglement has not been challenged in prior work. In this paper, we provide a sober look on recent progress in the field and challenge some common assumptions. We first theoretically show that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases on both the models and the data. Then, we train more than $\num{12000}$ models covering the six most prominent methods, and evaluate them across six disentanglement metrics in a reproducible large-scale experimental study on seven different data sets. On the positive side, we observe that different methods successfully enforce properties ``encouraged'' by the corresponding losses. On the negative side, we observe that in our study (1) ``good'' hyperparameters seemingly cannot be identified without access to ground-truth labels, (2) good hyperparameters neither transfer across data sets nor across disentanglement metrics, and (3) that increased disentanglement does not seem to lead to a decreased sample complexity of learning for downstream tasks. These results suggest that future work on disentanglement learning should be explicit about the role of inductive biases and (implicit) supervision, investigate concrete benefits of enforcing disentanglement of the learned representations, and consider a reproducible experimental setup covering several data sets.

Authors Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, Olivier Bachem

Link DOI

Abstract Kernel methods on discrete domains have shown great promise for many challenging tasks, e.g., on biological sequence data as well as on molecular structures. Scalable kernel methods like support vector machines offer good predictive performances but they often do not provide uncertainty estimates. In contrast, probabilistic kernel methods like Gaussian Processes offer uncertainty estimates in addition to good predictive performance but fall short in terms of scalability. We present the first sparse Gaussian Process approximation framework on discrete input domains. Our framework achieves good predictive performance as well as uncertainty estimates using different discrete optimization techniques. We present competitive results comparing our framework to support vector machine and full Gaussian Process baselines on synthetic data as well as on challenging real-world DNA sequence data.

Authors Vincent Fortuin, Gideon Dresdner, Heiko Strathmann, Gunnar Rätsch

Submitted arXiv Preprints

Link

Abstract Translation initiation is orchestrated by the cap binding and 43S pre-initiation complexes (PIC). Eukaryotic initiation factor 1A (EIF1A) is essential for recruitment of the ternary complex and for assembling the 43S PIC. Recurrent EIF1AX mutations in papillary thyroid cancers are mutually exclusive with other drivers, including RAS. EIF1AX is enriched in advanced thyroid cancers, where it displays a striking co-occurrence with RAS, which cooperates to induce tumorigenesis in mice and isogenic cell lines. The C-terminal EIF1AX-A113splice mutation is the most prevalent in advanced thyroid cancer. EIF1AX-A113spl variants stabilize the PIC and induce ATF4, a sensor of cellular stress, which is co-opted to suppress EIF2α phosphorylation, enabling a general increase in protein synthesis. RAS stabilizes c-MYC, an effect augmented by EIF1AX-A113spl. ATF4 and c-MYC induce expression of aminoacid transporters and enhance sensitivity of mTOR to aminoacid supply. These mutually reinforcing events generate therapeutic vulnerabilities to MEK, BRD4 and mTOR kinase inhibitors.

Authors Gnana P. Krishnamoorthy, Natalie R Davidson, Steven D Leach, Zhen Zhao, Scott W. Lowe, Gina Lee, Iñigo Landa, James Nagarajah, Mahesh Saqcena, Kamini Singh, Hans-Guido Wendel, Snjezana Dogan, Prasanna P. Tamarapu, John Blenis, Ronald Ghossein, Jeffrey A. Knauf, Gunnar Rätsch and James A. Fagin

Submitted Cancer Discovery

Link DOI

Abstract Our comprehensive analysis of alternative splicing across 32 The Cancer Genome Atlas cancer types from 8,705 patients detects alternative splicing events and tumor variants by reanalyzing RNA and whole-exome sequencing data. Tumors have up to 30% more alternative splicing events than normal samples. Association analysis of somatic variants with alternative splicing events confirmed known trans associations with variants in SF3B1 and U2AF1 and identified additional trans-acting variants (e.g., TADA1, PPP2R1A). Many tumors have thousands of alternative splicing events not detectable in normal samples; on average, we identified ≈930 exon-exon junctions (“neojunctions”) in tumors not typically found in GTEx normals. From Clinical Proteomic Tumor Analysis Consortium data available for breast and ovarian tumor samples, we confirmed ≈1.7 neojunction- and ≈0.6 single nucleotide variant-derived peptides per tumor sample that are also predicted major histocompatibility complex-I binders (“putative neoantigens”).

Authors Andre Kahles, Kjong-Van Lehmann, Nora C. Toussaint, Matthias Hüser, Stefan Stark, Timo Sachsenberg, Oliver Stegle, Oliver Kohlbacher, Chris Sander, Gunnar Rätsch, The Cancer Genome Atlas Research Network

Submitted Cancer Cell

Link DOI