- starks@ inf.ethz.ch
- +41 44 632 65 24
Department of Computer Science
Biomedical Informatics Group
- CAB F52.1
I am generally interested in developing and applying machine learning methods to single cell data in order to better understand the behavior of cellular populations.
I am particularly interested in learning to align cellular populations across both perturbational effects as well as multimodal profiles. I've applied these efforts towards understanding and optimizing cancer treatments, particiuarlly within the context of the Tumor Profiler consortium. Prior to starting my PhD, I finished a Masters degree in Computer Science, also at ETH Zürich, where Ie worked on matching cancerous somatic mutations to information extracted from clinical text notes, and contributed analysis to the cancer consortia ICGC and TCGA.
Abstract Several recently developed single-cell DNA sequencing technologies enable whole-genome sequencing of thousands of cells. However, the ultra-low coverage of the sequenced data (<0.05× per cell) mostly limits their usage to the identification of copy number alterations in multi-megabase segments. Many tumors are not copy number-driven, and thus single-nucleotide variant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumor heterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible when superimposing the sequenced genomes of hundreds of genetically similar cells. Thus, we have developed a new approach to efficiently cluster tumor cells based on a Bayesian filtering approach of relevant loci and exploiting read overlap and phasing.
Authors Hana Rozhoňová, Daniel Danciu, Stefan Stark, Gunnar Rätsch, André Kahles, Kjong-Van Lehmann
Abstract Understanding and predicting molecular responses towards external perturbations is a core question in molecular biology. Technological advancements in the recent past have enabled the generation of high-resolution single-cell data, making it possible to profile individual cells under different experimentally controlled perturbations. However, cells are typically destroyed during measurement, resulting in unpaired distributions over either perturbed or non-perturbed cells. Leveraging the theory of optimal transport and the recent advents of convex neural architectures, we learn a coupling describing the response of cell populations upon perturbation, enabling us to predict state trajectories on a single-cell level. We apply our approach, CellOT, to predict treatment responses of 21,650 cells subject to four different drug perturbations. CellOT outperforms current state-of-the-art methods both qualitatively and quantitatively, accurately capturing cellular behavior shifts across all different drugs.
Authors Charlotte Bunne, Stefan Stark, Gabriele Gut, Jacobo Sarabia del Castillo, Mitchell Levesque, Kjong Van Lehmann, Lucas Pelkmans, Andreas Krause, Gunnar Rätsch
Abstract Pancreatic adenocarcinoma (PDAC) epitomizes a deadly cancer driven by abnormal KRAS signaling. Here, we show that the eIF4A RNA helicase is required for translation of key KRAS signaling molecules and that pharmacological inhibition of eIF4A has single-agent activity against murine and human PDAC models at safe dose levels. EIF4A was uniquely required for the translation of mRNAs with long and highly structured 5′ untranslated regions, including those with multiple G-quadruplex elements. Computational analyses identified these features in mRNAs encoding KRAS and key downstream molecules. Transcriptome-scale ribosome footprinting accurately identified eIF4A-dependent mRNAs in PDAC, including critical KRAS signaling molecules such as PI3K, RALA, RAC2, MET, MYC, and YAP1. These findings contrast with a recent study that relied on an older method, polysome fractionation, and implicated redox-related genes as eIF4A clients. Together, our findings highlight the power of ribosome footprinting in conjunction with deep RNA sequencing in accurately decoding translational control mechanisms and define the therapeutic mechanism of eIF4A inhibitors in PDAC.
Authors Kamini Singh, Jianan Lin, Nicolas Lecomte, Prathibha Mohan, Askan Gokce, Viraj R Sanghvi, Man Jiang, Olivera Grbovic-Huezo, Antonija Burčul, Stefan G Stark, Paul B Romesser, Qing Chang, Jerry P Melchor, Rachel K Beyer, Mark Duggan, Yoshiyuki Fukase, Guangli Yang, Ouathek Ouerfelli, Agnes Viale, Elisa De Stanchina, Andrew W Stamford, Peter T Meinke, Gunnar Rätsch, Steven D Leach, Zhengqing Ouyang, Hans-Guido Wendel
Submitted Journal Cancer research
Abstract Motivation Deep learning techniques have yielded tremendous progress in the field of computational biology over the last decade, however many of these techniques are opaque to the user. To provide interpretable results, methods have incorporated biological priors directly into the learning task; one such biological prior is pathway structure. While pathways represent most biological processes in the cell, the high level of correlation and hierarchical structure make it complicated to determine an appropriate computational representation. Results Here, we present pathway module Variational Autoencoder (pmVAE). Our method encodes pathway information by restricting the structure of our VAE to mirror gene-pathway memberships. Its architecture is composed of a set of subnetworks, which we refer to as pathway modules. The subnetworks learn interpretable latent representations by factorizing the latent space according to pathway gene sets. We directly address correlation between pathways by balancing a module-specific local loss and a global reconstruction loss. Furthermore, since many pathways are by nature hierarchical and therefore the product of multiple downstream signals, we model each pathway as a multidimensional vector. Due to their factorization over pathways, the representations allow for easy and interpretable analysis of multiple downstream effects, such as cell type and biological stimulus, within the contexts of each pathway. We compare pmVAE against two other state-of-the-art methods on two single-cell RNA-seq case-control data sets, demonstrating that our pathway representations are both more discriminative and consistent in detecting pathways targeted by a perturbation. Availability and implementation https://github.com/ratschlab/pmvae
Authors Gilles Gut, Stefan G Stark, Gunnar Rätsch, Natalie R Davidson
Abstract The application and integration of molecular profiling technologies create novel opportunities for personalized medicine. Here, we introduce the Tumor Profiler Study, an observational trial combining a prospective diagnostic approach to assess the relevance of in-depth tumor profiling to support clinical decision-making with an exploratory approach to improve the biological understanding of the disease.
Authors Anja Irmisch, Ximena Bonilla, Stéphane Chevrier, Kjong-Van Lehmann, Franziska Singer, Nora C. Toussaint, Cinzia Esposito, Julien Mena, Emanuela S. Milani, Ruben Casanova, Daniel J. Stekhoven, Rebekka Wegmann, Francis Jacob, Bettina Sobottka, Sandra Goetze, Jack Kuipers, Jacobo Sarabia del Castillo, Michael Prummer, Mustafa A. Tuncel, Ulrike Menzel, Andrea Jacobs, Stefanie Engler, Sujana Sivapatham, Anja L. Frei, Gabriele Gut, Joanna Ficek-Pascual, Nicola Miglino, Melike Ak, Faisal S. Al-Quaddoomi, Jonas Albinus, Ilaria Alborelli, Sonali Andani, Per-Olof Attinger, Daniel Baumhoer, Beatrice Beck-Schimmer, Lara Bernasconi, Anne Bertolini, Natalia Chicherova, Maya D'Costa, Esther Danenberg, Natalie Davidson, Monica-Andreea Drăgan, Martin Erkens, Katja Eschbach, André Fedier, Pedro Ferreira, Bruno Frey, Linda Grob, Detlef Günther, Martina Haberecker, Pirmin Haeuptle, Sylvia Herter, Rene Holtackers, Tamara Huesser, Tim M. Jaeger, Katharina Jahn, Alva R. James, Philip M. Jermann, André Kahles, Abdullah Kahraman, Werner Kuebler, Christian P. Kunze, Christian Kurzeder, Sebastian Lugert, Gerd Maass, Philipp Markolin, Julian M. Metzler, Simone Muenst, Riccardo Murri, Charlotte K.Y. Ng, Stefan Nicolet, Marta Nowak, Patrick G.A. Pedrioli, Salvatore Piscuoglio, Mathilde Ritter, Christian Rommel, María L. Rosano-González, Natascha Santacroce, Ramona Schlenker, Petra C. Schwalie, Severin Schwan, Tobias Schär, Gabriela Senti, Vipin T. Sreedharan, Stefan Stark, Tinu M. Thomas, Vinko Tosevski, Marina Tusup, Audrey Van Drogen, Marcus Vetter, Tatjana Vlajnic, Sandra Weber, Walter P. Weber, Michael Weller, Fabian Wendt, Norbert Wey, Mattheus H.E. Wildschut, Shuqing Yu, Johanna Ziegler, Marc Zimmermann, Martin Zoche, Gregor Zuend, Rudolf Aebersold, Marina Bacac, Niko Beerenwinkel, Christian Beisel, Bernd Bodenmiller, Reinhard Dummer, Viola Heinzelmann-Schwarz, Viktor H. Koelzer, Markus G. Manz, Holger Moch, Lucas Pelkmans, Berend Snijder, Alexandre P.A. Theocharides, Markus Tolnay, Andreas Wicki, Bernd Wollscheid, Gunnar Rätsch, Mitchell P. Levesque
Submitted Cancer Cell (Commentary)
Abstract Motivation Recent technological advances have led to an increase in the production and availability of single-cell data. The ability to integrate a set of multi-technology measurements would allow the identification of biologically or clinically meaningful observations through the unification of the perspectives afforded by each technology. In most cases, however, profiling technologies consume the used cells and thus pairwise correspondences between datasets are lost. Due to the sheer size single-cell datasets can acquire, scalable algorithms that are able to universally match single-cell measurements carried out in one cell to its corresponding sibling in another technology are needed. Results We propose Single-Cell data Integration via Matching (SCIM), a scalable approach to recover such correspondences in two or more technologies. SCIM assumes that cells share a common (low-dimensional) underlying structure and that the underlying cell distribution is approximately constant across technologies. It constructs a technology-invariant latent space using an autoencoder framework with an adversarial objective. Multi-modal datasets are integrated by pairing cells across technologies using a bipartite matching scheme that operates on the low-dimensional latent representations. We evaluate SCIM on a simulated cellular branching process and show that the cell-to-cell matches derived by SCIM reflect the same pseudotime on the simulated dataset. Moreover, we apply our method to two real-world scenarios, a melanoma tumor sample and a human bone marrow sample, where we pair cells from a scRNA dataset to their sibling cells in a CyTOF dataset achieving 90% and 78% cell-matching accuracy for each one of the samples, respectively.
Authors Stefan G Stark, Joanna Ficek-Pascual, Francesco Locatello, Ximena Bonilla, Stéphane Chevrier, Franziska Singer, Tumor Profiler Consortium, Gunnar Rätsch, Kjong-Van Lehmann
Abstract The recent adoption of Electronic Health Records (EHRs) by health care providers has introduced an important source of data that provides detailed and highly specific insights into patient phenotypes over large cohorts. These datasets, in combination with machine learning and statistical approaches, generate new opportunities for research and clinical care. However, many methods require the patient representations to be in structured formats, while the information in the EHR is often locked in unstructured texts designed for human readability. In this work, we develop the methodology to automatically extract clinical features from clinical narratives from large EHR corpora without the need for prior knowledge. We consider medical terms and sentences appearing in clinical narratives as atomic information units. We propose an efficient clustering strategy suitable for the analysis of large text corpora and to utilize the clusters to represent information about the patient compactly. To demonstrate the utility of our approach, we perform an association study of clinical features with somatic mutation profiles from 4,007 cancer patients and their tumors. We apply the proposed algorithm to a dataset consisting of about 65 thousand documents with a total of about 3.2 million sentences. We identify 341 significant statistical associations between the presence of somatic mutations and clinical features. We annotated these associations according to their novelty, and report several known associations. We also propose 32 testable hypotheses where the underlying biological mechanism does not appear to be known but plausible. These results illustrate that the …
Authors Stefan G Stark, Stephanie L Hyland, Melanie F Pradier, Kjong-Van Lehmann, Andreas Wicki, Fernando Perez Cruz, Julia E Vogt, Gunnar Rätsch
Abstract Our comprehensive analysis of alternative splicing across 32 The Cancer Genome Atlas cancer types from 8,705 patients detects alternative splicing events and tumor variants by reanalyzing RNA and whole-exome sequencing data. Tumors have up to 30% more alternative splicing events than normal samples. Association analysis of somatic variants with alternative splicing events confirmed known trans associations with variants in SF3B1 and U2AF1 and identified additional trans-acting variants (e.g., TADA1, PPP2R1A). Many tumors have thousands of alternative splicing events not detectable in normal samples; on average, we identified ≈930 exon-exon junctions (“neojunctions”) in tumors not typically found in GTEx normals. From Clinical Proteomic Tumor Analysis Consortium data available for breast and ovarian tumor samples, we confirmed ≈1.7 neojunction- and ≈0.6 single nucleotide variant-derived peptides per tumor sample that are also predicted major histocompatibility complex-I binders (“putative neoantigens”).
Authors Andre Kahles, Kjong-Van Lehmann, Nora C. Toussaint, Matthias Hüser, Stefan Stark, Timo Sachsenberg, Oliver Stegle, Oliver Kohlbacher, Chris Sander, Gunnar Rätsch, The Cancer Genome Atlas Research Network
Submitted Cancer Cell
Abstract We present the most comprehensive catalogue of cancer-associated gene alterations through characterization of tumor transcriptomes from 1,188 donors of the Pan-Cancer Analysis of Whole Genomes project. Using matched whole-genome sequencing data, we attributed RNA alterations to germline and somatic DNA alterations, revealing likely genetic mechanisms. We identified 444 associations of gene expression with somatic non-coding single-nucleotide variants. We found 1,872 splicing alterations associated with somatic mutation in intronic regions, including novel exonization events associated with Alu elements. Somatic copy number alterations were the major driver of total gene and allele-specific expression (ASE) variation. Additionally, 82% of gene fusions had structural variant support, including 75 of a novel class called "bridged" fusions, in which a third genomic location bridged two different genes. Globally, we observe transcriptomic alteration signatures that differ between cancer types and have associations with DNA mutational signatures. Given this unique dataset of RNA alterations, we also identified 1,012 genes significantly altered through both DNA and RNA mechanisms. Our study represents an extensive catalog of RNA alterations and reveals new insights into the heterogeneous molecular mechanisms of cancer gene alterations.
Authors Claudia Calabrese, Natalie R Davidson, Nuno A Fonseca, Yao He, André Kahles, Kjong-Van Lehmann, Fenglin Liu, Yuichi Shiraishi, Cameron M Soulette, Lara Urban, Deniz Demircioğlu, Liliana Greger, Siliang Li, Dongbing Liu, Marc D Perry, Linda Xiang, Fan Zhang, Junjun Zhang, Peter Bailey, Serap Erkek, Katherine A Hoadley, Yong Hou, Helena Kilpinen, Jan O Korbel, Maximillian G Marin, Julia Markowski, Tannistha Nandi, Qiang Pan-Hammarström, Chandra S Pedamallu, Reiner Siebert, Stefan G Stark, Hong Su, Patrick Tan, Sebastian M Waszak, Christina Yung, Shida Zhu, Philip Awadalla, Chad J Creighton, Matthew Meyerson, B Francis F Ouellette, Kui Wu, Huanming Yang, Alvis Brazma, Angela N Brooks, Jonathan Göke, Gunnar Rätsch, Roland F Schwarz, Oliver Stegle, Zemin Zhang