Andre Kahles, Dr. rer. nat.

Even in science there is some room for optimism.

Senior Scientist

E-Mail
andre.kahles@get-your-addresses-elsewhere.inf.ethz.ch
Phone
+41 44 632 9067
Address
ETH Zürich
Department of Computer Science
Biomedical Informatics Group
Universitätsstrasse 6
8092 Zürich
Room
CAB F 52.2

My scientific background is in computer science, specifically bioinformatics, where I am most interested in algorithms and data structures that make efficient computation on large, population scale sequencing data sets possible.

I completed my undergraduate training at Friedrich Schiller University in Jena, Germany, and finished my Diplom thesis as a joint work with the Stockholm Bioinformatics Centre in Sweden. In 2009 I joined the Friedrich Miescher Laboratory of the Max Planck Society in Tübingen, Germany, to take up me graduate training. Mostly working on algorithms for genome and transcriptome analysis in model organisms during my time in Tübingen, I moved to the analysis of human transcriptomes when looking into large scale cancer sequencing projects in my second part of the PhD at the Memorial Sloan Kettering Cancer Center in New York City, USA. After graduating in 2014, I stayed two more years in New York, working under a fellowship of the Lucille Castori Center for Microbes, Inflammation and Cancer on efficient data structures for the representation of large collections of mixed sequences, such as whole metagenome sequencing samples.

Current Research

Since 2016 I am a member of the Biomedical Informatics group at ETH, where my main focus is research on graph representations of large sequence sets and the analysis of complex sequencing data. This includes a wide range of applications, such as cancer genomics, transcriptomics and metagagenomics. A particular area of interest is thereby the connection of data science, bioinformatics and personalized health. I am an active member to the Swiss Personalized Health Network, contributing to two SPHN driver projects and advising on policies as a member of the Data Life Cycle Management working group. In a similar context stands my work as a Co-Lead of the ICGC Argo Consortium technical working group for RNA-Seq analysis. Lastly, represening my vivid interest in metagenomics, I am an active contributor to the MetaSUB consortium, organizing local sample collection in Zurich, including surface and air microbiomes. 

Teaching

Benefiting from the excellent conditions at ETH, I am very much enjoying teaching several courses on Bioinformatics and related topics. Over the past years I have offered / contributed to the following courses:

  • Algorithms and Data Structures for Population Scale Genomics (261-5112-00L) - offered yearly in the autumn semester
  • Introduction to Bioinformatics (551-1299-00L) - offered yearly in the autumn semester
  • Computational Biomedicine (261-5100-00L) - offered yearly in the autumn semester
  • Computational Challenges in Medical Genomics (261-5113-00L) - offered yearly in the spring semester
  • Digital Medicine II (252-0868-00L) - offered yearly in the spring semester

Please contact me directly, in case you have questions regarding any of the courses. 

Abstract Raw nanopore signal analysis is a common approach in genomics to provide fast and resource-efficient analysis without translating the signals to bases (i.e., without basecalling). However, existing solutions cannot interpret raw signals directly if a reference genome is unknown due to a lack of accurate mechanisms to handle increased noise in pairwise raw signal comparison. Our goal is to enable the direct analysis of raw signals without a reference genome. To this end, we propose Rawsamble, the first mechanism that can 1) identify regions of similarity between all raw signal pairs, known as all-vs-all overlapping, using a hash-based search mechanism and 2) use these to construct genomes from scratch, called de novo assembly. Our extensive evaluations across multiple genomes of varying sizes show that Rawsamble provides a significant speedup (on average by 16.36x and up to 41.59x) and reduces peak memory usage (on average by 11.73x and up to by 41.99x) compared to a conventional genome assembly pipeline using the state-of-the-art tools for basecalling (Dorado's fastest mode) and overlapping (minimap2) on a CPU. We find that 36.57% of overlapping pairs generated by Rawsamble are identical to those generated by minimap2. Using the overlaps from Rawsamble, we construct the first de novo assemblies directly from raw signals without basecalling. We show that we can construct contiguous assembly segments (unitigs) up to 2.7 million bases in length (half the genome length of E. coli). We identify previously unexplored directions that can be enabled by finding overlaps and constructing de novo assemblies. Rawsamble is available at this https URL. We also provide the scripts to fully reproduce our results on our GitHub page.

Authors Can Firtina, Maximilian Mordig, Harun Mustafa, Sayan Goswami, Nika Mansouri Ghiasi, Stefano Mercogliano, Furkan Eris, Joël Lindegger, Andre Kahles, Onur Mutlu

Submitted arXiv

Link DOI

Abstract With the constant increase of large-scale genomic data projects, automated and high-throughput quality assessment becomes a crucial component of any analysis. Whereas small projects often have a more homogeneous design and a manageable structure allowing for a manual per-sample analysis of quality, large-scale studies tend to be much more heterogeneous and complex. Many quality metrics have been developed to assess the quality of an individual sample on the raw read level. Degradation effects are typically assessed based on the RNA integrity (RIN) score, or on postalignment data. In this study, we show that single commonly used quality criteria such as the RIN score alone are not sufficient to ensure RNA sample quality. We developed a new approach and provide an efficient tool that estimates RNA sample degradation by computing the 5′/3′ bias based on all genes in an alignment-free manner. That enables degradation assessment right after data generation and not during the analysis procedure allowing for early intervention in the sample handling process. Our analysis shows that this strategy is fast, robust to annotation and differences in library size, and provides complementary quality information to RIN scores enabling the accurate identification of degraded samples.

Authors Kjong-van Lehmann, Andre Kahles, Magdalena Murr, Gunnar Raetsch

Submitted Journal of Computational Biology

Link

Abstract Natural microbial communities are phylogenetically and metabolically diverse. In addition to underexplored organismal groups1, this diversity encompasses a rich discovery potential for ecologically and biotechnologically relevant enzymes and biochemical compounds. However, studying this diversity to identify genomic pathways for the synthesis of such compounds4 and assigning them to their respective hosts remains challenging. The biosynthetic potential of microorganisms in the open ocean remains largely uncharted owing to limitations in the analysis of genome-resolved data at the global scale. Here we investigated the diversity and novelty of biosynthetic gene clusters in the ocean by integrating around 10,000 microbial genomes from cultivated and single cells with more than 25,000 newly reconstructed draft genomes from more than 1,000 seawater samples. These efforts revealed approximately 40,000 putative mostly new biosynthetic gene clusters, several of which were found in previously unsuspected phylogenetic groups. Among these groups, we identified a lineage rich in biosynthetic gene clusters (‘Candidatus Eudoremicrobiaceae’) that belongs to an uncultivated bacterial phylum and includes some of the most biosynthetically diverse microorganisms in this environment. From these, we characterized the phospeptin and pythonamide pathways, revealing cases of unusual bioactive compound structure and enzymology, respectively. Together, this research demonstrates how microbiomics-driven strategies can enable the investigation of previously undescribed enzymes and natural products in underexplored microbial groups and environments.

Authors Lucas Paoli, Hans-Joachim Ruscheweyh, Clarissa C. Forneris, Florian Hubrich, Satria Kautsar, Agneya Bhushan, Alessandro Lotti, Quentin Clayssen, Guillem Salazar, Alessio Milanese, Charlotte I. Carlström, Chrysa Papadopoulou, Daniel Gehrig, Mikhail Karasikov, Harun Mustafa, Martin Larralde, Laura M. Carroll, Pablo Sánchez, Ahmed A. Zayed, Dylan R. Cronin, Silvia G. Acinas, Peer Bork, Chris Bowler, Tom O. Delmont, Josep M. Gasol, Alvar D. Gossert, Andre Kahles, Matthew B. Sullivan, Patrick Wincker, Georg Zeller, Serina L. Robinson, Jörn Piel, and Shinichi Sunagawa

Submitted Nature

Link DOI

Abstract We present a global atlas of 4,728 metagenomic samples from mass-transit systems in 60 cities over 3 years, representing the first systematic, worldwide catalog of the urban microbial ecosystem. This atlas provides an annotated, geospatial profile of microbial strains, functional characteristics, antimicrobial resistance (AMR) markers, and genetic elements, including 10,928 viruses, 1,302 bacteria, 2 archaea, and 838,532 CRISPR arrays not found in reference databases. We identified 4,246 known species of urban microorganisms and a consistent set of 31 species found in 97% of samples that were distinct from human commensal organisms. Profiles of AMR genes varied widely in type and density across cities. Cities showed distinct microbial taxonomic signatures that were driven by climate and geographic differences. These results constitute a high-resolution global metagenomic atlas that enables discovery of organisms and genes, highlights potential public health and forensic applications, and provides a culture-independent view of AMR burden in cities.

Authors David Danko, Daniela Bezdan, Evan E. Afshin, Sofia Ahsanuddin, Chandrima Bhattacharya, Daniel J. Butler, Kern Rei Chng, Daisy Donnellan, Jochen Hecht, Katelyn Jackson, Katerina Kuchin, Mikhail Karasikov, Abigail Lyons, Lauren Mak, Dmitry Meleshko, Harun Mustafa, Beth Mutai, Russell Y. Neches, Amanda Ng, Olga Nikolayeva, Tatyana Nikolayeva, Eileen Png, Krista A. Ryon, Jorge L. Sanchez, Heba Shaaban, Maria A. Sierra, Dominique Thomas, Ben Young, Omar O. Abudayyeh, Josue Alicea, Malay Bhattacharyya, Ran Blekhman, Eduardo Castro-Nallar, Ana M. Cañas, Aspassia D. Chatziefthimiou, Robert W. Crawford, Francesca De Filippis, Youping Deng, Christelle Desnues, Emmanuel Dias-Neto, Marius Dybwad, Eran Elhaik, Danilo Ercolini, Alina Frolova, Dennis Gankin, Jonathan S. Gootenberg, Alexandra B. Graf, David C. Green, Iman Hajirasouliha, Jaden J.A. Hastings, Mark Hernandez, Gregorio Iraola, Soojin Jang, Andre Kahles, Frank J. Kelly, Kaymisha Knights, Nikos C. Kyrpides, Paweł P. Łabaj, Patrick K.H. Lee, Marcus H.Y. Leung, Per O. Ljungdahl, Gabriella Mason-Buck, Ken McGrath, Cem Meydan, Emmanuel F. Mongodin, Milton Ozorio Moraes, Niranjan Nagarajan, Marina Nieto-Caballero, Houtan Noushmehr, Manuela Oliveira, Stephan Ossowski, Olayinka O. Osuolale, Orhan Özcan, David Paez-Espino, Nicolás Rascovan, Hugues Richard, Gunnar Rätsch, Lynn M. Schriml, Torsten Semmler, Osman U. Sezerman, Leming Shi, Tieliu Shi, Rania Siam, Le Huu Song, Haruo Suzuki, Denise Syndercombe Court, Scott W. Tighe, Xinzhao Tong, Klas I. Udekwu, Juan A. Ugalde, Brandon Valentine, Dimitar I. Vassilev, Elena M. Vayndorf, Thirumalaisamy P. Velavan, Jun Wu, María M. Zambrano, Jifeng Zhu, Sibo Zhu, Christopher E. Mason, The International MetaSUB Consortium

Submitted Cell

Link DOI

Abstract We call upon the research community to standardize efforts to use daily self-reported data about COVID-19 symptoms in the response to the pandemic and to form a collaborative consortium to maximize global gain while protecting participant privacy.

Authors Eran Segal , Feng Zhang, Xihong Lin , Gary King , Ophir Shalem , Smadar Shilo, William E. Allen, Faisal Alquaddoomi, Han Altae-Tran, Simon Anders , Ran Balicer, Tal Bauman, Ximena Bonilla , Gisel Booman , Andrew T. Chan , Ori Cohen, Silvano Coletti, Natalie Davidson, Yuval Dor, David A. Drew , Olivier Elemento, Georgina Evans, Phil Ewels , Joshua Gale, Amir Gavrieli, Benjamin Geiger, Yonatan H. Grad , Casey S. Greene, Iman Hajirasouliha, Roman Jerala , Andre Kahles, Olli Kallioniemi, Ayya Keshet, Ljupco Kocarev, Gregory Landua, Tomer Meir, Aline Muller, Long H. Nguyen, Matej Oresic , Svetlana Ovchinnikova, Hedi Peterson , Jana Prodanova, Jay Rajagopal, Gunnar Rätsch, Hagai Rossman, Johan Rung , Andrea Sboner, Alexandros Sigaras , Tim Spector , Ron Steinherz, Irene Stevens, Jaak Vilo , Paul Wilmes

Submitted Nature Medicine

DOI

Abstract The discovery of drivers of cancer has traditionally focused on protein-coding genes1,2,3,4. Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers6,7, raise doubts about others and identify novel candidates, including point mutations in the 5′ region of TP53, in the 3′ untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available.

Authors Esther Rheinbay, Morten Muhlig Nielsen, Federico Abascal, Jeremiah A. Wala, Ofer Shapira, Grace Tiao, Henrik Hornshøj, Julian M. Hess, Randi Istrup Juul, Ziao Lin, Lars Feuerbach, Radhakrishnan Sabarinathan, Tobias Madsen, Jaegil Kim, Loris Mularoni, Shimin Shuai, Andrés Lanzós, Carl Herrmann, Yosef E. Maruvka, Ciyue Shen, Samirkumar B. Amin, Pratiti Bandopadhayay, Johanna Bertl, Keith A. Boroevich, John Busanovich, Joana Carlevaro-Fita, Dimple Chakravarty, Calvin Wing Yiu Chan, David Craft, Priyanka Dhingra, Klev Diamanti, Nuno A. Fonseca, Abel Gonzalez-Perez, Qianyun Guo, Mark P. Hamilton, Nicholas J. Haradhvala, Chen Hong, Keren Isaev, Todd A. Johnson, Malene Juul, Andre Kahles, Abdullah Kahraman, Youngwook Kim, Jan Komorowski, Kiran Kumar, Sushant Kumar, Donghoon Lee, Kjong-Van Lehmann, Yilong Li, Eric Minwei Liu, Lucas Lochovsky, Keunchil Park, Oriol Pich, Nicola D. Roberts, Gordon Saksena, Steven E. Schumacher, Nikos Sidiropoulos, Lina Sieverling, Nasa Sinnott-Armstrong, Chip Stewart, David Tamborero, Jose M. C. Tubio, Husen M. Umer, Liis Uusküla-Reimand, Claes Wadelius, Lina Wadi, Xiaotong Yao, Cheng-Zhong Zhang, Jing Zhang, James E. Haber, Asger Hobolth, Marcin Imielinski, Manolis Kellis, Michael S. Lawrence, Christian von Mering, Hidewaki Nakagawa, Benjamin J. Raphael, Mark A. Rubin, Chris Sander, Lincoln D. Stein, Joshua M. Stuart, Tatsuhiko Tsunoda, David A. Wheeler, Rory Johnson, Jüri Reimand, Mark Gerstein, Ekta Khurana, Peter J. Campbell, Núria López-Bigas, PCAWG Drivers and Functional Interpretation Working Group, PCAWG Structural Variation Working Group, Joachim Weischenfeldt, Rameen Beroukhim, Iñigo Martincorena, Jakob Skou Pedersen, Gad Getz & PCAWG Consortium

Submitted Nature

Link DOI

Abstract High-throughput DNA sequencing data is accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. While there has been good progress towards representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this work, we present a new compression approach, Multi-BRWT, which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world datasets.

Authors Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, Andre Kahles

Submitted RECOMB 2019

Link DOI

Abstract Our comprehensive analysis of alternative splicing across 32 The Cancer Genome Atlas cancer types from 8,705 patients detects alternative splicing events and tumor variants by reanalyzing RNA and whole-exome sequencing data. Tumors have up to 30% more alternative splicing events than normal samples. Association analysis of somatic variants with alternative splicing events confirmed known trans associations with variants in SF3B1 and U2AF1 and identified additional trans-acting variants (e.g., TADA1, PPP2R1A). Many tumors have thousands of alternative splicing events not detectable in normal samples; on average, we identified ≈930 exon-exon junctions (“neojunctions”) in tumors not typically found in GTEx normals. From Clinical Proteomic Tumor Analysis Consortium data available for breast and ovarian tumor samples, we confirmed ≈1.7 neojunction- and ≈0.6 single nucleotide variant-derived peptides per tumor sample that are also predicted major histocompatibility complex-I binders (“putative neoantigens”).

Authors Andre Kahles, Kjong-Van Lehmann, Nora C. Toussaint, Matthias Hüser, Stefan Stark, Timo Sachsenberg, Oliver Stegle, Oliver Kohlbacher, Chris Sander, Gunnar Rätsch, The Cancer Genome Atlas Research Network

Submitted Cancer Cell

Link DOI

Abstract Motivation: Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata. Results: We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain. Availability: We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation.

Authors Harun Mustafa, Ingo Schilken, Mikhail Karasikov, Carsten Eickhoff, Gunnar Rätsch, Andre Kahles

Submitted Bioinformatics

Link DOI

Abstract Technological advancements in high throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains inaccessible to the research com- munity through a lack efficient data representation and indexing solutions. One of the available techniques to represent read data on a more abstract level is its transformation into an assem- bly graph. Although the sequence information is now accessible, any contextual annotation and metadata is lost. We present a new approach for a compressed representation of a graph coloring based on a set of Bloom filters. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph to decide on false positives, we can reduce the memory requirements for a given set of colors per edge by three orders of magnitude. As insertion and query on a Bloom filter are constant time operations, the complexity to compress and decompress an edge color is linear in the number of color bits. Representing individual colors as independent filters, our approach is fully dynamic and can be easily parallelized. These properties allow for an easy upscaling to the problem sizes common in the biomedical domain. A prototype implementation of our method is available in Java.

Authors Ingo Schilken, Harun Mustafa, Gunnar Rätsch, Carsten Eickhoff, Andre Kahles

Submitted bioRxiv

Link DOI

Abstract Much of the DNA and RNA sequencing data available is in the form of high-throughput sequencing (HTS) reads and is currently unindexed by established sequence search databases. Recent succinct data structures for indexing both reference sequences and HTS data, along with associated metadata, have been based on either hashing or graph models, but many of these structures are static in nature, and thus, not well-suited as backends for dynamic databases. We propose a parallel construction method for and novel application of the wavelet trie as a dynamic data structure for compressing and indexing graph metadata. By developing an algorithm for merging wavelet tries, we are able to construct large tries in parallel by merging smaller tries constructed concurrently from batches of data. When compared against general compression algorithms and those developed specifically for graph colors (VARI and Rainbowfish), our method achieves compression ratios superior to gzip and VARI, converging to compression ratios of 6.5% to 2% on data sets constructed from over 600 virus genomes. While marginally worse than compression by bzip2 or Rainbowfish, this structure allows for both fast extension and query. We also found that additionally encoding graph topology metadata improved compression ratios, particularly on data sets consisting of several mutually-exclusive reference genomes. It was also observed that the compression ratio of wavelet tries grew sublinearly with the density of the annotation matrices. This work is a significant step towards implementing a dynamic data structure for indexing large annotated sequence data sets that supports fast query and update operations. At the time of writing, no established standard tool has filled this niche.

Authors Harun Mustafa, Andre Kahles, Mikhail Karasikov, Gunnar Raetsch

Submitted bioRxiv

Link DOI

Abstract Cancer is characterised by somatic genetic variation, but the effect of the majority of non-coding somatic variants and the interface with the germline genome are still unknown. We analysed the whole genome and RNA-seq data from 1,188 human cancer patients as provided by the Pan-cancer Analysis of Whole Genomes (PCAWG) project to map cis expression quantitative trait loci of somatic and germline variation and to uncover the causes of allele-specific expression patterns in human cancers. The availability of the first large-scale dataset with both whole genome and gene expression data enabled us to uncover the effects of the non-coding variation on cancer. In addition to confirming known regulatory effects, we identified novel associations between somatic variation and expression dysregulation, in particular in distal regulatory elements. Finally, we uncovered links between somatic mutational signatures and gene expression changes, including TERT and LMO2, and we explained the inherited risk factors in APOBEC-related mutational processes. This work represents the first large-scale assessment of the effects of both germline and somatic genetic variation on gene expression in cancer and creates a valuable resource cataloguing these effects.

Authors Claudia Calabrese, Kjong-Van Lehmann, Lara Urban, Fenglin Liu, Serap Erkek, Nuno Fonseca, Andre Kahles, Leena Helena Kilpinen-Barrett, Julia Markowski, PCAWG-3, Sebastian Waszak, Jan Korbel, Zemin Zhang, Alvis Brazma, Gunnar Raetsch, Roland Schwarz, Oliver Stegle

Submitted bioRxiv

Link DOI

Abstract Most human protein-coding genes are regulated by multiple, distinct promoters, suggesting that the choice of promoter is as important as its level of transcriptional activity. While the role of promoters as driver elements in cancer has been recognized, the contribution of alternative promoters to regulation of the cancer transcriptome remains largely unexplored. Here we show that active promoters can be identified using RNA-Seq data, enabling the analysis of promoter activity in more than 1,000 cancer samples with matched whole genome sequencing data. We find that alternative promoters are a major contributor to tissue-specific regulation of isoform expression and that alternative promoters are frequently deregulated in cancer, affecting known cancer-genes and novel candidates. Noncoding passenger mutations are enriched at promoters of genes with lower regulatory complexity, whereas noncoding driver mutations occur at genes with multiple promoters, often affecting the promoter that shows the highest level of activity. Together our study demonstrates that the landscape of active promoters shapes the cancer transcriptome, opening many opportunities to further explore the interplay of regulatory mechanism and noncoding somatic mutations with transcriptional aberrations in cancer.

Authors Deniz Demircioğlu, Martin Kindermans, Tannistha Nandi, Engin Cukuroglu, Claudia Calabrese, Nuno A. Fonseca, Andre Kahles, Kjong Lehmann, Oliver Stegle, PCAWG-3, PCAWG-Network, Alvis Brazma, Angela Brooks, Gunnar Rätsch, Patrick Tan, Jonathan Göke

Submitted bioRxiv

Link DOI

Authors Natalie R. Davidson, ; PanCancer Analysis of Whole Genomes 3 (PCAWG-3) for ICGC, Alvis Brazma, Angela N. Brooks, Claudia Calabrese, Nuno A. Fonseca, Jonathan Goke, Yao He, Xueda Hu, Andre Kahles, Kjong-Van Lehmann, Fenglin Liu, Gunnar Rätsch, Siliang Li, Roland F. Schwarz, Mingyu Yang, Zemin Zhang, Fan Zhang and Liangtao Zheng

Submitted Proceedings of the American Association for Cancer Research Annual Meeting 2017

Link DOI

Abstract Understanding the occurrence and regulation of alternative splicing (AS) is a key task towards explaining the regulatory processes that shape the complex transcriptomes of higher eukaryotes. With the advent of high-throughput sequencing of RNA (RNA-Seq), the diversity of AS transcripts could be measured at an unprecedented depth. Although the catalog of known AS events has grown ever since, novel transcripts are commonly observed when working with less well annotated organisms, in the context of disease, or within large populations. Whereas an identification of complete transcripts is technically challenging and computationally expensive, focusing on single splicing events as a proxy for transcriptome characteristics is fruitful and sufficient for a wide range of analyses.

Authors Andre Kahles, Cheng Soon Ong, Yi Zhong, Gunnar Rätsch

Submitted Bioinformatics (Oxford, England)

Link Pubmed DOI

Abstract Mapping high-throughput sequencing data to a reference genome is an essential step for most analysis pipelines aiming at the computational analysis of genome and transcriptome sequencing data. Breaking ties between equally well mapping locations poses a severe problem not only during the alignment phase but also has significant impact on the results of downstream analyses. We present the multi-mapper resolution (MMR) tool that infers optimal mapping locations from the coverage density of other mapped reads.

Authors Andre Kahles, Jonas Behr, Gunnar Rätsch

Submitted Bioinformatics (Oxford, England)

Link Pubmed DOI

Abstract Epigenome modulation potentially provides a mechanism for organisms to adapt, within and between generations. However, neither the extent to which this occurs, nor the mechanisms involved are known. Here we investigate DNA methylation variation in Swedish Arabidopsis thaliana accessions grown at two different temperatures. Environmental effects were limited to transposons, where CHH methylation was found to increase with temperature. Genome-wide association studies (GWAS) revealed that the extensive CHH methylation variation was strongly associated with genetic variants in both cis and trans, including a major trans-association close to the DNA methyltransferase CMT2. Unlike CHH methylation, CpG gene body methylation (GBM) was not affected by growth temperature, but was instead correlated with the latitude of origin. Accessions from colder regions had higher levels of GBM for a significant fraction of the genome, and this was associated with increased transcription for the genes affected. GWAS revealed that this effect was largely due to trans-acting loci, many of which showed evidence of local adaptation.

Authors Manu J Dubin, Pei Zhang, Dazhe Meng, Marie Stanislas Remigereau, Edward J Osborne, Francesco Paolo Casale, Philipp Drewe, Andre Kahles, Geraldine Jean, Bjarni Vilhjalmsson, Joanna Jagoda, Selen Irez, Viktor Voronin, Qiang Song, Quan Long, Gunnar Rätsch, Oliver Stegle, Richard M Clark, Magnus Nordborg

Submitted eLife

Link Pubmed DOI

Abstract We present a genome-wide analysis of splicing patterns of 282 kidney renal clear cell carcinoma patients in which we integrate data from whole-exome sequencing of tumor and normal samples, RNA-seq and copy number variation. We proposed a scoring mechanism to compare splicing patterns in tumor samples to normal samples in order to rank and detect tumor-specific isoforms that have a potential for new biomarkers. We identified a subset of genes that show introns only observable in tumor but not in normal samples, ENCODE and GEUVADIS samples. In order to improve our understanding of the underlying genetic mechanisms of splicing variation we performed a large-scale association analysis to find links between somatic or germline variants with alternative splicing events. We identified 915 cis- and trans-splicing quantitative trait loci (sQTL) associated with changes in splicing patterns. Some of these sQTL have previously been associated with being susceptibility loci for cancer and other diseases. Our analysis also allowed us to identify the function of several COSMIC variants showing significant association with changes in alternative splicing. This demonstrates the potential significance of variants affecting alternative splicing events and yields insights into the mechanisms related to an array of disease phenotypes.

Authors Kjong-Van Lehmann, Andre Kahles, Cyriac Kandoth, William Lee, Nikolaus Schultz, Oliver Stegle, Gunnar Rätsch

Submitted Biocomputing

Link Pubmed

Abstract We present Oqtans, an open-source workbench for quantitative transcriptome analysis, that is integrated in Galaxy. Its distinguishing features include customizable computational workflows and a modular pipeline architecture that facilitates comparative assessment of tool and data quality. Oqtans integrates an assortment of machine learning-powered tools into Galaxy, which show superior or equal performance to state-of-the-art tools. Implemented tools comprise a complete transcriptome analysis workflow: short-read alignment, transcript identification/quantification and differential expression analysis. Oqtans and Galaxy facilitate persistent storage, data exchange and documentation of intermediate results and analysis workflows. We illustrate how Oqtans aids the interpretation of data from different experiments in easy to understand use cases. Users can easily create their own workflows and extend Oqtans by integrating specific tools. Oqtans is available as (i) a cloud machine image with a demo instance at cloud.oqtans.org, (ii) a public Galaxy instance at galaxy.cbio.mskcc.org, (iii) a git repository containing all installed software (oqtans.org/git); most of which is also available from (iv) the Galaxy Toolshed and (v) a share string to use along with Galaxy CloudMan.

Authors Vipin T Sreedharan, Sebastian J Schultheiss, Geraldine Jean, Andre Kahles, Regina Bohnert, Philipp Drewe, Pramod Mudrakarta, Nico Görnitz, Georg Zeller, Gunnar Rätsch

Submitted Bioinformatics (Oxford, England)

Link Pubmed DOI

Abstract High-throughput RNA sequencing is an increasingly accessible method for studying gene structure and activity on a genome-wide scale. A critical step in RNA-seq data analysis is the alignment of partial transcript reads to a reference genome sequence. To assess the performance of current mapping software, we invited developers of RNA-seq aligners to process four large human and mouse RNA-seq data sets. In total, we compared 26 mapping protocols based on 11 programs and pipelines and found major performance differences between methods on numerous benchmarks, including alignment yield, basewise accuracy, mismatch and gap placement, exon junction discovery and suitability of alignments for transcript reconstruction. We observed concordant results on real and simulated RNA-seq data, confirming the relevance of the metrics employed. Future developments in RNA-seq alignment methods would benefit from improved placement of multimapped reads, balanced utilization of existing gene annotation and a reduced false discovery rate for splice junctions.

Authors Par G Engstrom, Tamara Steijger, Botond Sipos, Gregory R Grant, Andre Kahles, Gunnar Rätsch, Nick Goldman, Tim J Hubbard, Jennifer Harrow, Roderic Guigo, Paul Bertone

Submitted Nature methods

Link Pubmed DOI

Abstract The nonsense-mediated decay (NMD) surveillance pathway can recognize erroneous transcripts and physiological mRNAs, such as precursor mRNA alternative splicing (AS) variants. Currently, information on the global extent of coupled AS and NMD remains scarce and even absent for any plant species. To address this, we conducted transcriptome-wide splicing studies using Arabidopsis thaliana mutants in the NMD factor homologs UP FRAMESHIFT1 (UPF1) and UPF3 as well as wild-type samples treated with the translation inhibitor cycloheximide. Our analyses revealed that at least 17.4% of all multi-exon, protein-coding genes produce splicing variants that are targeted by NMD. Moreover, we provide evidence that UPF1 and UPF3 act in a translation-independent mRNA decay pathway. Importantly, 92.3% of the NMD-responsive mRNAs exhibit classical NMD-eliciting features, supporting their authenticity as direct targets. Genes generating NMD-sensitive AS variants function in diverse biological processes, including signaling and protein modification, for which NaCl stress-modulated AS-NMD was found. Besides mRNAs, numerous noncoding RNAs and transcripts derived from intergenic regions were shown to be NMD responsive. In summary, we provide evidence for a major function of AS-coupled NMD in shaping the Arabidopsis transcriptome, having fundamental implications in gene regulation and quality control of transcript processing.

Authors Gabriele Drechsel, Andre Kahles, Anil K Kesarwani, Eva Stauffer, Jonas Behr, Philipp Drewe, Gunnar Rätsch, Andreas Wachter

Submitted The Plant cell

Link Pubmed DOI

Abstract High-throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in the detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape, pose profound computational challenges for transcriptome reconstruction.

Authors Jonas Behr, Andre Kahles, Yi Zhong, Vipin T Sreedharan, Philipp Drewe, Gunnar Rätsch

Submitted Bioinformatics (Oxford, England)

Link Pubmed DOI

Abstract Deep transcriptome sequencing (RNA-Seq) has become a vital tool for studying the state of cells in the context of varying environments, genotypes and other factors. RNA-Seq profiling data enable identification of novel isoforms, quantification of known isoforms and detection of changes in transcriptional or RNA-processing activity. Existing approaches to detect differential isoform abundance between samples either require a complete isoform annotation or fall short in providing statistically robust and calibrated significance estimates. Here, we propose a suite of statistical tests to address these open needs: a parametric test that uses known isoform annotations to detect changes in relative isoform abundance and a non-parametric test that detects differential read coverages and can be applied when isoform annotations are not available. Both methods account for the discrete nature of read counts and the inherent biological variability. We demonstrate that these tests compare favorably to previous methods, both in terms of accuracy and statistical calibrations. We use these techniques to analyze RNA-Seq libraries from Arabidopsis thaliana and Drosophila melanogaster. The identified differential RNA processing events were consistent with RT-qPCR measurements and previous studies. The proposed toolkit is available from http://bioweb.me/rdiff and enables in-depth analyses of transcriptomes, with or without available isoform annotation.

Authors Philipp Drewe, Oliver Stegle, Lisa Hartmann, Andre Kahles, Regina Bohnert, Andreas Wachter, Karsten Borgwardt, Gunnar Rätsch

Submitted Nucleic acids research

Link Pubmed DOI

Abstract Deep sequencing of transcriptomes allows quantitative and qualitative analysis of many RNA species in a sample, with parallel comparison of expression levels, splicing variants, natural antisense transcripts, RNA editing and transcriptional start and stop sites the ideal goal. By computational modeling, we show how libraries of multiple insert sizes combined with strand-specific, paired-end (SS-PE) sequencing can increase the information gained on alternative splicing, especially in higher eukaryotes. Despite the benefits of gaining SS-PE data with paired ends of varying distance, the standard Illumina protocol allows only non-strand-specific, paired-end sequencing with a single insert size. Here, we modify the Illumina RNA ligation protocol to allow SS-PE sequencing by using a custom pre-adenylated 3' adaptor. We generate parallel libraries with differing insert sizes to aid deconvolution of alternative splicing events and to characterize the extent and distribution of natural antisense transcription in C. elegans. Despite stringent requirements for detection of alternative splicing, our data increases the number of intron retention and exon skipping events annotated in the Wormbase genome annotations by 127% and 121%, respectively. We show that parallel libraries with a range of insert sizes increase transcriptomic information gained by sequencing and that by current established benchmarks our protocol gives competitive results with respect to library quality.

Authors Lisa M Smith, Lisa Hartmann, Philipp Drewe, Regina Bohnert, Andre Kahles, Christa Lanz, Gunnar Rätsch

Submitted RNA biology

Link Pubmed DOI