Publications

Abstract Cancer is characterised by somatic genetic variation, but the effect of the majority of non-coding somatic variants and the interface with the germline genome are still unknown. We analysed the whole genome and RNA-seq data from 1,188 human cancer patients as provided by the Pan-cancer Analysis of Whole Genomes (PCAWG) project to map cis expression quantitative trait loci of somatic and germline variation and to uncover the causes of allele-specific expression patterns in human cancers. The availability of the first large-scale dataset with both whole genome and gene expression data enabled us to uncover the effects of the non-coding variation on cancer. In addition to confirming known regulatory effects, we identified novel associations between somatic variation and expression dysregulation, in particular in distal regulatory elements. Finally, we uncovered links between somatic mutational signatures and gene expression changes, including TERT and LMO2, and we explained the inherited risk factors in APOBEC-related mutational processes. This work represents the first large-scale assessment of the effects of both germline and somatic genetic variation on gene expression in cancer and creates a valuable resource cataloguing these effects.

Authors Claudia Calabrese, Kjong-Van Lehmann, Lara Urban, Fenglin Liu, Serap Erkek, Nuno Fonseca, Andre Kahles, Leena Helena Kilpinen-Barrett, Julia Markowski, PCAWG-3, Sebastian Waszak, Jan Korbel, Zemin Zhang, Alvis Brazma, Gunnar Raetsch, Roland Schwarz, Oliver Stegle

Submitted bioRxiv

Link DOI

Abstract Most human protein-coding genes are regulated by multiple, distinct promoters, suggesting that the choice of promoter is as important as its level of transcriptional activity. While the role of promoters as driver elements in cancer has been recognized, the contribution of alternative promoters to regulation of the cancer transcriptome remains largely unexplored. Here we show that active promoters can be identified using RNA-Seq data, enabling the analysis of promoter activity in more than 1,000 cancer samples with matched whole genome sequencing data. We find that alternative promoters are a major contributor to tissue-specific regulation of isoform expression and that alternative promoters are frequently deregulated in cancer, affecting known cancer-genes and novel candidates. Noncoding passenger mutations are enriched at promoters of genes with lower regulatory complexity, whereas noncoding driver mutations occur at genes with multiple promoters, often affecting the promoter that shows the highest level of activity. Together our study demonstrates that the landscape of active promoters shapes the cancer transcriptome, opening many opportunities to further explore the interplay of regulatory mechanism and noncoding somatic mutations with transcriptional aberrations in cancer.

Authors Deniz Demircioğlu, Martin Kindermans, Tannistha Nandi, Engin Cukuroglu, Claudia Calabrese, Nuno A. Fonseca, Andre Kahles, Kjong Lehmann, Oliver Stegle, PCAWG-3, PCAWG-Network, Alvis Brazma, Angela Brooks, Gunnar Rätsch, Patrick Tan, Jonathan Göke

Submitted bioRxiv

Link DOI

Abstract Variational Inference is a popular technique to approximate a possibly intractable Bayesian posterior with a more tractable one. Recently, Boosting Variational Inference has been proposed as a new paradigm to approximate the posterior by a mixture of densities by greedily adding components to the mixture. In the present work, we study the convergence properties of this approach from a modern optimization viewpoint by establishing connections to the classic Frank-Wolfe algorithm. Our analyses yields novel theoretical insights on the Boosting of Variational Inference regarding the sufficient conditions for convergence, explicit sublinear/linear rates, and algorithmic simplifications.

Authors Francesco Locatello, Rajiv Khanna, Joydeep Ghosh, Gunnar Rätsch

Submitted submitted

Link DOI

Abstract During rheumatoid arthritis (RA), Tumor Necrosis Factor (TNF) activates fibroblast-like synoviocytes (FLS) inducing in a temporal order a constellation of genes, which perpetuate synovial inflammation. Although the molecular mechanisms regulating TNF-induced transcription are well characterized, little is known about the impact of mRNA stability on gene expression and the impact of TNF on decay rates of mRNA transcripts in FLS. To address these issues we performed RNA sequencing and genome-wide analysis of the mRNA stabilome in RA FLS. We found that TNF induces a biphasic gene expression program: initially, the inducible transcriptome consists primarily of unstable transcripts but progressively switches and becomes dominated by very stable transcripts. This temporal switch is due to: a) TNF-induced prolonged stabilization of previously unstable transcripts that enables progressive transcript accumulation over days and b) sustained expression and late induction of very stable transcripts. TNF-induced mRNA stabilization in RA FLS occurs during the late phase of TNF response, is MAPK-dependent, and involves several genes with pathogenic potential such as IL6, CXCL1, CXCL3, CXCL8/IL8, CCL2, and PTGS2. These results provide the first insights into genome-wide regulation of mRNA stability in RA FLS and highlight the potential contribution of dynamic regulation of the mRNA stabilome by TNF to chronic synovitis.

Authors Loupasakis K, Kuo D, Sokhi UK, Sohn C, Syracuse B, Giannopoulou EG, Park SH, Kang H, Rätsch G, Ivashkiv LB, Kalliolias GD

Submitted PLoS One

Link DOI

Abstract Generative Adversarial Networks (GANs) have shown remarkable success as a framework for training models to produce realistic-looking data. In this work, we propose a Recurrent GAN (RGAN) and Recurrent Conditional GAN (RCGAN) to produce realistic real-valued multi-dimensional time series, with an emphasis on their application to medical data. RGANs make use of recurrent neural networks in the generator and the discriminator. In the case of RCGANs, both of these RNNs are conditioned on auxiliary information. We demonstrate our models in a set of toy datasets, where we show visually and quantitatively (using sample likelihood and maximum mean discrepancy) that they can successfully generate realistic time-series. We also describe novel evaluation methods for GANs, where we generate a synthetic labelled training dataset, and evaluate on a real test set the performance of a model trained on the synthetic data, and vice-versa. We illustrate with these metrics that RCGANs can generate time-series data useful for supervised training, with only minor degradation in performance on real test data. This is demonstrated on digit classification from 'serialised' MNIST and by training an early warning system on a medical dataset of 17,000 patients from an intensive care unit. We further discuss and analyse the privacy concerns that may arise when using RCGANs to generate realistic synthetic medical time series data.

Authors Stephanie L Hyland, Cristobal Esteban, Gunnar Rätsch

Submitted arXiv

Link

Abstract Greedy optimization methods such as Matching Pursuit (MP) and Frank-Wolfe (FW) algorithms regained popularity in recent years due to their simplicity, effectiveness and theoretical guarantees. MP and FW address optimization over the linear span and the convex hull of a set of atoms, respectively. In this paper, we consider the intermediate case of optimization over the convex cone, parametrized as the conic hull of a generic atom set, leading to the first principled definitions of non-negative MP algorithms for which we give explicit convergence rates and demonstrate excellent empirical performance. In particular, we derive sublinear (O(1/t)) convergence on general smooth and convex objectives, and linear convergence (O(e−t)) on strongly convex objectives, in both cases for general sets of atoms. Furthermore, we establish a clear correspondence of our algorithms to known algorithms from the MP and FW literature. Our novel algorithms and analyses target general atom sets and general objective functions, and hence are directly applicable to a large variety of learning settings.

Authors Francesco Locatello, Michael Tschannen, Gunnar Rätsch, Martin Jaggi

Submitted NIPS 2017

Link DOI

Abstract To understand the population genetics of structural variants and their effects on phenotypes, we developed an approach to mapping structural variants that segregate in a population sequenced at low coverage. We avoid calling structural variants directly. Instead, the evidence for a potential structural variant at a locus is indicated by variation in the counts of short-reads that map anomalously to that locus. These structural variant traits are treated as quantitative traits and mapped genetically, analogously to a gene expression study. Association between a structural variant trait at one locus, and genotypes at a distant locus indicate the origin and target of a transposition. Using ultra-low-coverage (0.3×) population sequence data from 488 recombinant inbred Arabidopsis thaliana genomes, we identified 6502 segregating structural variants. Remarkably, 25% of these were transpositions. While many structural variants cannot be delineated precisely, we validated 83% of 44 predicted transposition breakpoints by polymerase chain reaction. We show that specific structural variants may be causative for quantitative trait loci for germination and resistance to infection by the fungus Albugo laibachii, isolate Nc14. Further we show that the phenotypic heritability attributable to read-mapping anomalies differs from, and, in the case of time to germination and bolting, exceeds that due to standard genetic variation. Genes within structural variants are also more likely to be silenced or dysregulated. This approach complements the prevalent strategy of structural variant discovery in fewer individuals sequenced at high coverage. It is generally applicable to large populations sequenced at low-coverage, and is particularly suited to mapping transpositions.

Authors Martha Imprialou, André Kahles, Joshua G. Steffen, Edward J. Osborne, Xiangchao Gan, Janne Lempe, Amarjit Bhomra, Eric Belfield, Anne Visscher, Robert Greenhalgh, Nicholas P Harberd, Richard Goram, Jotun Hein, Alexandre Robert-Seilaniantz, Jonathan Jones, Oliver Stegle, Paula Kover, Miltos Tsiantis, Magnus Nordborg, Gunnar Rätsch, Richard M. Clark andRichard Mott

Submitted Genetics

Link DOI

Authors Natalie R. Davidson, ; PanCancer Analysis of Whole Genomes 3 (PCAWG-3) for ICGC, Alvis Brazma, Angela N. Brooks, Claudia Calabrese, Nuno A. Fonseca, Jonathan Goke, Yao He, Xueda Hu, Andre Kahles, Kjong-Van Lehmann, Fenglin Liu, Gunnar Rätsch, Siliang Li, Roland F. Schwarz, Mingyu Yang, Zemin Zhang, Fan Zhang and Liangtao Zheng

Submitted Proceedings of the American Association for Cancer Research Annual Meeting 2017

Link DOI

Abstract We present SplashRNA, a sequential classifier to predict potent microRNA-based short hairpin RNAs (shRNAs). Trained on published and novel data sets, SplashRNA outperforms previous algorithms and reliably predicts the most efficient shRNAs for a given gene. Combined with an optimized miR-E backbone, >90% of high-scoring SplashRNA predictions trigger >85% protein knockdown when expressed from a single genomic integration. SplashRNA can significantly improve the accuracy of loss-of-function genetics studies and facilitates the generation of compact shRNA libraries.

Authors Pelossof R, Fairchild L, Huang CH, Widmer C, Sreedharan VT, Sinha N, Lai DY, Guan Y, Premsrirut PK, Tschaharganeh DF, Hoffmann T, Thapar V, Xiang Q, Garippa RJ, Rätsch G, Zuber J, Lowe SW, Leslie CS, Fellmann C

Submitted Nature Biotechnology

Link DOI

Abstract Two of the most fundamental prototypes of greedy optimization are the matching pursuit and Frank-Wolfe algorithms. In this paper, we take a unified view on both classes of methods, leading to the first explicit convergence rates of matching pursuit methods in an optimization sense, for general sets of atoms. We derive sublinear (1/t) convergence for both classes on general smooth objectives, and linear convergence on strongly convex objectives, as well as a clear correspondence of algorithm variants. Our presented algorithms and rates are affine invariant, and do not need any incoherence or sparsity assumptions.

Authors Francesco Locatello, Rajiv Khanna, Michael Tschannen, Martin Jaggi

Submitted arXiv

Link

Abstract MOTIVATION:Deep sequencing based ribosome footprint profiling can provide novel insights into the regulatory mechanisms of protein translation. However, the observed ribosome profile is fundamentally confounded by transcriptional activity. In order to decipher principles of translation regulation, tools that can reliably detect changes in translation efficiency in case-control studies are needed. RESULTS: We present a statistical framework and an analysis tool, RiboDiff, to detect genes with changes in translation efficiency across experimental treatments. RiboDiff uses generalized linear models to estimate the over-dispersion of RNA-Seq and ribosome profiling measurements separately, and performs a statistical test for differential translation efficiency using both mRNA abundance and ribosome occupancy. AVAILABILITY AND IMPLEMENTATION: RiboDiff webpage http://bioweb.me/ribodiff Source code including scripts for preprocessing the FASTQ data are available at http://github.com/ratschlab/ribodiff CONTACTS: zhongy@cbio.mskcc.org or raetsch@inf.ethz.chSupplementary information: Supplementary data are available at Bioinformatics online.

Authors Zhong Y, Karaletsos T, Drewe P, Sreedharan VT, Kuo D, Singh K, Wendel HG, Rätsch G.

Submitted Bioinformatics

Link DOI

Abstract Plants use light as source of energy and information to detect diurnal rhythms and seasonal changes. Sensing changing light conditions is critical to adjust plant metabolism and to initiate developmental transitions. Here, we analyzed transcriptome-wide alterations in gene expression and alternative splicing (AS) of etiolated seedlings undergoing photomorphogenesis upon exposure to blue, red, or white light. Our analysis revealed massive transcriptome reprogramming as reflected by differential expression of ∼20% of all genes and changes in several hundred AS events. For more than 60% of all regulated AS events, light promoted the production of a presumably protein-coding variant at the expense of an mRNA with nonsense-mediated decay-triggering features. Accordingly, AS of the putative splicing factor REDUCED RED-LIGHT RESPONSES IN CRY1CRY2 BACKGROUND1, previously identified as a red light signaling component, was shifted to the functional variant under light. Downstream analyses of candidate AS events pointed at a role of photoreceptor signaling only in monochromatic but not in white light. Furthermore, we demonstrated similar AS changes upon light exposure and exogenous sugar supply, with a critical involvement of kinase signaling. We propose that AS is an integration point of signaling pathways that sense and transmit information regarding the energy availability in plants.

Authors Hartmann L, Drewe-Boß P, Wießner T, Wagner G, Geue S, Lee HC, Obermüller DM, Kahles A, Behr J, Sinz FH, Rätsch G, Wachter A

Submitted Plant Cell

Link DOI

Abstract Mapping high-throughput sequencing data to a reference genome is an essential step for most analysis pipelines aiming at the computational analysis of genome and transcriptome sequencing data. Breaking ties between equally well mapping locations poses a severe problem not only during the alignment phase but also has significant impact on the results of downstream analyses. We present the multi-mapper resolution (MMR) tool that infers optimal mapping locations from the coverage density of other mapped reads.

Authors Andre Kahles, Jonas Behr, Gunnar Rätsch

Submitted Bioinformatics (Oxford, England)

Link Pubmed DOI

Authors Stephanie L Hyland, Theofanis Karaletsos, Gunnar Rätsch

Submitted NIPS Workshop on Machine Learning for Healthcare, 2015

Link

Authors M Tauber, T Darrell, Marius Kloft, M Pontil, Gunnar Rätsch, E Rodner, C Lengauer, M Bolten, R D Falgout, O Schenk

Link

Authors Julia Vogt, Marius Kloft, Stefan Stark, S S Raman, S Prabhakaran, V Roth, Gunnar Rätsch

Submitted Machine Learning

Link DOI

Abstract We report a mechanism of translational control that is determined by a requirement for eIF4A RNA helicase activity and underlies the anticancer effects of Silvestrol and related compounds. Briefly, activation of cap-dependent translation contributes to T-cell leukemia (T-ALL) development and maintenance. Accordingly, inhibition of translation initiation factor eIF4A with Silvestrol produces powerful therapeutic effects against T-ALL in vivo. We used transcriptome-scale ribosome footprinting on Silvestrol-treated T-ALL cells to identify Silvestrol-sensitive transcripts and the hallmark features of eIF4A-dependent translation. These include a long 5 UTR and a 12-mer sequence motif that encodes a guanine quartet (CGG)4. RNA folding algorithms as well as experimental evidences pinpoint the (CGG)4 motif as a common site of RNA G-quadruplex structures within the 5 UTR. In T-ALL these structures mark approximately eighty highly Silvestrol-sensitive transcripts that include key oncogenes and transcription factors and contribute to the drug's anti-leukemic action. Hence, the eIF4A-dependent translation of G-quadruplex containing transcripts emerges as a gene-specific and therapeutically targetable mechanism of translational control.

Authors Kamini Singh, Andrew L Wolfe, Yi Zhong, Gunnar Rätsch, Hans Guido Wendel

Link DOI

Authors JE Vogt

Submitted IEEE IEEE/ACM Transactions on Computational Biology and Bioinformatics

Link

Abstract Identifying discriminative motifs underlying the functionality and evolution of organisms is a major challenge in computational biology. Machine learning approaches such as support vector machines (SVMs) achieve state-of-the-art performances in genomic discrimination tasks, but--due to its black-box character--motifs underlying its decision function are largely unknown. As a remedy, positional oligomer importance matrices (POIMs) allow us to visualize the significance of position-specific subsequences. Although being a major step towards the explanation of trained SVM models, they suffer from the fact that their size grows exponentially in the length of the motif, which renders their manual inspection feasible only for comparably small motif sizes, typically k ≤ 5. In this work, we extend the work on positional oligomer importance matrices, by presenting a new machine-learning methodology, entitled motifPOIM, to extract the truly relevant motifs--regardless of their length and complexity--underlying the predictions of a trained SVM model. Our framework thereby considers the motifs as free parameters in a probabilistic model, a task which can be phrased as a non-convex optimization problem. The exponential dependence of the POIM size on the oligomer length poses a major numerical challenge, which we address by an efficient optimization framework that allows us to find possibly overlapping motifs consisting of up to hundreds of nucleotides. We demonstrate the efficacy of our approach on a synthetic data set as well as a real-world human splice site data set.

Authors Marina M C Vidovic, Nico Görnitz, Klaus Robert Müller, Gunnar Rätsch, Marius Kloft

Submitted PloS one

Link Pubmed DOI

Abstract Interferon-γ (IFN-gamma) primes macrophages for enhanced microbial killing and inflammatory activation by Toll-like receptors (TLRs), but little is known about the regulation of cell metabolism or mRNA translation during this priming. We found that IFN-γ regulated the metabolism and mRNA translation of human macrophages by targeting the kinases mTORC1 and MNK, both of which converge on the selective regulator of translation initiation eIF4E. Physiological downregulation of mTORC1 by IFN-γ was associated with autophagy and translational suppression of repressors of inflammation such as HES1. Genome-wide ribosome profiling in TLR2-stimulated macrophages showed that IFN-γ selectively modulated the macrophage translatome to promote inflammation, further reprogram metabolic pathways and modulate protein synthesis. These results show that IFN-γ-mediated metabolic reprogramming and translational regulation are key components of classical inflammatory macrophage activation.

Authors Xiaodi Su, Yingpu Yu, Yi Zhong, Eugenia G Giannopoulou, Xiaoyu Hu, Hui Liu, Justin R Cross, Gunnar Rätsch, Charles M Rice, Lionel B Ivashkiv

Submitted Nature immunology

Link Pubmed DOI

Abstract Epigenome modulation potentially provides a mechanism for organisms to adapt, within and between generations. However, neither the extent to which this occurs, nor the mechanisms involved are known. Here we investigate DNA methylation variation in Swedish Arabidopsis thaliana accessions grown at two different temperatures. Environmental effects were limited to transposons, where CHH methylation was found to increase with temperature. Genome-wide association studies (GWAS) revealed that the extensive CHH methylation variation was strongly associated with genetic variants in both cis and trans, including a major trans-association close to the DNA methyltransferase CMT2. Unlike CHH methylation, CpG gene body methylation (GBM) was not affected by growth temperature, but was instead correlated with the latitude of origin. Accessions from colder regions had higher levels of GBM for a significant fraction of the genome, and this was associated with increased transcription for the genes affected. GWAS revealed that this effect was largely due to trans-acting loci, many of which showed evidence of local adaptation.

Authors Manu J Dubin, Pei Zhang, Dazhe Meng, Marie Stanislas Remigereau, Edward J Osborne, Francesco Paolo Casale, Philipp Drewe, Andre Kahles, Geraldine Jean, Bjarni Vilhjalmsson, Joanna Jagoda, Selen Irez, Viktor Voronin, Qiang Song, Quan Long, Gunnar Rätsch, Oliver Stegle, Richard M Clark, Magnus Nordborg

Submitted eLife

Link Pubmed DOI

Abstract We present a genome-wide analysis of splicing patterns of 282 kidney renal clear cell carcinoma patients in which we integrate data from whole-exome sequencing of tumor and normal samples, RNA-seq and copy number variation. We proposed a scoring mechanism to compare splicing patterns in tumor samples to normal samples in order to rank and detect tumor-specific isoforms that have a potential for new biomarkers. We identified a subset of genes that show introns only observable in tumor but not in normal samples, ENCODE and GEUVADIS samples. In order to improve our understanding of the underlying genetic mechanisms of splicing variation we performed a large-scale association analysis to find links between somatic or germline variants with alternative splicing events. We identified 915 cis- and trans-splicing quantitative trait loci (sQTL) associated with changes in splicing patterns. Some of these sQTL have previously been associated with being susceptibility loci for cancer and other diseases. Our analysis also allowed us to identify the function of several COSMIC variants showing significant association with changes in alternative splicing. This demonstrates the potential significance of variants affecting alternative splicing events and yields insights into the mechanisms related to an array of disease phenotypes.

Authors Kjong Van Lehmann, Andre Kahles, Cyriac Kandoth, William Lee, Nikolaus Schultz, Oliver Stegle, Gunnar Rätsch

Submitted Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

Link Pubmed

Authors Xinghua Lou, Marius Kloft, Gunnar Rätsch, F A Hamprecht

Link

Authors Nico Görnitz, AK Porbadnigk, Alexander Binder, C Sannelli, Mikio L Braun, Klaus Robert Müller, Marius Kloft

Link

Abstract Analysis of microscopy images can provide insight into many biological processes. One particularly challenging problem is cellular nuclear segmentation in highly anisotropic and noisy 3D image data. Manually localizing and segmenting each and every cellular nucleus is very time-consuming, which remains a bottleneck in large-scale biological experiments. In this work, we present a tool for automated segmentation of cellular nuclei from 3D fluorescent microscopic data. Our tool is based on state-of-the-art image processing and machine learning techniques and provides a user-friendly graphical user interface. We show that our tool is as accurate as manual annotation and greatly reduces the time for the registration.

Authors Christian K Widmer, Stephanie Heinrich, Philipp Drewe, Xinghua Lou, Shefali Umrania, Gunnar Rätsch

Submitted Signal, image and video processing

Link Pubmed DOI

Abstract Alternative splicing is an essential mechanism for increasing transcriptome and proteome diversity in eukaryotes. Particularly in multicellular eukaryotes, this mechanism is involved in the regulation of developmental and physiological processes like growth, differentiation and signal transduction.

Authors Arash Kianianmomeni, Cheng Soon Ong, Gunnar Rätsch, Armin Hallmann

Submitted BMC genomics

Link Pubmed DOI

Abstract Intraspecific genetic incompatibilities prevent the assembly of specific alleles into single genotypes and influence genome- and species-wide patterns of sequence variation. A common incompatibility in plants is hybrid necrosis, characterized by autoimmune responses due to epistatic interactions between natural genetic variants. By systematically testing thousands of F1 hybrids of Arabidopsis thaliana strains, we identified a small number of incompatibility hot spots in the genome, often in regions densely populated by nucleotide-binding domain and leucine-rich repeat (NLR) immune receptor genes. In several cases, these immune receptor loci interact with each other, suggestive of conflict within the immune system. A particularly dangerous locus is a highly variable cluster of NLR genes, DM2, which causes multiple independent incompatibilities with genes that encode a range of biochemical functions, including NLRs. Our findings suggest that deleterious interactions of immune receptors limit the combinations of favorable disease resistance alleles accessible to plant genomes.

Authors Eunyoung Chae, Kirsten Bomblies, Sang Tae Kim, Darya Karelina, Maricris Zaidem, Stephan Ossowski, Carmen Martin Pizarro, Roosa A E Laitinen, Beth A Rowan, Hezi Tenenboim, Sarah Lechner, Monika Demar, Anette Habring Müller, Christa Lanz, Gunnar Rätsch, Detlef Weigel

Submitted Cell

Link Pubmed DOI

Abstract Transcription factor (TF) DNA sequence preferences direct their regulatory activity, but are currently known for only ∼1% of eukaryotic TFs. Broadly sampling DNA-binding domain (DBD) types from multiple eukaryotic clades, we determined DNA sequence preferences for >1,000 TFs encompassing 54 different DBD classes from 131 diverse eukaryotes. We find that closely related DBDs almost always have very similar DNA sequence preferences, enabling inference of motifs for ∼34% of the ∼170,000 known or predicted eukaryotic TFs. Sequences matching both measured and inferred motifs are enriched in chromatin immunoprecipitation sequencing (ChIP-seq) peaks and upstream of transcription start sites in diverse eukaryotic lineages. SNPs defining expression quantitative trait loci in Arabidopsis promoters are also enriched for predicted TF binding sites. Importantly, our motif "library" can be used to identify specific TFs whose binding may be altered by human disease risk alleles. These data present a powerful resource for mapping transcriptional networks across eukaryotes.

Authors Matthew T Weirauch, Ally Yang, Mihai Albu, Atina G Cote, Alejandro Montenegro Montero, Philipp Drewe, Hamed S Najafabadi, Samuel A Lambert, Ishminder Mann, Kate Cook, Hong Zheng, Alejandra Goity, Harm van Bakel, Jean Claude Lozano, Mary Galli, Mathew G Lewsey, Eryong Huang, Tuhin Mukherjee, Xiaoting Chen, John S Reece Hoyes, Sridhar Govindarajan, Gad Shaulsky, Albertha J M Walhout, Francois Yves Bouget, Gunnar Rätsch, Luis F Larrondo, Joseph R Ecker, Timothy R Hughes

Submitted Cell

Link Pubmed DOI

Abstract The translational control of oncoprotein expression is implicated in many cancers. Here we report an eIF4A RNA helicase-dependent mechanism of translational control that contributes to oncogenesis and underlies the anticancer effects of silvestrol and related compounds. For example, eIF4A promotes T-cell acute lymphoblastic leukaemia development in vivo and is required for leukaemia maintenance. Accordingly, inhibition of eIF4A with silvestrol has powerful therapeutic effects against murine and human leukaemic cells in vitro and in vivo. We use transcriptome-scale ribosome footprinting to identify the hallmarks of eIF4A-dependent transcripts. These include 5' untranslated region (UTR) sequences such as the 12-nucleotide guanine quartet (CGG)4 motif that can form RNA G-quadruplex structures. Notably, among the most eIF4A-dependent and silvestrol-sensitive transcripts are a number of oncogenes, superenhancer-associated transcription factors, and epigenetic regulators. Hence, the 5' UTRs of select cancer genes harbour a targetable requirement for the eIF4A RNA helicase.

Authors Andrew L Wolfe, Kamini Singh, Yi Zhong, Philipp Drewe, Vinagolu K Rajasekhar, Viraj R Sanghvi, Konstantinos J Mavrakis, Man Jiang, Justine E Roderick, Joni Van der Meulen, Jonathan H Schatz, Christina M Rodrigo, Chunying Zhao, Pieter Rondou, Elisa de Stanchina, Julie Teruya Feldstein, Michelle A Kelliher, Frank Speleman, John A Porco, Jerry Pelletier, Gunnar Rätsch, Hans Guido Wendel

Submitted Nature

Link Pubmed DOI

Abstract We present Oqtans, an open-source workbench for quantitative transcriptome analysis, that is integrated in Galaxy. Its distinguishing features include customizable computational workflows and a modular pipeline architecture that facilitates comparative assessment of tool and data quality. Oqtans integrates an assortment of machine learning-powered tools into Galaxy, which show superior or equal performance to state-of-the-art tools. Implemented tools comprise a complete transcriptome analysis workflow: short-read alignment, transcript identification/quantification and differential expression analysis. Oqtans and Galaxy facilitate persistent storage, data exchange and documentation of intermediate results and analysis workflows. We illustrate how Oqtans aids the interpretation of data from different experiments in easy to understand use cases. Users can easily create their own workflows and extend Oqtans by integrating specific tools. Oqtans is available as (i) a cloud machine image with a demo instance at cloud.oqtans.org, (ii) a public Galaxy instance at galaxy.cbio.mskcc.org, (iii) a git repository containing all installed software (oqtans.org/git); most of which is also available from (iv) the Galaxy Toolshed and (v) a share string to use along with Galaxy CloudMan.

Authors Vipin T Sreedharan, Sebastian J Schultheiss, Geraldine Jean, Andre Kahles, Regina Bohnert, Philipp Drewe, Pramod Mudrakarta, Nico Görnitz, Georg Zeller, Gunnar Rätsch

Submitted Bioinformatics (Oxford, England)

Link Pubmed DOI

Abstract Recent genomic analyses of pathologically defined tumor types identify “within-a-tissue” disease subtypes. However, the extent to which genomic signatures are shared across tissues is still unclear. We performed an integrative analysis using five genome-wide platforms and one proteomic platform on 3,527 specimens from 12 cancer types, revealing a unified classification into 11 major subtypes. Five subtypes were nearly identical to their tissue-of-origin counterparts, but several distinct cancer types were found to converge into common subtypes. Lung squamous, head and neck, and a subset of bladder cancers coalesced into one subtype typified by TP53 alterations, TP63 amplifications, and high expression of immune and proliferation pathway genes. Of note, bladder cancers split into three pan-cancer subtypes. The multiplatform classification, while correlated with tissue-of-origin, provides independent information for predicting clinical outcomes. All data sets are available for data-mining from a unified resource to support further biological discoveries and insights into novel therapeutic strategies.

Authors K A Hoadley, C Yau, D M Wolf, A D Cherniack, D Tamborero, S Ng, M D M Leiserson, B Niu, M D McLellan, V Uzunangelov, J Zhang, Cyriac Kandoth, R Akbani, H Shen, L Omberg, A Chu, A A Margolin, LJ Van't Veer, N Lopez Bigas, P W Laird, B J Raphael, L Ding, A G Robertson, L A Byers, G B Mills, J N Weinstein, C Van Waes, Z Chen, E A Collisson, Cancer Genome Atlas Research Network

Submitted Cell

Link DOI

Authors Nico Görnitz, Marius Kloft, K Rieck, U Brefeld

Submitted Journal of Artificial Intelligence Research

Link

Authors A Bauer, Nico Görnitz, F Biegler, Klaus Robert Müller, Marius Kloft

Submitted IEEE Transactions on Neural Networks and Learning Systems

Link

Abstract Insulin initiates diverse hepatic metabolic responses, including gluconeogenic suppression and induction of glycogen synthesis and lipogenesis. The liver possesses a rich sinusoidal capillary network with a higher degree of hypoxia and lower gluconeogenesis in the perivenous zone as compared to the rest of the organ. Here, we show that diverse vascular endothelial growth factor (VEGF) inhibitors improved glucose tolerance in nondiabetic C57BL/6 and diabetic db/db mice, potentiating hepatic insulin signaling with lower gluconeogenic gene expression, higher glycogen storage and suppressed hepatic glucose production. VEGF inhibition induced hepatic hypoxia through sinusoidal vascular regression and sensitized liver insulin signaling through hypoxia-inducible factor-2α (Hif-2α, encoded by Epas1) stabilization. Notably, liver-specific constitutive activation of HIF-2α, but not HIF-1α, was sufficient to augment hepatic insulin signaling through direct and indirect induction of insulin receptor substrate-2 (Irs2), an essential insulin receptor adaptor protein. Further, liver Irs2 was both necessary and sufficient to mediate Hif-2α and Vegf inhibition effects on glucose tolerance and hepatic insulin signaling. These results demonstrate an unsuspected intersection between Hif-2α-mediated hypoxic signaling and hepatic insulin action through Irs2 induction, which can be co-opted by Vegf inhibitors to modulate glucose metabolism. These studies also indicate distinct roles in hepatic metabolism for Hif-1α, which promotes glycolysis, and Hif-2α, which suppresses gluconeogenesis, and suggest new treatment approaches for type 2 diabetes mellitus.

Authors K Wei, SM Piecewicz, LM McGinnis, CM Taniguchi, SJ Wiegand, K Anderson, CW M Chan, KX Mulligan, David Kuo, J Yuan, M Vallon, LC Morton, E Lefai, MC Simon, JJ Maher, G Mithieux, F Rajas, JP Annes, OP McGuinness, G Thurston, AJ Giaccia, CJ Kuo

Submitted Nat Med

Link DOI

Abstract The intestinal microbiota is a microbial ecosystem of crucial importance to human health. Understanding how the microbiota confers resistance against enteric pathogens and how antibiotics disrupt that resistance is key to the prevention and cure of intestinal infections. We present a novel method to infer microbial community ecology directly from time-resolved metagenomics. This method extends generalized Lotka-Volterra dynamics to account for external perturbations. Data from recent experiments on antibiotic-mediated Clostridium difficile infection is analyzed to quantify microbial interactions, commensal-pathogen interactions, and the effect of the antibiotic on the community. Stability analysis reveals that the microbiota is intrinsically stable, explaining how antibiotic perturbations and C. difficile inoculation can produce catastrophic shifts that persist even after removal of the perturbations. Importantly, the analysis suggests a subnetwork of bacterial groups implicated in protection against C. difficile. Due to its generality, our method can be applied to any high-resolution ecological time-series data to infer community structure and response to external stimuli.

Authors Richard R Stein, Vanni Bucci, Nora C Toussaint, Charlie G Buffie, Gunnar Rätsch, Eric G Pamer, Chris Sander, Joao B Xavier

Submitted PLoS computational biology

Link Pubmed DOI

Abstract High-throughput RNA sequencing is an increasingly accessible method for studying gene structure and activity on a genome-wide scale. A critical step in RNA-seq data analysis is the alignment of partial transcript reads to a reference genome sequence. To assess the performance of current mapping software, we invited developers of RNA-seq aligners to process four large human and mouse RNA-seq data sets. In total, we compared 26 mapping protocols based on 11 programs and pipelines and found major performance differences between methods on numerous benchmarks, including alignment yield, basewise accuracy, mismatch and gap placement, exon junction discovery and suitability of alignments for transcript reconstruction. We observed concordant results on real and simulated RNA-seq data, confirming the relevance of the metrics employed. Future developments in RNA-seq alignment methods would benefit from improved placement of multimapped reads, balanced utilization of existing gene annotation and a reduced false discovery rate for splice junctions.

Authors Par G Engstrom, Tamara Steijger, Botond Sipos, Gregory R Grant, Andre Kahles, Gunnar Rätsch, Nick Goldman, Tim J Hubbard, Jennifer Harrow, Roderic Guigo, Paul Bertone

Submitted Nature methods

Link Pubmed DOI

Abstract The nonsense-mediated decay (NMD) surveillance pathway can recognize erroneous transcripts and physiological mRNAs, such as precursor mRNA alternative splicing (AS) variants. Currently, information on the global extent of coupled AS and NMD remains scarce and even absent for any plant species. To address this, we conducted transcriptome-wide splicing studies using Arabidopsis thaliana mutants in the NMD factor homologs UP FRAMESHIFT1 (UPF1) and UPF3 as well as wild-type samples treated with the translation inhibitor cycloheximide. Our analyses revealed that at least 17.4% of all multi-exon, protein-coding genes produce splicing variants that are targeted by NMD. Moreover, we provide evidence that UPF1 and UPF3 act in a translation-independent mRNA decay pathway. Importantly, 92.3% of the NMD-responsive mRNAs exhibit classical NMD-eliciting features, supporting their authenticity as direct targets. Genes generating NMD-sensitive AS variants function in diverse biological processes, including signaling and protein modification, for which NaCl stress-modulated AS-NMD was found. Besides mRNAs, numerous noncoding RNAs and transcripts derived from intergenic regions were shown to be NMD responsive. In summary, we provide evidence for a major function of AS-coupled NMD in shaping the Arabidopsis transcriptome, having fundamental implications in gene regulation and quality control of transcript processing.

Authors Gabriele Drechsel, Andre Kahles, Anil K Kesarwani, Eva Stauffer, Jonas Behr, Philipp Drewe, Gunnar Rätsch, Andreas Wachter

Submitted The Plant cell

Link Pubmed DOI

Abstract High-throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in the detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape, pose profound computational challenges for transcriptome reconstruction.

Authors Jonas Behr, Andre Kahles, Yi Zhong, Vipin T Sreedharan, Philipp Drewe, Gunnar Rätsch

Submitted Bioinformatics (Oxford, England)

Link Pubmed DOI

Abstract Deep transcriptome sequencing (RNA-Seq) has become a vital tool for studying the state of cells in the context of varying environments, genotypes and other factors. RNA-Seq profiling data enable identification of novel isoforms, quantification of known isoforms and detection of changes in transcriptional or RNA-processing activity. Existing approaches to detect differential isoform abundance between samples either require a complete isoform annotation or fall short in providing statistically robust and calibrated significance estimates. Here, we propose a suite of statistical tests to address these open needs: a parametric test that uses known isoform annotations to detect changes in relative isoform abundance and a non-parametric test that detects differential read coverages and can be applied when isoform annotations are not available. Both methods account for the discrete nature of read counts and the inherent biological variability. We demonstrate that these tests compare favorably to previous methods, both in terms of accuracy and statistical calibrations. We use these techniques to analyze RNA-Seq libraries from Arabidopsis thaliana and Drosophila melanogaster. The identified differential RNA processing events were consistent with RT-qPCR measurements and previous studies. The proposed toolkit is available from http://bioweb.me/rdiff and enables in-depth analyses of transcriptomes, with or without available isoform annotation.

Authors Philipp Drewe, Oliver Stegle, Lisa Hartmann, Andre Kahles, Regina Bohnert, Andreas Wachter, Karsten Borgwardt, Gunnar Rätsch

Submitted Nucleic acids research

Link Pubmed DOI

Abstract Using a variety of techniques including Topic Modeling, PCA and Bi-clustering, we explore electronic patient records in the form of unstructured clinical notes and genetic mutation test results. Our ultimate goal is to gain insight into a unique body of clinical data, specifically regarding the topics discussed within the note content and relationships between patient clinical notes and their underlying genetics.

Authors K R Chan, Xinghua Lou, Theo Karaletsos, C Crosbie, S Gardos, D Artz, Gunnar Rätsch

Submitted ICDM Workshop on Biological Data Mining and its Applications in Healthcare

Link DOI

Abstract The Cancer Genome Atlas (TCGA) Research Network has profiled and analyzed large numbers of human tumors to discover molecular aberrations at the DNA, RNA, protein and epigenetic levels. The resulting rich data provide a major opportunity to develop an integrated picture of commonalities, differences and emergent themes across tumor lineages. The Pan-Cancer initiative compares the first 12 tumor types profiled by TCGA. Analysis of the molecular aberrations and their functional roles across tumor types will teach us how to extend therapies effective in one cancer type to others with a similar genomic profile.

Authors Cancer Genome Atlas Research Network, J N Weinstein, E A Collisson, G B Mills, K R M Shaw, B A Ozenberger, K Ellrott, I Shmulevich, Chris Sander, J M Stuart

Submitted Nature Genetics

Link DOI

Authors Tamara Steijger, J F Abril, Par G Engstrom, F Kokocinski, Tim J Hubbard, Roderic Guigo, Jennifer Harrow, Paul Bertone, RGASP Consortium

Submitted Nature Methods

Link DOI

Abstract CD45 encodes a trans-membrane protein-tyrosine phosphatase expressed in diverse cells of the immune system. By combinatorial use of three variable exons 4-6, isoforms are generated that differ in their extracellular domain, thereby modulating phosphatase activity and immune response. Alternative splicing of these CD45 exons involves two heterogeneous ribonucleoproteins, hnRNP L and its cell-type specific paralog hnRNP L-like (LL). To address the complex combinatorial splicing of exons 4-6, we investigated hnRNP L/LL protein expression in human B-cells in relation to CD45 splicing patterns, applying RNA-Seq. In addition, mutational and RNA-binding analyses were carried out in HeLa cells. We conclude that hnRNP LL functions as the major CD45 splicing repressor, with two CA elements in exon 6 as its primary target. In exon 4, one element is targeted by both hnRNP L and LL. In contrast, exon 5 was never repressed on its own and only co-regulated with exons 4 and 6. Stable L/LL interaction requires CD45 RNA, specifically exons 4 and 6. We propose a novel model of combinatorial alternative splicing: HnRNP L and LL cooperate on the CD45 pre-mRNA, bridging exons 4 and 6 and looping out exon 5, thereby achieving full repression of the three variable exons.

Authors Marco Preussner, Silke Schreiner, Lee Hsueh Hung, Martina Porstner, Hans Martin Jack, Vladimir Benes, Gunnar Rätsch, Albrecht Bindereif

Submitted Nucleic Acids Res

Link DOI

Abstract Deep sequencing of transcriptomes allows quantitative and qualitative analysis of many RNA species in a sample, with parallel comparison of expression levels, splicing variants, natural antisense transcripts, RNA editing and transcriptional start and stop sites the ideal goal. By computational modeling, we show how libraries of multiple insert sizes combined with strand-specific, paired-end (SS-PE) sequencing can increase the information gained on alternative splicing, especially in higher eukaryotes. Despite the benefits of gaining SS-PE data with paired ends of varying distance, the standard Illumina protocol allows only non-strand-specific, paired-end sequencing with a single insert size. Here, we modify the Illumina RNA ligation protocol to allow SS-PE sequencing by using a custom pre-adenylated 3' adaptor. We generate parallel libraries with differing insert sizes to aid deconvolution of alternative splicing events and to characterize the extent and distribution of natural antisense transcription in C. elegans. Despite stringent requirements for detection of alternative splicing, our data increases the number of intron retention and exon skipping events annotated in the Wormbase genome annotations by 127% and 121%, respectively. We show that parallel libraries with a range of insert sizes increase transcriptomic information gained by sequencing and that by current established benchmarks our protocol gives competitive results with respect to library quality.

Authors Lisa M Smith, Lisa Hartmann, Philipp Drewe, Regina Bohnert, Andre Kahles, Christa Lanz, Gunnar Rätsch

Submitted RNA biology

Link Pubmed DOI

Abstract CD45 encodes a trans-membrane protein-tyrosine phosphatase expressed in diverse cells of the immune system. By combinatorial use of three variable exons 4-6, isoforms are generated that differ in their extracellular domain, thereby modulating phosphatase activity and immune response. Alternative splicing of these CD45 exons involves two heterogeneous ribonucleoproteins, hnRNP L and its cell-type specific paralog hnRNP L-like (LL). To address the complex combinatorial splicing of exons 4-6, we investigated hnRNP L/LL protein expression in human B-cells in relation to CD45 splicing patterns, applying RNA-Seq. In addition, mutational and RNA-binding analyses were carried out in HeLa cells. We conclude that hnRNP LL functions as the major CD45 splicing repressor, with two CA elements in exon 6 as its primary target. In exon 4, one element is targeted by both hnRNP L and LL. In contrast, exon 5 was never repressed on its own and only co-regulated with exons 4 and 6. Stable L/LL interaction requires CD45 RNA, specifically exons 4 and 6. We propose a novel model of combinatorial alternative splicing: HnRNP L and LL cooperate on the CD45 pre-mRNA, bridging exons 4 and 6 and looping out exon 5, thereby achieving full repression of the three variable exons.

Authors Marco Preussner, Silke Schreiner, Lee Hsueh Hung, Martina Porstner, Hans Martin Jack, Vladimir Benes, Gunnar Rätsch, Albrecht Bindereif

Submitted Nucleic acids research

Link Pubmed DOI

Abstract Alternative splicing (AS) generates transcript variants by variable exon/intron definition and massively expands transcriptome diversity. Changes in AS patterns have been found to be linked to manifold biological processes, yet fundamental aspects, such as the regulation of AS and its functional implications, largely remain to be addressed. In this work, widespread AS regulation by Arabidopsis thaliana Polypyrimidine tract binding protein homologs (PTBs) was revealed. In total, 452 AS events derived from 307 distinct genes were found to be responsive to the levels of the splicing factors PTB1 and PTB2, which predominantly triggered splicing of regulated introns, inclusion of cassette exons, and usage of upstream 5' splice sites. By contrast, no major AS regulatory function of the distantly related PTB3 was found. Dependent on their position within the mRNA, PTB-regulated events can both modify the untranslated regions and give rise to alternative protein products. We find that PTB-mediated AS events are connected to diverse biological processes, and the functional implications of selected instances were further elucidated. Specifically, PTB misexpression changes AS of PHYTOCHROME INTERACTING FACTOR6, coinciding with altered rates of abscisic acid-dependent seed germination. Furthermore, AS patterns as well as the expression of key flowering regulators were massively changed in a PTB1/2 level-dependent manner.

Authors Christina Ruhl, Eva Stauffer, Andre Kahles, Gabriele Wagner, Gabriele Drechsel, Gunnar Rätsch, Andreas Wachter

Submitted The Plant cell

Link Pubmed DOI

Abstract Cohesin is a protein complex that forms a ring around sister chromatids thus holding them together. The ring is composed of three proteins: Smc1, Smc3 and Scc1. The roles of three additional proteins that associate with the ring, Scc3, Pds5 and Wpl1, are not well understood. It has been proposed that these three factors form a complex that stabilizes the ring and prevents it from opening. This activity promotes sister chromatid cohesion but at the same time poses an obstacle for the initial entrapment of sister DNAs. This hindrance to cohesion establishment is overcome during DNA replication via acetylation of the Smc3 subunit by the Eco1 acetyltransferase. However, the full mechanistic consequences of Smc3 acetylation remain unknown. In the current work, we test the requirement of Scc3 and Pds5 for the stable association of cohesin with DNA. We investigated the consequences of Scc3 and Pds5 depletion in vivo using degron tagging in budding yeast. The previously described DHFR-based N-terminal degron as well as a novel Eco1-derived C-terminal degron were employed in our study. Scc3 and Pds5 associate with cohesin complexes independently of each other and require the Scc1 "core" subunit for their association with chromosomes. Contrary to previous data for Scc1 downregulation, depletion of either Scc3 or Pds5 had a strong effect on sister chromatid cohesion but not on cohesin binding to DNA. Quantity, stability and genome-wide distribution of cohesin complexes remained mostly unchanged after the depletion of Scc3 and Pds5. Our findings are inconsistent with a previously proposed model that Scc3 and Pds5 are cohesin maintenance factors required for cohesin ring stability or for maintaining its association with DNA. We propose that Scc3 and Pds5 specifically function during cohesion establishment in S phase.

Authors Irina Kulemzina, Martin R Schumacher, Vikash Verma, Jochen Reiter, Janina Metzler, Antonio Virgilio Failla, Christa Lanz, Vipin T Sreedharan, Gunnar Rätsch, Dmitri Ivanov

Submitted PLoS genetics

Link Pubmed DOI

Authors Nico Görnitz, Georg Zeller, Jonas Behr, Andre Kahles, Pramod Mudrakarta, Soren Sonnenburg, Gunnar Rätsch

Link

Abstract We have conducted a study on the long-term availability of bioinformatics Web services: an observation of 927 Web services published in the annual Nucleic Acids Research Web Server Issues between 2003 and 2009. We found that 72% of Web sites are still available at the published addresses, only 9% of services are completely unavailable. Older addresses often redirect to new pages. We checked the functionality of all available services: for 33%, we could not test functionality because there was no example data or a related problem; 13% were truly no longer working as expected; we could positively confirm functionality only for 45% of all services. Additionally, we conducted a survey among 872 Web Server Issue corresponding authors; 274 replied. 78% of all respondents indicate their services have been developed solely by students and researchers without a permanent position. Consequently, these services are in danger of falling into disrepair after the original developers move to another institution, and indeed, for 24% of services, there is no plan for maintenance, according to the respondents. We introduce a Web service quality scoring system that correlates with the number of citations: services with a high score are cited 1.8 times more often than low-scoring services. We have identified key characteristics that are predictive of a service's survival, providing reviewers, editors, and Web service developers with the means to assess or improve Web services. A Web service conforming to these criteria receives more citations and provides more reliable service for its users. The most effective way of ensuring continued access to a service is a persistent Web address, offered either by the publishing journal, or created on the authors' own initiative, for example at http://bioweb.me. The community would benefit the most from a policy requiring any source code needed to reproduce results to be deposited in a public repository.

Authors Sebastian J Schultheiss, Marc Christian Munch, Gergana D Andreeva, Gunnar Rätsch

Submitted PloS one

Link Pubmed DOI

Abstract Genetic differences between Arabidopsis thaliana accessions underlie the plant's extensive phenotypic variation, and until now these have been interpreted largely in the context of the annotated reference accession Col-0. Here we report the sequencing, assembly and annotation of the genomes of 18 natural A. thaliana accessions, and their transcriptomes. When assessed on the basis of the reference annotation, one-third of protein-coding genes are predicted to be disrupted in at least one accession. However, re-annotation of each genome revealed that alternative gene models often restore coding potential. Gene expression in seedlings differed for nearly half of expressed genes and was frequently associated with cis variants within 5 kilobases, as were intron retention alternative splicing events. Sequence and expression variation is most pronounced in genes that respond to the biotic environment. Our data further promote evolutionary and functional studies in A. thaliana, especially the MAGIC genetic reference population descended from these accessions.

Authors Xiangchao Gan, Oliver Stegle, Jonas Behr, Joshua G Steffen, Philipp Drewe, Katie L Hildebrand, Rune Lyngsoe, Sebastian J Schultheiss, Edward J Osborne, Vipin T Sreedharan, Andre Kahles, Regina Bohnert, Geraldine Jean, Paul Derwent, Paul Kersey, Eric J Belfield, Nicholas P Harberd, Eric Kemen, Christopher Toomajian, Paula X Kover, Richard M Clark, Gunnar Rätsch, Richard Mott

Submitted Nature

Link Pubmed DOI

Abstract Precise 5' splice-site recognition is essential for both constitutive and regulated pre-mRNA splicing. The U1 small nuclear ribonucleoprotein particle (snRNP)-specific protein U1C is involved in this first step of spliceosome assembly and important for stabilizing early splicing complexes. We used an embryonically lethal U1C mutant zebrafish, hi1371, to investigate the potential genomewide role of U1C for splicing regulation. U1C mutant embryos contain overall stable, but U1C-deficient U1 snRNPs. Surprisingly, genomewide RNA-Seq analysis of mutant versus wild-type embryos revealed a large set of specific target genes that changed their alternative splicing patterns in the absence of U1C. Injection of ZfU1C cRNA into mutant embryos and in vivo splicing experiments in HeLa cells after siRNA-mediated U1C knockdown confirmed the U1C dependency and specificity, as well as the functional conservation of the effects observed. In addition, sequence motif analysis of the U1C-dependent 5' splice sites uncovered an association with downstream intronic U-rich elements. In sum, our findings provide evidence for a new role of a general snRNP protein, U1C, as a mediator of alternative splicing regulation.

Authors Tanja Dorothe Rosel, Lee Hsueh Hung, Jan Medenbach, Katrin Donde, Stefan Starke, Vladimir Benes, Gunnar Rätsch, Albrecht Bindereif

Submitted The EMBO journal

Link Pubmed DOI

Abstract Alternative splicing (AS) is a process which generates several distinct mRNA isoforms from the same gene by splicing different portions out of the precursor transcript. Due to the (patho-)physiological importance of AS, a complete inventory of AS is of great interest. While this is in reach for human and mammalian model organisms, our knowledge of AS in plants has remained more incomplete. Experimental approaches for monitoring AS are either based on transcript sequencing or rely on hybridization to DNA microarrays. Among the microarray platforms facilitating the discovery of AS events, tiling arrays are well-suited for identifying intron retention, the most prevalent type of AS in plants. However, analyzing tiling array data is challenging, because of high noise levels and limited probe coverage.

Authors Johannes Eichner, Georg Zeller, Sascha Laubinger, Gunnar Rätsch

Submitted BMC bioinformatics

Link Pubmed DOI

Abstract The C. elegans genome has been completely sequenced, and the developmental anatomy of this model organism is described at single-cell resolution. Here we utilize strategies that exploit this precisely defined architecture to link gene expression to cell type. We obtained RNAs from specific cells and from each developmental stage using tissue-specific promoters to mark cells for isolation by FACS or for mRNA extraction by the mRNA-tagging method. We then generated gene expression profiles of more than 30 different cells and developmental stages using tiling arrays. Machine-learning-based analysis detected transcripts corresponding to established gene models and revealed novel transcriptionally active regions (TARs) in noncoding domains that comprise at least 10% of the total C. elegans genome. Our results show that about 75% of transcripts with detectable expression are differentially expressed among developmental stages and across cell types. Examination of known tissue- and cell-specific transcripts validates these data sets and suggests that newly identified TARs may exercise cell-specific functions. Additionally, we used self-organizing maps to define groups of coregulated transcripts and applied regulatory element analysis to identify known transcription factor- and miRNA-binding sites, as well as novel motifs that likely function to control subsets of these genes. By using cell-specific, whole-genome profiling strategies, we have detected a large number of novel transcripts and produced high-resolution gene expression maps that provide a basis for establishing the roles of individual genes in cellular differentiation.

Authors William C Spencer, Georg Zeller, Joseph D Watson, Stefan R Henz, Kathie L Watkins, Rebecca D McWhirter, Sarah Petersen, Vipin T Sreedharan, Christian K Widmer, Jeanyoung Jo, Valerie Reinke, Lisa Petrella, Susan Strome, Stephen E Von Stetina, Menachem Katz, Shai Shaham, Gunnar Rätsch, David M Miller

Submitted Genome research

Link Pubmed DOI

Abstract CO(2) is both a critical regulator of animal physiology and an important sensory cue for many animals for host detection, food location, and mate finding. The free-living soil nematode Caenorhabditis elegans shows CO(2) avoidance behavior, which requires a pair of ciliated sensory neurons, the BAG neurons. Using in vivo calcium imaging, we show that CO(2) specifically activates the BAG neurons and that the CO(2)-sensing function of BAG neurons requires TAX-2/TAX-4 cyclic nucleotide-gated ion channels and the receptor-type guanylate cyclase GCY-9. Our results delineate a molecular pathway for CO(2) sensing and suggest that activation of a receptor-type guanylate cyclase is an evolutionarily conserved mechanism by which animals detect environmental CO(2).

Authors Elissa A Hallem, William C Spencer, Rebecca D McWhirter, Georg Zeller, Stefan R Henz, Gunnar Rätsch, David M Miller, H Robert Horvitz, Paul W Sternberg, Niels Ringstad

Submitted Proceedings of the National Academy of Sciences of the United States of America

Link Pubmed DOI

Authors Soren Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian K Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojt Franc

Submitted J. Mach. Learn. Res.

Link

Authors Jonas Behr, Regina Bohnert, Georg Zeller, Gabriele Schweikert, Lisa Hartmann, Gunnar Rätsch

Link DOI

Abstract The classic phytohormones cytokinin and auxin play essential roles in the maintenance of stem-cell systems embedded in shoot and root meristems, and exhibit complex functional interactions. Here we show that the activity of both hormones directly converges on the promoters of two A-type ARABIDOPSIS RESPONSE REGULATOR (ARR) genes, ARR7 and ARR15, which are negative regulators of cytokinin signalling and have important meristematic functions. Whereas ARR7 and ARR15 expression in the shoot apical meristem (SAM) is induced by cytokinin, auxin has a negative effect, which is, at least in part, mediated by the AUXIN RESPONSE FACTOR5/MONOPTEROS (MP) transcription factor. Our results provide a mechanistic framework for hormonal control of the apical stem-cell niche and demonstrate how root and shoot stem-cell systems differ in their response to phytohormones.

Authors Z Zhao, SU Andersen, K Ljung, K Dolezal, A Miotk, Sebastian J Schultheiss, Jan U Lohmann

Submitted Nature

Link DOI

Abstract The challenge of identifying cis-regulatory modules (CRMs) is an important milestone for the ultimate goal of understanding transcriptional regulation in eukaryotic cells. It has been approached, among others, by motif-finding algorithms that identify overrepresented motifs in regulatory sequences. These methods succeed in finding single, well-conserved motifs, but fail to identify combinations of degenerate binding sites, like the ones often found in CRMs. We have developed a method that combines the abilities of existing motif finding with the discriminative power of a machine learning technique to model the regulation of genes (Schultheiss et al. (2009) Bioinformatics 25, 2126-2133). Our software is called KIRMES: , which stands for kernel-based identification of regulatory modules in eukaryotic sequences. Starting from a set of genes thought to be co-regulated, KIRMES: can identify the key CRMs responsible for this behavior and can be used to determine for any other gene not included on that list if it is also regulated by the same mechanism. Such gene sets can be derived from microarrays, chromatin immunoprecipitation experiments combined with next-generation sequencing or promoter/whole genome microarrays. The use of an established machine learning method makes the approach fast to use and robust with respect to noise. By providing easily understood visualizations for the results returned, they become interpretable and serve as a starting point for further analysis. Even for complex regulatory relationships, KIRMES: can be a helpful tool in directing the design of biological experiments.

Authors Sebastian J Schultheiss

Submitted Methods Mol Biol

Link DOI

Abstract Despite the independent evolution of multicellularity in plants and animals, the basic organization of their stem cell niches is remarkably similar. Here, we report the genome-wide regulatory potential of WUSCHEL, the key transcription factor for stem cell maintenance in the shoot apical meristem of the reference plant Arabidopsis thaliana. WUSCHEL acts by directly binding to at least two distinct DNA motifs in more than 100 target promoters and preferentially affects the expression of genes with roles in hormone signaling, metabolism, and development. Striking examples are the direct transcriptional repression of CLAVATA1, which is part of a negative feedback regulation of WUSCHEL, and the immediate regulation of transcriptional repressors of the TOPLESS family, which are involved in auxin signaling. Our results shed light on the complex transcriptional programs required for the maintenance of a dynamic and essential stem cell niche.

Authors Wolfgang Busch, A Miotk, FD Ariel, Z Zhao, J Forner, G Daum, T Suzaki, C Schuster, Sebastian J Schultheiss, A Leibfried, S Haubeiss, N Ha, R L Chan, Jan U Lohmann

Submitted Dev Cell

Link DOI

Abstract String kernels are commonly used for the classification of biological sequences, nucleotide as well as amino acid sequences. Although string kernels are already very powerful, when it comes to amino acids they have a major short coming. They ignore an important piece of information when comparing amino acids: the physico-chemical properties such as size, hydrophobicity, or charge. This information is very valuable, especially when training data is less abundant. There have been only very few approaches so far that aim at combining these two ideas.

Authors Nora C Toussaint, Christian K Widmer, Oliver Kohlbacher, Gunnar Rätsch

Submitted BMC bioinformatics

Link Pubmed DOI

Abstract The lack of sufficient training data is the limiting factor for many Machine Learning applications in Computational Biology. If data is available for several different but related problem domains, Multitask Learning algorithms can be used to learn a model based on all available information. In Bioinformatics, many problems can be cast into the Multitask Learning scenario by incorporating data from several organisms. However, combining information from several tasks requires careful consideration of the degree of similarity between tasks. Our proposed method simultaneously learns or refines the similarity between tasks along with the Multitask Learning classifier. This is done by formulating the Multitask Learning problem as Multiple Kernel Learning, using the recently published q-Norm MKL algorithm.

Authors Christian K Widmer, Nora C Toussaint, Yasemin Altun, Gunnar Rätsch

Submitted BMC bioinformatics

Link Pubmed DOI

Abstract In Arabidopsis thaliana, four different dicer-like (DCL) proteins have distinct but partially overlapping functions in the biogenesis of microRNAs (miRNAs) and siRNAs from longer, noncoding precursor RNAs. To analyze the impact of different components of the small RNA biogenesis machinery on the transcriptome, we subjected dcl and other mutants impaired in small RNA biogenesis to whole-genome tiling array analysis. We compared both protein-coding genes and noncoding transcripts, including most pri-miRNAs, in two tissues and several stress conditions. Our analysis revealed a surprising number of common targets in dcl1 and dcl2 dcl3 dcl4 triple mutants. Furthermore, our results suggest that the DCL1 is not only involved in miRNA action but also contributes to silencing of a subset of transposons, apparently through an effect on DNA methylation.

Authors Sascha Laubinger, Georg Zeller, Stefan R Henz, Sabine Buechel, Timo Sachsenberg, Jia Wei Wang, Gunnar Rätsch, Detlef Weigel

Submitted Proceedings of the National Academy of Sciences of the United States of America

Link Pubmed DOI

Abstract We provide a novel web service, called rQuant.web, allowing convenient access to tools for quantitative analysis of RNA sequencing data. The underlying quantitation technique rQuant is based on quadratic programming and estimates different biases induced by library preparation, sequencing and read mapping. It can tackle multiple transcripts per gene locus and is therefore particularly well suited to quantify alternative transcripts. rQuant.web is available as a tool in a Galaxy installation at http://galaxy.fml.mpg.de. Using rQuant.web is free of charge, it is open to all users, and there is no login requirement.

Authors Regina Bohnert, Gunnar Rätsch

Submitted Nucleic acids research

Link Pubmed DOI

Abstract We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor-binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor-binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.

Authors Mark B Gerstein, Zhi John Lu, Eric L Van Nostrand, Chao Cheng, Bradley I Arshinoff, Tao Liu, Kevin Y Yip, Rebecca Robilotto, Andreas Rechtsteiner, Kohta Ikegami, Pedro Alves, Aurelien Chateigner, Marc Perry, Mitzi Morris, Raymond K Auerbach, Xin Feng, Jing Leng, Anne Vielle, Wei Niu, Kahn Rhrissorrakrai, Ashish Agarwal, Roger P Alexander, Galt Barber, Cathleen M Brdlik, Jennifer Brennan, Jeremy Jean Brouillet, Adrian Carr, Ming Sin Cheung, Hiram Clawson, Sergio Contrino, Luke O Dannenberg, Abby F Dernburg, Arshad Desai, Lindsay Dick, Andrea C Dose, Jiang Du, Thea Egelhofer, Sevinc Ercan, Ghia Euskirchen, Brent Ewing, Elise A Feingold, Reto Gassmann, Peter J Good, Phil Green, Francois Gullier, Michelle Gutwein, Mark S Guyer, Lukas Habegger, Ting Han, Jorja G Henikoff, Stefan R Henz, Angie Hinrichs, Heather Holster, Tony Hyman, A Leo Iniguez, Judith Janette, Morten Jensen, Masaomi Kato, W James Kent, Ellen Kephart, Vishal Khivansara, Ekta Khurana, John K Kim, Paulina Kolasinska Zwierz, Eric C Lai, Isabel Latorre, Amber Leahey, Suzanna Lewis, Paul Lloyd, Lucas Lochovsky, Rebecca F Lowdon, Yaniv Lubling, Rachel Lyne, Michael MacCoss, Sebastian D Mackowiak, Marco Mangone, Sheldon McKay, Desirea Mecenas, Gennifer Merrihew, David M Miller, Andrew Muroyama, John I Murray, Siew Loon Ooi, Hoang Pham, Taryn Phippen, Elicia A Preston, Nikolaus Rajewsky, Gunnar Rätsch, Heidi Rosenbaum, Joel Rozowsky, Kim Rutherford, Peter Ruzanov, Mihail Sarov, Rajkumar Sasidharan, Andrea Sboner, Paul Scheid, Eran Segal, Hyunjin Shin, Chong Shou, Frank J Slack, Cindie Slightam, Richard Smith, William C Spencer, E O Stinson, Scott Taing, Teruaki Takasaki, Dionne Vafeados, Ksenia Voronina, Guilin Wang, Nicole L Washington, Christina M Whittle, Beijing Wu, Koon Kiu Yan, Georg Zeller, Zheng Zha, Mei Zhong, Xingliang Zhou, Julie Ahringer, Susan Strome, Kristin C Gunsalus, Gos Micklem, X Shirley Liu, Valerie Reinke, Sang Tae Kim, LaDeana W Hillier, Steven Henikoff, Fabio Piano, Michael Snyder, Lincoln Stein, Jason D Lieb, Robert H Waterston

Submitted Science (New York, N.Y.)

Link Pubmed DOI

Abstract Next-generation sequencing technologies have revolutionized genome and transcriptome sequencing. RNA-Seq experiments are able to generate huge amounts of transcriptome sequence reads at a fraction of the cost of Sanger sequencing. Reads produced by these technologies are relatively short and error prone. To utilize such reads for transcriptome reconstruction and gene-structure identification, one needs to be able to accurately align the sequence reads over intron boundaries. In this unit, we describe PALMapper, a fast and easy-to-use tool that is designed to accurately compute both unspliced and spliced alignments for millions of RNA-Seq reads. It combines the efficient read mapper GenomeMapper with the spliced aligner QPALMA, which exploits read-quality information and predictions of splice sites to improve the alignment accuracy. The PALMapper package is available as a command-line tool running on Unix or Mac OS X systems or through a Web interface based on Galaxy tools.

Authors Geraldine Jean, Andre Kahles, Vipin T Sreedharan, Fabio de Bona, Gunnar Rätsch

Submitted Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.]

Link Pubmed DOI

Abstract In Arabidopsis thaliana, gene expression level polymorphisms (ELPs) between natural accessions that exhibit simple, single locus inheritance are promising quantitative trait locus (QTL) candidates to explain phenotypic variability. It is assumed that such ELPs overwhelmingly represent regulatory element polymorphisms. However, comprehensive genome-wide analyses linking expression level, regulatory sequence and gene structure variation are missing, preventing definite verification of this assumption. Here, we analyzed ELPs observed between the Eil-0 and Lc-0 accessions. Compared with non-variable controls, 5' regulatory sequence variation in the corresponding genes is indeed increased. However, approximately 42\% of all the ELP genes also carry major transcription unit deletions in one parent as revealed by genome tiling arrays, representing a >4-fold enrichment over controls. Within the subset of ELPs with simple inheritance, this proportion is even higher and deletions are generally more severe. Similar results were obtained from analyses of the Bay-0 and Sha accessions, using alternative technical approaches. Collectively, our results suggest that drastic structural changes are a major cause for ELPs with simple inheritance, corroborating experimentally observed indel preponderance in cloned Arabidopsis QTL.

Authors S Plantegenet, J Weber, DR Goldstein, Georg Zeller, C Nussbaumer, J Thomas, Detlef Weigel, K Harshman, CS Hardtke

Submitted Mol Syst Biol

Link DOI

Abstract Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability. Unfortunately, l1-norm MKL is hardly observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary lp-norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary p > 1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the traditionally used wrapper approaches. Finally, we apply lp-norm MKL to real-world problems from computational biology, showing that non-sparse MKL achieves accuracies that go beyond the state-of-the-art.

Authors Marius Kloft, U Brefeld, Soren Sonnenburg, P Laskow, Klaus Robert Müller, Alexander Zien

Link

Abstract Modern systems biology aims at understanding how the different molecular components of a biological cell interact. Often, cellular functions are performed by complexes consisting of many different proteins. The composition of these complexes may change according to the cellular environment, and one protein may be involved in several different processes. The automatic discovery of functional complexes from protein interaction data is challenging. While previous approaches use approximations to extract dense modules, our approach exactly solves the problem of dense module enumeration. Furthermore, constraints from additional information sources such as gene expression and phenotype data can be integrated, so we can systematically mine for dense modules with interesting profiles.

Authors E Georgii, S Dietmann, T Uno, P Pagel, Koji Tsuda

Submitted Bioinformatics

Link DOI

Abstract SUMMARY: The DICS database is a dynamic web repository of computationally predicted functional modules from the human protein-protein interaction network. It provides references to the CORUM, DrugBank, KEGG and Reactome pathway databases. DICS can be accessed for retrieving sets of overlapping modules and protein complexes that are significantly enriched in a gene list, thereby providing valuable information about the functional context. AVAILABILITY: Supplementary information on datasets and methods is available on the web server http://mips.gsf.de/proj/dics.

Authors S Dietmann, E Georgii, A Antonov, Koji Tsuda, HW Mewes

Submitted Bioinformatics

Link DOI

Abstract The Affymetrix ATH1 array provides a robust standard tool for transcriptome analysis, but unfortunately does not represent all of the transcribed genes in Arabidopsis thaliana. Recently, Affymetrix has introduced its Arabidopsis Tiling 1.0R array, which offers whole-genome coverage of the sequenced Col-0 reference strain. Here, we present an approach to exploit this platform for quantitative mRNA expression analysis, and compare the results with those obtained using ATH1 arrays. We also propose a method for selecting unique tiling probes for each annotated gene or transcript in the most current genome annotation, TAIR7, generating Chip Definition Files for the Tiling 1.0R array. As a test case, we compared the transcriptome of wild-type plants with that of transgenic plants overproducing the heterodimeric E2Fa-DPa transcription factor. We show that with the appropriate data pre-processing, the estimated changes per gene for those with significantly different expression levels is very similar for the two array types. With the tiling arrays we could identify 368 new E2F-regulated genes, with a large fraction including an E2F motif in the promoter. The latter groups increase the number of excellent candidates for new, direct E2F targets by almost twofold, from 181 to 334.

Authors Naira Naouar, Klaas Vandepoele, Tim Lammens, Tineke Casneuf, Georg Zeller, Paul van Hummelen, Detlef Weigel, Gunnar Rätsch, Dirk Inze, Martin Kuiper, Lieven De Veylder, Marnik Vuylsteke

Submitted The Plant journal : for cell and molecular biology

Link Pubmed DOI

Abstract Rice, the primary source of dietary calories for half of humanity, is the first crop plant for which a high-quality reference genome sequence from a single variety was produced. We used resequencing microarrays to interrogate 100 Mb of the unique fraction of the reference genome for 20 diverse varieties and landraces that capture the impressive genotypic and phenotypic diversity of domesticated rice. Here, we report the distribution of 160,000 nonredundant SNPs. Introgression patterns of shared SNPs revealed the breeding history and relationships among the 20 varieties; some introgressed regions are associated with agronomic traits that mark major milestones in rice improvement. These comprehensive SNP data provide a foundation for deep exploration of rice diversity and gene-trait relationships and their use for future rice improvement.

Authors Kenneth L McNally, Kevin L Childs, Regina Bohnert, Rebecca M Davidson, Keyan Zhao, Victor J Ulat, Georg Zeller, Richard M Clark, Douglas R Hoen, Thomas E Bureau, Renee Stokowski, Dennis G Ballinger, Kelly A Frazer, David R Cox, Badri Padhukasahasram, Carlos D Bustamante, Detlef Weigel, David J Mackill, Richard M Bruskiewich, Gunnar Rätsch, C Robin Buell, Hei Leung, Jan E Leach

Submitted Proceedings of the National Academy of Sciences of the United States of America

Link Pubmed DOI

Abstract We present a highly accurate gene-prediction system for eukaryotic genomes, called mGene. It combines in an unprecedented manner the flexibility of generalized hidden Markov models (gHMMs) with the predictive power of modern machine learning methods, such as Support Vector Machines (SVMs). Its excellent performance was proved in an objective competition based on the genome of the nematode Caenorhabditis elegans. Considering the average of sensitivity and specificity, the developmental version of mGene exhibited the best prediction performance on nucleotide, exon, and transcript level for ab initio and multiple-genome gene-prediction tasks. The fully developed version shows superior performance in 10 out of 12 evaluation criteria compared with the other participating gene finders, including Fgenesh++ and Augustus. An in-depth analysis of mGene's genome-wide predictions revealed that approximately 2200 predicted genes were not contained in the current genome annotation. Testing a subset of 57 of these genes by RT-PCR and sequencing, we confirmed expression for 24 (42%) of them. mGene missed 300 annotated genes, out of which 205 were unconfirmed. RT-PCR testing of 24 of these genes resulted in a success rate of merely 8%. These findings suggest that even the gene catalog of a well-studied organism such as C. elegans can be substantially improved by mGene's predictions. We also provide gene predictions for the four nematodes C. briggsae, C. brenneri, C. japonica, and C. remanei. Comparing the resulting proteomes among these organisms and to the known protein universe, we identified many species-specific gene inventions. In a quality assessment of several available annotations for these genomes, we find that mGene's predictions are most accurate.

Authors Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, Fabio de Bona, Lisa Hartmann, Anja Bohlen, Nina Kruger, Soren Sonnenburg, Gunnar Rätsch

Submitted Genome research

Link Pubmed DOI

Abstract We describe mGene.web, a web service for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. It offers pre-trained models for the recognition of gene structures including untranslated regions in an increasing number of organisms. With mGene.web, users have the additional possibility to train the system with their own data for other organisms on the push of a button, a functionality that will greatly accelerate the annotation of newly sequenced genomes. The system is built in a highly modular way, such that individual components of the framework, like the promoter prediction tool or the splice site predictor, can be used autonomously. The underlying gene finding system mGene is based on discriminative machine learning techniques and its high accuracy has been demonstrated in an international competition on nematode genomes. mGene.web is available at http://www.mgene.org/web, it is free of charge and can be used for eukaryotic genomes of small to moderate size (several hundred Mbp).

Authors Gabriele Schweikert, Jonas Behr, Alexander Zien, Georg Zeller, Cheng Soon Ong, Soren Sonnenburg, Gunnar Rätsch

Submitted Nucleic acids research

Link Pubmed DOI

Abstract We shed light on the discrimination between patterns belonging to two different classes by casting this decoding problem into a generalized prototype framework. The discrimination process is then separated into two stages: a projection stage that reduces the dimensionality of the data by projecting it on a line and a threshold stage where the distributions of the projected patterns of both classes are separated. For this, we extend the popular mean-of-class prototype classification using algorithms from machine learning that satisfy a set of invariance properties. We report a simple yet general approach to express different types of linear classification algorithms in an identical and easy-to-visualize formal framework using generalized prototypes where these prototypes are used to express the normal vector and offset of the hyperplane. We investigate non-margin classifiers such as the classical prototype classifier, the Fisher classifier, and the relevance vector machine. We then study hard and soft margin classifiers such as the support vector machine and a boosted version of the prototype classifier. Subsequently, we relate mean-of-class prototype classification to other classification algorithms by showing that the prototype classifier is a limit of any soft margin classifier and that boosting a prototype classifier yields the support vector machine. While giving novel insights into classification per se by presenting a common and unified formalism, our generalized prototype framework also provides an efficient visualization and a principled comparison of machine learning classification.

Authors Arnulf B A Graf, Olivier Bousquet, Gunnar Rätsch, Bernhard Schölkopf

Submitted Neural computation

Link Pubmed DOI

Abstract Understanding transcriptional regulation is one of the main challenges in computational biology. An important problem is the identification of transcription factor (TF) binding sites in promoter regions of potential TF target genes. It is typically approached by position weight matrix-based motif identification algorithms using Gibbs sampling, or heuristics to extend seed oligos. Such algorithms succeed in identifying single, relatively well-conserved binding sites, but tend to fail when it comes to the identification of combinations of several degenerate binding sites, as those often found in cis-regulatory modules.

Authors Sebastian J Schultheiss, Wolfgang Busch, Jan U Lohmann, Oliver Kohlbacher, Gunnar Rätsch

Submitted Bioinformatics (Oxford, England)

Link Pubmed DOI

Abstract The responses of plants to abiotic stresses are accompanied by massive changes in transcriptome composition. To provide a comprehensive view of stress-induced changes in the Arabidopsis thaliana transcriptome, we have used whole-genome tiling arrays to analyze the effects of salt, osmotic, cold and heat stress as well as application of the hormone abscisic acid (ABA), an important mediator of stress responses. Among annotated genes in the reference strain Columbia we have found many stress-responsive genes, including several transcription factor genes as well as pseudogenes and transposons that have been missed in previous analyses with standard expression arrays. In addition, we report hundreds of newly identified, stress-induced transcribed regions. These often overlap with known, annotated genes. The results are accessible through the Arabidopsis thaliana Tiling Array Express (At-TAX) homepage, which provides convenient tools for displaying expression values of annotated genes, as well as visualization of unannotated transcribed regions along each chromosome.

Authors Georg Zeller, Stefan R Henz, Christian K Widmer, Timo Sachsenberg, Gunnar Rätsch, Detlef Weigel, Sascha Laubinger

Submitted The Plant journal : for cell and molecular biology

Link Pubmed DOI

Abstract Novel high-throughput sequencing technologies open exciting new approaches to transcriptome profiling. Sequencing transcript populations of interest, e.g. from different tissues or variable stress conditions, with RNA sequencing (RNA-Seq) [1] generates millions of short reads. Accurately aligned to a reference genome, they provide digital counts and thus facilitate transcript quantification. As the observed read counts only provide the summation of all expressed sequences at one locus, the inference of the underlying transcript abundances is crucial for further quantitative analyses.

Authors Regina Bohnert, Jonas Behr, Gunnar Rätsch

Submitted BMC Bioinformatics

Link DOI

Abstract Next generation sequencing technologies open exciting new possibilities for genome and transcriptome sequencing. While reads produced by these technologies are relatively short and error prone compared to the Sanger method their throughput is several magnitudes higher. To utilize such reads for transcriptome sequencing and gene structure identification, one needs to be able to accurately align the sequence reads over intron boundaries. This represents a significant challenge given their short length and inherent high error rate. RESULTS: We present a novel approach, called QPALMA, for computing accurate spliced alignments which takes advantage of the read's quality information as well as computational splice site predictions. Our method uses a training set of spliced reads with quality information and known alignments. It uses a large margin approach similar to support vector machines to estimate its parameters to maximize alignment accuracy. In computational experiments, we illustrate that the quality information as well as the splice site predictions help to improve the alignment quality. Finally, to facilitate mapping of massive amounts of sequencing data typically generated by the new technologies, we have combined our method with a fast mapping pipeline based on enhanced suffix arrays. Our algorithms were optimized and tested using reads produced with the Illumina Genome Analyzer for the model plant Arabidopsis thaliana. AVAILABILITY: Datasets for training and evaluation, additional results and a stand-alone alignment tool implemented in C++ and python are available at http://www.fml.mpg.de/raetsch/projects/qpalma.

Authors Fabio de Bona, Stephan Ossowski, Korbinian Schneeberger, Gunnar Rätsch

Submitted Bioinformatics,

Link Pubmed DOI

Abstract For the analysis of transcriptional tiling arrays we have developed two methods based on state-of-the-art machine learning algorithms. First, we present a novel transcript normalization technique to alleviate the effect of oligonucleotide probe sequences on hybridization intensity. It is specifically designed to decrease the variability observed for individual probes complementary to the same transcript. Applying this normalization technique to Arabidopsis tiling arrays, we are able to reduce sequence biases and also significantly improve separation in signal intensity between exonic and intronic/intergenic probes. Our second contribution is a method for transcript mapping. It extends an algorithm proposed for yeast tiling arrays to the more challenging task of spliced transcript identification. When evaluated on raw versus normalized intensities our method achieves highest prediction accuracy when segmentation is performed on transcript-normalized tiling array data.

Authors Georg Zeller, Stefan R Henz, Sascha Laubinger, Detlef Weigel, Gunnar Rätsch

Submitted Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

Link Pubmed

Authors Asa Ben Hur, Cheng Soon Ong, Soren Sonnenburg, Bernhard Schölkopf, Gunnar Rätsch

Submitted PLoS computational biology

Link Pubmed DOI

Abstract Next generation sequencing technologies open exciting new possibilities for genome and transcriptome sequencing. While reads produced by these technologies are relatively short and error prone compared to the Sanger method their throughput is several magnitudes higher. To utilize such reads for transcriptome sequencing and gene structure identification, one needs to be able to accurately align the sequence reads over intron boundaries. This represents a significant challenge given their short length and inherent high error rate.

Authors Fabio de Bona, Stephan Ossowski, Korbinian Schneeberger, Gunnar Rätsch

Submitted Bioinformatics (Oxford, England)

Link Pubmed DOI

Abstract Gene expression maps for model organisms, including Arabidopsis thaliana, have typically been created using gene-centric expression arrays. Here, we describe a comprehensive expression atlas, Arabidopsis thaliana Tiling Array Express (At-TAX), which is based on whole-genome tiling arrays. We demonstrate that tiling arrays are accurate tools for gene expression analysis and identified more than 1,000 unannotated transcribed regions. Visualizations of gene expression estimates, transcribed regions, and tiling probe measurements are accessible online at the At-TAX homepage.

Authors Sascha Laubinger, Georg Zeller, Stefan R Henz, Timo Sachsenberg, Christian K Widmer, Naira Naouar, Marnik Vuylsteke, Bernhard Schölkopf, Gunnar Rätsch, Detlef Weigel

Submitted Genome biology

Link Pubmed DOI

Abstract At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts.

Authors Soren Sonnenburg, Alexander Zien, Petra Philips, Gunnar Rätsch

Submitted Bioinformatics (Oxford, England)

Link Pubmed DOI

Abstract The processing of Arabidopsis thaliana microRNAs (miRNAs) from longer primary transcripts (pri-miRNAs) requires the activity of several proteins, including DICER-LIKE1 (DCL1), the double-stranded RNA-binding protein HYPONASTIC LEAVES1 (HYL1), and the zinc finger protein SERRATE (SE). It has been noted before that the morphological appearance of weak se mutants is reminiscent of plants with mutations in ABH1/CBP80 and CBP20, which encode the two subunits of the nuclear cap-binding complex. We report that, like SE, the cap-binding complex is necessary for proper processing of pri-miRNAs. Inactivation of either ABH1/CBP80 or CBP20 results in decreased levels of mature miRNAs accompanied by apparent stabilization of pri-miRNAs. Whole-genome tiling array analyses reveal that se, abh1/cbp80, and cbp20 mutants also share similar splicing defects, leading to the accumulation of many partially spliced transcripts. This is unlikely to be an indirect consequence of improper miRNA processing or other mRNA turnover pathways, because introns retained in se, abh1/cbp80, and cbp20 mutants are not affected by mutations in other genes required for miRNA processing or for nonsense-mediated mRNA decay. Taken together, our results uncover dual roles in splicing and miRNA processing that distinguish SE and the cap-binding complex from specialized miRNA processing factors such as DCL1 and HYL1.

Authors Sascha Laubinger, Timo Sachsenberg, Georg Zeller, Wolfgang Busch, Jan U Lohmann, Gunnar Rätsch, Detlef Weigel

Submitted Proceedings of the National Academy of Sciences of the United States of America

Link Pubmed DOI

Abstract Whole-genome oligonucleotide resequencing arrays have allowed the comprehensive discovery of single nucleotide polymorphisms (SNPs) in eukaryotic genomes of moderate to large size. With this technology, the detection rate for isolated SNPs is typically high. However, it is greatly reduced when other polymorphisms are located near a SNP as multiple mismatches inhibit hybridization to arrayed oligonucleotides. Contiguous tracts of suppressed hybridization therefore typify polymorphic regions (PRs) such as clusters of SNPs or deletions. We developed a machine learning method, designated margin-based prediction of polymorphic regions (mPPR), to predict PRs from resequencing array data. Conceptually similar to hidden Markov models, the method is trained with discriminative learning techniques related to support vector machines, and accurately identifies even very short polymorphic tracts (<10 bp). We applied this method to resequencing array data previously generated for the euchromatic genomes of 20 strains (accessions) of the best-characterized plant, Arabidopsis thaliana. Nonredundantly, 27% of the genome was included within the boundaries of PRs predicted at high specificity ( approximately 97%). The resulting data set provides a fine-scale view of polymorphic sequences in A. thaliana; patterns of polymorphism not apparent in SNP data were readily detected, especially for noncoding regions. Our predictions provide a valuable resource for evolutionary genetic and functional studies in A. thaliana, and our method is applicable to similar data sets in other species. More broadly, our computational approach can be applied to other segmentation tasks related to the analysis of genomic variation.

Authors Georg Zeller, Richard M Clark, Korbinian Schneeberger, Anja Bohlen, Detlef Weigel, Gunnar Rätsch

Submitted Genome research

Link Pubmed DOI

Abstract he major breakthrough at the turn of the millennium was the completion of genome sequences for individuals from many species, including human, worm and rice. More recently, it has also been important to describe sequence variation within one species, providing the first step towards the linkage of genetic variation to traits. Today, rice is the most important source for human caloric intake, making up 20% of the calorie supply and feeding millions of people daily. The more detailed understanding and findings on the molecular assembly of phenotypic rice varieties will therefore be essential for future improvement in rice cultivation and breeding. In order to reveal patterns of sequence variation in Oryza sativa (rice), the non-repetitive portion of the genomes of 20 diverse rice cultivars was resequenced, in collaboration with Perlegen Sciences, Inc., using a high-density oligonucleotide microarray technology.

Authors Regina Bohnert, Georg Zeller, Richard M Clark, Kevin L Childs, Victor J Ulat, Renee Stokowski, Dennis G Ballinger, Kelly A Frazer, David R Cox, Richard M Bruskiewich, C Robin Buell, Jan E Leach, Hei Leung, Kenneth L McNally, Detlef Weigel, Gunnar Rätsch

Submitted BMC Bioinformatics

Link DOI

Authors Soren Sonnenburg, Mikio L Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus Robert Müller, Fernando Pereira, Carl E Rasmussen, Gunnar Rätsch

Link

Abstract The cDNA array technology is a powerful tool to analyze a high number of genes in parallel. We investigated whether large-scale gene expression analysis allows clustering and identification of cellular phenotypes of chondrocytes in different in vivo and in vitro conditions. In 100\% of cases, clustering analysis distinguished between in vivo and in vitro samples, suggesting fundamental differences in chondrocytes in situ and in vitro regardless of the culture conditions or disease status. It also allowed us to differentiate between healthy and osteoarthritic cartilage. The clustering also revealed the relative importance of the investigated culturing conditions (stimulation agent, stimulation time, bead/monolayer). We augmented the cluster analysis with a statistical search for genes showing differential expression. The identified genes provided hints to the molecular basis of the differences between the sample classes. Our approach shows the power of modern bioinformatic algorithms for understanding and classifying chondrocytic phenotypes in vivo and in vitro. Although it does not generate new experimental data per se, it provides valuable information regarding the biology of chondrocytes and may provide tools for diagnosing and staging the osteoarthritic disease process.

Authors Alexander Zien, PM Gebhard, K Fundel, T Aigner

Submitted Clin Orthop Relat Res

Link DOI

Abstract The support vector machine (SVM) has been spotlighted in the machine learning community because of its theoretical soundness and practical performance. When applied to a large data set, however, it requires a large memory and a long time for training. To cope with the practical difficulty, we propose a pattern selection algorithm based on neighborhood properties. The idea is to select only the patterns that are likely to be located near the decision boundary. Those patterns are expected to be more informative than the randomly selected patterns. The experimental results provide promising evidence that it is possible to successfully employ the proposed algorithm ahead of SVM training.

Authors Hyunjin Shin, S Cho

Submitted Neural Computation

Link

Abstract The genomes of individuals from the same species vary in sequence as a result of different evolutionary processes. To examine the patterns of, and the forces shaping, sequence variation in Arabidopsis thaliana, we performed high-density array resequencing of 20 diverse strains (accessions). More than 1 million nonredundant single-nucleotide polymorphisms (SNPs) were identified at moderate false discovery rates (FDRs), and approximately 4% of the genome was identified as being highly dissimilar or deleted relative to the reference genome sequence. Patterns of polymorphism are highly nonrandom among gene families, with genes mediating interaction with the biotic environment having exceptional polymorphism levels. At the chromosomal scale, regional variation in polymorphism was readily apparent. A scan for recent selective sweeps revealed several candidate regions, including a notable example in which almost all variation was removed in a 500-kilobase window. Analyzing the polymorphisms we describe in larger sets of accessions will enable a detailed understanding of forces shaping population-wide sequence variation in A. thaliana.

Authors Richard M Clark, Gabriele Schweikert, Christopher Toomajian, Stephan Ossowski, Georg Zeller, Paul Shinn, Norman Warthmann, Tina T Hu, Glenn Fu, David A Hinds, Huaming Chen, Kelly A Frazer, Daniel H Huson, Bernhard Schölkopf, Magnus Nordborg, Gunnar Rätsch, Joseph R Ecker, Detlef Weigel

Submitted Science (New York, N.Y.)

Link Pubmed DOI

Abstract Despite many years of research on how to properly align sequences in the presence of sequencing errors, alternative splicing and micro-exons, the correct alignment of mRNA sequences to genomic DNA is still a challenging task.

Authors Uta Schulze, Bettina Hepp, Cheng Soon Ong, Gunnar Rätsch

Submitted Bioinformatics (Oxford, England)

Link Pubmed DOI

Abstract For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%-13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

Authors Gunnar Rätsch, Soren Sonnenburg, Jagan Srinivasan, Hanh Witte, Klaus Robert Müller, Ralf J Sommer, Bernhard Schölkopf

Submitted PLoS computational biology

Link Pubmed DOI

Abstract Since prilocaine is being increasingly used for day case surgery as a short acting local anaesthetic for spinal anaesthesia and because of its low risk for transient neurological symptoms, we compared it to bupivacaine.

Authors Gunnar Rätsch, H Niebergall, L Hauenstein, A Reber

Submitted Der Anaesthesist

Link Pubmed DOI

Abstract For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks.

Authors Soren Sonnenburg, Gabriele Schweikert, Petra Philips, Jonas Behr, Gunnar Rätsch

Submitted BMC bioinformatics

Link Pubmed DOI

Authors Soren Sonnenburg, Gunnar Rätsch, Bernhard Schölkopf

Submitted Journal of Machine Learning Research

Link

Abstract Despite many research efforts in recent decades, the major pathogenetic mechanisms of osteoarthritis (OA), including gene alterations occurring during OA cartilage degeneration, are poorly understood, and there is no disease-modifying treatment approach. The present study was therefore initiated in order to identify differentially expressed disease-related genes and potential therapeutic targets.

Authors T Aigner, K Fundel, J Saas, PM Gebhard, J Haag, T Weiss, Alexander Zien, F Obermayr, R Zimmer, E Bartnik

Submitted Arthritis Rheum

Link DOI

Abstract We develop new methods for finding transcription start sites (TSS) of RNA Polymerase II binding genes in genomic DNA sequences. Employing Support Vector Machines with advanced sequence kernels, we achieve drastically higher prediction accuracies than state-of-the-art methods.

Authors Soren Sonnenburg, Alexander Zien, Gunnar Rätsch

Submitted Bioinformatics (Oxford, England)

Link Pubmed DOI

Abstract Support Vector Machines (SVMs)--using a variety of string kernels--have been successfully applied to biological sequence classification problems. While SVMs achieve high classification accuracy they lack interpretability. In many applications, it does not suffice that an algorithm just detects a biological signal in the sequence, but it should also provide means to interpret its solution in order to gain biological insight.

Authors Gunnar Rätsch, Soren Sonnenburg, Christin Schafer

Submitted BMC bioinformatics

Link Pubmed DOI

Authors Gunnar Rätsch, Manfred K Warmuth

Submitted Journal of Machine Learning Research

Link

Abstract We tackle the problem of finding regularities in microarray data. Various data mining tools, such as clustering, classification, Bayesian networks and association rules, have been applied so far to gain insight into gene-expression data. Association rule mining techniques used so far work on discretizations of the data and cannot account for cumulative effects. In this paper, we investigate the use of quantitative association rules that can operate directly on numeric data and represent cumulative effects of variables. Technically speaking, this type of quantitative association rules based on half-spaces can find non-axis-parallel regularities.

Authors E Georgii, L Richter, U Rückert, S Kramer

Submitted Bioinformatics

Link DOI

Abstract Computational approaches to protein function prediction infer protein function by finding proteins with similar sequence, structure, surface clefts, chemical properties, amino acid motifs, interaction partners or phylogenetic profiles. We present a new approach that combines sequential, structural and chemical information into one graph model of proteins. We predict functional class membership of enzymes and non-enzymes using graph kernels and support vector machine classification on these protein graphs.

Authors Karsten Borgwardt, Cheng Soon Ong, S Schönauer, SVN Vishwanathan, A J Smola, HP Kriegel

Submitted Bioinformatics

Link DOI

Abstract One way of image denoising is to project a noisy image to the subspace of admissible images derived, for instance, by PCA. However, a major drawback of this method is that all pixels are updated by the projection, even when only a few pixels are corrupted by noise or occlusion. We propose a new method to identify the noisy pixels by l1-norm penalization and to update the identified pixels only. The identification and updating of noisy pixels are formulated as one linear program which can be efficiently solved. In particular, one can apply the upsilon trick to directly specify the fraction of pixels to be reconstructed. Moreover, we extend the linear program to be able to exploit prior knowledge that occlusions often appear in contiguous blocks (e.g., sunglasses on faces). The basic idea is to penalize boundary points and interior points of the occluded area differently. We are also able to show the upsilon property for this extended LP leading to a method which is easy to use. Experimental results demonstrate the power of our approach.

Authors Koji Tsuda, Gunnar Rätsch

Submitted IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Link Pubmed

Abstract Eukaryotic pre-mRNAs are spliced to form mature mRNA. Pre-mRNA alternative splicing greatly increases the complexity of gene expression. Estimates show that more than half of the human genes and at least one-third of the genes of less complex organisms, such as nematodes or flies, are alternatively spliced. In this work, we consider one major form of alternative splicing, namely the exclusion of exons from the transcript. It has been shown that alternatively spliced exons have certain properties that distinguish them from constitutively spliced exons. Although most recent computational studies on alternative splicing apply only to exons which are conserved among two species, our method only uses information that is available to the splicing machinery, i.e. the DNA sequence itself. We employ advanced machine learning techniques in order to answer the following two questions: (1) Is a certain exon alternatively spliced? (2) How can we identify yet unidentified exons within known introns?

Authors Gunnar Rätsch, Soren Sonnenburg, Bernhard Schölkopf

Submitted Bioinformatics (Oxford, England)

Link Pubmed DOI

Abstract In this article we report about a successful application of modern machine learning technology, namely Support Vector Machines, to the problem of assessing the 'drug-likeness' of a chemical from a given set of descriptors of the substance. We were able to drastically improve the recent result by Byvatov et al. (2003) on this task and achieved an error rate of about 7% on unseen compounds using Support Vector Machines. We see a very high potential of such machine learning techniques for a variety of computational chemistry problems that occur in the drug discovery and drug design process.

Authors Klaus Robert Müller, Gunnar Rätsch, Soren Sonnenburg, Sebastian Mika, Michael Grimm, Nikolaus Heinrich

Submitted Journal of chemical information and modeling

Link Pubmed DOI

Authors S Knabe, Sebastian Mika, Klaus Robert Müller, Gunnar Rätsch, W Schruff

Submitted Die Wirtschaftsprüfung

Link

Abstract We investigate the following data mining problem from computer-aided drug design: From a large collection of compounds, find those that bind to a target molecule in as few iterations of biochemical testing as possible. In each iteration a comparatively small batch of compounds is screened for binding activity toward this target. We employed the so-called "active learning paradigm" from Machine Learning for selecting the successive batches. Our main selection strategy is based on the maximum margin hyperplane-generated by "Support Vector Machines". This hyperplane separates the current set of active from the inactive compounds and has the largest possible distance from any labeled compound. We perform a thorough comparative study of various other selection strategies on data sets provided by DuPont Pharmaceuticals and show that the strategies based on the maximum margin hyperplane clearly outperform the simpler ones.

Authors Manfred K Warmuth, Jun Liao, Gunnar Rätsch, Michael Mathieson, Santosh Putta, Christian Lemmen

Submitted Journal of chemical information and computer sciences

Link Pubmed DOI

Authors Gunnar Rätsch, Sebastian Mika, Manfred K Warmuth

Link

Authors Gunnar Rätsch, Manfred K Warmuth

Link

Authors Soren Sonnenburg, Gunnar Rätsch, A Jagoda, Klaus Robert Müller

Link

Authors Manfred K Warmuth, Gunnar Rätsch, Michael Mathieson, Jun Liao, Christian Lemmen

Link

Authors Gunnar Rätsch, Sebastian Mika, Bernhard Schölkopf, Klaus Robert Müller

Submitted IEEE Transactions on Pattern Analysis and Machine Intelligence

Link

Abstract Recently, Jaakkola and Haussler (1999) proposed a method for constructing kernel functions from probabilistic models. Their so-called Fisher kernel has been combined with discriminative classifiers such as support vector machines and applied successfully in, for example, DNA and protein analysis. Whereas the Fisher kernel is calculated from the marginal log-likelihood, we propose the TOP kernel derived; from tangent vectors of posterior log-odds. Furthermore, we develop a theoretical framework on feature extractors from probabilistic models and use it for analyzing the TOP kernel. In experiments, our new discriminative TOP kernel compares favorably to the Fisher kernel.

Authors Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Soren Sonnenburg, Klaus Robert Müller

Submitted Neural computation

Link Pubmed DOI

Authors T Onoda, Gunnar Rätsch, Klaus Robert Müller

Submitted Journal of the Japanese Society for AI

Link

Abstract This paper provides an introduction to support vector machines, kernel Fisher discriminant analysis, and kernel principal component analysis, as examples for successful kernel-based learning methods. We first give a short background about Vapnik-Chervonenkis theory and kernel feature spaces and then proceed to kernel based learning in supervised and unsupervised scenarios including practical and algorithmic considerations. We illustrate the usefulness of kernel algorithms by discussing applications such as optical character recognition and DNA analysis.

Authors Klaus Robert Müller, Sebastian Mika, Gunnar Rätsch, Koji Tsuda, Bernhard Schölkopf

Submitted IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council

Link Pubmed DOI

Abstract Recently, ensemble methods like AdaBoost have been applied successfully in many problems, while seemingly defying the problems of overfitting. AdaBoost rarely overfits in the low noise regime, however, we show that it clearly does so for higher noise levels. Central to the understanding of this fact is the margin distribution. AdaBoost can be viewed as a constraint gradient descent in an error function with respect to the margin. We find that AdaBoost asymptotically achieves a hard margin distribution, i.e. the algorithm concentrates its resources on a few hard-to-learn patterns that are interestingly very similar to Support Vectors. A hard margin is clearly a sub-optimal strategy in the noisy case, and regularization, in our case a mistrust in the data, must be introduced in the algorithm to alleviate the distortions that single difficult patterns (e.g. outliers) can cause to the margin distribution. We propose several regularization methods and generalizations of the original AdaBoost algorithm to achieve a soft margin. In particular we suggest (1) regularized AdaBoostREG where the gradient decent is done directly with respect to the soft margin and (2) regularized linear and quadratic programming (LP/QP-) AdaBoost, where the soft margin is attained by introducing slack variables. Extensive simulations demonstrate that the proposed regularized AdaBoost-type algorithms are useful and yield competitive results for noisy data.

Authors Gunnar Rätsch, T Onoda, Klaus Robert Müller

Submitted Machine Learning

Link DOI

Authors G Rätsch}, Bernhard Schölkopf, A J Smola, Sebastian Mika, T Onoda, K R Müller"

Link

Authors Gunnar Rätsch, Bernhard Schölkopf, A J Smola, Klaus Robert Müller, T Onoda, Sebastian Mika

Link

Authors Gunnar Rätsch, Manfred K Warmuth, Sebastian Mika, T Onoda, S Lemm, Klaus Robert Müller

Link

Authors Gunnar Rätsch, B Scherkopf, A J Smola, Sebastian Mika, T Onoda, Klaus Robert Müller

Link

Authors Gunnar Rätsch, Bernhard Schölkopf, Sebastian Mika, Klaus Robert Müller

Link

Abstract In order to extract protein sequences from nucleotide sequences, it is an important step to recognize points at which regions start that code for proteins. These points are called translation initiation sites (TIS).

Authors Alexander Zien, Gunnar Rätsch, Sebastian Mika, Bernhard Schölkopf, T Lengauer, Klaus Robert Müller

Submitted Bioinformatics (Oxford, England)

Link Pubmed

Authors Sebastian Mika, Bernhard Schölkopf, A J Smola, Klaus Robert Müller, M Scholz, Gunnar Rätsch

Link

Authors Gunnar Rätsch, T Onoda, Klaus Robert Müller

Link

Abstract In order to extract protein sequences from nucleotide sequences, it is an important step to recognize points from which regions encoding pro­ teins start, the so­called translation initiation sites (TIS). This can be modeled as a classification prob­ lem. We demonstrate the power of support vector machines (SVMs) for this task, and show how to suc­ cessfully incorporate biological prior knowledge by engineering an appropriate kernel function.

Authors Alexander Zien, Gunnar Rätsch, Sebastian Mika, Bernhard Schölkopf, Christian Lemmen, A J Smola, T Lengauer, Klaus Robert Müller

Authors Sebastian Mika, Gunnar Rätsch, Bernhard Schölkopf, Klaus Robert Müller

Submitted Neural networks for signal processing IX

Link

Abstract This paper collects some ideas targeted at advancing our understanding of the feature spaces associated with support vector (SV) kernel functions. We first discuss the geometry of feature space. In particular, we review what is known about the shape of the image of input space under the feature space map, and how this influences the capacity of SV methods. Following this, we describe how the metric governing the intrinsic geometry of the mapped surface can be computed in terms of the kernel, using the example of the class of inhomogeneous polynomial kernels, which are often used in SV pattern recognition. We then discuss the connection between feature space and input space by dealing with the question of how one can, given some vector in feature space, find a preimage (exact or approximate) in input space. We describe algorithms to tackle this issue, and show their utility in two applications of kernel methods. First, we use it to reduce the computational complexity of SV decision functions; second, we combine it with the Kernel PCA algorithm, thereby constructing a nonlinear statistical denoising technique which is shown to perform well on real-world data.

Authors Bernhard Schölkopf, Sebastian Mika, C C Burges, P Knirsch, Klaus Robert Müller, Gunnar Rätsch, A J Smola

Submitted IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council

Link Pubmed DOI

Authors Klaus Robert Müller, A J Smola, Gunnar Rätsch, Bernhard Schölkopf, J Kohlmorgen, V Vapnik

Link