Publications
2025
Abstract MOTIVATION Pairwise alignment is at the core of computational biology. Most commonly used exact methods are either based on O(ns) band doubling or O(n+s²) diagonal transition, where n is the sequence length and s the number of errors. However, as the length of sequences has grown, these exact methods are often replaced by approximate methods based on e.g. seed-and-extend and heuristics to bound the computed region. We would like to develop an exact method that matches the performance of these approximate methods. Recently, Astarix introduced the A* shortest path algorithm with the seed heuristic for exact sequence-to-graph alignment. A*PA adapted and improved this for pairwise sequence alignment and achieves near-linear runtime when divergence (error rate) is low, at the cost of being very slow when divergence is high. METHODS We introduce A*PA2, an exact global pairwise aligner with respect to edit distance. The goal of A*PA2 is to unify the near-linear runtime of A*PA on similar sequences with the efficiency of dynamic programming (DP) based methods. Like Edlib, A*PA2 uses Ukkonen’s band doubling in combination with Myers' bitpacking. A*PA2 1) uses large block sizes inspired by Block Aligner, 2) extends this with SIMD (single instruction, multiple data), 3) introduces a new profile for efficient computations, 4) introduces a new optimistic technique for traceback based on diagonal transition, 5) avoids recomputation of states where possible, and 6) applies the heuristics developed in A*PA and improves them using pre-pruning. RESULTS With the first 4 engineering optimizations, A*PA2-simple has complexity O(ns) and is 6× to 8× faster than Edlib for sequences ≥ 10 kbp. A*PA2-full also includes the heuristic and is often near-linear in practice for sequences with small divergence. The average runtime of A*PA2 is 19× faster than the exact aligners BiWFA and Edlib on >500 kbp long ONT (Oxford Nanopore Technologies) reads of a human genome having 6% divergence on average. On shorter ONT reads of 11% average divergence the speedup is 5.6× (avg. length 11 kbp) and 0.81× (avg. length 800 bp). On all tested datasets, A*PA2 is competitive with or faster than approximate methods.
Authors Ragnar Groot Koerkamp
Submitted WABI24
2024
Abstract Notable progress has been made in generalist medical large language models across various healthcare areas. However, large-scale modeling of in-hospital time series data - such as vital signs, lab results, and treatments in critical care - remains underexplored. Existing datasets are relatively small, but combining them can enhance patient diversity and improve model robustness. To effectively utilize these combined datasets for large-scale modeling, it is essential to address the distribution shifts caused by varying treatment policies, necessitating the harmonization of treatment variables across the different datasets. This work aims to establish a foundation for training large-scale multi-variate time series models on critical care data and to provide a benchmark for machine learning models in transfer learning across hospitals to study and address distribution shift challenges. We introduce a harmonized dataset for sequence modeling and transfer learning research, representing the first large-scale collection to include core treatment variables. Future plans involve expanding this dataset to support further advancements in transfer learning and the development of scalable, generalizable models for critical healthcare applications.
Authors Manuel Burger, Fedor Sergeev, Malte Londschien, Daphné Chopard, Hugo Yèche, Eike Gerdes, Polina Leshetkina, Alexander Morgenroth, Zeynep Babür, Jasmina Bogojeska, Martin Faltys, Rita Kuznetsova, Gunnar Rätsch
Submitted AIM-FM Workshop at NeurIPS 2024
Abstract Raw nanopore signal analysis is a common approach in genomics to provide fast and resource-efficient analysis without translating the signals to bases (i.e., without basecalling). However, existing solutions cannot interpret raw signals directly if a reference genome is unknown due to a lack of accurate mechanisms to handle increased noise in pairwise raw signal comparison. Our goal is to enable the direct analysis of raw signals without a reference genome. To this end, we propose Rawsamble, the first mechanism that can 1) identify regions of similarity between all raw signal pairs, known as all-vs-all overlapping, using a hash-based search mechanism and 2) use these to construct genomes from scratch, called de novo assembly. Our extensive evaluations across multiple genomes of varying sizes show that Rawsamble provides a significant speedup (on average by 16.36x and up to 41.59x) and reduces peak memory usage (on average by 11.73x and up to by 41.99x) compared to a conventional genome assembly pipeline using the state-of-the-art tools for basecalling (Dorado's fastest mode) and overlapping (minimap2) on a CPU. We find that 36.57% of overlapping pairs generated by Rawsamble are identical to those generated by minimap2. Using the overlaps from Rawsamble, we construct the first de novo assemblies directly from raw signals without basecalling. We show that we can construct contiguous assembly segments (unitigs) up to 2.7 million bases in length (half the genome length of E. coli). We identify previously unexplored directions that can be enabled by finding overlaps and constructing de novo assemblies. Rawsamble is available at this https URL. We also provide the scripts to fully reproduce our results on our GitHub page.
Authors Can Firtina, Maximilian Mordig, Harun Mustafa, Sayan Goswami, Nika Mansouri Ghiasi, Stefano Mercogliano, Furkan Eris, Joël Lindegger, Andre Kahles, Onur Mutlu
Submitted arXiv
Abstract Question Can established cardiovascular risk tools be adapted for local populations without sacrificing interpretability? Findings This cohort study including 95 326 individuals applied a machine learning recalibration method that uses minimal variables to the American Heart Association’s Predicting Risk of Cardiovascular Disease Events (AHA-PREVENT) equations for a New England population. This approach strengthened the AHA-PREVENT risk equations, improving calibration while maintaining similar risk discrimination. Meaning The results indicate that the interpretable machine learning-based recalibration method used in this study can be implemented to tailor risk stratification in local health systems.
Authors Aniket N Zinzuwadia, Olga Mineeva, Chunying Li, Zareen Farukhi, Franco Giulianini, Brian Cade, Lin Chen, Elizabeth Karlson, Nina Paynter, Samia Mora, Olga Demler
Submitted JAMA cardiology
Abstract MOTIVATION Given a string S, a minimizer scheme is an algorithm defined by a triple (k,w,O) that samples a subset of k-mers (k-long substrings) from a string S. Specifically, it samples the smallest k-mer according to the order O from each window of w consecutive k-mers in S. Because consecutive windows can sample the same k-mer, the set of the sampled k-mers is typically much smaller than S. More generally, we consider substring sampling algorithms that respect a window guarantee: at least one k-mer must be sampled from every window of w consecutive k-mers. As a sampled k-mer is uniquely identified by its absolute position in S, we can define the density of a sampling algorithm as the fraction of distinct sampled positions. Good methods have low density which, by respecting the window guarantee, is lower bounded by 1/w. It is however difficult to design a sequence-agnostic algorithm with provably optimal density. In practice, the order O is usually implemented using a pseudo-random hash function to obtain the so-called random minimizer. This scheme is simple to implement, very fast to compute even in streaming fashion, and easy to analyze. However, its density is almost a factor of 2 away from the lower bound for large windows. METHODS In this work we introduce mod-sampling, a two-step sampling algorithm to obtain new minimizer schemes. Given a (small) parameter t, the mod-sampling algorithm finds the position p of the smallest t-mer in a window. It then samples the k-mer at position pod w. The lr-minimizer uses t = k-w and the mod-minimizer uses t = k (mod w). RESULTS These new schemes have provably lower density than random minimizers and other schemes when k is large compared to w, while being as fast to compute. Importantly, the mod-minimizer achieves optimal density when k goes to infinity. Although the mod-minimizer is not the first method to achieve optimal density for large k, its proof of optimality is simpler than previous work. We provide pseudocode for a number of other methods and compare to them. In practice, the mod-minimizer has considerably lower density than the random minimizer and other state-of-the-art methods, like closed syncmers and miniception, when k > w. We plugged the mod-minimizer into SSHash, a k-mer dictionary based on minimizers. For default parameters (w,k) = (11,21), space usage decreases by 15% when indexing the whole human genome (GRCh38), while maintaining its fast query time.
Authors Ragnar Groot Koerkamp, Giulio Ermanno Pibiri
Submitted WABI24
Abstract Applying reinforcement learning (RL) to real-world problems is often made challenging by the inability to interact with the environment and the difficulty of designing reward functions. Offline RL addresses the first challenge by considering access to an offline dataset of environment interactions labeled by the reward function. In contrast, Preference-based RL does not assume access to the reward function and learns it from preferences, but typically requires an online interaction with the environment. We bridge the gap between these frameworks by exploring efficient methods for acquiring preference feedback in a fully offline setup. We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm, which leverages a learned environment model to elicit preference feedback on simulated rollouts. Drawing on insights from both the offline RL and the preference-based RL literature, our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy. We provide theoretical guarantees regarding the sample complexity of our approach, dependent on how well the offline data covers the optimal policy. Finally, we demonstrate the empirical performance of Sim-OPRL in different environments.
Authors Alizée Pace, Bernhard Schölkopf, Gunnar Rätsch, Giorgia Ramponi
Submitted ICML 2024 MFHAIA
Abstract Knowing which features of a multivariate time series to measure and when is a key task in medicine, wearables, and robotics. Better acquisition policies can reduce costs while maintaining or even improving the performance of downstream predictors. Inspired by the maximization of conditional mutual information, we propose an approach to train acquirers end-to-end using only the downstream loss. We show that our method outperforms random acquisition policy, matches a model with an unrestrained budget, but does not yet overtake a static acquisition strategy. We highlight the assumptions and outline avenues for future work.
Authors Fedor Sergeev, Paola Malsot, Gunnar Rätsch, Vincent Fortuin
Submitted SPIGM ICML Workshop
Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. Results We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment.
Authors Harun Mustafa, Mikhail Karasikov, Nika Mansouri Ghiasi, Gunnar Rätsch, André Kahles
Submitted Bioinformatics, ISMB 2024
Abstract Acute kidney injury (AKI) is a syndrome that affects a large fraction of all critically ill patients, and early diagnosis to receive adequate treatment is as imperative as it is challenging to make early. Consequently, machine learning approaches have been developed to predict AKI ahead of time. However, the prevalence of AKI is often underestimated in state-of-the-art approaches, as they rely on an AKI event annotation solely based on creatinine, ignoring urine output. We construct and evaluate early warning systems for AKI in a multi-disciplinary ICU setting, using the complete KDIGO definition of AKI. We propose several variants of gradient-boosted decision tree (GBDT)-based models, including a novel time-stacking based approach. A state-of-the-art LSTM-based model previously proposed for AKI prediction is used as a comparison, which was not specifically evaluated in ICU settings yet. We find that optimal performance is achieved by using GBDT with the time-based stacking technique (AUPRC = 65.7%, compared with the LSTM-based model’s AUPRC = 62.6%), which is motivated by the high relevance of time since ICU admission for this task. Both models show mildly reduced performance in the limited training data setting, perform fairly across different subcohorts, and exhibit no issues in gender transfer. Following the official KDIGO definition substantially increases the number of annotated AKI events. In our study GBDTs outperform LSTM models for AKI prediction. Generally, we find that both model types are robust in a variety of challenging settings arising for ICU data.
Authors Xinrui Lyu, Bowen Fan, Matthias Hüser, Philip Hartout, Thomas Gumbsch, Martin Faltys, Tobias M. Merz, Gunnar Rätsch, and Karsten Borgwardt
Submitted Bioinformatics, ISMB 2024
Abstract Metagenomics has led to significant advances in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system. In-storage processing can be a fundamental solution for reducing this overhead. However, designing an in-storage processing system for metagenomics is challenging because existing approaches to metagenomic analysis cannot be directly implemented in storage effectively due to the hardware limitations of modern SSDs. We propose MegIS, the first in-storage processing system designed to significantly reduce the data movement overhead of the end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight design that effectively leverages and orchestrates processing inside and outside the storage system. We address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) data mapping, and 5) lightweight in-storage accelerators. MegIS's design is flexible, capable of supporting different types of metagenomic input datasets, and can be integrated into various metagenomic analysis pipelines. Our evaluation shows that MegIS outperforms the state-of-the-art performance- and accuracy-optimized software metagenomic tools by 2.7×-37.2× and 6.9×-100.2×, respectively, while matching the accuracy of the accuracy-optimized tool. MegIS achieves 1.5×-5.1× speedup compared to the state-of-the-art metagenomic hardware-accelerated (using processing-in-memory) tool, while achieving significantly higher accuracy.
Authors Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu
Submitted ISCA 2024
Abstract The amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making it full-text searchable and easily accessible to researchers in life and data science is an unsolved problem. In this work, we take advantage of recently developed, very efficient data structures and algorithms for representing sequence sets. We make Petabases of DNA sequences across all clades of life, including viruses, bacteria, fungi, plants, animals, and humans, fully searchable. Our indexes are freely available to the research community. This highly compressed representation of the input sequences (up to 5800×) fits on a single consumer hard drive (≈100 USD), making this valuable resource cost-effective to use and easily transportable. We present the underlying methodological framework, called MetaGraph, that allows us to scalably index very large sets of DNA or protein sequences using annotated De Bruijn graphs. We demonstrate the feasibility of indexing the full extent of existing sequencing data and present new approaches for efficient and cost-effective full-text search at an on-demand cost of $0.10 per queried Mbp. We explore several practical use cases to mine existing archives for interesting associations and demonstrate the utility of our indexes for integrative analyses.
Authors Mikhail Karasikov, Harun Mustafa, Daniel Danciu, Marc Zimmermann, Christopher Barber, Gunnar Rätsch, André Kahles
Submitted bioRxiv
Abstract Fracture prediction is essential in managing patients with osteoporosis and is an integral component of many fracture prevention guidelines. We aimed to identify the most relevant clinical fracture risk factors in contemporary populations by training and validating short- and long-term fracture risk prediction models in 2 cohorts. We used traditional and machine learning survival models to predict risks of vertebral, hip, and any fractures on the basis of clinical risk factors, T-scores, and treatment history among participants in a nationwide Swiss Osteoporosis Registry (N = 5944 postmenopausal women, median follow-up of 4.1 yr between January 2015 and October 2022; a total of 1190 fractures during follow-up). The independent validation cohort comprised 5474 postmenopausal women from the UK Biobank with 290 incident fractures during follow-up. Uno’s C-index and the time-dependent area under the receiver operating characteristics curve were calculated to evaluate the performance of different machine learning models (Random survival forest and eXtreme Gradient Boosting). In the independent validation set, the C-index was 0.74 [0.58, 0.86] for vertebral fractures, 0.83 [0.7, 0.94] for hip fractures, and 0.63 [0.58, 0.69] for any fractures at year 2, and these values further increased for longer estimations of up to 7 yr. In comparison, the 10-yr fracture probability calculated with FRAX Switzerland was 0.60 [0.55, 0.64] for major osteoporotic fractures and 0.62 [0.49, 0.74] for hip fractures. The most important variables identified with Shapley additive explanations values were age, T-scores, and prior fractures, while number of falls was an important predictor of hip fractures. Performances of both traditional and machine learning models showed similar C-indices. We conclude that fracture risk can be improved by including the lumbar spine T-score, trabecular bone score, numbers of falls and recent fractures, and treatment information has a significant impact on fracture prediction.
Authors Oliver Lehmann, Olga Mineeva, Dinara Veshchezerova, HansJörg Häuselmann, Laura Guyer, Stephan Reichenbach, Thomas Lehmann, Olga Demler, Judith Everts-Graber, The Swiss Osteoporosis Registry Study Group
Submitted Journal of Bone and Mineral Research
Abstract Spatial transcriptomics enables in-depth molecular characterization of samples on a morphology and RNA level while preserving spatial location. Integrating the resulting multi-modal data is an unsolved problem, and developing new solutions in precision medicine depends on improved methodologies. Here, we introduce AESTETIK, a convolutional deep learning model that jointly integrates spatial, transcriptomics, and morphology information to learn accurate spot representations. AESTETIK yielded substantially improved cluster assignments on widely adopted technology platforms (e.g., 10x Genomics™, NanoString™) across multiple datasets. We achieved performance enhancement on structured tissues (e.g., brain) with a 21% increase in median ARI over previous state-of-the-art methods. Notably, AESTETIK also demonstrated superior performance on cancer tissues with heterogeneous cell populations, showing a two-fold increase in breast cancer, 79% in melanoma, and 21% in liver cancer. We expect that these advances will enable a multi-modal understanding of key biological processes.
Authors Kalin Nonchev, Sonali Andani, Joanna Ficek-Pascual, Marta Nowak, Bettina Sobottka, Tumor Profiler Consortium, Viktor Hendrik Koelzer, and Gunnar Rätsch
Submitted MedRxiv
Abstract Machine learning applications hold promise to aid clinicians in a wide range of clinical tasks, from diagnosis to prognosis, treatment, and patient monitoring. These potential applications are accompanied by a surge of ethical concerns surrounding the use of Machine Learning (ML) models in healthcare, especially regarding fairness and non-discrimination. While there is an increasing number of regulatory policies to ensure the ethical and safe integration of such systems, the translation from policies to practices remains an open challenge. Algorithmic frameworks, aiming to bridge this gap, should be tailored to the application to enable the translation from fundamental human-right principles into accurate statistical analysis, capturing the inherent complexity and risks associated with the system. In this work, we propose a set of fairness impartial checks especially adapted to ML early-warning systems in the medical context, comprising on top of standard fairness metrics, an analysis of clinical outcomes, and a screening of potential sources of bias in the pipeline. Our analysis is further fortified by the inclusion of event-based and prevalence-corrected metrics, as well as statistical tests to measure biases. Additionally, we emphasize the importance of considering subgroups beyond the conventional demographic attributes. Finally, to facilitate operationalization, we present an open-source tool FAMEWS to generate comprehensive fairness reports. These reports address the diverse needs and interests of the stakeholders involved in integrating ML into medical practice. The use of FAMEWS has the potential to reveal critical insights that might otherwise remain obscured. This can lead to improved model design, which in turn may translate into enhanced health outcomes.
Authors Marine Hoche, Olga Mineeva, Manuel Burger, Alessandro Blasimme, Gunnar Ratsch
Submitted Proceedings of Machine Learning Research
Abstract Electronic Health Record (EHR) datasets from Intensive Care Units (ICU) contain a diverse set of data modalities. While prior works have successfully leveraged multiple modalities in supervised settings, we apply advanced self-supervised multi-modal contrastive learning techniques to ICU data, specifically focusing on clinical notes and time-series for clinically relevant online prediction tasks. We introduce a loss function Multi-Modal Neighborhood Contrastive Loss (MM-NCL), a soft neighborhood function, and showcase the excellent linear probe and zero-shot performance of our approach.
Authors Fabian Baldenweg, Manuel Burger, Gunnar Rätsch, Rita Kuznetsova
Submitted TS4H ICLR Workshop
Abstract The success of reinforcement learning from human feedback (RLHF) in language model alignment is strongly dependent on the quality of the underlying reward model. In this paper, we present a novel approach to improve reward model quality by generating synthetic preference data, thereby augmenting the training dataset with on-policy, high-quality preference pairs. Motivated by the promising results of Best-of-N sampling strategies in language model training, we extend their application to reward model training. This results in a self-training strategy to generate preference pairs by selecting the best and worst candidates in a pool of responses to a given query. Empirically, we find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data. This work opens up new avenues of research for improving RLHF for language model alignment, by offering synthetic preference generation as a solution to reward modeling challenges.
Authors Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn
Submitted ICLR 2024 DPFM
Abstract MOTIVATION Sequence alignment has been at the core of computational biology for half a century. Still, it is an open problem to design a practical algorithm for exact alignment of a pair of related sequences in linear-like time. RESULTS We solve exact global pairwise alignment with respect to edit distance by using the A* shortest path algorithm. In order to efficiently align long sequences with high divergence, we extend the recently proposed seed heuristic with match chaining, gap costs, and inexact matches. We additionally integrate the novel match pruning technique and diagonal transition to improve the A* search. We prove the correctness of our algorithm, implement it in the A*PA aligner, and justify our extensions intuitively and empirically. RESULTS On random sequences of divergence d=4% and length n, the empirical runtime of A*PA scales near-linearly with length (best~fit n^1.06, n<10^7 bp). A similar scaling remains up to d=12% (best~fit ~n^1.24, n<10^7 bp). For n=10^7bp and d=4%, A*PA reaches >500x speedup compared to the leading exact aligners Edlib and WFA. The performance of A*PA is highly influenced by long gaps. On long (n>500 kbp) ONT reads of a human sample it efficiently aligns sequences with d<10%, leading to 3x median speedup compared to Edlib and WFA. When the sequences come from different human samples, A*PA performs 1.7x faster than Edlib and WFA. Availability github.com/RagnarGrootKoerkamp/astar-pairwise-aligner
Authors Ragnar Groot Koerkamp, Pesho Ivanov
Submitted Bioinformatics
Abstract A prominent challenge of offline reinforcement learning (RL) is the issue of hidden confounding: unobserved variables may influence both the actions taken by the agent and the observed outcomes. Hidden confounding can compromise the validity of any causal conclusion drawn from data and presents a major obstacle to effective offline RL. In the present paper, we tackle the problem of hidden confounding in the nonidentifiable setting. We propose a definition of uncertainty due to hidden confounding bias, termed delphic uncertainty, which uses variation over world models compatible with the observations, and differentiate it from the well-known epistemic and aleatoric uncertainties. We derive a practical method for estimating the three types of uncertainties, and construct a pessimistic offline RL algorithm to account for them. Our method does not assume identifiability of the unobserved confounders, and attempts to reduce the amount of confounding bias. We demonstrate through extensive experiments and ablations the efficacy of our approach on a sepsis management benchmark, as well as on electronic health records. Our results suggest that nonidentifiable hidden confounding bias can be mitigated to improve offline RL solutions in practice.
Authors Alizée Pace, Hugo Yèche, Bernhard Schölkopf, Gunnar Ratsch, Guy Tennenholtz
Submitted ICLR 2024
2023
Abstract In this paper, we explore the structure of the penultimate Gram matrix in deep neural networks, which contains the pairwise inner products of outputs corresponding to a batch of inputs. In several architectures it has been observed that this Gram matrix becomes degenerate with depth at initialization, which dramatically slows training. Normalization layers, such as batch or layer normalization, play a pivotal role in preventing the rank collapse issue. Despite promising advances, the existing theoretical results (i) do not extend to layer normalization, which is widely used in transformers, (ii) can not characterize the bias of normalization quantitatively at finite depth. To bridge this gap, we provide a proof that layer normalization, in conjunction with activation layers, biases the Gram matrix of a multilayer perceptron towards isometry at an exponential rate with depth at initialization. We quantify this rate using the Hermite expansion of the activation function, highlighting the importance of higher order (≥2) Hermite coefficients in the bias towards isometry.
Authors Amir Joudaki, Hadi Daneshmand, Francis Bach
Submitted NeurIPS 2023 (poster)
Abstract The rapid expansion of genomic sequence data calls for new methods to achieve robust sequence representations. Existing techniques often neglect intricate structural details, emphasizing mainly contextual information. To address this, we developed k-mer embeddings that merge contextual and structural string information by enhancing De Bruijn graphs with structural similarity connections. Subsequently, we crafted a self-supervised method based on Contrastive Learning that employs a heterogeneous Graph Convolutional Network encoder and constructs positive pairs based on node similarities. Our embeddings consistently outperform prior techniques for Edit Distance Approximation and Closest String Retrieval tasks.
Authors Kacper Kapusniak, Manuel Burger, Gunnar Rätsch, Amir Joudaki
Submitted NeurIPS 2023 Workshop: Frontiers in Graph Learning
Abstract Recent advances in deep learning architectures for sequence modeling have not fully transferred to tasks handling time-series from electronic health records. In particular, in problems related to the Intensive Care Unit (ICU), the state-of-the-art remains to tackle sequence classification in a tabular manner with tree-based methods. Recent findings in deep learning for tabular data are now surpassing these classical methods by better handling the severe heterogeneity of data input features. Given the similar level of feature heterogeneity exhibited by ICU time-series and motivated by these findings, we explore these novel methods' impact on clinical sequence modeling tasks. By jointly using such advances in deep learning for tabular data, our primary objective is to underscore the importance of step-wise embeddings in time-series modeling, which remain unexplored in machine learning methods for clinical data. On a variety of clinically relevant tasks from two large-scale ICU datasets, MIMIC-III and HiRID, our work provides an exhaustive analysis of state-of-the-art methods for tabular time-series as time-step embedding models, showing overall performance improvement. In particular, we evidence the importance of feature grouping in clinical time-series, with significant performance gains when considering features within predefined semantic groups in the step-wise embedding module.
Authors Rita Kuznetsova, Alizée Pace, Manuel Burger, Hugo Yèche, Gunnar Rätsch
Submitted ML4H 2023 (PMLR)
Abstract Clinicians are increasingly looking towards machine learning to gain insights about patient evolutions. We propose a novel approach named Multi-Modal UMLS Graph Learning (MMUGL) for learning meaningful representations of medical concepts using graph neural networks over knowledge graphs based on the unified medical language system. These representations are aggregated to represent entire patient visits and then fed into a sequence model to perform predictions at the granularity of multiple hospital visits of a patient. We improve performance by incorporating prior medical knowledge and considering multiple modalities. We compare our method to existing architectures proposed to learn representations at different granularities on the MIMIC-III dataset and show that our approach outperforms these methods. The results demonstrate the significance of multi-modal medical concept representations based on prior medical knowledge.
Authors Manuel Burger, Gunnar Rätsch, Rita Kuznetsova
Submitted ML4H 2023 (PMLR)
Abstract Intensive Care Units (ICU) require comprehensive patient data integration for enhanced clinical outcome predictions, crucial for assessing patient conditions. Recent deep learning advances have utilized patient time series data, and fusion models have incorporated unstructured clinical reports, improving predictive performance. However, integrating established medical knowledge into these models has not yet been explored. The medical domain's data, rich in structural relationships, can be harnessed through knowledge graphs derived from clinical ontologies like the Unified Medical Language System (UMLS) for better predictions. Our proposed methodology integrates this knowledge with ICU data, improving clinical decision modeling. It combines graph representations with vital signs and clinical reports, enhancing performance, especially when data is missing. Additionally, our model includes an interpretability component to understand how knowledge graph nodes affect predictions.
Authors Samyak Jain, Manuel Burger, Gunnar Rätsch, Rita Kuznetsova
Submitted ML4H 2023 (Findings Track)
Abstract Flexibly quantifying both irreducible aleatoric and model-dependent epistemic uncertainties plays an important role for complex regression problems. While deep neural networks in principle can provide this flexibility and learn heteroscedastic aleatoric uncertainties through non-linear functions, recent works highlight that maximizing the log likelihood objective parameterized by mean and variance can lead to compromised mean fits since the gradient are scaled by the predictive variance, and propose adjustments in line with this premise. We instead propose to use the natural parametrization of the Gaussian, which has been shown to be more stable for heteroscedastic regression based on non-linear feature maps and Gaussian processes. Further, we emphasize the significance of principled regularization of the network parameters and prediction. We therefore propose an efficient Laplace approximation for heteroscedastic neural networks that allows automatic regularization through empirical Bayes and provides epistemic uncertainties, both of which improve generalization. We showcase on a range of regression problems—including a new heteroscedastic image regression benchmark—that our methods are scalable, improve over previous approaches for heteroscedastic regression, and provide epistemic uncertainty without requiring hyperparameter tuning.
Authors Alexander Immer, Emanuele Palumbo, Alexander Marx, Julia E Vogt
Submitted NeurIPS 2023
Abstract Convolutions encode equivariance symmetries into neural networks leading to better generalisation performance. However, symmetries provide fixed hard constraints on the functions a network can represent, need to be specified in advance, and can not be adapted. Our goal is to allow flexible symmetry constraints that can automatically be learned from data using gradients. Learning symmetry and associated weight connectivity structures from scratch is difficult for two reasons. First, it requires efficient and flexible parameterisations of layer-wise equivariances. Secondly, symmetries act as constraints and are therefore not encouraged by training losses measuring data fit. To overcome these challenges, we improve parameterisations of soft equivariance and learn the amount of equivariance in layers by optimising the marginal likelihood, estimated using differentiable Laplace approximations. The objective balances data fit and model complexity enabling layer-wise symmetry discovery in deep networks. We demonstrate the ability to automatically learn layer-wise equivariances on image classification tasks, achieving equivalent or improved performance over baselines with hard-coded symmetry.
Authors Tycho FA van der Ouderaa, Alexander Immer, Mark van der Wilk
Submitted NeurIPS 2023
Abstract The core components of many modern neural network architectures, such as transformers, convolutional, or graph neural networks, can be expressed as linear layers with . Kronecker-Factored Approximate Curvature (K-FAC), a second-order optimisation method, has shown promise to speed up neural network training and thereby reduce computational costs. However, there is currently no framework to apply it to generic architectures, specifically ones with linear weight-sharing layers. In this work, we identify two different settings of linear weight-sharing layers which motivate two flavours of K-FAC -- and . We show that they are exact for deep linear networks with weight-sharing in their respective setting. Notably, K-FAC-reduce is generally faster than K-FAC-expand, which we leverage to speed up automatic hyperparameter selection via optimising the marginal likelihood for a Wide ResNet. Finally, we observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer. However, both variations are able to reach a fixed validation metric target in - of the number of steps of a first-order reference run, which translates into a comparable improvement in wall-clock time. This highlights the potential of applying K-FAC to modern neural network architectures.
Authors Runa Eschenhagen, Alexander Immer, Richard E Turner, Frank Schneider, Philipp Hennig
Submitted NeurIPS 2023
Abstract In research areas with scarce data, representation learning plays a significant role. This work aims to enhance representation learning for clinical time series by deriving universal embeddings for clinical features, such as heart rate and blood pressure. We use self-supervised training paradigms for language models to learn high-quality clinical feature embeddings, achieving a finer granularity than existing time-step and patient-level representation learning. We visualize the learnt embeddings via unsupervised dimension reduction techniques and observe a high degree of consistency with prior clinical knowledge. We also evaluate the model performance on the MIMIC-III benchmark and demonstrate the effectiveness of using clinical feature embeddings. We publish our code online for replication.
Authors Yurong Hu, Manuel Burger, Gunnar Rätsch, Rita Kuznetsova
Submitted NeurIPS 2023 Workshop: Self-Supervised Learning - Theory and Practice
Abstract The splicing factor SF3B1 is recurrently mutated in various tumors, including pancreatic ductal adenocarcinoma (PDAC). The impact of the hotspot mutation SF3B1K700E on the PDAC pathogenesis, however, remains elusive. Here, we demonstrate that Sf3b1K700E alone is insufficient to induce malignant transformation of the murine pancreas, but that it increases aggressiveness of PDAC if it co-occurs with mutated KRAS and p53. We further show that Sf3b1K700E already plays a role during early stages of pancreatic tumor progression and reduces the expression of TGF-β1-responsive epithelial–mesenchymal transition (EMT) genes. Moreover, we found that SF3B1K700E confers resistance to TGF-β1-induced cell death in pancreatic organoids and cell lines, partly mediated through aberrant splicing of Map3k7. Overall, our findings demonstrate that SF3B1K700E acts as an oncogenic driver in PDAC, and suggest that it promotes the progression of early stage tumors by impeding the cellular response to tumor suppressive effects of TGF-β.
Authors Simmler, Patrik and Ioannidi, Eleonora I and Mengis, Tamara and Marquart, Kim Fabiano and Asawa, Simran and Van-Lehmann, Kjong and Kahles, Andre and Thomas, Tinu and Schwerdel, Cornelia and Aceto, Nicola and others
Submitted Elife
Abstract Selecting hyperparameters in deep learning greatly impacts its effectiveness but requires manual effort and expertise. Recent works show that Bayesian model selection with Laplace approximations can allow to optimize such hyperparameters just like standard neural network parameters using gradients and on the training data. However, estimating a single hyperparameter gradient requires a pass through the entire dataset, limiting the scalability of such algorithms. In this work, we overcome this issue by introducing lower bounds to the linearized Laplace approximation of the marginal likelihood. In contrast to previous estimators, these bounds are amenable to stochastic-gradient-based optimization and allow to trade off estimation accuracy against computational complexity. We derive them using the function-space form of the linearized Laplace, which can be estimated using the neural tangent kernel. Experimentally, we show that the estimators can significantly accelerate gradient-based hyperparameter optimization.
Authors Alexander Immer, Tycho FA van der Ouderaa, Mark van der Wilk, Gunnar Rätsch, Bernhard Schölkopf
Submitted ICML 2023
Abstract We study the class of location-scale or heteroscedastic noise models (LSNMs), in which the effect Y can be written as a function of the cause X and a noise source N independent of X, which may be scaled by a positive function g over the cause, i.e., Y=f(X)+g(X)N. Despite the generality of the model class, we show the causal direction is identifiable up to some pathological cases. To empirically validate these theoretical findings, we propose two estimators for LSNMs: an estimator based on (non-linear) feature maps, and one based on neural networks. Both model the conditional distribution of Y given X as a Gaussian parameterized by its natural parameters. When the feature maps are correctly specified, we prove that our estimator is jointly concave, and a consistent estimator for the cause-effect identification task. Although the the neural network does not inherit those guarantees, it can fit functions of arbitrary complexity, and reaches state-of-the-art performance across benchmarks.
Authors Alexander Immer, Christoph Schultheiss, Julia E Vogt, Bernhard Schölkopf, Peter Bühlmann, Alexander Marx
Submitted ICML 2023
Abstract Graph contrastive learning has shown great promise when labeled data is scarce, but large unlabeled datasets are available. However, it often does not take uncertainty estimation into account. We show that a variational Bayesian neural network approach can be used to improve not only the uncertainty estimates but also the downstream performance on semi-supervised node-classification tasks. Moreover, we propose a new measure of uncertainty for contrastive learning, that is based on the disagreement in likelihood due to different positive samples.
Authors Alexander Möllers, Alexander Immer, Elvin Isufi, Vincent Fortuin
Submitted AABI 2023
Abstract The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks. It is theoretically compelling since it can be seen as a Gaussian process posterior with the mean function given by the neural network's maximum-a-posteriori predictive function and the covariance function induced by the empirical neural tangent kernel. However, while its efficacy has been studied in large-scale tasks like image classification, it has not been studied in sequential decision-making problems like Bayesian optimization where Gaussian processes -- with simple mean functions and kernels such as the radial basis function -- are the de-facto surrogate models. In this work, we study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility. However, we also present some pitfalls that might arise and a potential problem with the LLA when the search space is unbounded.
Authors Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Vincent Fortuin
Submitted AABI 2023
Abstract Deep neural networks are highly effective but suffer from a lack of interpretability due to their black-box nature. Neural additive models (NAMs) solve this by separating into additive sub-networks, revealing the interactions between features and predictions. In this paper, we approach the NAM from a Bayesian perspective in order to quantify the uncertainty in the recovered interactions. Linearized Laplace approximation enables inference of these interactions directly in function space and yields a tractable estimate of the marginal likelihood, which can be used to perform implicit feature selection through an empirical Bayes procedure. Empirically, we show that Laplace-approximated NAMs (LA-NAM) are both more robust to noise and easier to interpret than their non-Bayesian counterpart for tabular regression and classification tasks.
Authors Kouroche Bouchiat, Alexander Immer, Hugo Yèche, Gunnar Rätsch, Vincent Fortuin
Submitted AABI 2023
Abstract The number of published metagenome assemblies is rapidly growing due to advances in sequencing technologies. However, sequencing errors, variable coverage, repetitive genomic regions, and other factors can produce misassemblies, which are challenging to detect for taxonomically novel genomic data. Assembly errors can affect all downstream analyses of the assemblies. Accuracy for the state of the art in reference-free misassembly prediction does not exceed an AUPRC of 0.57, and it is not clear how well these models generalize to real-world data. Here, we present the Residual neural network for Misassembled Contig identification (ResMiCo), a deep learning approach for reference-free identification of misassembled contigs. To develop ResMiCo, we first generated a training dataset of unprecedented size and complexity that can be used for further benchmarking and developments in the field. Through rigorous validation, we show that ResMiCo is substantially more accurate than the state of the art, and the model is robust to novel taxonomic diversity and varying assembly methods. ResMiCo estimated 4.7% misassembled contigs per metagenome across multiple real-world datasets. We demonstrate how ResMiCo can be used to optimize metagenome assembly hyperparameters to improve accuracy, instead of optimizing solely for contiguity. The accuracy, robustness, and ease-of-use of ResMiCo make the tool suitable for general quality control of metagenome assemblies and assembly methodology optimization.
Authors Olga Mineeva, Daniel Danciu, Bernhard Schölkopf, Ruth E. Ley, Gunnar Rätsch, Nicholas D. Youngblut
Submitted PLoS Computational Biology
Abstract Mean-field theory is widely used in theoretical studies of neural networks. In this paper, we analyze the role of depth in the concentration of mean-field predictions for Gram matrices of hidden representations in deep multilayer perceptron (MLP) with batch normalization (BN) at initialization. It is postulated that the mean-field predictions suffer from layer-wise errors that amplify with depth. We demonstrate that BN avoids this error amplification with depth. When the chain of hidden representations is rapidly mixing, we establish a concentration bound for a mean-field model of Gram matrices. To our knowledge, this is the first concentration bound that does not become vacuous with depth for standard MLPs with a finite width.
Authors Amir Joudaki, Hadi Daneshmand, Francis Bach
Submitted ICML 2023 (poster)
Abstract Engineered microbes show potential for diagnosing and treating diseases. In this issue of Cell Host & Microbe, Zou et al. develop an “intelligent” bacterial strain that detects and monitors an inflammation biomarker in the gut and responds by releasing an immunomodulator, thereby combining diagnosis and therapy for intestinal inflammation.
Authors Tanmay Tanna, Randall J. Platt
Submitted Cell Host and Microbe
2022
Abstract Understanding deep learning model behavior is critical to accepting machine learning-based decision support systems in the medical community. Previous research has shown that jointly using clinical notes with electronic health record (EHR) data improved predictive performance for patient monitoring in the intensive care unit (ICU). In this work, we explore the underlying reasons for these improvements. While relying on a basic attention-based model to allow for interpretability, we first confirm that performance significantly improves over state-of-the-art EHR data models when combining EHR data and clinical notes. We then provide an analysis showing improvements arise almost exclusively from a subset of notes containing broader context on patient state rather than clinician notes. We believe such findings highlight deep learning models for EHR data to be more limited by partially-descriptive data than by modeling choice, motivating a more data-centric approach in the field.
Authors Severin Husmann, Hugo Yèche, Gunnar Rätsch, Rita Kuznetsova
Submitted Workshop on Learning from Time Series for Health, 36th Conference on Neural Information Processing Systems (NeurIPS 2022)
Abstract Data augmentation is commonly applied to improve performance of deep learning by enforcing the knowledge that certain transformations on the input preserve the output. Currently, the used data augmentation is chosen by human effort and costly cross-validation, which makes it cumbersome to apply to new datasets. We develop a convenient gradient-based method for selecting the data augmentation without validation data and during training of a deep neural network. Our approach relies on phrasing data augmentation as an invariance in the prior distribution and learning it using Bayesian model selection, which has been shown to work in Gaussian processes, but not yet for deep neural networks. We propose a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective, which can be optimised without human supervision or validation data. We show that our method can successfully recover invariances present in the data, and that this improves generalisation and data efficiency on image datasets.
Authors Alexander Immer, Tycho FA van der Ouderaa, Gunnar Rätsch, Vincent Fortuin, Mark van der Wilk
Submitted NeurIPS 2022
Abstract The amount of data stored in genomic sequence databases is growing exponentially, far exceeding traditional indexing strategies' processing capabilities. Many recent indexing methods organize sequence data into a sequence graph to succinctly represent large genomic data sets from reference genome and sequencing read set databases. These methods typically use De Bruijn graphs as the graph model or the underlying index model, with auxiliary graph annotation data structures to associate graph nodes with various metadata. Examples of metadata can include a node's source samples (called labels), genomic coordinates, expression levels, etc. An important property of these graphs is that the set of sequences spelled by graph walks is a superset of the set of input sequences. Thus, when aligning to graphs indexing samples derived from low-coverage sequencing sets, sequence information present in many target samples can compensate for missing information resulting from a lack of sequence coverage. Aligning a query against an entire sequence graph (as in traditional sequence-to-graph alignment) using state-of-the-art algorithms can be computationally intractable for graphs constructed from thousands of samples, potentially searching through many non-biological combinations of samples before converging on the best alignment. To address this problem, we propose a novel alignment strategy called multi-label alignment (MLA) and an algorithm implementing this strategy using annotated De Bruijn graphs within the MetaGraph framework, called MetaGraph-MLA. MLA extends current sequence alignment scoring models with additional label change operations for incorporating mixtures of samples into an alignment, penalizing mixtures that are dissimilar in their sequence content. To overcome disconnects in the graph that result from a lack of sequencing coverage, we further extend our graph index to utilize a variable-order De Bruijn graph and introduce node length change as an operation. In this model, traversal between nodes that share a suffix of length < k-1 acts as a proxy for inserting nodes into the graph. MetaGraph-MLA constructs an MLA of a query by chaining single-label alignments using sparse dynamic programming. We evaluate MetaGraph-MLA on simulated data against state-of-the-art sequence-to-graph aligners. We demonstrate increases in alignment lengths for simulated viral Illumina-type (by 6.5%), PacBio CLR-type (by 6.2%), and PacBio CCS-type (by 6.7%) sequencing reads, respectively, and show that the graph walks incorporated into our MLAs originate predominantly from samples of the same strain as the reads' ground-truth samples. We envision MetaGraph-MLA as a step towards establishing sequence graph tools for sequence search against a wide variety of target sequence types.
Authors Harun Mustafa, Mikhail Karasikov, Gunnar Rätsch, André Kahles
Submitted bioRxiv
Authors Sarah Fremond, Sonali Andani, Jurriaan Barkey Wolf, Jouke Dijkstra, Sinéad Melsbach, Jan J. Jobsen, Mariel Brinkhuis, Suzan Roothaan, Ina Jurgenliemk-Schulz, Ludy CHW. Lutgens, Remi A. Nout, Elzbieta M. van der Steen-Banasik, Stephanie M. de Boer, Melanie E. Powell, Naveena Singh, Linda R. Mileshkin, Helen J. Mackay, Alexandra Leary, Hans W. Nijman, Vincent T.H.B.M. Smit, Carien L. Creutzberg, Nanda Horeweg, Viktor H Koelzer, Tjalling Bosse
Submitted The Lancet Digital Health (accepted)
Abstract Methods In a single-center retrospective study of matched pairs of initial and post-therapeutic glioma cases with a recurrence period greater than one year, we performed whole exome sequencing combined with mRNA and microRNA expression profiling to identify processes that are altered in recurrent gliomas. Results Mutational analysis of recurrent gliomas revealed early branching evolution in seventy-five percent of patients. High plasticity was confirmed at the mRNA and miRNA levels. SBS1 signature was reduced and SBS11 was elevated, demonstrating the effect of alkylating agent therapy on the mutational landscape. There was no evidence for secondary genomic alterations driving therapy resistance. ALK7/ACVR1C and LTBP1 were upregulated, whereas LEFTY2 was downregulated, pointing towards enhanced Tumor Growth Factor β (TGF-β) signaling in recurrent gliomas. Consistently, altered microRNA expression profiles pointed towards enhanced Nuclear Factor Kappa B and Wnt signaling that, cooperatively with TGF-β, induces epithelial to mesenchymal transition (EMT), migration and stemness. TGF-β-induced expression of pro-apoptotic proteins and repression of anti-apoptotic proteins were uncoupled in the recurrent tumor. Conclusions Our results suggest an important role of TGF-β signaling in recurrent gliomas. This may have clinical implication, since TGF-β inhibitors have entered clinical phase studies and may potentially be used in combination therapy to interfere with chemoradiation resistance. Recurrent gliomas show high incidence of early branching evolution. High tumor plasticity is confirmed at the level of microRNA and mRNA expression profiles.
Authors Elham Kashani, Désirée Schnidrig, Ali Hashemi Gheinani, Martina Selina Ninck, Philipp Zens, Theoni Maragkou, Ulrich Baumgartner, Philippe Schucht, Gunnar Rätsch, Mark A Rubin, Sabina Berezowska, Charlotte KY Ng, Erik Vassella
Submitted Neuro-oncology
Abstract Background. Glioblastoma (GBM) is the most aggressive primary brain tumor and represents a particular challenge of therapeutic intervention. Methods. In a single-center retrospective study of matched pairs of initial and post-therapeutic GBM cases with a recurrence period greater than one year, we performed whole exome sequencing combined with mRNA and microRNA expression profiling to identify processes that are altered in recurrent GBM. Results. Expression and mutational profiling of recurrent GBM revealed evidence for early branching evolution in seventy-five percent of patients. SBS1 signature was reduced in the recurrent tumor and SBS11 was elevated, demonstrating the effect of alkylating agent therapy on the mutational landscape. There was no evidence for secondary genomic alterations driving therapy resistance. ALK7/ACVR1C and LTBP1 were upregulated, whereas LEFTY2 was downregulated, pointing towards enhanced Tumor Growth Factor β (TGF-β) signaling in the recurrent GBM. Consistently, altered microRNA expression profiles pointed towards enhanced Nuclear Factor Kappa B signaling that, cooperatively with TGF-β, induces epithelial to mesenchymal transition (EMT), migration and stemness. In contrast, TGF-β-induced expression of pro-apoptotic proteins and repression of anti-apoptotic proteins were uncoupled in the recurrent tumor. Conclusions. Our results suggest an important role of TGF-β signaling in recurrent GBM. This may have clinical implication, since TGF-β inhibitors have entered clinical phase studies and may potentially be used in combination therapy to interfere with chemoradiation resistance. Recurrent GBM show high incidence of early branching evolution. High tumor plasticity is confirmed at the level of microRNA and mRNA expression profiles.
Authors Elham Kashani, Désirée Schnidrig, Ali Hashemi Gheinani, Martina Selina Ninck, Philipp Zens, Theoni Maragkou, Ulrich Baumgartner, Philippe Schucht, Gunnar Rätsch, Mark Andrew Rubin, Sabina Berezowska, Charlotte KY Ng, Erik Vassella
Submitted Research Square (Preprint Platform)
Abstract Several recently developed single-cell DNA sequencing technologies enable whole-genome sequencing of thousands of cells. However, the ultra-low coverage of the sequenced data (<0.05× per cell) mostly limits their usage to the identification of copy number alterations in multi-megabase segments. Many tumors are not copy number-driven, and thus single-nucleotide variant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumor heterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible when superimposing the sequenced genomes of hundreds of genetically similar cells. Thus, we have developed a new approach to efficiently cluster tumor cells based on a Bayesian filtering approach of relevant loci and exploiting read overlap and phasing.
Authors Hana Rozhoňová, Daniel Danciu, Stefan Stark, Gunnar Rätsch, André Kahles, Kjong-Van Lehmann
Submitted Bioinformatics
Abstract The activation of memory T cells is a very rapid and concerted cellular response that requires coordination between cellular processes in different compartments and on different time scales. In this study, we use ribosome profiling and deep RNA sequencing to define the acute mRNA translation changes in CD8 memory T cells following initial activation events. We find that initial translation enables subsequent events of human and mouse T cell activation and expansion. Briefly, early events in the activation of Ag-experienced CD8 T cells are insensitive to transcriptional blockade with actinomycin D, and instead depend on the translation of pre-existing mRNAs and are blocked by cycloheximide. Ribosome profiling identifies ∼92 mRNAs that are recruited into ribosomes following CD8 T cell stimulation. These mRNAs typically have structured GC and pyrimidine-rich 5′ untranslated regions and they encode key regulators of T cell activation and proliferation such as Notch1, Ifngr1, Il2rb, and serine metabolism enzymes Psat1 and Shmt2 (serine hydroxymethyltransferase 2), as well as translation factors eEF1a1 (eukaryotic elongation factor α1) and eEF2 (eukaryotic elongation factor 2). The increased production of receptors of IL-2 and IFN-γ precedes the activation of gene expression and augments cellular signals and T cell activation. Taken together, we identify an early RNA translation program that acts in a feed-forward manner to enable the rapid and dramatic process of CD8 memory T cell expansion and activation.
Authors Darin Salloum, Kamini Singh, Natalie R Davidson, Linlin Cao, David Kuo, Viraj R Sanghvi, Man Jiang, Maria Tello Lafoz, Agnes Viale, Gunnar Ratsch, Hans-Guido Wendel
Submitted The Journal of Immunology
Abstract Mutations in the splicing factor SF3B1 are frequently occurring in various cancers and drive tumor progression through the activation of cryptic splice sites in multiple genes. Recent studies also demonstrate a positive correlation between the expression levels of wild-type SF3B1 and tumor malignancy. Here, we demonstrate that SF3B1 is a hypoxia-inducible factor (HIF)-1 target gene that positively regulates HIF1 pathway activity. By physically interacting with HIF1α, SF3B1 facilitates binding of the HIF1 complex to hypoxia response elements (HREs) to activate target gene expression. To further validate the relevance of this mechanism for tumor progression, we show that a reduction in SF3B1 levels via monoallelic deletion of Sf3b1 impedes tumor formation and progression via impaired HIF signaling in a mouse model for pancreatic cancer. Our work uncovers an essential role of SF3B1 in HIF1 signaling, thereby providing a potential explanation for the link between high SF3B1 expression and aggressiveness of solid tumors.
Authors Patrik Simmler, Cédric Cortijo, Lisa Maria Koch, Patricia Galliker, Silvia Angori, Hella Anna Bolck, Christina Mueller, Ana Vukolic, Peter Mirtschink, Yann Christinat, Natalie R Davidson, Kjong-Van Lehmann, Giovanni Pellegrini, Chantal Pauli, Daniela Lenggenhager, Ilaria Guccini, Till Ringel, Christian Hirt, Kim Fabiano Marquart, Moritz Schaefer, Gunnar Rätsch, Matthias Peter, Holger Moch, Markus Stoffel, Gerald Schwank
Submitted Cell Reports
RNA Instant Quality Check: Alignment-Free RNA-Degradation Detection Journal of Computational Biology
Abstract With the constant increase of large-scale genomic data projects, automated and high-throughput quality assessment becomes a crucial component of any analysis. Whereas small projects often have a more homogeneous design and a manageable structure allowing for a manual per-sample analysis of quality, large-scale studies tend to be much more heterogeneous and complex. Many quality metrics have been developed to assess the quality of an individual sample on the raw read level. Degradation effects are typically assessed based on the RNA integrity (RIN) score, or on postalignment data. In this study, we show that single commonly used quality criteria such as the RIN score alone are not sufficient to ensure RNA sample quality. We developed a new approach and provide an efficient tool that estimates RNA sample degradation by computing the 5′/3′ bias based on all genes in an alignment-free manner. That enables degradation assessment right after data generation and not during the analysis procedure allowing for early intervention in the sample handling process. Our analysis shows that this strategy is fast, robust to annotation and differences in library size, and provides complementary quality information to RIN scores enabling the accurate identification of degraded samples.
Authors Kjong-van Lehmann, Andre Kahles, Magdalena Murr, Gunnar Raetsch
Submitted Journal of Computational Biology
Abstract Alternative splicing (AS) is a regulatory process during mRNA maturation that shapes higher eukaryotes’ complex transcriptomes. High-throughput sequencing of RNA (RNA-Seq) allows for measurements of AS transcripts at an unprecedented depth and diversity. The ever-expanding catalog of known AS events provides biological insights into gene regulation, population genetics, or in the context of disease. Here, we present an overview on the usage of SplAdder, a graph-based alternative splicing toolbox, which can integrate an arbitrarily large number of RNA-Seq alignments and a given annotation file to augment the shared annotation based on RNA-Seq evidence. The shared augmented annotation graph is then used to identify, quantify, and confirm alternative splicing events based on the RNA-Seq data. Splice graphs for individual alignments can also be tested for significant quantitative differences between other samples or groups of samples.
Authors Philipp Markolin, Gunnar Rätsch, André Kahles
Submitted Variant Calling
Abstract Natural microbial communities are phylogenetically and metabolically diverse. In addition to underexplored organismal groups1, this diversity encompasses a rich discovery potential for ecologically and biotechnologically relevant enzymes and biochemical compounds. However, studying this diversity to identify genomic pathways for the synthesis of such compounds4 and assigning them to their respective hosts remains challenging. The biosynthetic potential of microorganisms in the open ocean remains largely uncharted owing to limitations in the analysis of genome-resolved data at the global scale. Here we investigated the diversity and novelty of biosynthetic gene clusters in the ocean by integrating around 10,000 microbial genomes from cultivated and single cells with more than 25,000 newly reconstructed draft genomes from more than 1,000 seawater samples. These efforts revealed approximately 40,000 putative mostly new biosynthetic gene clusters, several of which were found in previously unsuspected phylogenetic groups. Among these groups, we identified a lineage rich in biosynthetic gene clusters (‘Candidatus Eudoremicrobiaceae’) that belongs to an uncultivated bacterial phylum and includes some of the most biosynthetically diverse microorganisms in this environment. From these, we characterized the phospeptin and pythonamide pathways, revealing cases of unusual bioactive compound structure and enzymology, respectively. Together, this research demonstrates how microbiomics-driven strategies can enable the investigation of previously undescribed enzymes and natural products in underexplored microbial groups and environments.
Authors Lucas Paoli, Hans-Joachim Ruscheweyh, Clarissa C. Forneris, Florian Hubrich, Satria Kautsar, Agneya Bhushan, Alessandro Lotti, Quentin Clayssen, Guillem Salazar, Alessio Milanese, Charlotte I. Carlström, Chrysa Papadopoulou, Daniel Gehrig, Mikhail Karasikov, Harun Mustafa, Martin Larralde, Laura M. Carroll, Pablo Sánchez, Ahmed A. Zayed, Dylan R. Cronin, Silvia G. Acinas, Peer Bork, Chris Bowler, Tom O. Delmont, Josep M. Gasol, Alvar D. Gossert, Andre Kahles, Matthew B. Sullivan, Patrick Wincker, Georg Zeller, Serina L. Robinson, Jörn Piel, and Shinichi Sunagawa
Submitted Nature
Abstract Decision making algorithms, in practice, are often trained on data that exhibits a variety of biases. Decision-makers often aim to take decisions based on some ground-truth target that is assumed or expected to be unbiased, i.e., equally distributed across socially salient groups. In many practical settings, the ground-truth cannot be directly observed, and instead, we have to rely on a biased proxy measure of the ground-truth, i.e., biased labels, in the data. In addition, data is often selectively labeled, i.e., even the biased labels are only observed for a small fraction of the data that received a positive decision. To overcome label and selection biases, recent work proposes to learn stochastic, exploring decision policies via i) online training of new policies at each time-step and ii) enforcing fairness as a constraint on performance. However, the existing approach uses only labeled data, disregarding a large amount of unlabeled data, and thereby suffers from high instability and variance in the learned decision policies at different times. In this paper, we propose a novel method based on a variational autoencoder for practical fair decision-making. Our method learns an unbiased data representation leveraging both labeled and unlabeled data and uses the representations to learn a policy in an online process. Using synthetic data, we empirically validate that our method converges to the optimal (fair) policy according to the ground-truth with low variance. In real-world experiments, we further show that our training approach not only offers a more stable learning process but also yields policies with higher fairness as well as utility than previous approaches.
Authors Miriam Rateike, Ayan Majumdar, Olga Mineeva, Krishna P. Gummadi, Isabel Valera
Submitted FAccT '22
Abstract Transcriptional recording by CRISPR spacer acquisition from RNA endows engineered Escherichia coli with synthetic memory, which through Record-seq reveals transcriptome-scale records. Microbial sentinels that traverse the gastrointestinal tract capture a wide range of genes and pathways that describe interactions with the host, including quantitative shifts in the molecular environment that result from alterations in the host diet, induced inflammation, and microbiome complexity. We demonstrate multiplexed recording using barcoded CRISPR arrays, enabling the reconstruction of transcriptional histories of isogenic bacterial strains in vivo. Record-seq therefore provides a scalable, noninvasive platform for interrogating intestinal and microbial physiology throughout the length of the intestine without manipulations to host physiology and can determine how single microbial genetic differences alter the way in which the microbe adapts to the host intestinal environment.
Authors Florian Schmidt, Jakob Zimmermann, Tanmay Tanna, Rick Farouni, Tyrell Conway, Andrew J Macpherson, and Randall J Platt
Submitted Science
Abstract Pre-trained contextual representations have led to dramatic performance improvements on a range of downstream tasks. This has motivated researchers to quantify and understand the linguistic information encoded in them. In general, this is done by probing, which consists of training a supervised model to predict a linguistic property from said representations. Unfortunately, this definition of probing has been subject to extensive criticism, and can lead to paradoxical or counter-intuitive results. In this work, we present a novel framework for probing where the goal is to evaluate the inductive bias of representations for a particular task, and provide a practical avenue to do this using Bayesian inference. We apply our framework to a series of token-, arc-, and sentence-level tasks. Our results suggest that our framework solves problems of previous approaches and that fastText can offer a better inductive bias than BERT in certain situations.
Authors Alexander Immer, Lucas Torroba Hennigen, Vincent Fortuin, Ryan Cotterell
Submitted ACL 2022
Abstract Building models of human decision-making from observed behaviour is critical to better understand, diagnose and support real-world policies such as clinical care. As established policy learning approaches remain focused on imitation performance, they fall short of explaining the demonstrated decision-making process. Policy Extraction through decision Trees (POETREE) is a novel framework for interpretable policy learning, compatible with fully-offline and partially-observable clinical decision environments -- and builds probabilistic tree policies determining physician actions based on patients' observations and medical history. Fully-differentiable tree architectures are grown incrementally during optimization to adapt their complexity to the modelling task, and learn a representation of patient history through recurrence, resulting in decision tree policies that adapt over time with patient information. This policy learning method outperforms the state-of-the-art on real and synthetic medical datasets, both in terms of understanding, quantifying and evaluating observed behaviour as well as in accurately replicating it -- with potential to improve future decision support systems.
Authors Alizée Pace, Alex Chan, Mihaela van der Schaar
Submitted ICLR 2022 (Spotlight)
Abstract Read mapping is a fundamental step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). Read mapping is costly because it needs to perform approximate string matching (ASM) on large amounts of data. To address the computational challenges in genome analysis, many prior works propose various approaches such as accurate filters that select the reads within a dataset of genomic reads (called a read set) that must undergo expensive computation, efficient heuristics, and hardware acceleration. While effective at reducing the amount of expensive computation, all such approaches still require the costly movement of a large amount of data from storage to the rest of the system, which can significantly lower the end-to-end performance of read mapping in conventional and emerging genomics systems. We propose GenStore, the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. GenStore leverages hardware/software co-design to address the challenges of in-storage processing, supporting reads with 1) different properties such as read lengths and error rates, which highly depend on the sequencing technology, and 2) different degrees of genetic variation compared to the reference genome, which highly depends on the genomes that are being compared. Through rigorous analysis of read mapping processes of reads with different properties and degrees of genetic variation, we meticulously design low-cost hardware accelerators and data/computation flows inside a NAND flash-based solid-state drive (SSD). Our evaluation using a wide range of real genomic datasets shows that GenStore, when implemented in three modern NAND flash-based SSDs, significantly improves the read mapping performance of state-of-the-art software (hardware) baselines by 2.07-6.05× (1.52-3.32×) for read sets with high similarity to the reference genome and 1.45-33.63× (2.70-19.2×) for read sets with low similarity to the reference genome.
Authors Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, Rachata Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, Onur Mutlu
Submitted ASPLOS 2022
Abstract In recent years, the transformer has established itself as a workhorse in many applications ranging from natural language processing to reinforcement learning. Similarly, Bayesian deep learning has become the gold-standard for uncertainty estimation in safety-critical applications, where robustness and calibration are crucial. Surprisingly, no successful attempts to improve transformer models in terms of predictive uncertainty using Bayesian inference exist. In this work, we study this curiously underpopulated area of Bayesian transformers. We find that weight-space inference in transformers does not work well, regardless of the approximate posterior. We also find that the prior is at least partially at fault, but that it is very hard to find well-specified weight priors for these models. We hypothesize that these problems stem from the complexity of obtaining a meaningful mapping from weight-space to function-space distributions in the transformer. Therefore, moving closer to function-space, we propose a novel method based on the implicit reparameterization of the Dirichlet distribution to apply variational inference directly to the attention weights. We find that this proposed method performs competitively with our baselines.
Authors Tristan Cinquin, Alexander Immer, Max Horn, Vincent Fortuin
Submitted AABI 2022
Abstract Particle-based approximate Bayesian inference approaches such as Stein Variational Gradient Descent (SVGD) combine the flexibility and convergence guarantees of sampling methods with the computational benefits of variational inference. In practice, SVGD relies on the choice of an appropriate kernel function, which impacts its ability to model the target distribution -- a challenging problem with only heuristic solutions. We propose Neural Variational Gradient Descent (NVGD), which is based on parameterizing the witness function of the Stein discrepancy by a deep neural network whose parameters are learned in parallel to the inference, mitigating the necessity to make any kernel choices whatsoever. We empirically evaluate our method on popular synthetic inference problems, real-world Bayesian linear regression, and Bayesian neural network inference.
Authors Lauro Langosco di Langosco, Vincent Fortuin, Heiko Strathmann
Submitted AABI 2022
Abstract Quantum machine learning promises great speedups over classical algorithms, but it often requires repeated computations to achieve a desired level of accuracy for its point estimates. Bayesian learning focuses more on sampling from posterior distributions than on point estimation, thus it might be more forgiving in the face of additional quantum noise. We propose a quantum algorithm for Bayesian neural network inference, drawing on recent advances in quantum deep learning, and simulate its empirical performance on several tasks. We find that already for small numbers of qubits, our algorithm approximates the true posterior well, while it does not require any repeated computations and thus fully realizes the quantum speedups.
Authors Noah Berner, Vincent Fortuin, Jonas Landman
Submitted AABI 2022
Abstract Complex multivariate time series arise in many fields, ranging from computer vision to robotics or medicine. Often we are interested in the independent underlying factors that give rise to the high-dimensional data we are observing. While many models have been introduced to learn such disentangled representations, only few attempt to explicitly exploit the structure of sequential data. We investigate the disentanglement properties of Gaussian process variational autoencoders, a class of models recently introduced that have been successful in different tasks on time series data. Our model exploits the temporal structure of the data by modeling each latent channel with a GP prior and employing a structured variational distribution that can capture dependencies in time. We demonstrate the competitiveness of our approach against state-of-the-art unsupervised and weakly-supervised disentanglement methods on a benchmark task. Moreover, we provide evidence that we can learn meaningful disentangled representations on real-world medical time series data.
Authors Simon Bing, Vincent Fortuin, Gunnar Rätsch
Submitted AABI 2022
Authors Forny, Patrick and Bonilla, Ximena and Lamparter, David and Shao, Wenguang and Plessl, Tanja and Frei, Caroline and Bingisser, Anna and Goetze, Sandra and Van Drogen, Audrey and Harshman, Keith and others
Submitted Molecular Genetics and Metabolism
Abstract Multi-layered omics technologies can help define relationships between genetic factors, biochemical processes and phenotypes thus extending research of monogenic diseases beyond identifying their cause. We implemented a multi-layered omics approach for the inherited metabolic disorder methylmalonic aciduria. We performed whole genome sequencing, transcriptomic sequencing, and mass spectrometry-based proteotyping from matched primary fibroblast samples of 230 individuals (210 affected, 20 controls) and related the molecular data to 105 phenotypic features. Integrative analysis identified a molecular diagnosis for 84% (179/210) of affected individuals, the majority (150) of whom had pathogenic variants in methylmalonyl-CoA mutase (MMUT). Untargeted integration of all three omics layers revealed dysregulation of TCA cycle and surrounding metabolic pathways, a finding that was further supported by multi-organ metabolomics of a hemizygous Mmut mouse model. Stratification by phenotypic severity indicated downregulation of oxoglutarate dehydrogenase and upregulation of glutamate dehydrogenase in disease. This was supported by metabolomics and isotope tracing studies which showed increased glutamine-derived anaplerosis. We further identified MMUT to physically interact with both, oxoglutarate dehydrogenase and glutamate dehydrogenase providing a mechanistic link. This study emphasizes the utility of a multi-modal omics approach to investigate metabolic diseases and highlights glutamine anaplerosis as a potential therapeutic intervention point in methylmalonic aciduria.
Authors Patrick Forny, Ximena Bonilla, David Lamparter, Wenguang Shao, Tanja Plessl, Caroline Frei, Anna Bingisser, Sandra Goetze, Audrey van Drogen, Keith Harshmann, Patrick GA Pedrioli, Cedric Howald, Florian Traversi, Sarah Cherkaoui, Raphael J Morscher, Luke Simmons, Merima Forny, Ioannis Xenarios, Ruedi Aebersold, Nicola Zamboni, Gunnar Rätsch, Emmanouil Dermitzakis, Bernd Wollscheid, Matthias R Baumgartner, D Sean Froese
Submitted medRxiv
Abstract We propose a stochastic conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. Existing CGM variants for this template either suffer from slow convergence rates, or require carefully increasing the batch size over the course of the algorithm’s execution, which leads to computing full gradients. In contrast, the proposed method, equipped with a stochastic average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques. In applications we put special emphasis on problems with a large number of separable constraints. Such problems are prevalent among semidefinite programming (SDP) formulations arising in machine learning and theoretical computer science. We provide numerical experiments on matrix completion, unsupervised clustering, and sparsest-cut SDPs.
Authors Gideon Dresdner, Maria-Luiza Vladarean, Gunnar Rätsch, Francesco Locatello, Volkan Cevher, Alp Yurtsever
Submitted Proceedings of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS-22)
2021
Abstract Deep ensembles have recently gained popularity in the deep learning community for their conceptual simplicity and efficiency. However, maintaining functional diversity between ensemble members that are independently trained with gradient descent is challenging. This can lead to pathologies when adding more ensemble members, such as a saturation of the ensemble performance, which converges to the performance of a single model. Moreover, this does not only affect the quality of its predictions, but even more so the uncertainty estimates of the ensemble, and thus its performance on out-of-distribution data. We hypothesize that this limitation can be overcome by discouraging different ensemble members from collapsing to the same function. To this end, we introduce a kernelized repulsive term in the update rule of the deep ensembles. We show that this simple modification not only enforces and maintains diversity among the members but, even more importantly, transforms the maximum a posteriori inference into proper Bayesian inference. Namely, we show that the training dynamics of our proposed repulsive ensembles follow a Wasserstein gradient flow of the KL divergence with the true posterior. We study repulsive terms in weight and function space and empirically compare their performance to standard ensembles and Bayesian baselines on synthetic and real-world prediction tasks.
Authors Francesco D'Angelo, Vincent Fortuin
Submitted NeurIPS 2021 (spotlight)
Abstract Bayesian formulations of deep learning have been shown to have compelling theoretical properties and offer practical functional benefits, such as improved predictive uncertainty quantification and model selection. The Laplace approximation (LA) is a classic, and arguably the simplest family of approximations for the intractable posteriors of deep neural networks. Yet, despite its simplicity, the LA is not as popular as alternatives like variational Bayes or deep ensembles. This may be due to assumptions that the LA is expensive due to the involved Hessian computation, that it is difficult to implement, or that it yields inferior results. In this work we show that these are misconceptions: we (i) review the range of variants of the LA including versions with minimal cost overhead; (ii) introduce "laplace", an easy-to-use software library for PyTorch offering user-friendly access to all major flavors of the LA; and (iii) demonstrate through extensive experiments that the LA is competitive with more popular alternatives in terms of performance, while excelling in terms of computational cost. We hope that this work will serve as a catalyst to a wider adoption of the LA in practical deep learning, including in domains where Bayesian approaches are not typically considered at the moment.
Authors Erik Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, Philipp Hennig
Submitted NeurIPS 2021
Abstract High-throughput sequencing data is rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in solving the experiment discovery problem and building compressed representations of annotated de Bruijn graphs where k-mer sets can be efficiently indexed and interactively queried. However, approaches for representing and retrieving other quantitative attributes such as gene expression or genome positions in a general manner have yet to be developed. In this work, we propose the concept of Counting de Bruijn graphs generalizing the notion of annotated (or colored) de Bruijn graphs. Counting de Bruijn graphs supplement each node-label relation with one or many attributes (e.g., a k-mer count or its positions in genome). To represent them, we first observe that many schemes for the representation of compressed binary matrices already support the rank operation on the columns or rows, which can be used to define an inherent indexing of any additional quantitative attributes. Based on this property, we generalize these schemes and introduce a new approach for representing non-binary sparse matrices in compressed data structures. Finally, we notice that relation attributes are often easily predictable from a node’s local neighborhood in the graph. Notable examples are genome positions shifting by 1 for neighboring nodes in the graph, or expression levels that are often shared across neighbors. We exploit this regularity of graph annotations and apply an invertible delta-like coding to achieve better compression. We show that Counting de Bruijn graphs index k-mer counts from 2,652 human RNA-Seq read sets in representations over 8-fold smaller and yet faster to query compared to state-of-the-art bioinformatics tools. Furthermore, Counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip -9 for human Illumina RNA-Seq and 57% smaller for PacBio HiFi sequencing of viral samples. A complete joint searchable index of all viral PacBio SMRT reads from NCBI’s SRA (152,884 read sets, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, they generate a lossless and fully queryable index that is 4.4-fold smaller compared to the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools employing de Bruijn graphs and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed and fully searchable graph-based sequence indexes.
Authors Mikhail Karasikov, Harun Mustafa, Gunnar Rätsch, André Kahles
Submitted RECOMB 2022
Abstract Uncertainty estimation in deep learning has recently emerged as a crucial area of interest to advance reliability and robustness in safety-critical applications. While there have been many proposed methods that either focus on distance-aware model uncertainties for out-of-distribution detection or on input-dependent label uncertainties for in-distribution calibration, both of these types of uncertainty are often necessary. In this work, we propose the HetSNGP method for jointly modeling the model and data uncertainty. We show that our proposed model affords a favorable combination between these two complementary types of uncertainty and thus outperforms the baseline methods on some challenging out-of-distribution datasets, including CIFAR-100C, Imagenet-C, and Imagenet-A. Moreover, we propose HetSNGP Ensemble, an ensembled version of our method which adds an additional type of uncertainty and also outperforms other ensemble baselines.
Authors Vincent Fortuin, Mark Collier, Florian Wenzel, James Allingham, Jeremiah Liu, Dustin Tran, Balaji Lakshminarayanan, Jesse Berent, Rodolphe Jenatton, Effrosyni Kokiopoulou
Submitted arXiv Preprints
Abstract Machine learning models based on the aggregated outputs of submodels, either at the activation or prediction levels, lead to strong performance. We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs). First, we show that these two approaches have complementary features whose combination is beneficial. Then, we present partitioned batch ensembles, an efficient ensemble of sparse MoEs that takes the best of both classes of models. Extensive experiments on fine-tuned vision transformers demonstrate the accuracy, log-likelihood, few-shot learning, robustness, and uncertainty calibration improvements of our approach over several challenging baselines. Partitioned batch ensembles not only scale to models with up to 2.7B parameters, but also provide larger performance gains for larger models.
Authors James Urquhart Allingham, Florian Wenzel, Zelda E Mariet, Basil Mustafa, Joan Puigcerver, Neil Houlsby, Ghassen Jerfel, Vincent Fortuin, Balaji Lakshminarayanan, Jasper Snoek, Dustin Tran, Carlos Riquelme Ruiz, Rodolphe Jenatton
Submitted arXiv Preprints
Abstract Marginal-likelihood based model-selection, even though promising, is rarely used in deep learning due to estimation difficulties. Instead, most approaches rely on validation data, which may not be readily available. In this work, we present a scalable marginal-likelihood estimation method to select both hyperparameters and network architectures, based on the training data alone. Some hyperparameters can be estimated online during training, simplifying the procedure. Our marginal-likelihood estimate is based on Laplace's method and Gauss-Newton approximations to the Hessian, and it outperforms cross-validation and manual-tuning on standard regression and image classification datasets, especially in terms of calibration and out-of-distribution detection. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable (e.g., in nonstationary settings).
Authors Alexander Immer, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, Mohammad Emtiyaz Khan
Submitted ICML 2021
Abstract We propose a novel Bayesian neural network architecture that can learn invariances from data alone by inferring a posterior distribution over different weight-sharing schemes. We show that our model outperforms other non-invariant architectures, when trained on datasets that contain specific invariances. The same holds true when no data augmentation is performed.
Authors Nikolaos Mourdoukoutas, Marco Federici, Georges Pantalos, Mark van der Wilk, Vincent Fortuin
Submitted arXiv Preprints
Abstract Meta-learning can successfully acquire useful inductive biases from data, especially when a large number of meta-tasks are available. Yet, its generalization properties to unseen tasks are poorly understood. Particularly if the number of meta-tasks is small, this raises concerns for potential overfitting. We provide a theoretical analysis using the PAC-Bayesian framework and derive novel generalization bounds for meta-learning with unbounded loss functions and Bayesian base learners. Using these bounds, we develop a class of PAC-optimal meta-learning algorithms with performance guarantees and a principled meta-regularization. When instantiating our PAC-optimal hyper-posterior (PACOH) with Gaussian processes as base learners, the resulting approach consistently outperforms several popular meta-learning methods, both in terms of predictive accuracy and the quality of its uncertainty estimates.
Authors Jonas Rothfuss, Vincent Fortuin, Martin Josifoski, Andreas Krause
Submitted ICML 2021
Abstract We introduce a new algorithm for finding stable matchings in multi-sided matching markets. Our setting is motivated by a PhD market of students, advisors, and co-advisors, and can be generalized to supply chain networks viewed as n-sided markets. In the three-sided PhD market, students primarily care about advisors and then about co-advisors (consistent preferences), while advisors and co-advisors have preferences over students only (hence they are cooperative). A student must be matched to one advisor and one co-advisor, or not at all. In contrast to previous work, advisor-student and student-co-advisor pairs may not be mutually acceptable (e.g., a student may not want to work with an advisor or co-advisor and vice versa). We show that three-sided stable matchings always exist, and present an algorithm that, in time quadratic in the market size (up to log factors), finds a three-sided stable matching using any two-sided stable matching algorithm as matching engine. We illustrate the challenges that arise when not all advisor-co-advisor pairs are compatible. We then generalize our algorithm to n-sided markets with quotas and show how they can model supply chain networks. Finally, we show how our algorithm outperforms the baseline given by [Danilov, 2003] in terms of both producing a stable matching and a larger number of matches on a synthetic dataset.
Authors Maximilian Mordig, Riccardo Della Vecchia, Nicolò Cesa-Bianchi, Bernhard Schölkopf
Submitted arXiv
Abstract Clustering high-dimensional data, such as images or biological measurements, is a long-standing problem and has been studied extensively. Recently, Deep Clustering gained popularity due to its flexibility in fitting the specific peculiarities of complex data. Here we introduce the Mixture-of-Experts Similarity Variational Autoencoder (MoE-Sim-VAE), a novel generative clustering model. The model can learn multi-modal distributions of high-dimensional data and use these to generate realistic data with high efficacy and efficiency. MoE-Sim-VAE is based on a Variational Autoencoder (VAE), where the decoder consists of a Mixture-of-Experts (MoE) architecture. This specific architecture allows for various modes of the data to be automatically learned by means of the experts. Additionally, we encourage the lower dimensional latent representation of our model to follow a Gaussian mixture distribution and to accurately represent the similarities between the data points. We assess the performance of our model on the MNIST benchmark data set and a challenging real-world task of defining cell subpopulations from mass cytometry (CyTOF) measurements on hundreds of different datasets. MoE-Sim-VAE exhibits superior clustering performance on all these tasks in comparison to the baselines as well as competitor methods and we show that the MoE architecture in the decoder reduces the computational cost of sampling specific data modes with high fidelity.
Authors Andreas Kopf, Vincent Fortuin, Vignesh Ram Somnath, Manfred Claassen
Submitted PLOS Computational Biology
Authors Patrik T Simmler, Tamara Mengis, Kjong-Van Lehmann, André Kahles, Tinu Thomas, Gunnar Rätsch, Markus Stoffel, Gerald Schwank
Submitted bioRxiv
Abstract Bayesian neural networks that incorporate data augmentation implicitly use a ``randomly perturbed log-likelihood [which] does not have a clean interpretation as a valid likelihood function'' (Izmailov et al. 2021). Here, we provide several approaches to developing principled Bayesian neural networks incorporating data augmentation. We introduce a ``finite orbit'' setting which allows likelihoods to be computed exactly, and give tight multi-sample bounds in the more usual ``full orbit'' setting. These models cast light on the origin of the cold posterior effect. In particular, we find that the cold posterior effect persists even in these principled models incorporating data augmentation. This suggests that the cold posterior effect cannot be dismissed as an artifact of data augmentation using incorrect likelihoods.
Authors Seth Nabarro, Stoil Ganev, Adrià Garriga-Alonso, Vincent Fortuin, Mark van der Wilk, Laurence Aitchison
Submitted arXiv Preprints
Abstract Intensive care units (ICU) are increasingly looking towards machine learning for methods to provide online monitoring of critically ill patients. In machine learning, online monitoring is often formulated as a supervised learning problem. Recently, contrastive learning approaches have demonstrated promising improvements over competitive supervised benchmarks. These methods rely on well-understood data augmentation techniques developed for image data which do not apply to online monitoring. In this work, we overcome this limitation by supplementing time-series data augmentation techniques with a novel contrastive learning objective which we call neighborhood contrastive learning (NCL). Our objective explicitly groups together contiguous time segments from each patient while maintaining state-specific information. Our experiments demonstrate a marked improvement over existing work applying contrastive methods to medical time-series.
Authors Hugo Yèche, Gideon Dresdner, Francesco Locatello, Matthias Hüser, Gunnar Rätsch
Submitted ICML 2021
Abstract Ensembles of deep neural networks have achieved great success recently, but they do not offer a proper Bayesian justification. Moreover, while they allow for averaging of predictions over several hypotheses, they do not provide any guarantees for their diversity, leading to redundant solutions in function space. In contrast, particle-based inference methods, such as Stein variational gradient descent (SVGD), offer a Bayesian framework, but rely on the choice of a kernel to measure the similarity between ensemble members. In this work, we study different SVGD methods operating in the weight space, function space, and in a hybrid setting. We compare the SVGD approaches to other ensembling-based methods in terms of their theoretical properties and assess their empirical performance on synthetic and real-world tasks. We find that SVGD using functional and hybrid kernels can overcome the limitations of deep ensembles. It improves on functional diversity and uncertainty estimation and approaches the true Bayesian posterior more closely. Moreover, we show that using stochastic SVGD updates, as opposed to the standard deterministic ones, can further improve the performance.
Authors Francesco D'Angelo, Vincent Fortuin, Florian Wenzel
Submitted arXiv Preprints
Abstract We present a global atlas of 4,728 metagenomic samples from mass-transit systems in 60 cities over 3 years, representing the first systematic, worldwide catalog of the urban microbial ecosystem. This atlas provides an annotated, geospatial profile of microbial strains, functional characteristics, antimicrobial resistance (AMR) markers, and genetic elements, including 10,928 viruses, 1,302 bacteria, 2 archaea, and 838,532 CRISPR arrays not found in reference databases. We identified 4,246 known species of urban microorganisms and a consistent set of 31 species found in 97% of samples that were distinct from human commensal organisms. Profiles of AMR genes varied widely in type and density across cities. Cities showed distinct microbial taxonomic signatures that were driven by climate and geographic differences. These results constitute a high-resolution global metagenomic atlas that enables discovery of organisms and genes, highlights potential public health and forensic applications, and provides a culture-independent view of AMR burden in cities.
Authors David Danko, Daniela Bezdan, Evan E. Afshin, Sofia Ahsanuddin, Chandrima Bhattacharya, Daniel J. Butler, Kern Rei Chng, Daisy Donnellan, Jochen Hecht, Katelyn Jackson, Katerina Kuchin, Mikhail Karasikov, Abigail Lyons, Lauren Mak, Dmitry Meleshko, Harun Mustafa, Beth Mutai, Russell Y. Neches, Amanda Ng, Olga Nikolayeva, Tatyana Nikolayeva, Eileen Png, Krista A. Ryon, Jorge L. Sanchez, Heba Shaaban, Maria A. Sierra, Dominique Thomas, Ben Young, Omar O. Abudayyeh, Josue Alicea, Malay Bhattacharyya, Ran Blekhman, Eduardo Castro-Nallar, Ana M. Cañas, Aspassia D. Chatziefthimiou, Robert W. Crawford, Francesca De Filippis, Youping Deng, Christelle Desnues, Emmanuel Dias-Neto, Marius Dybwad, Eran Elhaik, Danilo Ercolini, Alina Frolova, Dennis Gankin, Jonathan S. Gootenberg, Alexandra B. Graf, David C. Green, Iman Hajirasouliha, Jaden J.A. Hastings, Mark Hernandez, Gregorio Iraola, Soojin Jang, Andre Kahles, Frank J. Kelly, Kaymisha Knights, Nikos C. Kyrpides, Paweł P. Łabaj, Patrick K.H. Lee, Marcus H.Y. Leung, Per O. Ljungdahl, Gabriella Mason-Buck, Ken McGrath, Cem Meydan, Emmanuel F. Mongodin, Milton Ozorio Moraes, Niranjan Nagarajan, Marina Nieto-Caballero, Houtan Noushmehr, Manuela Oliveira, Stephan Ossowski, Olayinka O. Osuolale, Orhan Özcan, David Paez-Espino, Nicolás Rascovan, Hugues Richard, Gunnar Rätsch, Lynn M. Schriml, Torsten Semmler, Osman U. Sezerman, Leming Shi, Tieliu Shi, Rania Siam, Le Huu Song, Haruo Suzuki, Denise Syndercombe Court, Scott W. Tighe, Xinzhao Tong, Klas I. Udekwu, Juan A. Ugalde, Brandon Valentine, Dimitar I. Vassilev, Elena M. Vayndorf, Thirumalaisamy P. Velavan, Jun Wu, María M. Zambrano, Jifeng Zhu, Sibo Zhu, Christopher E. Mason, The International MetaSUB Consortium
Submitted Cell
Abstract Bayesian neural networks have shown great promise in many applications where calibrated uncertainty estimates are crucial and can often also lead to a higher predictive performance. However, it remains challenging to choose a good prior distribution over their weights. While isotropic Gaussian priors are often chosen in practice due to their simplicity, they do not reflect our true prior beliefs well and can lead to suboptimal performance. Our new library, BNNpriors, enables state-of-the-art Markov Chain Monte Carlo inference on Bayesian neural networks with a wide range of predefined priors, including heavy-tailed ones, hierarchical ones, and mixture priors. Moreover, it follows a modular approach that eases the design and implementation of new custom priors. It has facilitated foundational discoveries on the nature of the cold posterior effect in Bayesian neural networks and will hopefully catalyze future research as well as practical applications in this area.
Authors Vincent Fortuin, Adrià Garriga-Alonso, Mark van der Wilk, Laurence Aitchison
Submitted Software Impacts
Abstract The development of respiratory failure is common among patients in intensive care units (ICU). Large data quantities from ICU patient monitoring systems make timely and comprehensive analysis by clinicians difficult but are ideal for automatic processing by machine learning algorithms. Early prediction of respiratory system failure could alert clinicians to patients at risk of respiratory failure and allow for early patient reassessment and treatment adjustment. We propose an early warning system that predicts moderate/severe respiratory failure up to 8 hours in advance. Our system was trained on HiRID-II, a data-set containing more than 60,000 admissions to a tertiary care ICU. An alarm is typically triggered several hours before the beginning of respiratory failure. Our system outperforms a clinical baseline mimicking traditional clinical decision-making based on pulse-oximetric oxygen saturation and the fraction of inspired oxygen. To provide model introspection and diagnostics, we developed an easy-to-use web browser-based system to explore model input data and predictions visually.
Authors Matthias Hüser, Martin Faltys, Xinrui Lyu, Chris Barber, Stephanie L. Hyland, Thomas M. Merz, Gunnar Rätsch
Submitted arXiv Preprints
Abstract With a mortality rate of 5.4 million lives worldwide every year and a healthcare cost of more than 16 billion dollars in the USA alone, sepsis is one of the leading causes of hospital mortality and an increasing concern in the ageing western world. Recently, medical and technological advances have helped re-define the illness criteria of this disease, which is otherwise poorly understood by the medical society. Together with the rise of widely accessible Electronic Health Records, the advances in data mining and complex nonlinear algorithms are a promising avenue for the early detection of sepsis. This work contributes to the research effort in the field of automated sepsis detection with an open-access labelling of the medical MIMIC-III data set. Moreover, we propose MGP-AttTCN: a joint multitask Gaussian Process and attention-based deep learning model to early predict the occurrence of sepsis in an interpretable manner. We show that our model outperforms the current state-of-the-art and present evidence that different labelling heuristics lead to discrepancies in task difficulty.
Authors Margherita Rosnati, Vincent Fortuin
Submitted PLOS One
Abstract While the choice of prior is one of the most critical parts of the Bayesian inference workflow, recent Bayesian deep learning models have often fallen back on vague priors, such as standard Gaussians. In this review, we highlight the importance of prior choices for Bayesian deep learning and present an overview of different priors that have been proposed for (deep) Gaussian processes, variational autoencoders, and Bayesian neural networks. We also outline different methods of learning priors for these models from data. We hope to motivate practitioners in Bayesian deep learning to think more carefully about the prior specification for their models and to provide them with some inspiration in this regard.
Authors Vincent Fortuin
Submitted arXiv Preprints
Abstract Pancreatic adenocarcinoma (PDAC) epitomizes a deadly cancer driven by abnormal KRAS signaling. Here, we show that the eIF4A RNA helicase is required for translation of key KRAS signaling molecules and that pharmacological inhibition of eIF4A has single-agent activity against murine and human PDAC models at safe dose levels. EIF4A was uniquely required for the translation of mRNAs with long and highly structured 5′ untranslated regions, including those with multiple G-quadruplex elements. Computational analyses identified these features in mRNAs encoding KRAS and key downstream molecules. Transcriptome-scale ribosome footprinting accurately identified eIF4A-dependent mRNAs in PDAC, including critical KRAS signaling molecules such as PI3K, RALA, RAC2, MET, MYC, and YAP1. These findings contrast with a recent study that relied on an older method, polysome fractionation, and implicated redox-related genes as eIF4A clients. Together, our findings highlight the power of ribosome footprinting in conjunction with deep RNA sequencing in accurately decoding translational control mechanisms and define the therapeutic mechanism of eIF4A inhibitors in PDAC.
Authors Kamini Singh, Jianan Lin, Nicolas Lecomte, Prathibha Mohan, Askan Gokce, Viraj R Sanghvi, Man Jiang, Olivera Grbovic-Huezo, Antonija Burčul, Stefan G Stark, Paul B Romesser, Qing Chang, Jerry P Melchor, Rachel K Beyer, Mark Duggan, Yoshiyuki Fukase, Guangli Yang, Ouathek Ouerfelli, Agnes Viale, Elisa De Stanchina, Andrew W Stamford, Peter T Meinke, Gunnar Rätsch, Steven D Leach, Zhengqing Ouyang, Hans-Guido Wendel
Submitted Journal Cancer research
Abstract The generalized Gauss-Newton (GGN) approximation is often used to make practical Bayesian deep learning approaches scalable by replacing a second order derivative with a product of first order derivatives. In this paper we argue that the GGN approximation should be understood as a local linearization of the underlying Bayesian neural network (BNN), which turns the BNN into a generalized linear model (GLM). Because we use this linearized model for posterior inference, we should also predict using this modified model instead of the original one. We refer to this modified predictive as" GLM predictive" and show that it effectively resolves common underfitting problems of the Laplace approximation. It extends previous results in this vein to general likelihoods and has an equivalent Gaussian process formulation, which enables alternative inference schemes for BNNs in function space. We demonstrate the effectiveness of our approach on several standard classification datasets as well as on out-of-distribution detection.
Authors Alexander Immer, Maciej Korzepa, Matthias Bauer
Submitted AISTATS 2021
Abstract Conventional variational autoencoders fail in modeling correlations between data points due to their use of factorized priors. Amortized Gaussian process inference through GP-VAEs has led to significant improvements in this regard, but is still inhibited by the intrinsic complexity of exact GP inference. We improve the scalability of these methods through principled sparse inference approaches. We propose a new scalable GP-VAE model that outperforms existing approaches in terms of runtime and memory footprint, is easy to implement, and allows for joint end-to-end optimization of all components.
Authors Metod Jazbec, Vincent Fortuin, Michael Pearce, Stephan Mandt, Gunnar Rätsch
Submitted AISTATS 2021
Abstract Generating interpretable visualizations of multivariate time series in the intensive care unit is of great practical importance. Clinicians seek to condense complex clinical observations into intuitively understandable critical illness patterns, like failures of different organ systems. They would greatly benefit from a low-dimensional representation in which the trajectories of the patients' pathology become apparent and relevant health features are highlighted. To this end, we propose to use the latent topological structure of Self-Organizing Maps (SOMs) to achieve an interpretable latent representation of ICU time series and combine it with recent advances in deep clustering. Specifically, we (a) present a novel way to fit SOMs with probabilistic cluster assignments (PSOM), (b) propose a new deep architecture for probabilistic clustering (DPSOM) using a VAE, and (c) extend our architecture to cluster and forecast clinical states in time series (T-DPSOM). We show that our model achieves superior clustering performance compared to state-of-the-art SOM-based clustering methods while maintaining the favorable visualization properties of SOMs. On the eICU data-set, we demonstrate that T-DPSOM provides interpretable visualizations of patient state trajectories and uncertainty estimation. We show that our method rediscovers well-known clinical patient characteristics, such as a dynamic variant of the Acute Physiology And Chronic Health Evaluation (APACHE) score. Moreover, we illustrate how it can disentangle individual organ dysfunctions on disjoint regions of the two-dimensional SOM map.
Authors Laura Manduchi, Matthias Hüser, Martin Faltys, Julia Vogt, Gunnar Rätsch, Vincent Fortuin
Submitted ACM-CHIL 2021
Abstract Dynamic assessment of mortality risk in the intensive care unit (ICU) can be used to stratify patients, inform about treatment effectiveness or serve as part of an early-warning system. Static risk scoring systems, such as APACHE or SAPS, have recently been supplemented with data-driven approaches that track the dynamic mortality risk over time. Recent works have focused on enhancing the information delivered to clinicians even further by producing full survival distributions instead of point predictions or fixed horizon risks. In this work, we propose a non-parametric ensemble model, Weighted Resolution Survival Ensemble (WRSE), tailored to estimate such dynamic individual survival distributions. Inspired by the simplicity and robustness of ensemble methods, the proposed approach combines a set of binary classifiers spaced according to a decay function reflecting the relevance of short-term mortality predictions. Models and baselines are evaluated under weighted calibration and discrimination metrics for individual survival distributions which closely reflect the utility of a model in ICU practice. We show competitive results with state-of-the-art probabilistic models, while greatly reducing training time by factors of 2-9x.
Authors Jonathan Heitz, Joanna Ficek-Pascual, Martin Faltys, Tobias M. Merz, Gunnar Rätsch, Matthias Hüser
Submitted Proceedings of the AAAI-2021 - Spring Symposium on Survival Prediction
Abstract Intra-tumor hypoxia is a common feature in many solid cancers. Although transcriptional targets of hypoxia-inducible factors (HIFs) have been well characterized, alternative splicing or processing of pre-mRNA transcripts which occurs during hypoxia and subsequent HIF stabilization is much less understood. Here, we identify many HIF-dependent alternative splicing events after whole transcriptome sequencing in pancreatic cancer cells exposed to hypoxia with and without downregulation of the aryl hydrocarbon receptor nuclear translocator (ARNT), a protein required for HIFs to form a transcriptionally active dimer. We correlate the discovered hypoxia-driven events with available sequencing data from pan-cancer TCGA patient cohorts to select a narrow set of putative biologically relevant splice events for experimental validation. We validate a small set of candidate HIF-dependent alternative splicing events in multiple human gastrointestinal cancer cell lines as well as patient-derived human pancreatic cancer organoids. Lastly, we report the discovery of a HIF-dependent mechanism to produce a hypoxia-dependent, long and coding isoform of the UDP-N-acetylglucosamine transporter SLC35A3.
Authors Philipp Markolin, Natalie Davidson, Christian K Hirt, Christophe D Chabbert, Nicola Zamboni, Gerald Schwank, Wilhelm Krek, Gunnar Rätsch
Submitted Genomics
Abstract Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. In this paper, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10,000 RNA-seq datasets show that RowDiff combined with Multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a Multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST.
Authors Daniel Danciu, Mikhail Karasikov, Harun Mustafa, André Kahles, Gunnar Rätsch
Submitted ISMB/ECCB 2021
Abstract The sharp increase in next-generation sequencing technologies’ capacity has created a demand for algorithms capable of quickly searching a large corpus of biological sequences. The complexity of biological variability and the magnitude of existing data sets have impeded finding algorithms with guaranteed accuracy that efficiently run in practice. Our main contribution is the Tensor Sketch method that efficiently and accurately estimates edit distances. In our experiments, Tensor Sketch had 0.88 Spearman’s rank correlation with the exact edit distance, almost doubling the 0.466 correlation of the closest competitor while running 8.8 times faster. Finally, all sketches can be updated dynamically if the input is a sequence stream, making it appealing for large-scale applications where data cannot fit into memory. Conceptually, our approach has three steps: 1) represent sequences as tensors over their sub-sequences, 2) apply tensor sketching that preserves tensor inner products, 3) implicitly compute the sketch. The sub-sequences, which are not necessarily contiguous pieces of the sequence, allow us to outperform fc-mer-based methods, such as min-hash sketching over a set of k-mers. Typically, the number of sub-sequences grows exponentially with the sub-sequence length, introducing both memory and time overheads. We directly address this problem in steps 2 and 3 of our method. While the sketching of rank-1 or super-symmetric tensors is known to admit efficient sketching, the sub-sequence tensor does not satisfy either of these properties. Hence, we propose a new sketching scheme that completely avoids the need for constructing the ambient space. Our tensor-sketching technique’s main advantages are three-fold: 1) Tensor Sketch has higher accuracy than any of the other assessed sketching methods used in practice. 2) All sketches can be computed in a streaming fashion, leading to significant time and memory savings when there is overlap between input sequences. 3) It is straightforward to extend tensor sketching to different settings leading to efficient methods for related sequence analysis tasks. We view tensor sketching as a framework to tackle a wide range of relevant bioinformatics problems, and we are confident that it can bring significant improvements for applications based on edit distance estimation.
Authors Amir Joudaki, Gunnar Rätsch, André Kahles
Submitted RECOMB 2021
Abstract Motivation Deep learning techniques have yielded tremendous progress in the field of computational biology over the last decade, however many of these techniques are opaque to the user. To provide interpretable results, methods have incorporated biological priors directly into the learning task; one such biological prior is pathway structure. While pathways represent most biological processes in the cell, the high level of correlation and hierarchical structure make it complicated to determine an appropriate computational representation. Results Here, we present pathway module Variational Autoencoder (pmVAE). Our method encodes pathway information by restricting the structure of our VAE to mirror gene-pathway memberships. Its architecture is composed of a set of subnetworks, which we refer to as pathway modules. The subnetworks learn interpretable latent representations by factorizing the latent space according to pathway gene sets. We directly address correlation between pathways by balancing a module-specific local loss and a global reconstruction loss. Furthermore, since many pathways are by nature hierarchical and therefore the product of multiple downstream signals, we model each pathway as a multidimensional vector. Due to their factorization over pathways, the representations allow for easy and interpretable analysis of multiple downstream effects, such as cell type and biological stimulus, within the contexts of each pathway. We compare pmVAE against two other state-of-the-art methods on two single-cell RNA-seq case-control data sets, demonstrating that our pathway representations are both more discriminative and consistent in detecting pathways targeted by a perturbation. Availability and implementation https://github.com/ratschlab/pmvae
Authors Gilles Gut, Stefan G Stark, Gunnar Rätsch, Natalie R Davidson
Submitted biorxiv
Abstract The application and integration of molecular profiling technologies create novel opportunities for personalized medicine. Here, we introduce the Tumor Profiler Study, an observational trial combining a prospective diagnostic approach to assess the relevance of in-depth tumor profiling to support clinical decision-making with an exploratory approach to improve the biological understanding of the disease.
Authors Anja Irmisch, Ximena Bonilla, Stéphane Chevrier, Kjong-Van Lehmann, Franziska Singer, Nora C. Toussaint, Cinzia Esposito, Julien Mena, Emanuela S. Milani, Ruben Casanova, Daniel J. Stekhoven, Rebekka Wegmann, Francis Jacob, Bettina Sobottka, Sandra Goetze, Jack Kuipers, Jacobo Sarabia del Castillo, Michael Prummer, Mustafa A. Tuncel, Ulrike Menzel, Andrea Jacobs, Stefanie Engler, Sujana Sivapatham, Anja L. Frei, Gabriele Gut, Joanna Ficek-Pascual, Nicola Miglino, Melike Ak, Faisal S. Al-Quaddoomi, Jonas Albinus, Ilaria Alborelli, Sonali Andani, Per-Olof Attinger, Daniel Baumhoer, Beatrice Beck-Schimmer, Lara Bernasconi, Anne Bertolini, Natalia Chicherova, Maya D'Costa, Esther Danenberg, Natalie Davidson, Monica-Andreea Drăgan, Martin Erkens, Katja Eschbach, André Fedier, Pedro Ferreira, Bruno Frey, Linda Grob, Detlef Günther, Martina Haberecker, Pirmin Haeuptle, Sylvia Herter, Rene Holtackers, Tamara Huesser, Tim M. Jaeger, Katharina Jahn, Alva R. James, Philip M. Jermann, André Kahles, Abdullah Kahraman, Werner Kuebler, Christian P. Kunze, Christian Kurzeder, Sebastian Lugert, Gerd Maass, Philipp Markolin, Julian M. Metzler, Simone Muenst, Riccardo Murri, Charlotte K.Y. Ng, Stefan Nicolet, Marta Nowak, Patrick G.A. Pedrioli, Salvatore Piscuoglio, Mathilde Ritter, Christian Rommel, María L. Rosano-González, Natascha Santacroce, Ramona Schlenker, Petra C. Schwalie, Severin Schwan, Tobias Schär, Gabriela Senti, Vipin T. Sreedharan, Stefan Stark, Tinu M. Thomas, Vinko Tosevski, Marina Tusup, Audrey Van Drogen, Marcus Vetter, Tatjana Vlajnic, Sandra Weber, Walter P. Weber, Michael Weller, Fabian Wendt, Norbert Wey, Mattheus H.E. Wildschut, Shuqing Yu, Johanna Ziegler, Marc Zimmermann, Martin Zoche, Gregor Zuend, Rudolf Aebersold, Marina Bacac, Niko Beerenwinkel, Christian Beisel, Bernd Bodenmiller, Reinhard Dummer, Viola Heinzelmann-Schwarz, Viktor H. Koelzer, Markus G. Manz, Holger Moch, Lucas Pelkmans, Berend Snijder, Alexandre P.A. Theocharides, Markus Tolnay, Andreas Wicki, Bernd Wollscheid, Gunnar Rätsch, Mitchell P. Levesque
Submitted Cancer Cell (Commentary)
Abstract Isotropic Gaussian priors are the de facto standard for modern Bayesian neural network inference. However, it is unclear whether these priors accurately reflect our true beliefs about the weight distributions or give optimal performance. To find better priors, we study summary statistics of neural network weights in networks trained using SGD. We find that convolutional neural network (CNN) weights display strong spatial correlations, while fully connected networks (FCNNs) display heavy-tailed weight distributions. Building these observations into priors leads to improved performance on a variety of image classification datasets. Surprisingly, these priors mitigate the cold posterior effect in FCNNs, but slightly increase the cold posterior effect in ResNets.
Authors Vincent Fortuin, Adrià Garriga-Alonso, Florian Wenzel, Gunnar Rätsch, Richard Turner, Mark van der Wilk, Laurence Aitchison
Submitted AABI 2021
Abstract Stochastic gradient Markov Chain Monte Carlo algorithms are popular samplers for approximate inference, but they are generally biased. We show that many recent versions of these methods (e.g. Chen et al. (2014)) cannot be corrected using Metropolis-Hastings rejection sampling, because their acceptance probability is always zero. We can fix this by employing a sampler with realizable backwards trajectories, such as Gradient-Guided Monte Carlo (Horowitz, 1991), which generalizes stochastic gradient Langevin dynamics (Welling and Teh, 2011) and Hamiltonian Monte Carlo. We show that this sampler can be used with stochastic gradients, yielding nonzero acceptance probabilities, which can be computed even across multiple steps.
Authors Adrià Garriga-Alonso, Vincent Fortuin
Submitted AABI 2021
Abstract Particle based optimization algorithms have recently been developed as sampling methods that iteratively update a set of particles to approximate a target distribution. In particular Stein variational gradient descent has gained attention in the approximate inference literature for its flexibility and accuracy. We empirically explore the ability of this method to sample from multi-modal distributions and focus on two important issues: (i) the inability of the particles to escape from local modes and (ii) the inefficacy in reproducing the density of the different regions. We propose an annealing schedule to solve these issues and show, through various experiments, how this simple solution leads to significant improvements in mode coverage, without invalidating any theoretical properties of the original algorithm.
Authors Francesco D'Angelo, Vincent Fortuin
Submitted AABI 2021
Abstract Tumors contain multiple subpopulations of genetically distinct cancer cells. Reconstructing their evolutionary history can improve our understanding of how cancers develop and respond to treatment. Subclonal reconstruction methods cluster mutations into groups that co-occur within the same subpopulations, estimate the frequency of cells belonging to each subpopulation, and infer the ancestral relationships among the subpopulations by constructing a clone tree. However, often multiple clone trees are consistent with the data and current methods do not efficiently capture this uncertainty; nor can these methods scale to clone trees with a large number of subclonal populations. Here, we formalize the notion of a partially-defined clone tree (partial clone tree for short) that defines a subset of the pairwise ancestral relationships in a clone tree, thereby implicitly representing the set of all clone trees that have these defined pairwise relationships. Also, we introduce a special partial clone tree, the Maximally-Constrained Ancestral Reconstruction (MAR), which summarizes all clone trees fitting the input data equally well. Finally, we extend commonly used clone tree validity conditions to apply to partial clone trees and describe SubMARine, a polynomial-time algorithm producing the subMAR, which approximates the MAR and guarantees that its defined relationships are a subset of those present in the MAR. We also extend SubMARine to work with subclonal copy number aberrations and define equivalence constraints for this purpose. Further, we extend SubMARine to permit noise in the estimates of the subclonal frequencies while retaining its validity conditions and guarantees. In contrast to other clone tree reconstruction methods, SubMARine runs in time and space that scale polynomially in the number of subclones. We show through extensive noise-free simulation, a large lung cancer dataset and a prostate cancer dataset that the subMAR equals the MAR in all cases where only a single clone tree exists and that it is a perfect match to the MAR in most of the other cases. Notably, SubMARine runs in less than 70 seconds on a single thread with less than one Gb of memory on all datasets presented in this paper, including ones with 50 nodes in a clone tree. On the real-world data, SubMARine almost perfectly recovers the previously reported trees and identifies minor errors made in the expert-driven reconstructions of those trees. The freely-available open-source code implementing SubMARine can be downloaded at https://github.com/morrislab/submarine.
Authors Linda K. Sundermann, Jeff Wintersinger, Gunnar Rätsch, Jens Stoye, Quaid Morris
Submitted PLOS Computational Biology
Abstract Variational autoencoders often assume isotropic Gaussian priors and mean-field posteriors, hence do not exploit structure in scenarios where we may expect similarity or consistency across latent variables. Gaussian process variational autoencoders alleviate this problem through the use of a latent Gaussian process, but lead to a cubic inference time complexity. We propose a more scalable extension of these models by leveraging the independence of the auxiliary features, which is present in many datasets. Our model factorizes the latent kernel across these features in different dimensions, leading to a significant speed-up (in theory and practice), while empirically performing comparably to existing non-scalable approaches. Moreover, our approach allows for additional modeling of global latent information and for more general extrapolation to unseen input combinations.
Authors Metod Jazbec, Michael Pearce, Vincent Fortuin
Submitted AABI 2021
Abstract Variational Inference makes a trade-off between the capacity of the variational family and the tractability of finding an approximate posterior distribution. Instead, Boosting Variational Inference allows practitioners to obtain increasingly good posterior approximations by spending more compute. The main obstacle to widespread adoption of Boosting Variational Inference is the amount of resources necessary to improve over a strong Variational Inference baseline. In our work, we trace this limitation back to the global curvature of the KL-divergence. We characterize how the global curvature impacts time and memory consumption, address the problem with the notion of local curvature, and provide a novel approximate backtracking algorithm for estimating local curvature. We give new theoretical convergence rates for our algorithms and provide experimental validation on synthetic and real-world datasets.
Authors Gideon Dresdner, Saurav Shekhar, Fabian Pedregosa, Francesco Locatello, Gunnar Rätsch
Submitted International Joint Conference on Artificial Intelligence (IJCAI-21)
2020
Abstract Motivation Recent technological advances have led to an increase in the production and availability of single-cell data. The ability to integrate a set of multi-technology measurements would allow the identification of biologically or clinically meaningful observations through the unification of the perspectives afforded by each technology. In most cases, however, profiling technologies consume the used cells and thus pairwise correspondences between datasets are lost. Due to the sheer size single-cell datasets can acquire, scalable algorithms that are able to universally match single-cell measurements carried out in one cell to its corresponding sibling in another technology are needed. Results We propose Single-Cell data Integration via Matching (SCIM), a scalable approach to recover such correspondences in two or more technologies. SCIM assumes that cells share a common (low-dimensional) underlying structure and that the underlying cell distribution is approximately constant across technologies. It constructs a technology-invariant latent space using an autoencoder framework with an adversarial objective. Multi-modal datasets are integrated by pairing cells across technologies using a bipartite matching scheme that operates on the low-dimensional latent representations. We evaluate SCIM on a simulated cellular branching process and show that the cell-to-cell matches derived by SCIM reflect the same pseudotime on the simulated dataset. Moreover, we apply our method to two real-world scenarios, a melanoma tumor sample and a human bone marrow sample, where we pair cells from a scRNA dataset to their sibling cells in a CyTOF dataset achieving 90% and 78% cell-matching accuracy for each one of the samples, respectively.
Authors Stefan G Stark, Joanna Ficek-Pascual, Francesco Locatello, Ximena Bonilla, Stéphane Chevrier, Franziska Singer, Tumor Profiler Consortium, Gunnar Rätsch, Kjong-Van Lehmann
Submitted Bioinformatics
Abstract The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) curated consensus somatic mutation calls using whole exome sequencing (WES) and whole genome sequencing (WGS), respectively. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types, we compare WES and WGS side-by-side from 746 TCGA samples, finding that ~80% of mutations overlap in covered exonic regions. We estimate that low variant allele fraction (VAF < 15%) and clonal heterogeneity contribute up to 68% of private WGS mutations and 71% of private WES mutations. We observe that ~30% of private WGS mutations trace to mutations identified by a single variant caller in WES consensus efforts. WGS captures both ~50% more variation in exonic regions and un-observed mutations in loci with variable GC-content. Together, our analysis highlights technological divergences between two reproducible somatic variant detection efforts.
Authors Matthew H. Bailey, William U. Meyerson, Lewis Jonathan Dursi, Liang-Bo Wang, Guanlan Dong, Wen-Wei Liang, Amila Weerasinghe, Shantao Li, Yize Li, Sean Kelso, MC3 Working Group, PCAWG novel somatic mutation calling methods working group, Gordon Saksena, Kyle Ellrott, Michael C. Wendl, David A. Wheeler, Gad Getz, Jared T. Simpson, Mark B. Gerstein, Li Ding & PCAWG Consortium
Submitted Nature Communications
Abstract The amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making all this sequencing data searchable and easily accessible to life science and data science researchers is an unsolved problem. We present MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories. MetaGraph efficiently indexes vast collections of sequences to enable fast search and comprehensive analysis. A wide range of underlying data structures offer different practically relevant trade-offs between the space taken by the index and its query performance. MetaGraph provides a flexible methodological framework allowing for index construction to be scaled from consumer laptops to distribution onto a cloud compute cluster for processing terabases to petabases of input data. Achieving compression ratios of up to 1,000-fold over the already compressed raw input data, MetaGraph can represent the content of large sequencing archives in the working memory of a single compute server. We demonstrate our framework’s scalability by indexing over 1.4 million whole genome sequencing (WGS) records from NCBI’s Sequence Read Archive, representing a total input of more than three petabases. Besides demonstrating the utility of MetaGraph indexes on key applications, such as experiment discovery, sequence alignment, error correction, and differential assembly, we make a wide range of indexes available as a community resource, including those over 450,000 microbial WGS records, more than 110,000 fungi WGS records, and more than 20,000 whole metagenome sequencing records. A subset of these indexes is made available online for interactive queries. All indexes created from public data comprising in total more than 1 million records are available for download or usage in the cloud. As an example of our indexes’ integrative analysis capabilities, we introduce the concept of differential assembly, which allows for the extraction of sequences present in a foreground set of samples but absent in a given background set. We apply this technique to differentially assemble contigs to identify pathogenic agents transfected via human kidney transplants. In a second example, we indexed more than 20,000 human RNA-Seq records from the TCGA and GTEx cohorts and use them to extract transcriptome features that are hard to characterize using a classical linear reference. We discovered over 200 trans-splicing events in GTEx and found broad evidence for tissue-specific non-A-to-I RNA-editing in GTEx and TCGA.
Authors Mikhail Karasikov, Harun Mustafa, Daniel Danciu, Marc Zimmermann, Christopher Barber, Gunnar Rätsch, André Kahles
Submitted bioRxiv
Abstract Large, multi-dimensional spatio-temporal datasets are omnipresent in modern science and engineering. An effective framework for handling such data are Gaussian process deep generative models (GP-DGMs), which employ GP priors over the latent variables of DGMs. Existing approaches for performing inference in GP-DGMs do not support sparse GP approximations based on inducing points, which are essential for the computational efficiency of GPs, nor do they handle missing data -- a natural occurrence in many spatio-temporal datasets -- in a principled manner. We address these shortcomings with the development of the sparse Gaussian process variational autoencoder (SGP-VAE), characterised by the use of partial inference networks for parameterising sparse GP approximations. Leveraging the benefits of amortised variational inference, the SGP-VAE enables inference in multi-output sparse GPs on previously unobserved data with no additional training. The SGP-VAE is evaluated in a variety of experiments where it outperforms alternative approaches including multi-output GPs and structured VAEs.
Authors Matthew Ashman, Jonathan So, Will Tebbutt, Vincent Fortuin, Michael Pearce, Richard E. Turner
Submitted arXiv Preprints
Abstract Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes. In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks.
Authors Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, Thomas Kipf
Submitted NeurIPS 2020 (spotlight)
Abstract Advances in synthetic biology and microbiology have enabled the creation of engineered bacteria which can sense and report on intracellular and extracellular signals. When deployed in vivo these whole-cell bacterial biosensors can act as sentinels to monitor biomolecules of interest in human health and disease settings. This is particularly interesting in the context of the gut microbiota, which interacts extensively with the human host throughout time and transit of the gut and can be accessed from feces without requiring invasive collection. Leveraging rational engineering approaches for genetic circuits as well as an expanding catalog of disease-associated biomarkers, bacterial biosensors can act as non-invasive and easy-to-monitor reporters of the gut. Here, we summarize recent engineering approaches applied in vivo in animal models and then highlight promising technologies for designing the next generation of bacterial biosensors.
Authors Tanmay Tanna, Raghavendra Ramachanderan, Randall J Platt
Submitted Current Opinion in Microbiology
Abstract Motivation Understanding the underlying mutational processes of cancer patients has been a long-standing goal in the community and promises to provide new insights that could improve cancer diagnoses and treatments. Mutational signatures are summaries of the mutational processes, and improving the derivation of mutational signatures can yield new discoveries previously obscured by technical and biological confounders. Results from existing mutational signature extraction methods depend on the size of available patient cohort and solely focus on the analysis of mutation count data without considering the exploitation of metadata. Results Here we present a supervised method that utilizes cancer type as metadata to extract more distinctive signatures. More specifically, we use a negative binomial non-negative matrix factorization and add a support vector machine loss. We show that mutational signatures extracted by our proposed method have a lower reconstruction error and are designed to be more predictive of cancer type than those generated by unsupervised methods. This design reduces the need for elaborate post-processing strategies in order to recover most of the known signatures unlike the existing unsupervised signature extraction methods. Signatures extracted by a supervised model used in conjunction with cancer-type labels are also more robust, especially when using small and potentially cancer-type limited patient cohorts. Finally, we adapted our model such that molecular features can be utilized to derive an according mutational signature. We used APOBEC expression and MUTYH mutation status to demonstrate the possibilities that arise from this ability. We conclude that our method, which exploits available metadata, improves the quality of mutational signatures as well as helps derive more interpretable representations.
Authors Xinrui Lyu, Jean Garret, Gunnar Rätsch, Kjong-Van Lehmann
Submitted Bioinformatics
Abstract We propose a novel Stochastic Frank-Wolfe (a.k.a. conditional gradient) algorithm for constrained smooth finite-sum minimization with a generalized linear prediction/structure. This class of problems includes empirical risk minimization with sparse, low-rank, or other structured constraints. The proposed method is simple to implement, does not require step-size tuning, and has a constant per-iteration cost that is independent of the dataset size. Furthermore, as a byproduct of the method we obtain a stochastic estimator of the Frank-Wolfe gap that can be used as a stopping criterion. Depending on the setting, the proposed method matches or improves on the best computational guarantees for Stochastic Frank-Wolfe algorithms. Benchmarks on several datasets highlight different regimes in which the proposed method exhibits a faster empirical convergence than related methods. Finally, we provide an implementation of all considered methods in an open-source package.
Authors Geoffrey Négiar, Gideon Dresdner, Alicia Tsai, Laurent El Ghaoui, Francesco Locatello, Robert M. Freund, Fabian Pedregosa
Submitted ICML 2020
Abstract Intelligent agents should be able to learn useful representations by observing changes in their environment. We model such observations as pairs of non-i.i.d. images sharing at least one of the underlying factors of variation. First, we theoretically show that only knowing how many factors have changed, but not which ones, is sufficient to learn disentangled representations. Second, we provide practical algorithms that learn disentangled representations from pairs of images without requiring annotation of groups, individual factors, or the number of factors that have changed. Third, we perform a large-scale empirical study and show that such pairs of observations are sufficient to reliably learn disentangled representations on several benchmark data sets. Finally, we evaluate our learned representations and find that they are simultaneously useful on a diverse suite of tasks, including generalization under covariate shifts, fairness, and abstract reasoning. Overall, our results demonstrate that weak supervision enables learning of useful disentangled representations in realistic scenarios.
Authors Francesco Locatello, Ben Poole, Gunnar Rätsch, Bernhard Schölkopf, Olivier Bachem, Michael Tschannen
Submitted ICML 2020
Abstract Although disinfection is key to infection control, the colonization patterns and resistomes of hospital-environment microbes remain underexplored. We report the first extensive genomic characterization of microbiomes, pathogens and antibiotic resistance cassettes in a tertiary-care hospital, from repeated sampling (up to 1.5 years apart) of 179 sites associated with 45 beds. Deep shotgun metagenomics unveiled distinct ecological niches of microbes and antibiotic resistance genes characterized by biofilm-forming and human-microbiome-influenced environments with corresponding patterns of spatiotemporal divergence. Quasi-metagenomics with nanopore sequencing provided thousands of high-contiguity genomes, phage and plasmid sequences (>60% novel), enabling characterization of resistome and mobilome diversity and dynamic architectures in hospital environments. Phylogenetics identified multidrug-resistant strains as being widely distributed and stably colonizing across sites. Comparisons with clinical isolates indicated that such microbes can persist in hospitals for extended periods (>8 years), to opportunistically infect patients. These findings highlight the importance of characterizing antibiotic resistance reservoirs in hospitals and establish the feasibility of systematic surveys to target resources for preventing infections.
Authors Kern Rei Chng, Chenhao Li, Denis Bertrand, Amanda Hui Qi Ng, Junmei Samantha Kwah, Hwee Meng Low, Chengxuan Tong, Maanasa Natrajan, Michael Hongjie Zhang, Licheng Xu, Karrie Kwan Ki Ko, Eliza Xin Pei Ho, Tamar V Av-Shalom, Jeanette Woon Pei Teo, Chiea Chuen Khor, MetaSUB Consortium; Swaine L Chen, Christopher E Mason, Oon Tek Ng, Kalisvar Marimuthu, Brenda Ang, Niranjan Nagarajan
Submitted Nature Medicine
Abstract We call upon the research community to standardize efforts to use daily self-reported data about COVID-19 symptoms in the response to the pandemic and to form a collaborative consortium to maximize global gain while protecting participant privacy.
Authors Eran Segal , Feng Zhang, Xihong Lin , Gary King , Ophir Shalem , Smadar Shilo, William E. Allen, Faisal Alquaddoomi, Han Altae-Tran, Simon Anders , Ran Balicer, Tal Bauman, Ximena Bonilla , Gisel Booman , Andrew T. Chan , Ori Cohen, Silvano Coletti, Natalie Davidson, Yuval Dor, David A. Drew , Olivier Elemento, Georgina Evans, Phil Ewels , Joshua Gale, Amir Gavrieli, Benjamin Geiger, Yonatan H. Grad , Casey S. Greene, Iman Hajirasouliha, Roman Jerala , Andre Kahles, Olli Kallioniemi, Ayya Keshet, Ljupco Kocarev, Gregory Landua, Tomer Meir, Aline Muller, Long H. Nguyen, Matej Oresic , Svetlana Ovchinnikova, Hedi Peterson , Jana Prodanova, Jay Rajagopal, Gunnar Rätsch, Hagai Rossman, Johan Rung , Andrea Sboner, Alexandros Sigaras , Tim Spector , Ron Steinherz, Irene Stevens, Jaak Vilo , Paul Wilmes
Submitted Nature Medicine
Abstract Kernel methods on discrete domains have shown great promise for many challenging tasks, e.g., on biological sequence data as well as on molecular structures. Scalable kernel methods like support vector machines offer good predictive performances but they often do not provide uncertainty estimates. In contrast, probabilistic kernel methods like Gaussian Processes offer uncertainty estimates in addition to good predictive performance but fall short in terms of scalability. We present the first sparse Gaussian Process approximation framework on discrete input domains. Our framework achieves good predictive performance as well as uncertainty estimates using different discrete optimization techniques. We present competitive results comparing our framework to support vector machine and full Gaussian Process baselines on synthetic data as well as on challenging real-world DNA sequence data.
Authors Vincent Fortuin, Gideon Dresdner, Heiko Strathmann, Gunnar Rätsch
Submitted IEEE Access
Abstract Jaccard Similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. However, little efforts have been made to develop a scalable and high-performance scheme for computing the Jaccard Similarity for today's large data sets. To address this issue, we design and implement SimilarityAtScale, the first communicationefficient distributed algorithm for computing the Jaccard Similarity. The key idea is to express the problem algebraically, as a sequence of matrix operations, and implement these operations with communication-avoiding distributed routines to minimize the amount of transferred data and ensure both high scalability and low latency. We then apply our algorithm to the problem of obtaining distances between whole-genome sequencing samples, a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using large-scale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains.
Authors Maciej Besta, Raghavendra Kanakagiri, Harun Mustafa, Mikhail Karasikov, Gunnar Rätsch, Torsten Hoefler, Edgar Solomonik
Submitted IPDPS 2020
Abstract We present an algorithm for the optimal alignment of sequences to genome graphs. It works by phrasing the edit distance minimization task as finding a shortest path on an implicit alignment graph. To find a shortest path, we instantiate the A⋆ paradigm with a novel domain-specific heuristic function that accounts for the upcoming sub-sequence in the query to be aligned, resulting in a provably optimal alignment algorithm called AStarix. Experimental evaluation of AStarix shows that it is 1–2 orders of magnitude faster than state-of-the-art optimal algorithms on the task of aligning Illumina reads to reference genome graphs. Implementations and evaluations are available at https://github.com/eth-sri/astarix.
Authors Pesho Ivanov, Benjamin Bichsel, Harun Mustafa, André Kahles, Gunnar Rätsch, Martin Vechev
Submitted RECOMB 2020
Abstract Intensive-care clinicians are presented with large quantities of measurements from multiple monitoring systems. The limited ability of humans to process complex information hinders early recognition of patient deterioration, and high numbers of monitoring alarms lead to alarm fatigue. We used machine learning to develop an early-warning system that integrates measurements from multiple organ systems using a high-resolution database with 240 patient-years of data. It predicts 90% of circulatory-failure events in the test set, with 82% identified more than 2 h in advance, resulting in an area under the receiver operating characteristic curve of 0.94 and an area under the precision-recall curve of 0.63. On average, the system raises 0.05 alarms per patient and hour. The model was externally validated in an independent patient cohort. Our model provides early identification of patients at risk for circulatory failure with a much lower false-alarm rate than conventional threshold-based systems.
Authors Stephanie L. Hyland, Martin Faltys, Matthias Hüser, Xinrui Lyu, Thomas Gumbsch, Cristóbal Esteban, Christian Bock, Max Horn, Michael Moor, Bastian Rieck, Marc Zimmermann, Dean Bodenham, Karsten Borgwardt, Gunnar Rätsch & Tobias M. Merz
Submitted Nature Medicine
Abstract Abstract Motivation Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies. Results We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications. Conclusions DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects.
Authors Olga Mineeva, Mateo Rojas-Carulla, Ruth E Ley, Bernhard Schölkopf, Nicholas D Youngblut
Submitted Bioinformatics (Oxford, England)
Abstract Transcript alterations often result from somatic changes in cancer genomes. Various forms of RNA alterations have been described in cancer, including overexpression, altered splicing and gene fusions; however, it is difficult to attribute these to underlying genomic changes owing to heterogeneity among patients and tumour types, and the relatively small cohorts of patients for whom samples have been analysed by both transcriptome and whole-genome sequencing. Here we present, to our knowledge, the most comprehensive catalogue of cancer-associated gene alterations to date, obtained by characterizing tumour transcriptomes from 1,188 donors of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). Using matched whole-genome sequencing data, we associated several categories of RNA alterations with germline and somatic DNA alterations, and identified probable genetic mechanisms. Somatic copy-number alterations were the major drivers of variations in total gene and allele-specific expression. We identified 649 associations of somatic single-nucleotide variants with gene expression in cis, of which 68.4% involved associations with flanking non-coding regions of the gene. We found 1,900 splicing alterations associated with somatic mutations, including the formation of exons within introns in proximity to Alu elements. In addition, 82% of gene fusions were associated with structural variants, including 75 of a new class, termed ‘bridged’ fusions, in which a third genomic location bridges two genes. We observed transcriptomic alteration signatures that differ between cancer types and have associations with variations in DNA mutational signatures. This compendium of RNA alterations in the genomic context provides a rich resource for identifying genes and mechanisms that are functionally implicated in cancer.
Authors PCAWG Transcriptome Core Group, Claudia Calabrese, Natalie R Davidson, Deniz Demircioğlu, Nuno A. Fonseca, Yao He, André Kahles, Kjong-Van Lehmann, Fenglin Liu, Yuichi Shiraishi, Cameron M. Soulette, Lara Urban, Liliana Greger, Siliang Li, Dongbing Liu, Marc D. Perry, Qian Xiang, Fan Zhang, Junjun Zhang, Peter Bailey, Serap Erkek, Katherine A. Hoadley, Yong Hou, Matthew R. Huska, Helena Kilpinen, Jan O. Korbel, Maximillian G. Marin, Julia Markowski, Tannistha Nandi, Qiang Pan-Hammarström, Chandra Sekhar Pedamallu, Reiner Siebert, Stefan G. Stark, Hong Su, Patrick Tan, Sebastian M. Waszak, Christina Yung, Shida Zhu, Philip Awadalla, Chad J. Creighton, Matthew Meyerson, B. F. Francis Ouellette, Kui Wu, Huanming Yang, PCAWG Transcriptome Working Group, Alvis Brazma, Angela N. Brooks, Jonathan Göke, Gunnar Rätsch, Roland F. Schwarz, Oliver Stegle, Zemin Zhang & PCAWG Consortium- Show fewer authors Nature volume 578, pages129–136(2020)Cite this article
Submitted Nature
Abstract The discovery of drivers of cancer has traditionally focused on protein-coding genes1,2,3,4. Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers6,7, raise doubts about others and identify novel candidates, including point mutations in the 5′ region of TP53, in the 3′ untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available.
Authors Esther Rheinbay, Morten Muhlig Nielsen, Federico Abascal, Jeremiah A. Wala, Ofer Shapira, Grace Tiao, Henrik Hornshøj, Julian M. Hess, Randi Istrup Juul, Ziao Lin, Lars Feuerbach, Radhakrishnan Sabarinathan, Tobias Madsen, Jaegil Kim, Loris Mularoni, Shimin Shuai, Andrés Lanzós, Carl Herrmann, Yosef E. Maruvka, Ciyue Shen, Samirkumar B. Amin, Pratiti Bandopadhayay, Johanna Bertl, Keith A. Boroevich, John Busanovich, Joana Carlevaro-Fita, Dimple Chakravarty, Calvin Wing Yiu Chan, David Craft, Priyanka Dhingra, Klev Diamanti, Nuno A. Fonseca, Abel Gonzalez-Perez, Qianyun Guo, Mark P. Hamilton, Nicholas J. Haradhvala, Chen Hong, Keren Isaev, Todd A. Johnson, Malene Juul, Andre Kahles, Abdullah Kahraman, Youngwook Kim, Jan Komorowski, Kiran Kumar, Sushant Kumar, Donghoon Lee, Kjong-Van Lehmann, Yilong Li, Eric Minwei Liu, Lucas Lochovsky, Keunchil Park, Oriol Pich, Nicola D. Roberts, Gordon Saksena, Steven E. Schumacher, Nikos Sidiropoulos, Lina Sieverling, Nasa Sinnott-Armstrong, Chip Stewart, David Tamborero, Jose M. C. Tubio, Husen M. Umer, Liis Uusküla-Reimand, Claes Wadelius, Lina Wadi, Xiaotong Yao, Cheng-Zhong Zhang, Jing Zhang, James E. Haber, Asger Hobolth, Marcin Imielinski, Manolis Kellis, Michael S. Lawrence, Christian von Mering, Hidewaki Nakagawa, Benjamin J. Raphael, Mark A. Rubin, Chris Sander, Lincoln D. Stein, Joshua M. Stuart, Tatsuhiko Tsunoda, David A. Wheeler, Rory Johnson, Jüri Reimand, Mark Gerstein, Ekta Khurana, Peter J. Campbell, Núria López-Bigas, PCAWG Drivers and Functional Interpretation Working Group, PCAWG Structural Variation Working Group, Joachim Weischenfeldt, Rameen Beroukhim, Iñigo Martincorena, Jakob Skou Pedersen, Gad Getz & PCAWG Consortium
Submitted Nature
Abstract Objective: Acute intracranial hypertension is an important risk factor of secondary brain damage after traumatic brain injury. Hypertensive episodes are often diagnosed reactively, leading to late detection and lost time for intervention planning. A pro-active approach that predicts critical events several hours ahead of time could assist in directing attention to patients at risk. Approach: We developed a prediction framework that forecasts onsets of acute intracranial hypertension in the next 8 hours. It jointly uses cerebral auto-regulation indices, spectral energies and morphological pulse metrics to describe the neurological state of the patient. One-minute base windows were compressed by computing signal metrics, and then stored in a multi-scale history, from which physiological features were derived. Main results: Our model predicted events up to 8 hours in advance with alarm recall rates of 90% at a precision of 30% in the MIMIC- III waveform database, improving upon two baselines from the literature. We found that features derived from high-frequency waveforms substantially improved the prediction performance over simple statistical summaries of low-frequency time series, and each of the three feature classes contributed to the performance gain. The inclusion of long-term history up to 8 hours was especially important. Significance: Our results highlight the importance of information contained in high-frequency waveforms in the neurological intensive care unit. They could motivate future studies on pre-hypertensive patterns and the design of new alarm algorithms for critical events in the injured brain.
Authors Matthias Hüser, Adrian Kündig, Walter Karlen, Valeria De Luca, Martin Jaggi
Submitted Physiological Measurement
Abstract The goal of the unsupervised learning of disentangled representations is to separate the independent explanatory factors of variation in the data without access to supervision. In this paper, we summarize the results of Locatello et al., 2019, and focus on their implications for practitioners. We discuss the theoretical result showing that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases and the practical challenges it entails. Finally, we comment on our experimental findings, highlighting the limitations of state-of-the-art approaches and directions for future research.
Authors Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, Olivier Bachem
Submitted AAAI 2020
Abstract It is difficult to elucidate the transcriptional history of a cell using current experimental approaches, as they are destructive in nature and therefore describe only a moment in time. To overcome these limitations, we recently established Record-seq, a technology that enables transcriptional recording by CRISPR spacer acquisition from RNA. The recorded transcriptomes are recovered by SENECA, a method that selectively amplifies expanded CRISPR arrays, followed by deep sequencing. The resulting CRISPR spacers are aligned to the host genome, thereby enabling transcript quantification and associated analyses. Here, we describe the experimental procedures of the Record-seq workflow as well as subsequent data analysis. Beginning with the experimental design, Record-seq data can be obtained and analyzed within 1–2 weeks.
Authors Tanmay Tanna, Florian Schmidt, Mariia Y. Cherepkova, Michal Okoniewski, Randall J. Platt
Submitted Nature Protocols
2019
Abstract High-throughput DNA sequencing data are accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. Although there has been good progress toward representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this study, we present a new compression approach, Multi-binary relation wavelet tree (BRWT), which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data, and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world data sets.
Authors Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, André Kahles
Submitted Journal of Computational Biology
Abstract Intra-tumor hypoxia is a common feature in many solid cancers. Although transcriptional targets of hypoxia-inducible factors (HIFs) have been well characterized, alternative splicing or processing of pre-mRNA transcripts which occurs during hypoxia and subsequent HIF stabilization is much less understood. Here, we identify HIF-dependent alternative splicing events after whole transcriptome sequencing in pancreatic cancer cells exposed to hypoxia with and without downregulation of the aryl hydrocarbon receptor nuclear translocator (ARNT), a protein required for HIFs to form a transcriptionally active dimer. We correlate the discovered hypoxia-driven events with available sequencing data from pan-cancer TCGA patient cohorts to select a narrow set of putative biologically relevant splice events for experimental validation. We validate a small set of candidate HIF-dependent alternative splicing events in multiple human cancer cell lines as well as patient-derived human pancreatic cancer organoids. Lastly, we report the discovery of a HIF-dependent mechanism to produce a hypoxia-dependent, long and coding isoform of the UDP-N-acetylglucosamine transporter SLC35A3.
Authors Philipp Markolin, Natalie R Davidson, Christian K. Hirt, Christophe D. Chabbert, Nicola Zamboni, Gerald Schwank, Wilhelm Krek, Gunnar Rätsch
Submitted bioaRxiv
Abstract Obtaining high-quality uncertainty estimates is essential for many applications of deep neural networks. In this paper, we theoretically justify a scheme for estimating uncertainties, based on sampling from a prior distribution. Crucially, the uncertainty estimates are shown to be conservative in the sense that they never underestimate a posterior uncertainty obtained by a hypothetical Bayesian algorithm. We also show concentration, implying that the uncertainty estimates converge to zero as we get more data. Uncertainty estimates obtained from random priors can be adapted to any deep network architecture and trained using standard supervised learning pipelines. We provide experimental evaluation of random priors on calibration and out-of-distribution detection on typical computer vision tasks, demonstrating that they outperform deep ensembles in practice.
Authors Kamil Ciosek, Vincent Fortuin, Ryota Tomioka, Katja Hofmann, Richard Turner
Submitted ICLR 2020
Abstract Metagenomic studies have increasingly utilized sequencing technologies in order to analyze DNA fragments found in environmental samples.One important step in this analysis is the taxonomic classification of the DNA fragments. Conventional read classification methods require large databases and vast amounts of memory to run, with recent deep learning methods suffering from very large model sizes. We therefore aim to develop a more memory-efficient technique for taxonomic classification. A task of particular interest is abundance estimation in metagenomic samples. Current attempts rely on classifying single DNA reads independently from each other and are therefore agnostic to co-occurence patterns between taxa. In this work, we also attempt to take these patterns into account. We develop a novel memory-efficient read classification technique, combining deep learning and locality-sensitive hashing. We show that this approach outperforms conventional mapping-based and other deep learning methods for single-read taxonomic classification when restricting all methods to a fixed memory footprint. Moreover, we formulate the task of abundance estimation as a Multiple Instance Learning (MIL) problem and we extend current deep learning architectures with two different types of permutation-invariant MIL pooling layers: a) deepsets and b) attention-based pooling. We illustrate that our architectures can exploit the co-occurrence of species in metagenomic read sets and outperform the single-read architectures in predicting the distribution over taxa at higher taxonomic ranks.
Authors Andreas Georgiou, Vincent Fortuin, Harun Mustafa, Gunnar Rätsch
Submitted arXiv
Abstract Most human protein-coding genes are regulated by multiple, distinct promoters, suggesting that the choice of promoter is as important as its level of transcriptional activity. However, while a global change in transcription is recognized as a defining feature of cancer, the contribution of alternative promoters still remains largely unexplored. Here, we infer active promoters using RNA-seq data from 18,468 cancer and normal samples, demonstrating that alternative promoters are a major contributor to context-specific regulation of transcription. We find that promoters are deregulated across tissues, cancer types, and patients, affecting known cancer genes and novel candidates. For genes with independently regulated promoters, we demonstrate that promoter activity provides a more accurate predictor of patient survival than gene expression. Our study suggests that a dynamic landscape of active promoters shapes the cancer transcriptome, opening new diagnostic avenues and opportunities to further explore the interplay of regulatory mechanisms with transcriptional aberrations in cancer.
Authors Demircioğlu D, Cukuroglu E, Kindermans M, Nandi T, Calabrese C, Fonseca NA, Kahles A, Kjong-Van Lehmann, Stegle O, Brazma A, Brooks AN, Rätsch G, Tan P, Göke J.
Submitted The Cell
Abstract Multivariate time series with missing values are common in areas such as healthcare and finance, and have grown in number and complexity over the years. This raises the question whether deep learning methodologies can outperform classical data imputation methods in this domain. However, naive applications of deep learning fall short in giving reliable confidence estimates and lack interpretability. We propose a new deep sequential latent variable model for dimensionality reduction and data imputation. Our modeling assumption is simple and interpretable: the high dimensional time series has a lower-dimensional representation which evolves smoothly in time according to a Gaussian process. The non-linear dimensionality reduction in the presence of missing data is achieved using a VAE approach with a novel structured variational approximation. We demonstrate that our approach outperforms several classical and deep learning-based data imputation methods on high-dimensional data from the domains of computer vision and healthcare, while additionally improving the smoothness of the imputations and providing interpretable uncertainty estimates.
Authors Vincent Fortuin, Dmitry Baranchuk, Gunnar Rätsch, Stephan Mandt
Submitted AISTATS 2020
Abstract The oncogenic c-MYC (MYC) transcription factor has broad effects on gene expression and cell behavior. We show that MYC alters the efficiency and quality of mRNA translation into functional proteins. Specifically, MYC drives the translation of most protein components of the electron transport chain in lymphoma cells, and many of these effects are independent from proliferation. Specific interactions of MYC-sensitive RNA-binding proteins (e.g., SRSF1/RBM42) with 5'UTR sequence motifs mediate many of these changes. Moreover, we observe a striking shift in translation initiation site usage. For example, in low-MYC conditions, lymphoma cells initiate translation of the CD19 mRNA from a site in exon 5. This results in the truncation of all extracellular CD19 domains and facilitates escape from CD19-directed CAR-T cell therapy. Together, our findings reveal MYC effects on the translation of key metabolic enzymes and immune receptors in lymphoma cells.
Authors Singh K, Lin J, Zhong Y, Burčul A, Mohan P, Jiang M, Sun L, Yong-Gonzalez V, Viale A, Cross JR, Hendrickson RC, Rätsch G, Ouyang Z, Wendel HG.
Submitted J Exp Med.
Abstract We consider the problem of recovering a common latent source with independent components from multiple views. This applies to settings in which a variable is measured with multiple experimental modalities, and where the goal is to synthesize the disparate measurements into a single unified representation. We consider the case that the observed views are a nonlinear mixing of component-wise corruptions of the sources. When the views are considered separately, this reduces to nonlinear Independent Component Analysis (ICA) for which it is provably impossible to undo the mixing. We present novel identifiability proofs that this is possible when the multiple views are considered jointly, showing that the mixing can theoretically be undone using function approximators such as deep neural networks. In contrast to known identifiability results for nonlinear ICA, we prove that independent latent sources with arbitrary mixing can be recovered as long as multiple, sufficiently different noisy views are available.
Authors Luigi Gresele, Paul K Rubenstein, Arash Mehrjou, Francesco Locatello, Bernhard Schölkopf
Submitted UAI 2019
Abstract Learning disentangled representations is considered a cornerstone problem in representation learning. Recently, Locatello et al.(2019) demonstrated that unsupervised disentanglement learning without inductive biases is theoretically impossible and that existing inductive biases and unsupervised methods do not allow to consistently learn disentangled representations. However, in many practical settings, one might have access to a very limited amount of supervision, for example through manual labeling of training examples. In this paper, we investigate the impact of such supervision on state-of-the-art disentanglement methods and perform a large scale study, training over 29000 models under well-defined and reproducible experimental conditions. We first observe that a very limited number of labeled examples (0.01--0.5% of the data set) is sufficient to perform model selection on state-of-the-art unsupervised models. Yet, if one has access to labels for supervised model selection, this raises the natural question of whether they should also be incorporated into the training process. As a case-study, we test the benefit of introducing (very limited) supervision into existing state-of-the-art unsupervised disentanglement methods exploiting both the values of the labels and the ordinal information that can be deduced from them. Overall, we empirically validate that with very little and potentially imprecise supervision it is possible to reliably learn disentangled representations.
Authors Francesco Locatello, Michael Tschannen, Stefan Bauer, Gunnar Rätsch, Bernhard Schölkopf, Olivier Bachem
Submitted ICLR 2020
Abstract In recent years, the interest in \emph{unsupervised} learning of \emph{disentangled} representations has significantly increased. The key assumption is that real-world data is generated by a few explanatory factors of variation and that these factors can be recovered by unsupervised learning algorithms. A large number of unsupervised learning approaches based on \emph{auto-encoding} and quantitative evaluation metrics of disentanglement have been proposed; yet, the efficacy of the proposed approaches and utility of proposed notions of disentanglement has not been challenged in prior work. In this paper, we provide a sober look on recent progress in the field and challenge some common assumptions. We first theoretically show that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases on both the models and the data. Then, we train more than $\num{12000}$ models covering the six most prominent methods, and evaluate them across six disentanglement metrics in a reproducible large-scale experimental study on seven different data sets. On the positive side, we observe that different methods successfully enforce properties ``encouraged'' by the corresponding losses. On the negative side, we observe that in our study (1) ``good'' hyperparameters seemingly cannot be identified without access to ground-truth labels, (2) good hyperparameters neither transfer across data sets nor across disentanglement metrics, and (3) that increased disentanglement does not seem to lead to a decreased sample complexity of learning for downstream tasks. These results suggest that future work on disentanglement learning should be explicit about the role of inductive biases and (implicit) supervision, investigate concrete benefits of enforcing disentanglement of the learned representations, and consider a reproducible experimental setup covering several data sets.
Authors Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, Olivier Bachem
Submitted ICML 2019 - Best Paper Award
Abstract Macrophages tailor their function according to the signals found in tissue microenvironments, assuming a wide spectrum of phenotypes. A detailed understanding of macrophage phenotypes in human tissues is limited. Using single-cell RNA sequencing, we defined distinct macrophage subsets in the joints of patients with the autoimmune disease rheumatoid arthritis (RA), which affects ~1% of the population. The subset we refer to as HBEGF+ inflammatory macrophages is enriched in RA tissues and is shaped by resident fibroblasts and the cytokine tumor necrosis factor (TNF). These macrophages promoted fibroblast invasiveness in an epidermal growth factor receptor–dependent manner, indicating that intercellular cross-talk in this inflamed setting reshapes both cell types and contributes to fibroblast-mediated joint destruction. In an ex vivo synovial tissue assay, most medications used to treat RA patients targeted HBEGF+ inflammatory macrophages; however, in some cases, medication redirected them into a state that is not expected to resolve inflammation. These data highlight how advances in our understanding of chronically inflamed human tissues and the effects of medications therein can be achieved by studies on local macrophage phenotypes and intercellular interactions.
Authors David Kuo, Jennifer Ding, Ian Cohn, Fan Zhang, Kevin Wei, Deepak Rao, Cristina Rozo, Upneet K Sokhi, Sara Shanaj, David J. Oliver, Adriana P. Echeverria, Edward F. DiCarlo, Michael B. Brenner, Vivian P. Bykerk, Susan M. Goodman, Soumya Raychaudhuri, Gunnar Rätsch, Lionel B. Ivashkiv, Laura T. Donlin
Submitted Science Translational Medicine
Abstract The recent adoption of Electronic Health Records (EHRs) by health care providers has introduced an important source of data that provides detailed and highly specific insights into patient phenotypes over large cohorts. These datasets, in combination with machine learning and statistical approaches, generate new opportunities for research and clinical care. However, many methods require the patient representations to be in structured formats, while the information in the EHR is often locked in unstructured texts designed for human readability. In this work, we develop the methodology to automatically extract clinical features from clinical narratives from large EHR corpora without the need for prior knowledge. We consider medical terms and sentences appearing in clinical narratives as atomic information units. We propose an efficient clustering strategy suitable for the analysis of large text corpora and to utilize the clusters to represent information about the patient compactly. To demonstrate the utility of our approach, we perform an association study of clinical features with somatic mutation profiles from 4,007 cancer patients and their tumors. We apply the proposed algorithm to a dataset consisting of about 65 thousand documents with a total of about 3.2 million sentences. We identify 341 significant statistical associations between the presence of somatic mutations and clinical features. We annotated these associations according to their novelty, and report several known associations. We also propose 32 testable hypotheses where the underlying biological mechanism does not appear to be known but plausible. These results illustrate that the …
Authors Stefan G Stark, Stephanie L Hyland, Melanie F Pradier, Kjong-Van Lehmann, Andreas Wicki, Fernando Perez Cruz, Julia E Vogt, Gunnar Rätsch
Submitted arxiv
Abstract Macrophages tailor their function to the signals found in tissue microenvironments, taking on a wide spectrum of phenotypes. In human tissues, a detailed understanding of macrophage phenotypes is limited. Using single-cell RNA-sequencing, we define distinct macrophage subsets in the joints of patients with the autoimmune disease rheumatoid arthritis (RA), which affects ~1% of the population. The subset we refer to as HBEGF+ inflammatory macrophages is enriched in RA tissues and shaped by resident fibroblasts and the cytokine TNF. These macrophages promote fibroblast invasiveness in an EGF receptor dependent manner, indicating that inflammatory intercellular crosstalk reshapes both cell types and contributes to fibroblast-mediated joint destruction. In an ex vivo tissue assay, the HBEGF+ inflammatory macrophage is targeted by several anti-inflammatory RA medications, however, COX inhibition redirects it towards a different inflammatory phenotype that is also expected to perpetuate pathology. These data highlight advances in understanding the pathophysiology and drug mechanisms in chronic inflammatory disorders can be achieved by focusing on macrophage phenotypes in the context of complex interactions in human tissues.
Authors David Kuo, Jennifer Ding, Ian Cohn, Fan Zhang, Kevin Wei, Deepak Rao, Cristina Rozo, Upneet K Sokhi, Accelerating Medicines Partnership RA/SLE Network, Edward F. DiCarlo, Michael B. Brenner, Vivian P. Bykerk, VSusan M. Goodman, Soumya Raychaudhuri, Gunnar Rätsch, Lionel B. Ivashkiv, Laura T. Donlin
Submitted bioRxiv
Abstract When fitting Bayesian machine learning models on scarce data, the main challenge is to obtain suitable prior knowledge and encode it into the model. Recent advances in meta-learning offer powerful methods for extracting such prior knowledge from data acquired in related tasks. When it comes to meta-learning in Gaussian process models, approaches in this setting have mostly focused on learning the kernel function of the prior, but not on learning its mean function. In this work, we explore meta-learning the mean function of a Gaussian process prior. We present analytical and empirical evidence that mean function learning can be useful in the meta-learning setting, discuss the risk of overfitting, and draw connections to other meta-learning approaches, such as model agnostic meta-learning and functional PCA.
Authors Vincent Fortuin, Heiko Strathmann, Gunnar Rätsch
Submitted arXiv Preprints
2018
Abstract The BRCA Challenge is a long-term data-sharing project initiated within the Global Alliance for Genomics and Health (GA4GH) to aggregate BRCA1 and BRCA2 data to support highly collaborative research activities. Its goal is to generate an informed and current understanding of the impact of genetic variation on cancer risk across the iconic cancer predisposition genes, BRCA1 and BRCA2. Initially, reported variants in BRCA1 and BRCA2 available from public databases were integrated into a single, newly created site, www.brcaexchange.org. The purpose of the BRCA Exchange is to provide the community with a reliable and easily accessible record of variants interpreted for a high-penetrance phenotype. More than 20,000 variants have been aggregated, three times the number found in the next-largest public database at the project’s outset, of which approximately 7,250 have expert classifications. The data set is based on shared information from existing clinical databases—Breast Cancer Information Core (BIC), ClinVar, and the Leiden Open Variation Database (LOVD)—as well as population databases, all linked to a single point of access. The BRCA Challenge has brought together the existing international Evidence-based Network for the Interpretation of Germline Mutant Alleles (ENIGMA) consortium expert panel, along with expert clinicians, diagnosticians, researchers, and database providers, all with a common goal of advancing our understanding of BRCA1 and BRCA2 variation. Ongoing work includes direct contact with national centers with access to BRCA1 and BRCA2 diagnostic data to encourage data sharing, development of methods suitable for extraction of genetic variation at the level of individual laboratory reports, and engagement with participant communities to enable a more comprehensive understanding of the clinical significance of genetic variation in BRCA1 and BRCA2.
Authors Melissa S. Cline , Rachel G. Liao , Michael T. Parsons , Benedict Paten , Faisal Alquaddoomi, Antonis Antoniou, Samantha Baxter, Larry Brody, Robert Cook-Deegan, Amy Coffin, Fergus J. Couch, Brian Craft, Robert Currie, Chloe C. Dlott, Lena Dolman, Johan T. den Dunnen, Stephanie O. M. Dyke, Susan M. Domchek, Douglas Easton, Zachary Fischmann, William D. Foulkes, Judy Garber, David Goldgar, Mary J. Goldman, Peter Goodhand, Steven Harrison, David Haussler, Kazuto Kato, Bartha Knoppers, Charles Markello, Robert Nussbaum, Kenneth Offit, Sharon E. Plon, Jem Rashbass, Heidi L. Rehm, Mark Robson, Wendy S. Rubinstein, Dominique Stoppa-Lyonnet, Sean Tavtigian, Adrian Thorogood, Can Zhang, Marc Zimmermann, BRCA Challenge Authors , John Burn , Stephen Chanock , Gunnar Rätsch , Amanda B. Spurdle
Submitted PLOS Genetics
Abstract High-dimensional time series are common in many domains. Since human cognition is not optimized to work well in high-dimensional spaces, these areas could benefit from interpretable low-dimensional representations. However, most representation learning algorithms for time series data are difficult to interpret. This is due to non-intuitive mappings from data features to salient properties of the representation and non-smoothness over time. To address this problem, we propose a new representation learning framework building on ideas from interpretable discrete dimensionality reduction and deep generative modeling. This framework allows us to learn discrete representations of time series, which give rise to smooth and interpretable embeddings with superior clustering performance. We introduce a new way to overcome the non-differentiability in discrete representation learning and present a gradient-based version of the traditional self-organizing map algorithm that is more performant than the original. Furthermore, to allow for a probabilistic interpretation of our method, we integrate a Markov model in the representation space. This model uncovers the temporal transition structure, improves clustering performance even further and provides additional explanatory insights as well as a natural representation of uncertainty. We evaluate our model in terms of clustering performance and interpretability on static (Fashion-)MNIST data, a time series of linearly interpolated (Fashion-)MNIST images, a chaotic Lorenz attractor system with two macro states, as well as on a challenging real world medical time series application on the eICU data set. Our learned representations compare favorably with competitor methods and facilitate downstream tasks on the real world data.
Authors Vincent Fortuin, Matthias Hüser, Francesco Locatello, Heiko Strathmann, Gunnar Rätsch
Submitted ICLR 2019
Abstract Neural Processes (NPs) are a class of neural latent variable models that combine desirable properties of Gaussian Processes (GPs) and neural networks. Like GPs, NPs define distributions over functions and are able to estimate the uncertainty in their predictions. Like neural networks, NPs are computationally efficient during training and prediction time. In this paper, we establish an explicit theoretical connection between NPs and GPs. In particular, we show that, under certain conditions, NPs are mathematically equivalent to GPs with deep kernels. This result further elucidates the relationship between GPs and NPs and makes previously derived theoretical insights about GPs applicable to NPs. Furthermore, it suggests a novel approach to learning expressive GP covariance functions applicable across different prediction tasks by training a deep kernel GP on a set of datasets.
Authors Tim G. J. Rudner, Vincent Fortuin, Yee Whye Teh, Yarin Gal
Submitted Bayesian Deep Learning workshop @NeurIPS 2018
Abstract In this work, we investigate unsupervised representation learning on medical time series, which bears the promise of leveraging copious amounts of existing unlabeled data in order to eventually assist clinical decision making. By evaluating on the prediction of clinically relevant outcomes, we show that in a practical setting, unsupervised representation learning can offer clear performance benefits over end-to-end supervised architectures. We experiment with using sequence-to-sequence (Seq2Seq) models in two different ways, as an autoencoder and as a forecaster, and show that the best performance is achieved by a forecasting Seq2Seq model with an integrated attention mechanism, proposed here for the first time in the setting of unsupervised learning for medical time series.
Authors Xinrui Lyu, Matthias Hüser, Stephanie L. Hyland, George Zerveas, Gunnar Rätsch
Submitted Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 - Spotlight
Abstract High-throughput DNA sequencing data is accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. While there has been good progress towards representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this work, we present a new compression approach, Multi-BRWT, which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world datasets.
Authors Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, Andre Kahles
Submitted RECOMB 2019
Abstract Translation initiation is orchestrated by the cap binding and 43S pre-initiation complexes (PIC). Eukaryotic initiation factor 1A (EIF1A) is essential for recruitment of the ternary complex and for assembling the 43S PIC. Recurrent EIF1AX mutations in papillary thyroid cancers are mutually exclusive with other drivers, including RAS. EIF1AX is enriched in advanced thyroid cancers, where it displays a striking co-occurrence with RAS, which cooperates to induce tumorigenesis in mice and isogenic cell lines. The C-terminal EIF1AX-A113splice mutation is the most prevalent in advanced thyroid cancer. EIF1AX-A113spl variants stabilize the PIC and induce ATF4, a sensor of cellular stress, which is co-opted to suppress EIF2α phosphorylation, enabling a general increase in protein synthesis. RAS stabilizes c-MYC, an effect augmented by EIF1AX-A113spl. ATF4 and c-MYC induce expression of aminoacid transporters and enhance sensitivity of mTOR to aminoacid supply. These mutually reinforcing events generate therapeutic vulnerabilities to MEK, BRD4 and mTOR kinase inhibitors.
Authors Gnana P. Krishnamoorthy, Natalie R Davidson, Steven D Leach, Zhen Zhao, Scott W. Lowe, Gina Lee, Iñigo Landa, James Nagarajah, Mahesh Saqcena, Kamini Singh, Hans-Guido Wendel, Snjezana Dogan, Prasanna P. Tamarapu, John Blenis, Ronald Ghossein, Jeffrey A. Knauf, Gunnar Rätsch and James A. Fagin
Submitted Cancer Discovery
Abstract Our comprehensive analysis of alternative splicing across 32 The Cancer Genome Atlas cancer types from 8,705 patients detects alternative splicing events and tumor variants by reanalyzing RNA and whole-exome sequencing data. Tumors have up to 30% more alternative splicing events than normal samples. Association analysis of somatic variants with alternative splicing events confirmed known trans associations with variants in SF3B1 and U2AF1 and identified additional trans-acting variants (e.g., TADA1, PPP2R1A). Many tumors have thousands of alternative splicing events not detectable in normal samples; on average, we identified ≈930 exon-exon junctions (“neojunctions”) in tumors not typically found in GTEx normals. From Clinical Proteomic Tumor Analysis Consortium data available for breast and ovarian tumor samples, we confirmed ≈1.7 neojunction- and ≈0.6 single nucleotide variant-derived peptides per tumor sample that are also predicted major histocompatibility complex-I binders (“putative neoantigens”).
Authors Andre Kahles, Kjong-Van Lehmann, Nora C. Toussaint, Matthias Hüser, Stefan Stark, Timo Sachsenberg, Oliver Stegle, Oliver Kohlbacher, Chris Sander, Gunnar Rätsch, The Cancer Genome Atlas Research Network
Submitted Cancer Cell
Abstract The Global Alliance for Genomics and Health (GA4GH) proposes a data access policy model—“registered access”—to increase and improve access to data requiring an agreement to basic terms and conditions, such as the use of DNA sequence and health data in research. A registered access policy would enable a range of categories of users to gain access, starting with researchers and clinical care professionals. It would also facilitate general use and reuse of data but within the bounds of consent restrictions and other ethical obligations. In piloting registered access with the Scientific Demonstration data sharing projects of GA4GH, we provide additional ethics, policy and technical guidance to facilitate the implementation of this access model in an international setting.
Authors Stephanie O. M. Dyke, Mikael Linden, […], Gunnar Rätsch, […], Paul Flicek
Submitted European Journal of Human Genetics
Abstract The deterioration of organ function in ICU patients requires swift response to prevent further damage to vital systems. Focusing on the circulatory system, we build a model to predict if a patient’s state will deteriorate in the near future. We identify circulatory system dys- function using the combination of excess lactic acid in the blood and low mean arterial blood pressure or the presence of vasoactive drugs. Using an observational cohort of 45,000 patients from a Swiss ICU, we extract and process patient time series and identify periods of circulatory system dysfunction to develop an early warning system. We train a gra- dient boosting model to perform binary classification every five minutes on whether the patient will deteriorate during an increasingly large win- dow into the future, up to the duration of a shift (8 hours). The model achieves an AUROC between 0.952 and 0.919 across the prediction win- dows, and an AUPRC between 0.223 and 0.384 for events with positive prevalence between 0.014 and 0.042. We also show preliminary results from a recurrent neural network. These results show that contemporary machine learning approaches combined with careful preprocessing of raw data collected during routine care yield clinically useful predictions in near real time [Workshop Abstract]
Authors Stephanie Hyland, Matthias Hüser, Xinrui Lyu, Martin Faltys, Tobias Merz, Gunnar Rätsch
Submitted Proceedings of the First Joint Workshop on AI in Health
Abstract Motivation: Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata. Results: We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain. Availability: We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation.
Authors Harun Mustafa, Ingo Schilken, Mikhail Karasikov, Carsten Eickhoff, Gunnar Rätsch, Andre Kahles
Submitted Bioinformatics
Abstract Two popular examples of first-order optimization methods over linear spaces are coordinate descent and matching pursuit algorithms, with their randomized variants. While the former targets the optimization by moving along coordinates, the latter considers a generalized notion of directions. Exploiting the connection between the two algorithms, we present a unified analysis of both, providing affine invariant sublinear $O(1/t)$ rates on smooth objectives and linear convergence on strongly convex objectives. As a byproduct of our affine invariant analysis of matching pursuit, our rates for steepest coordinate descent are the tightest known. Furthermore, we show the first accelerated convergence rate $O(1/t^2)$ for matching pursuit and steepest coordinate descent on convex objectives.
Authors Francesco Locatello, Anant Raj, Sai Praneeth Reddy, Gunnar Rätsch, Bernhard Schölkopf, Sebastian U Stich, Martin Jaggi
Submitted ICML 2018
Abstract We propose a conditional gradient framework for a composite convex minimization template with broad applications. Our approach combines the notions of smoothing and homotopy under the CGM framework, and provably achieves the optimal $\mathcal {O}(1/\sqrt {k}) $ convergence rate. We demonstrate that the same rate holds if the linear subproblems are solved approximately with additive or multiplicative error. Specific applications of the framework include the non-smooth minimization, semidefinite programming, and minimization with linear inclusion constraints over a compact domain. We provide numerical evidence to demonstrate the benefits of the new framework.
Authors Alp Yurtsever, Olivier Fercoq, Francesco Locatello, Volkan Cevher
Submitted ICML 2018
Abstract Approximating a probability density in a tractable manner is a central task in Bayesian statistics. Variational Inference (VI) is a popular technique that achieves tractability by choosing a relatively simple variational family. Borrowing ideas from the classic boosting framework, recent approaches attempt to \emph{boost} VI by replacing the selection of a single density with a greedily constructed mixture of densities. In order to guarantee convergence, previous works impose stringent assumptions that require significant effort for practitioners. Specifically, they require a custom implementation of the greedy step (called the LMO) for every probabilistic model with respect to an unnatural variational family of truncated distributions. Our work fixes these issues with novel theoretical and algorithmic insights. On the theoretical side, we show that boosting VI satisfies a relaxed smoothness assumption which is sufficient for the convergence of the functional Frank-Wolfe (FW) algorithm. Furthermore, we rephrase the LMO problem and propose to maximize the Residual ELBO (RELBO) which replaces the standard ELBO optimization in VI. These theoretical enhancements allow for black box implementation of the boosting subroutine. Finally, we present a stopping criterion drawn from the duality gap in the classic FW analyses and exhaustive experiments to illustrate the usefulness of our theoretical and algorithmic contributions.
Authors Francesco Locatello, Gideon Dresdner, Rajiv Khanna, Isabel Valera, Gunnar Rätsch
Submitted NeurIPS 2018 (spotlight)
Abstract Clustering is a cornerstone of unsupervised learning which can be thought as disentangling the multiple generative mechanisms underlying the data. In this paper we introduce an algorithmic framework to train mixtures of implicit generative models which we instantiate for variational autoencoders. Relying on an additional set of discriminators, we propose a competitive procedure in which the models only need to approximate the portion of the data distribution from which they can produce realistic samples. As a byproduct, each model is simpler to train, and a clustering interpretation arises naturally from the partitioning of the training points among the models. We empirically show that our approach splits the training distribution in a reasonable way and increases the quality of the generated samples.
Authors Francesco Locatello, Damien Vincent, Ilya Tolstikhin, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf
Submitted Arxiv
Abstract Variational Inference is a popular technique to approximate a possibly intractable Bayesian posterior with a more tractable one. Recently, Boosting Variational Inference has been proposed as a new paradigm to approximate the posterior by a mixture of densities by greedily adding components to the mixture. In the present work, we study the convergence properties of this approach from a modern optimization viewpoint by establishing connections to the classic Frank-Wolfe algorithm. Our analyses yields novel theoretical insights on the Boosting of Variational Inference regarding the sufficient conditions for convergence, explicit sublinear/linear rates, and algorithmic simplifications.
Authors Francesco Locatello, Rajiv Khanna, Joydeep Ghosh, Gunnar Rätsch
Submitted AISTATS 2018
Abstract We present the most comprehensive catalogue of cancer-associated gene alterations through characterization of tumor transcriptomes from 1,188 donors of the Pan-Cancer Analysis of Whole Genomes project. Using matched whole-genome sequencing data, we attributed RNA alterations to germline and somatic DNA alterations, revealing likely genetic mechanisms. We identified 444 associations of gene expression with somatic non-coding single-nucleotide variants. We found 1,872 splicing alterations associated with somatic mutation in intronic regions, including novel exonization events associated with Alu elements. Somatic copy number alterations were the major driver of total gene and allele-specific expression (ASE) variation. Additionally, 82% of gene fusions had structural variant support, including 75 of a novel class called "bridged" fusions, in which a third genomic location bridged two different genes. Globally, we observe transcriptomic alteration signatures that differ between cancer types and have associations with DNA mutational signatures. Given this unique dataset of RNA alterations, we also identified 1,012 genes significantly altered through both DNA and RNA mechanisms. Our study represents an extensive catalog of RNA alterations and reveals new insights into the heterogeneous molecular mechanisms of cancer gene alterations.
Authors Claudia Calabrese, Natalie R Davidson, Nuno A Fonseca, Yao He, André Kahles, Kjong-Van Lehmann, Fenglin Liu, Yuichi Shiraishi, Cameron M Soulette, Lara Urban, Deniz Demircioğlu, Liliana Greger, Siliang Li, Dongbing Liu, Marc D Perry, Linda Xiang, Fan Zhang, Junjun Zhang, Peter Bailey, Serap Erkek, Katherine A Hoadley, Yong Hou, Helena Kilpinen, Jan O Korbel, Maximillian G Marin, Julia Markowski, Tannistha Nandi, Qiang Pan-Hammarström, Chandra S Pedamallu, Reiner Siebert, Stefan G Stark, Hong Su, Patrick Tan, Sebastian M Waszak, Christina Yung, Shida Zhu, Philip Awadalla, Chad J Creighton, Matthew Meyerson, B Francis F Ouellette, Kui Wu, Huanming Yang, Alvis Brazma, Angela N Brooks, Jonathan Göke, Gunnar Rätsch, Roland F Schwarz, Oliver Stegle, Zemin Zhang
Submitted bioRxiv
2017
Abstract Technological advancements in high throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains inaccessible to the research com- munity through a lack efficient data representation and indexing solutions. One of the available techniques to represent read data on a more abstract level is its transformation into an assem- bly graph. Although the sequence information is now accessible, any contextual annotation and metadata is lost. We present a new approach for a compressed representation of a graph coloring based on a set of Bloom filters. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph to decide on false positives, we can reduce the memory requirements for a given set of colors per edge by three orders of magnitude. As insertion and query on a Bloom filter are constant time operations, the complexity to compress and decompress an edge color is linear in the number of color bits. Representing individual colors as independent filters, our approach is fully dynamic and can be easily parallelized. These properties allow for an easy upscaling to the problem sizes common in the biomedical domain. A prototype implementation of our method is available in Java.
Authors Ingo Schilken, Harun Mustafa, Gunnar Rätsch, Carsten Eickhoff, Andre Kahles
Submitted bioRxiv
Abstract Much of the DNA and RNA sequencing data available is in the form of high-throughput sequencing (HTS) reads and is currently unindexed by established sequence search databases. Recent succinct data structures for indexing both reference sequences and HTS data, along with associated metadata, have been based on either hashing or graph models, but many of these structures are static in nature, and thus, not well-suited as backends for dynamic databases. We propose a parallel construction method for and novel application of the wavelet trie as a dynamic data structure for compressing and indexing graph metadata. By developing an algorithm for merging wavelet tries, we are able to construct large tries in parallel by merging smaller tries constructed concurrently from batches of data. When compared against general compression algorithms and those developed specifically for graph colors (VARI and Rainbowfish), our method achieves compression ratios superior to gzip and VARI, converging to compression ratios of 6.5% to 2% on data sets constructed from over 600 virus genomes. While marginally worse than compression by bzip2 or Rainbowfish, this structure allows for both fast extension and query. We also found that additionally encoding graph topology metadata improved compression ratios, particularly on data sets consisting of several mutually-exclusive reference genomes. It was also observed that the compression ratio of wavelet tries grew sublinearly with the density of the annotation matrices. This work is a significant step towards implementing a dynamic data structure for indexing large annotated sequence data sets that supports fast query and update operations. At the time of writing, no established standard tool has filled this niche.
Authors Harun Mustafa, Andre Kahles, Mikhail Karasikov, Gunnar Raetsch
Submitted bioRxiv
Abstract Cancer is characterised by somatic genetic variation, but the effect of the majority of non-coding somatic variants and the interface with the germline genome are still unknown. We analysed the whole genome and RNA-seq data from 1,188 human cancer patients as provided by the Pan-cancer Analysis of Whole Genomes (PCAWG) project to map cis expression quantitative trait loci of somatic and germline variation and to uncover the causes of allele-specific expression patterns in human cancers. The availability of the first large-scale dataset with both whole genome and gene expression data enabled us to uncover the effects of the non-coding variation on cancer. In addition to confirming known regulatory effects, we identified novel associations between somatic variation and expression dysregulation, in particular in distal regulatory elements. Finally, we uncovered links between somatic mutational signatures and gene expression changes, including TERT and LMO2, and we explained the inherited risk factors in APOBEC-related mutational processes. This work represents the first large-scale assessment of the effects of both germline and somatic genetic variation on gene expression in cancer and creates a valuable resource cataloguing these effects.
Authors Claudia Calabrese, Kjong-Van Lehmann, Lara Urban, Fenglin Liu, Serap Erkek, Nuno Fonseca, Andre Kahles, Leena Helena Kilpinen-Barrett, Julia Markowski, PCAWG-3, Sebastian Waszak, Jan Korbel, Zemin Zhang, Alvis Brazma, Gunnar Raetsch, Roland Schwarz, Oliver Stegle
Submitted bioRxiv
Abstract Most human protein-coding genes are regulated by multiple, distinct promoters, suggesting that the choice of promoter is as important as its level of transcriptional activity. While the role of promoters as driver elements in cancer has been recognized, the contribution of alternative promoters to regulation of the cancer transcriptome remains largely unexplored. Here we show that active promoters can be identified using RNA-Seq data, enabling the analysis of promoter activity in more than 1,000 cancer samples with matched whole genome sequencing data. We find that alternative promoters are a major contributor to tissue-specific regulation of isoform expression and that alternative promoters are frequently deregulated in cancer, affecting known cancer-genes and novel candidates. Noncoding passenger mutations are enriched at promoters of genes with lower regulatory complexity, whereas noncoding driver mutations occur at genes with multiple promoters, often affecting the promoter that shows the highest level of activity. Together our study demonstrates that the landscape of active promoters shapes the cancer transcriptome, opening many opportunities to further explore the interplay of regulatory mechanism and noncoding somatic mutations with transcriptional aberrations in cancer.
Authors Deniz Demircioğlu, Martin Kindermans, Tannistha Nandi, Engin Cukuroglu, Claudia Calabrese, Nuno A. Fonseca, Andre Kahles, Kjong Lehmann, Oliver Stegle, PCAWG-3, PCAWG-Network, Alvis Brazma, Angela Brooks, Gunnar Rätsch, Patrick Tan, Jonathan Göke
Submitted bioRxiv
Abstract During rheumatoid arthritis (RA), Tumor Necrosis Factor (TNF) activates fibroblast-like synoviocytes (FLS) inducing in a temporal order a constellation of genes, which perpetuate synovial inflammation. Although the molecular mechanisms regulating TNF-induced transcription are well characterized, little is known about the impact of mRNA stability on gene expression and the impact of TNF on decay rates of mRNA transcripts in FLS. To address these issues we performed RNA sequencing and genome-wide analysis of the mRNA stabilome in RA FLS. We found that TNF induces a biphasic gene expression program: initially, the inducible transcriptome consists primarily of unstable transcripts but progressively switches and becomes dominated by very stable transcripts. This temporal switch is due to: a) TNF-induced prolonged stabilization of previously unstable transcripts that enables progressive transcript accumulation over days and b) sustained expression and late induction of very stable transcripts. TNF-induced mRNA stabilization in RA FLS occurs during the late phase of TNF response, is MAPK-dependent, and involves several genes with pathogenic potential such as IL6, CXCL1, CXCL3, CXCL8/IL8, CCL2, and PTGS2. These results provide the first insights into genome-wide regulation of mRNA stability in RA FLS and highlight the potential contribution of dynamic regulation of the mRNA stabilome by TNF to chronic synovitis.
Authors Loupasakis K, Kuo D, Sokhi UK, Sohn C, Syracuse B, Giannopoulou EG, Park SH, Kang H, Rätsch G, Ivashkiv LB, Kalliolias GD
Submitted PLoS One
Abstract Generative Adversarial Networks (GANs) have shown remarkable success as a framework for training models to produce realistic-looking data. In this work, we propose a Recurrent GAN (RGAN) and Recurrent Conditional GAN (RCGAN) to produce realistic real-valued multi-dimensional time series, with an emphasis on their application to medical data. RGANs make use of recurrent neural networks in the generator and the discriminator. In the case of RCGANs, both of these RNNs are conditioned on auxiliary information. We demonstrate our models in a set of toy datasets, where we show visually and quantitatively (using sample likelihood and maximum mean discrepancy) that they can successfully generate realistic time-series. We also describe novel evaluation methods for GANs, where we generate a synthetic labelled training dataset, and evaluate on a real test set the performance of a model trained on the synthetic data, and vice-versa. We illustrate with these metrics that RCGANs can generate time-series data useful for supervised training, with only minor degradation in performance on real test data. This is demonstrated on digit classification from 'serialised' MNIST and by training an early warning system on a medical dataset of 17,000 patients from an intensive care unit. We further discuss and analyse the privacy concerns that may arise when using RCGANs to generate realistic synthetic medical time series data.
Authors Stephanie L Hyland, Cristobal Esteban, Gunnar Rätsch
Submitted arXiv
Abstract Greedy optimization methods such as Matching Pursuit (MP) and Frank-Wolfe (FW) algorithms regained popularity in recent years due to their simplicity, effectiveness and theoretical guarantees. MP and FW address optimization over the linear span and the convex hull of a set of atoms, respectively. In this paper, we consider the intermediate case of optimization over the convex cone, parametrized as the conic hull of a generic atom set, leading to the first principled definitions of non-negative MP algorithms for which we give explicit convergence rates and demonstrate excellent empirical performance. In particular, we derive sublinear (O(1/t)) convergence on general smooth and convex objectives, and linear convergence (O(e−t)) on strongly convex objectives, in both cases for general sets of atoms. Furthermore, we establish a clear correspondence of our algorithms to known algorithms from the MP and FW literature. Our novel algorithms and analyses target general atom sets and general objective functions, and hence are directly applicable to a large variety of learning settings.
Authors Francesco Locatello, Michael Tschannen, Gunnar Rätsch, Martin Jaggi
Submitted NIPS 2017
Abstract Two of the most fundamental prototypes of greedy optimization are the matching pursuit and FrankWolfe algorithms. In this paper, we take a unified view on both classes of methods, leading to the first explicit convergence rates of matching pursuit methods in an optimization sense, for general sets of atoms. We derive sublinear (1/t) convergence for both classes on general smooth objectives, and linear convergence on strongly convex objectives, as well as a clear correspondence of algorithm variants. Our presented algorithms and rates are affine invariant, and do not need any incoherence or sparsity assumptions.
Authors Francesco Locatello, Rajiv Khanna, Michael Tschannen, Martin Jaggi
Submitted AISTATS 2017
Abstract To understand the population genetics of structural variants and their effects on phenotypes, we developed an approach to mapping structural variants that segregate in a population sequenced at low coverage. We avoid calling structural variants directly. Instead, the evidence for a potential structural variant at a locus is indicated by variation in the counts of short-reads that map anomalously to that locus. These structural variant traits are treated as quantitative traits and mapped genetically, analogously to a gene expression study. Association between a structural variant trait at one locus, and genotypes at a distant locus indicate the origin and target of a transposition. Using ultra-low-coverage (0.3×) population sequence data from 488 recombinant inbred Arabidopsis thaliana genomes, we identified 6502 segregating structural variants. Remarkably, 25% of these were transpositions. While many structural variants cannot be delineated precisely, we validated 83% of 44 predicted transposition breakpoints by polymerase chain reaction. We show that specific structural variants may be causative for quantitative trait loci for germination and resistance to infection by the fungus Albugo laibachii, isolate Nc14. Further we show that the phenotypic heritability attributable to read-mapping anomalies differs from, and, in the case of time to germination and bolting, exceeds that due to standard genetic variation. Genes within structural variants are also more likely to be silenced or dysregulated. This approach complements the prevalent strategy of structural variant discovery in fewer individuals sequenced at high coverage. It is generally applicable to large populations sequenced at low-coverage, and is particularly suited to mapping transpositions.
Authors Martha Imprialou, André Kahles, Joshua G. Steffen, Edward J. Osborne, Xiangchao Gan, Janne Lempe, Amarjit Bhomra, Eric Belfield, Anne Visscher, Robert Greenhalgh, Nicholas P Harberd, Richard Goram, Jotun Hein, Alexandre Robert-Seilaniantz, Jonathan Jones, Oliver Stegle, Paula Kover, Miltos Tsiantis, Magnus Nordborg, Gunnar Rätsch, Richard M. Clark andRichard Mott
Submitted Genetics
Authors Natalie R. Davidson, ; PanCancer Analysis of Whole Genomes 3 (PCAWG-3) for ICGC, Alvis Brazma, Angela N. Brooks, Claudia Calabrese, Nuno A. Fonseca, Jonathan Goke, Yao He, Xueda Hu, Andre Kahles, Kjong-Van Lehmann, Fenglin Liu, Gunnar Rätsch, Siliang Li, Roland F. Schwarz, Mingyu Yang, Zemin Zhang, Fan Zhang and Liangtao Zheng
Submitted Proceedings of the American Association for Cancer Research Annual Meeting 2017
Abstract We present SplashRNA, a sequential classifier to predict potent microRNA-based short hairpin RNAs (shRNAs). Trained on published and novel data sets, SplashRNA outperforms previous algorithms and reliably predicts the most efficient shRNAs for a given gene. Combined with an optimized miR-E backbone, >90% of high-scoring SplashRNA predictions trigger >85% protein knockdown when expressed from a single genomic integration. SplashRNA can significantly improve the accuracy of loss-of-function genetics studies and facilitates the generation of compact shRNA libraries.
Authors Pelossof R, Fairchild L, Huang CH, Widmer C, Sreedharan VT, Sinha N, Lai DY, Guan Y, Premsrirut PK, Tschaharganeh DF, Hoffmann T, Thapar V, Xiang Q, Garippa RJ, Rätsch G, Zuber J, Lowe SW, Leslie CS, Fellmann C
Submitted Nature Biotechnology
Abstract MOTIVATION:Deep sequencing based ribosome footprint profiling can provide novel insights into the regulatory mechanisms of protein translation. However, the observed ribosome profile is fundamentally confounded by transcriptional activity. In order to decipher principles of translation regulation, tools that can reliably detect changes in translation efficiency in case-control studies are needed. RESULTS: We present a statistical framework and an analysis tool, RiboDiff, to detect genes with changes in translation efficiency across experimental treatments. RiboDiff uses generalized linear models to estimate the over-dispersion of RNA-Seq and ribosome profiling measurements separately, and performs a statistical test for differential translation efficiency using both mRNA abundance and ribosome occupancy. AVAILABILITY AND IMPLEMENTATION: RiboDiff webpage http://bioweb.me/ribodiff Source code including scripts for preprocessing the FASTQ data are available at http://github.com/ratschlab/ribodiff CONTACTS: zhongy@cbio.mskcc.org or raetsch@inf.ethz.chSupplementary information: Supplementary data are available at Bioinformatics online.
Authors Zhong Y, Karaletsos T, Drewe P, Sreedharan VT, Kuo D, Singh K, Wendel HG, Rätsch G.
Submitted Bioinformatics
2016
Abstract Plants use light as source of energy and information to detect diurnal rhythms and seasonal changes. Sensing changing light conditions is critical to adjust plant metabolism and to initiate developmental transitions. Here, we analyzed transcriptome-wide alterations in gene expression and alternative splicing (AS) of etiolated seedlings undergoing photomorphogenesis upon exposure to blue, red, or white light. Our analysis revealed massive transcriptome reprogramming as reflected by differential expression of ∼20% of all genes and changes in several hundred AS events. For more than 60% of all regulated AS events, light promoted the production of a presumably protein-coding variant at the expense of an mRNA with nonsense-mediated decay-triggering features. Accordingly, AS of the putative splicing factor REDUCED RED-LIGHT RESPONSES IN CRY1CRY2 BACKGROUND1, previously identified as a red light signaling component, was shifted to the functional variant under light. Downstream analyses of candidate AS events pointed at a role of photoreceptor signaling only in monochromatic but not in white light. Furthermore, we demonstrated similar AS changes upon light exposure and exogenous sugar supply, with a critical involvement of kinase signaling. We propose that AS is an integration point of signaling pathways that sense and transmit information regarding the energy availability in plants.
Authors Hartmann L, Drewe-Boß P, Wießner T, Wagner G, Geue S, Lee HC, Obermüller DM, Kahles A, Behr J, Sinz FH, Rätsch G, Wachter A
Submitted Plant Cell
Abstract Mapping high-throughput sequencing data to a reference genome is an essential step for most analysis pipelines aiming at the computational analysis of genome and transcriptome sequencing data. Breaking ties between equally well mapping locations poses a severe problem not only during the alignment phase but also has significant impact on the results of downstream analyses. We present the multi-mapper resolution (MMR) tool that infers optimal mapping locations from the coverage density of other mapped reads.
Authors Andre Kahles, Jonas Behr, Gunnar Rätsch
Submitted Bioinformatics (Oxford, England)
2015
Authors Stephanie L Hyland, Theofanis Karaletsos, Gunnar Rätsch
Submitted NIPS Workshop on Machine Learning for Healthcare, 2015
Authors M Tauber, T Darrell, Marius Kloft, M Pontil, Gunnar Rätsch, E Rodner, C Lengauer, M Bolten, R D Falgout, O Schenk
Authors Yi Zhong, Philipp Drewe, Andrew L Wolfe, Kamini Singh, Hans Guido Wendel, Gunnar Rätsch
Authors Marina M C Vidovic, Nico Görnitz, Klaus Robert Müller, Gunnar Rätsch, Marius Kloft
Abstract We report a mechanism of translational control that is determined by a requirement for eIF4A RNA helicase activity and underlies the anticancer effects of Silvestrol and related compounds. Briefly, activation of cap-dependent translation contributes to T-cell leukemia (T-ALL) development and maintenance. Accordingly, inhibition of translation initiation factor eIF4A with Silvestrol produces powerful therapeutic effects against T-ALL in vivo. We used transcriptome-scale ribosome footprinting on Silvestrol-treated T-ALL cells to identify Silvestrol-sensitive transcripts and the hallmark features of eIF4A-dependent translation. These include a long 5 UTR and a 12-mer sequence motif that encodes a guanine quartet (CGG)4. RNA folding algorithms as well as experimental evidences pinpoint the (CGG)4 motif as a common site of RNA G-quadruplex structures within the 5 UTR. In T-ALL these structures mark approximately eighty highly Silvestrol-sensitive transcripts that include key oncogenes and transcription factors and contribute to the drug's anti-leukemic action. Hence, the eIF4A-dependent translation of G-quadruplex containing transcripts emerges as a gene-specific and therapeutically targetable mechanism of translational control.
Authors Kamini Singh, Andrew L Wolfe, Yi Zhong, Gunnar Rätsch, Hans Guido Wendel
Authors Kjong Van Lehmann, Andre Kahles, Cyriac Kandoth, William Lee, Nikolaus Schultz, Oliver Stegle, Gunnar Rätsch
Authors S Brunak, F M de la Vega, A A Margolin, Gunnar Rätsch, J M Stuart
Authors JE Vogt
Submitted IEEE IEEE/ACM Transactions on Computational Biology and Bioinformatics
Abstract Identifying discriminative motifs underlying the functionality and evolution of organisms is a major challenge in computational biology. Machine learning approaches such as support vector machines (SVMs) achieve state-of-the-art performances in genomic discrimination tasks, but--due to its black-box character--motifs underlying its decision function are largely unknown. As a remedy, positional oligomer importance matrices (POIMs) allow us to visualize the significance of position-specific subsequences. Although being a major step towards the explanation of trained SVM models, they suffer from the fact that their size grows exponentially in the length of the motif, which renders their manual inspection feasible only for comparably small motif sizes, typically k ≤ 5. In this work, we extend the work on positional oligomer importance matrices, by presenting a new machine-learning methodology, entitled motifPOIM, to extract the truly relevant motifs--regardless of their length and complexity--underlying the predictions of a trained SVM model. Our framework thereby considers the motifs as free parameters in a probabilistic model, a task which can be phrased as a non-convex optimization problem. The exponential dependence of the POIM size on the oligomer length poses a major numerical challenge, which we address by an efficient optimization framework that allows us to find possibly overlapping motifs consisting of up to hundreds of nucleotides. We demonstrate the efficacy of our approach on a synthetic data set as well as a real-world human splice site data set.
Authors Marina M C Vidovic, Nico Görnitz, Klaus Robert Müller, Gunnar Rätsch, Marius Kloft
Submitted PloS one
Abstract Interferon-γ (IFN-gamma) primes macrophages for enhanced microbial killing and inflammatory activation by Toll-like receptors (TLRs), but little is known about the regulation of cell metabolism or mRNA translation during this priming. We found that IFN-γ regulated the metabolism and mRNA translation of human macrophages by targeting the kinases mTORC1 and MNK, both of which converge on the selective regulator of translation initiation eIF4E. Physiological downregulation of mTORC1 by IFN-γ was associated with autophagy and translational suppression of repressors of inflammation such as HES1. Genome-wide ribosome profiling in TLR2-stimulated macrophages showed that IFN-γ selectively modulated the macrophage translatome to promote inflammation, further reprogram metabolic pathways and modulate protein synthesis. These results show that IFN-γ-mediated metabolic reprogramming and translational regulation are key components of classical inflammatory macrophage activation.
Authors Xiaodi Su, Yingpu Yu, Yi Zhong, Eugenia G Giannopoulou, Xiaoyu Hu, Hui Liu, Justin R Cross, Gunnar Rätsch, Charles M Rice, Lionel B Ivashkiv
Submitted Nature immunology
Abstract Epigenome modulation potentially provides a mechanism for organisms to adapt, within and between generations. However, neither the extent to which this occurs, nor the mechanisms involved are known. Here we investigate DNA methylation variation in Swedish Arabidopsis thaliana accessions grown at two different temperatures. Environmental effects were limited to transposons, where CHH methylation was found to increase with temperature. Genome-wide association studies (GWAS) revealed that the extensive CHH methylation variation was strongly associated with genetic variants in both cis and trans, including a major trans-association close to the DNA methyltransferase CMT2. Unlike CHH methylation, CpG gene body methylation (GBM) was not affected by growth temperature, but was instead correlated with the latitude of origin. Accessions from colder regions had higher levels of GBM for a significant fraction of the genome, and this was associated with increased transcription for the genes affected. GWAS revealed that this effect was largely due to trans-acting loci, many of which showed evidence of local adaptation.
Authors Manu J Dubin, Pei Zhang, Dazhe Meng, Marie Stanislas Remigereau, Edward J Osborne, Francesco Paolo Casale, Philipp Drewe, Andre Kahles, Geraldine Jean, Bjarni Vilhjalmsson, Joanna Jagoda, Selen Irez, Viktor Voronin, Qiang Song, Quan Long, Gunnar Rätsch, Oliver Stegle, Richard M Clark, Magnus Nordborg
Submitted eLife
Abstract We present a genome-wide analysis of splicing patterns of 282 kidney renal clear cell carcinoma patients in which we integrate data from whole-exome sequencing of tumor and normal samples, RNA-seq and copy number variation. We proposed a scoring mechanism to compare splicing patterns in tumor samples to normal samples in order to rank and detect tumor-specific isoforms that have a potential for new biomarkers. We identified a subset of genes that show introns only observable in tumor but not in normal samples, ENCODE and GEUVADIS samples. In order to improve our understanding of the underlying genetic mechanisms of splicing variation we performed a large-scale association analysis to find links between somatic or germline variants with alternative splicing events. We identified 915 cis- and trans-splicing quantitative trait loci (sQTL) associated with changes in splicing patterns. Some of these sQTL have previously been associated with being susceptibility loci for cancer and other diseases. Our analysis also allowed us to identify the function of several COSMIC variants showing significant association with changes in alternative splicing. This demonstrates the potential significance of variants affecting alternative splicing events and yields insights into the mechanisms related to an array of disease phenotypes.
Authors Kjong-Van Lehmann, Andre Kahles, Cyriac Kandoth, William Lee, Nikolaus Schultz, Oliver Stegle, Gunnar Rätsch
Submitted Biocomputing
2014
Authors Xinghua Lou, Marius Kloft, Gunnar Rätsch, F A Hamprecht
Authors Jonas Behr, Gabriele Schweikert, Gunnar Rätsch
Authors S Brunak, F M de la Vega, Gunnar Rätsch, J M Stuart
Authors AK Porbadnigk, Nico Görnitz, Alexander Binder, Marius Kloft, C Sannelli, Mikio L Braun, Klaus Robert Müller
Authors Nico Görnitz, AK Porbadnigk, Alexander Binder, C Sannelli, Mikio L Braun, Klaus Robert Müller, Marius Kloft
Abstract Analysis of microscopy images can provide insight into many biological processes. One particularly challenging problem is cellular nuclear segmentation in highly anisotropic and noisy 3D image data. Manually localizing and segmenting each and every cellular nucleus is very time-consuming, which remains a bottleneck in large-scale biological experiments. In this work, we present a tool for automated segmentation of cellular nuclei from 3D fluorescent microscopic data. Our tool is based on state-of-the-art image processing and machine learning techniques and provides a user-friendly graphical user interface. We show that our tool is as accurate as manual annotation and greatly reduces the time for the registration.
Authors Christian K Widmer, Stephanie Heinrich, Philipp Drewe, Xinghua Lou, Shefali Umrania, Gunnar Rätsch
Submitted Signal, image and video processing
Abstract Alternative splicing is an essential mechanism for increasing transcriptome and proteome diversity in eukaryotes. Particularly in multicellular eukaryotes, this mechanism is involved in the regulation of developmental and physiological processes like growth, differentiation and signal transduction.
Authors Arash Kianianmomeni, Cheng Soon Ong, Gunnar Rätsch, Armin Hallmann
Submitted BMC genomics
Abstract Intraspecific genetic incompatibilities prevent the assembly of specific alleles into single genotypes and influence genome- and species-wide patterns of sequence variation. A common incompatibility in plants is hybrid necrosis, characterized by autoimmune responses due to epistatic interactions between natural genetic variants. By systematically testing thousands of F1 hybrids of Arabidopsis thaliana strains, we identified a small number of incompatibility hot spots in the genome, often in regions densely populated by nucleotide-binding domain and leucine-rich repeat (NLR) immune receptor genes. In several cases, these immune receptor loci interact with each other, suggestive of conflict within the immune system. A particularly dangerous locus is a highly variable cluster of NLR genes, DM2, which causes multiple independent incompatibilities with genes that encode a range of biochemical functions, including NLRs. Our findings suggest that deleterious interactions of immune receptors limit the combinations of favorable disease resistance alleles accessible to plant genomes.
Authors Eunyoung Chae, Kirsten Bomblies, Sang Tae Kim, Darya Karelina, Maricris Zaidem, Stephan Ossowski, Carmen Martin Pizarro, Roosa A E Laitinen, Beth A Rowan, Hezi Tenenboim, Sarah Lechner, Monika Demar, Anette Habring Müller, Christa Lanz, Gunnar Rätsch, Detlef Weigel
Submitted Cell
Abstract Transcription factor (TF) DNA sequence preferences direct their regulatory activity, but are currently known for only ∼1% of eukaryotic TFs. Broadly sampling DNA-binding domain (DBD) types from multiple eukaryotic clades, we determined DNA sequence preferences for >1,000 TFs encompassing 54 different DBD classes from 131 diverse eukaryotes. We find that closely related DBDs almost always have very similar DNA sequence preferences, enabling inference of motifs for ∼34% of the ∼170,000 known or predicted eukaryotic TFs. Sequences matching both measured and inferred motifs are enriched in chromatin immunoprecipitation sequencing (ChIP-seq) peaks and upstream of transcription start sites in diverse eukaryotic lineages. SNPs defining expression quantitative trait loci in Arabidopsis promoters are also enriched for predicted TF binding sites. Importantly, our motif "library" can be used to identify specific TFs whose binding may be altered by human disease risk alleles. These data present a powerful resource for mapping transcriptional networks across eukaryotes.
Authors Matthew T Weirauch, Ally Yang, Mihai Albu, Atina G Cote, Alejandro Montenegro Montero, Philipp Drewe, Hamed S Najafabadi, Samuel A Lambert, Ishminder Mann, Kate Cook, Hong Zheng, Alejandra Goity, Harm van Bakel, Jean Claude Lozano, Mary Galli, Mathew G Lewsey, Eryong Huang, Tuhin Mukherjee, Xiaoting Chen, John S Reece Hoyes, Sridhar Govindarajan, Gad Shaulsky, Albertha J M Walhout, Francois Yves Bouget, Gunnar Rätsch, Luis F Larrondo, Joseph R Ecker, Timothy R Hughes
Submitted Cell
Abstract The translational control of oncoprotein expression is implicated in many cancers. Here we report an eIF4A RNA helicase-dependent mechanism of translational control that contributes to oncogenesis and underlies the anticancer effects of silvestrol and related compounds. For example, eIF4A promotes T-cell acute lymphoblastic leukaemia development in vivo and is required for leukaemia maintenance. Accordingly, inhibition of eIF4A with silvestrol has powerful therapeutic effects against murine and human leukaemic cells in vitro and in vivo. We use transcriptome-scale ribosome footprinting to identify the hallmarks of eIF4A-dependent transcripts. These include 5' untranslated region (UTR) sequences such as the 12-nucleotide guanine quartet (CGG)4 motif that can form RNA G-quadruplex structures. Notably, among the most eIF4A-dependent and silvestrol-sensitive transcripts are a number of oncogenes, superenhancer-associated transcription factors, and epigenetic regulators. Hence, the 5' UTRs of select cancer genes harbour a targetable requirement for the eIF4A RNA helicase.
Authors Andrew L Wolfe, Kamini Singh, Yi Zhong, Philipp Drewe, Vinagolu K Rajasekhar, Viraj R Sanghvi, Konstantinos J Mavrakis, Man Jiang, Justine E Roderick, Joni Van der Meulen, Jonathan H Schatz, Christina M Rodrigo, Chunying Zhao, Pieter Rondou, Elisa de Stanchina, Julie Teruya Feldstein, Michelle A Kelliher, Frank Speleman, John A Porco, Jerry Pelletier, Gunnar Rätsch, Hans Guido Wendel
Submitted Nature
Abstract We present Oqtans, an open-source workbench for quantitative transcriptome analysis, that is integrated in Galaxy. Its distinguishing features include customizable computational workflows and a modular pipeline architecture that facilitates comparative assessment of tool and data quality. Oqtans integrates an assortment of machine learning-powered tools into Galaxy, which show superior or equal performance to state-of-the-art tools. Implemented tools comprise a complete transcriptome analysis workflow: short-read alignment, transcript identification/quantification and differential expression analysis. Oqtans and Galaxy facilitate persistent storage, data exchange and documentation of intermediate results and analysis workflows. We illustrate how Oqtans aids the interpretation of data from different experiments in easy to understand use cases. Users can easily create their own workflows and extend Oqtans by integrating specific tools. Oqtans is available as (i) a cloud machine image with a demo instance at cloud.oqtans.org, (ii) a public Galaxy instance at galaxy.cbio.mskcc.org, (iii) a git repository containing all installed software (oqtans.org/git); most of which is also available from (iv) the Galaxy Toolshed and (v) a share string to use along with Galaxy CloudMan.
Authors Vipin T Sreedharan, Sebastian J Schultheiss, Geraldine Jean, Andre Kahles, Regina Bohnert, Philipp Drewe, Pramod Mudrakarta, Nico Görnitz, Georg Zeller, Gunnar Rätsch
Submitted Bioinformatics (Oxford, England)
Abstract Recent genomic analyses of pathologically defined tumor types identify “within-a-tissue” disease subtypes. However, the extent to which genomic signatures are shared across tissues is still unclear. We performed an integrative analysis using five genome-wide platforms and one proteomic platform on 3,527 specimens from 12 cancer types, revealing a unified classification into 11 major subtypes. Five subtypes were nearly identical to their tissue-of-origin counterparts, but several distinct cancer types were found to converge into common subtypes. Lung squamous, head and neck, and a subset of bladder cancers coalesced into one subtype typified by TP53 alterations, TP63 amplifications, and high expression of immune and proliferation pathway genes. Of note, bladder cancers split into three pan-cancer subtypes. The multiplatform classification, while correlated with tissue-of-origin, provides independent information for predicting clinical outcomes. All data sets are available for data-mining from a unified resource to support further biological discoveries and insights into novel therapeutic strategies.
Authors K A Hoadley, C Yau, D M Wolf, A D Cherniack, D Tamborero, S Ng, M D M Leiserson, B Niu, M D McLellan, V Uzunangelov, J Zhang, Cyriac Kandoth, R Akbani, H Shen, L Omberg, A Chu, A A Margolin, LJ Van't Veer, N Lopez Bigas, P W Laird, B J Raphael, L Ding, A G Robertson, L A Byers, G B Mills, J N Weinstein, C Van Waes, Z Chen, E A Collisson, Cancer Genome Atlas Research Network
Submitted Cell
2013
Authors Christian K Widmer, Marius Kloft, Gunnar Rätsch
Authors AK Porbadnigk, Nico Görnitz, Marius Kloft, Klaus Robert Müller
Submitted Journal of Computing Science and Engineering
Authors Nico Görnitz, Marius Kloft, K Rieck, U Brefeld
Submitted Journal of Artificial Intelligence Research
Authors A Bauer, Nico Görnitz, F Biegler, Klaus Robert Müller, Marius Kloft
Submitted IEEE Transactions on Neural Networks and Learning Systems
Authors C Cortes, Marius Kloft, M Mohri
Abstract Insulin initiates diverse hepatic metabolic responses, including gluconeogenic suppression and induction of glycogen synthesis and lipogenesis. The liver possesses a rich sinusoidal capillary network with a higher degree of hypoxia and lower gluconeogenesis in the perivenous zone as compared to the rest of the organ. Here, we show that diverse vascular endothelial growth factor (VEGF) inhibitors improved glucose tolerance in nondiabetic C57BL/6 and diabetic db/db mice, potentiating hepatic insulin signaling with lower gluconeogenic gene expression, higher glycogen storage and suppressed hepatic glucose production. VEGF inhibition induced hepatic hypoxia through sinusoidal vascular regression and sensitized liver insulin signaling through hypoxia-inducible factor-2α (Hif-2α, encoded by Epas1) stabilization. Notably, liver-specific constitutive activation of HIF-2α, but not HIF-1α, was sufficient to augment hepatic insulin signaling through direct and indirect induction of insulin receptor substrate-2 (Irs2), an essential insulin receptor adaptor protein. Further, liver Irs2 was both necessary and sufficient to mediate Hif-2α and Vegf inhibition effects on glucose tolerance and hepatic insulin signaling. These results demonstrate an unsuspected intersection between Hif-2α-mediated hypoxic signaling and hepatic insulin action through Irs2 induction, which can be co-opted by Vegf inhibitors to modulate glucose metabolism. These studies also indicate distinct roles in hepatic metabolism for Hif-1α, which promotes glycolysis, and Hif-2α, which suppresses gluconeogenesis, and suggest new treatment approaches for type 2 diabetes mellitus.
Authors K Wei, SM Piecewicz, LM McGinnis, CM Taniguchi, SJ Wiegand, K Anderson, CW M Chan, KX Mulligan, David Kuo, J Yuan, M Vallon, LC Morton, E Lefai, MC Simon, JJ Maher, G Mithieux, F Rajas, JP Annes, OP McGuinness, G Thurston, AJ Giaccia, CJ Kuo
Submitted Nat Med
Abstract The intestinal microbiota is a microbial ecosystem of crucial importance to human health. Understanding how the microbiota confers resistance against enteric pathogens and how antibiotics disrupt that resistance is key to the prevention and cure of intestinal infections. We present a novel method to infer microbial community ecology directly from time-resolved metagenomics. This method extends generalized Lotka-Volterra dynamics to account for external perturbations. Data from recent experiments on antibiotic-mediated Clostridium difficile infection is analyzed to quantify microbial interactions, commensal-pathogen interactions, and the effect of the antibiotic on the community. Stability analysis reveals that the microbiota is intrinsically stable, explaining how antibiotic perturbations and C. difficile inoculation can produce catastrophic shifts that persist even after removal of the perturbations. Importantly, the analysis suggests a subnetwork of bacterial groups implicated in protection against C. difficile. Due to its generality, our method can be applied to any high-resolution ecological time-series data to infer community structure and response to external stimuli.
Authors Richard R Stein, Vanni Bucci, Nora C Toussaint, Charlie G Buffie, Gunnar Rätsch, Eric G Pamer, Chris Sander, Joao B Xavier
Submitted PLoS computational biology
Abstract High-throughput RNA sequencing is an increasingly accessible method for studying gene structure and activity on a genome-wide scale. A critical step in RNA-seq data analysis is the alignment of partial transcript reads to a reference genome sequence. To assess the performance of current mapping software, we invited developers of RNA-seq aligners to process four large human and mouse RNA-seq data sets. In total, we compared 26 mapping protocols based on 11 programs and pipelines and found major performance differences between methods on numerous benchmarks, including alignment yield, basewise accuracy, mismatch and gap placement, exon junction discovery and suitability of alignments for transcript reconstruction. We observed concordant results on real and simulated RNA-seq data, confirming the relevance of the metrics employed. Future developments in RNA-seq alignment methods would benefit from improved placement of multimapped reads, balanced utilization of existing gene annotation and a reduced false discovery rate for splice junctions.
Authors Par G Engstrom, Tamara Steijger, Botond Sipos, Gregory R Grant, Andre Kahles, Gunnar Rätsch, Nick Goldman, Tim J Hubbard, Jennifer Harrow, Roderic Guigo, Paul Bertone
Submitted Nature methods
Abstract The nonsense-mediated decay (NMD) surveillance pathway can recognize erroneous transcripts and physiological mRNAs, such as precursor mRNA alternative splicing (AS) variants. Currently, information on the global extent of coupled AS and NMD remains scarce and even absent for any plant species. To address this, we conducted transcriptome-wide splicing studies using Arabidopsis thaliana mutants in the NMD factor homologs UP FRAMESHIFT1 (UPF1) and UPF3 as well as wild-type samples treated with the translation inhibitor cycloheximide. Our analyses revealed that at least 17.4% of all multi-exon, protein-coding genes produce splicing variants that are targeted by NMD. Moreover, we provide evidence that UPF1 and UPF3 act in a translation-independent mRNA decay pathway. Importantly, 92.3% of the NMD-responsive mRNAs exhibit classical NMD-eliciting features, supporting their authenticity as direct targets. Genes generating NMD-sensitive AS variants function in diverse biological processes, including signaling and protein modification, for which NaCl stress-modulated AS-NMD was found. Besides mRNAs, numerous noncoding RNAs and transcripts derived from intergenic regions were shown to be NMD responsive. In summary, we provide evidence for a major function of AS-coupled NMD in shaping the Arabidopsis transcriptome, having fundamental implications in gene regulation and quality control of transcript processing.
Authors Gabriele Drechsel, Andre Kahles, Anil K Kesarwani, Eva Stauffer, Jonas Behr, Philipp Drewe, Gunnar Rätsch, Andreas Wachter
Submitted The Plant cell
Abstract High-throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in the detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape, pose profound computational challenges for transcriptome reconstruction.
Authors Jonas Behr, Andre Kahles, Yi Zhong, Vipin T Sreedharan, Philipp Drewe, Gunnar Rätsch
Submitted Bioinformatics (Oxford, England)
Abstract Deep transcriptome sequencing (RNA-Seq) has become a vital tool for studying the state of cells in the context of varying environments, genotypes and other factors. RNA-Seq profiling data enable identification of novel isoforms, quantification of known isoforms and detection of changes in transcriptional or RNA-processing activity. Existing approaches to detect differential isoform abundance between samples either require a complete isoform annotation or fall short in providing statistically robust and calibrated significance estimates. Here, we propose a suite of statistical tests to address these open needs: a parametric test that uses known isoform annotations to detect changes in relative isoform abundance and a non-parametric test that detects differential read coverages and can be applied when isoform annotations are not available. Both methods account for the discrete nature of read counts and the inherent biological variability. We demonstrate that these tests compare favorably to previous methods, both in terms of accuracy and statistical calibrations. We use these techniques to analyze RNA-Seq libraries from Arabidopsis thaliana and Drosophila melanogaster. The identified differential RNA processing events were consistent with RT-qPCR measurements and previous studies. The proposed toolkit is available from http://bioweb.me/rdiff and enables in-depth analyses of transcriptomes, with or without available isoform annotation.
Authors Philipp Drewe, Oliver Stegle, Lisa Hartmann, Andre Kahles, Regina Bohnert, Andreas Wachter, Karsten Borgwardt, Gunnar Rätsch
Submitted Nucleic acids research
Abstract Using a variety of techniques including Topic Modeling, PCA and Bi-clustering, we explore electronic patient records in the form of unstructured clinical notes and genetic mutation test results. Our ultimate goal is to gain insight into a unique body of clinical data, specifically regarding the topics discussed within the note content and relationships between patient clinical notes and their underlying genetics.
Authors K R Chan, Xinghua Lou, Theo Karaletsos, C Crosbie, S Gardos, D Artz, Gunnar Rätsch
Submitted ICDM Workshop on Biological Data Mining and its Applications in Healthcare
Abstract The Cancer Genome Atlas (TCGA) Research Network has profiled and analyzed large numbers of human tumors to discover molecular aberrations at the DNA, RNA, protein and epigenetic levels. The resulting rich data provide a major opportunity to develop an integrated picture of commonalities, differences and emergent themes across tumor lineages. The Pan-Cancer initiative compares the first 12 tumor types profiled by TCGA. Analysis of the molecular aberrations and their functional roles across tumor types will teach us how to extend therapies effective in one cancer type to others with a similar genomic profile.
Authors Cancer Genome Atlas Research Network, J N Weinstein, E A Collisson, G B Mills, K R M Shaw, B A Ozenberger, K Ellrott, I Shmulevich, Chris Sander, J M Stuart
Submitted Nature Genetics
2012
Abstract CD45 encodes a trans-membrane protein-tyrosine phosphatase expressed in diverse cells of the immune system. By combinatorial use of three variable exons 4-6, isoforms are generated that differ in their extracellular domain, thereby modulating phosphatase activity and immune response. Alternative splicing of these CD45 exons involves two heterogeneous ribonucleoproteins, hnRNP L and its cell-type specific paralog hnRNP L-like (LL). To address the complex combinatorial splicing of exons 4-6, we investigated hnRNP L/LL protein expression in human B-cells in relation to CD45 splicing patterns, applying RNA-Seq. In addition, mutational and RNA-binding analyses were carried out in HeLa cells. We conclude that hnRNP LL functions as the major CD45 splicing repressor, with two CA elements in exon 6 as its primary target. In exon 4, one element is targeted by both hnRNP L and LL. In contrast, exon 5 was never repressed on its own and only co-regulated with exons 4 and 6. Stable L/LL interaction requires CD45 RNA, specifically exons 4 and 6. We propose a novel model of combinatorial alternative splicing: HnRNP L and LL cooperate on the CD45 pre-mRNA, bridging exons 4 and 6 and looping out exon 5, thereby achieving full repression of the three variable exons.
Authors Marco Preussner, Silke Schreiner, Lee Hsueh Hung, Martina Porstner, Hans Martin Jack, Vladimir Benes, Gunnar Rätsch, Albrecht Bindereif
Submitted Nucleic Acids Res
Authors Christian K Widmer, Gunnar Rätsch
Authors Christian K Widmer, Marius Kloft, Nico Görnitz, Gunnar Rätsch
Abstract Deep sequencing of transcriptomes allows quantitative and qualitative analysis of many RNA species in a sample, with parallel comparison of expression levels, splicing variants, natural antisense transcripts, RNA editing and transcriptional start and stop sites the ideal goal. By computational modeling, we show how libraries of multiple insert sizes combined with strand-specific, paired-end (SS-PE) sequencing can increase the information gained on alternative splicing, especially in higher eukaryotes. Despite the benefits of gaining SS-PE data with paired ends of varying distance, the standard Illumina protocol allows only non-strand-specific, paired-end sequencing with a single insert size. Here, we modify the Illumina RNA ligation protocol to allow SS-PE sequencing by using a custom pre-adenylated 3' adaptor. We generate parallel libraries with differing insert sizes to aid deconvolution of alternative splicing events and to characterize the extent and distribution of natural antisense transcription in C. elegans. Despite stringent requirements for detection of alternative splicing, our data increases the number of intron retention and exon skipping events annotated in the Wormbase genome annotations by 127% and 121%, respectively. We show that parallel libraries with a range of insert sizes increase transcriptomic information gained by sequencing and that by current established benchmarks our protocol gives competitive results with respect to library quality.
Authors Lisa M Smith, Lisa Hartmann, Philipp Drewe, Regina Bohnert, Andre Kahles, Christa Lanz, Gunnar Rätsch
Submitted RNA biology
Abstract CD45 encodes a trans-membrane protein-tyrosine phosphatase expressed in diverse cells of the immune system. By combinatorial use of three variable exons 4-6, isoforms are generated that differ in their extracellular domain, thereby modulating phosphatase activity and immune response. Alternative splicing of these CD45 exons involves two heterogeneous ribonucleoproteins, hnRNP L and its cell-type specific paralog hnRNP L-like (LL). To address the complex combinatorial splicing of exons 4-6, we investigated hnRNP L/LL protein expression in human B-cells in relation to CD45 splicing patterns, applying RNA-Seq. In addition, mutational and RNA-binding analyses were carried out in HeLa cells. We conclude that hnRNP LL functions as the major CD45 splicing repressor, with two CA elements in exon 6 as its primary target. In exon 4, one element is targeted by both hnRNP L and LL. In contrast, exon 5 was never repressed on its own and only co-regulated with exons 4 and 6. Stable L/LL interaction requires CD45 RNA, specifically exons 4 and 6. We propose a novel model of combinatorial alternative splicing: HnRNP L and LL cooperate on the CD45 pre-mRNA, bridging exons 4 and 6 and looping out exon 5, thereby achieving full repression of the three variable exons.
Authors Marco Preussner, Silke Schreiner, Lee Hsueh Hung, Martina Porstner, Hans Martin Jack, Vladimir Benes, Gunnar Rätsch, Albrecht Bindereif
Submitted Nucleic acids research
Abstract Alternative splicing (AS) generates transcript variants by variable exon/intron definition and massively expands transcriptome diversity. Changes in AS patterns have been found to be linked to manifold biological processes, yet fundamental aspects, such as the regulation of AS and its functional implications, largely remain to be addressed. In this work, widespread AS regulation by Arabidopsis thaliana Polypyrimidine tract binding protein homologs (PTBs) was revealed. In total, 452 AS events derived from 307 distinct genes were found to be responsive to the levels of the splicing factors PTB1 and PTB2, which predominantly triggered splicing of regulated introns, inclusion of cassette exons, and usage of upstream 5' splice sites. By contrast, no major AS regulatory function of the distantly related PTB3 was found. Dependent on their position within the mRNA, PTB-regulated events can both modify the untranslated regions and give rise to alternative protein products. We find that PTB-mediated AS events are connected to diverse biological processes, and the functional implications of selected instances were further elucidated. Specifically, PTB misexpression changes AS of PHYTOCHROME INTERACTING FACTOR6, coinciding with altered rates of abscisic acid-dependent seed germination. Furthermore, AS patterns as well as the expression of key flowering regulators were massively changed in a PTB1/2 level-dependent manner.
Authors Christina Ruhl, Eva Stauffer, Andre Kahles, Gabriele Wagner, Gabriele Drechsel, Gunnar Rätsch, Andreas Wachter
Submitted The Plant cell
Abstract Cohesin is a protein complex that forms a ring around sister chromatids thus holding them together. The ring is composed of three proteins: Smc1, Smc3 and Scc1. The roles of three additional proteins that associate with the ring, Scc3, Pds5 and Wpl1, are not well understood. It has been proposed that these three factors form a complex that stabilizes the ring and prevents it from opening. This activity promotes sister chromatid cohesion but at the same time poses an obstacle for the initial entrapment of sister DNAs. This hindrance to cohesion establishment is overcome during DNA replication via acetylation of the Smc3 subunit by the Eco1 acetyltransferase. However, the full mechanistic consequences of Smc3 acetylation remain unknown. In the current work, we test the requirement of Scc3 and Pds5 for the stable association of cohesin with DNA. We investigated the consequences of Scc3 and Pds5 depletion in vivo using degron tagging in budding yeast. The previously described DHFR-based N-terminal degron as well as a novel Eco1-derived C-terminal degron were employed in our study. Scc3 and Pds5 associate with cohesin complexes independently of each other and require the Scc1 "core" subunit for their association with chromosomes. Contrary to previous data for Scc1 downregulation, depletion of either Scc3 or Pds5 had a strong effect on sister chromatid cohesion but not on cohesin binding to DNA. Quantity, stability and genome-wide distribution of cohesin complexes remained mostly unchanged after the depletion of Scc3 and Pds5. Our findings are inconsistent with a previously proposed model that Scc3 and Pds5 are cohesin maintenance factors required for cohesin ring stability or for maintaining its association with DNA. We propose that Scc3 and Pds5 specifically function during cohesion establishment in S phase.
Authors Irina Kulemzina, Martin R Schumacher, Vikash Verma, Jochen Reiter, Janina Metzler, Antonio Virgilio Failla, Christa Lanz, Vipin T Sreedharan, Gunnar Rätsch, Dmitri Ivanov
Submitted PLoS genetics
2011
Authors Nico Görnitz, Georg Zeller, Jonas Behr, Andre Kahles, Pramod Mudrakarta, Soren Sonnenburg, Gunnar Rätsch
Authors Nico Görnitz, Christian K Widmer, Georg Zeller, Andre Kahles, Soren Sonnenburg, Gunnar Rätsch
Authors Sebastian J Schultheiss, Geraldine Jean, Jonas Behr, Philipp Drewe, Nico Görnitz, Andre Kahles, Pramod Mudrakarta, V T Sreedharan, Georg Zeller, Gunnar Rätsch
Abstract We have conducted a study on the long-term availability of bioinformatics Web services: an observation of 927 Web services published in the annual Nucleic Acids Research Web Server Issues between 2003 and 2009. We found that 72% of Web sites are still available at the published addresses, only 9% of services are completely unavailable. Older addresses often redirect to new pages. We checked the functionality of all available services: for 33%, we could not test functionality because there was no example data or a related problem; 13% were truly no longer working as expected; we could positively confirm functionality only for 45% of all services. Additionally, we conducted a survey among 872 Web Server Issue corresponding authors; 274 replied. 78% of all respondents indicate their services have been developed solely by students and researchers without a permanent position. Consequently, these services are in danger of falling into disrepair after the original developers move to another institution, and indeed, for 24% of services, there is no plan for maintenance, according to the respondents. We introduce a Web service quality scoring system that correlates with the number of citations: services with a high score are cited 1.8 times more often than low-scoring services. We have identified key characteristics that are predictive of a service's survival, providing reviewers, editors, and Web service developers with the means to assess or improve Web services. A Web service conforming to these criteria receives more citations and provides more reliable service for its users. The most effective way of ensuring continued access to a service is a persistent Web address, offered either by the publishing journal, or created on the authors' own initiative, for example at http://bioweb.me. The community would benefit the most from a policy requiring any source code needed to reproduce results to be deposited in a public repository.
Authors Sebastian J Schultheiss, Marc Christian Munch, Gergana D Andreeva, Gunnar Rätsch
Submitted PloS one
Abstract Genetic differences between Arabidopsis thaliana accessions underlie the plant's extensive phenotypic variation, and until now these have been interpreted largely in the context of the annotated reference accession Col-0. Here we report the sequencing, assembly and annotation of the genomes of 18 natural A. thaliana accessions, and their transcriptomes. When assessed on the basis of the reference annotation, one-third of protein-coding genes are predicted to be disrupted in at least one accession. However, re-annotation of each genome revealed that alternative gene models often restore coding potential. Gene expression in seedlings differed for nearly half of expressed genes and was frequently associated with cis variants within 5 kilobases, as were intron retention alternative splicing events. Sequence and expression variation is most pronounced in genes that respond to the biotic environment. Our data further promote evolutionary and functional studies in A. thaliana, especially the MAGIC genetic reference population descended from these accessions.
Authors Xiangchao Gan, Oliver Stegle, Jonas Behr, Joshua G Steffen, Philipp Drewe, Katie L Hildebrand, Rune Lyngsoe, Sebastian J Schultheiss, Edward J Osborne, Vipin T Sreedharan, Andre Kahles, Regina Bohnert, Geraldine Jean, Paul Derwent, Paul Kersey, Eric J Belfield, Nicholas P Harberd, Eric Kemen, Christopher Toomajian, Paula X Kover, Richard M Clark, Gunnar Rätsch, Richard Mott
Submitted Nature
Abstract Precise 5' splice-site recognition is essential for both constitutive and regulated pre-mRNA splicing. The U1 small nuclear ribonucleoprotein particle (snRNP)-specific protein U1C is involved in this first step of spliceosome assembly and important for stabilizing early splicing complexes. We used an embryonically lethal U1C mutant zebrafish, hi1371, to investigate the potential genomewide role of U1C for splicing regulation. U1C mutant embryos contain overall stable, but U1C-deficient U1 snRNPs. Surprisingly, genomewide RNA-Seq analysis of mutant versus wild-type embryos revealed a large set of specific target genes that changed their alternative splicing patterns in the absence of U1C. Injection of ZfU1C cRNA into mutant embryos and in vivo splicing experiments in HeLa cells after siRNA-mediated U1C knockdown confirmed the U1C dependency and specificity, as well as the functional conservation of the effects observed. In addition, sequence motif analysis of the U1C-dependent 5' splice sites uncovered an association with downstream intronic U-rich elements. In sum, our findings provide evidence for a new role of a general snRNP protein, U1C, as a mediator of alternative splicing regulation.
Authors Tanja Dorothe Rosel, Lee Hsueh Hung, Jan Medenbach, Katrin Donde, Stefan Starke, Vladimir Benes, Gunnar Rätsch, Albrecht Bindereif
Submitted The EMBO journal
Abstract Alternative splicing (AS) is a process which generates several distinct mRNA isoforms from the same gene by splicing different portions out of the precursor transcript. Due to the (patho-)physiological importance of AS, a complete inventory of AS is of great interest. While this is in reach for human and mammalian model organisms, our knowledge of AS in plants has remained more incomplete. Experimental approaches for monitoring AS are either based on transcript sequencing or rely on hybridization to DNA microarrays. Among the microarray platforms facilitating the discovery of AS events, tiling arrays are well-suited for identifying intron retention, the most prevalent type of AS in plants. However, analyzing tiling array data is challenging, because of high noise levels and limited probe coverage.
Authors Johannes Eichner, Georg Zeller, Sascha Laubinger, Gunnar Rätsch
Submitted BMC bioinformatics
Abstract The C. elegans genome has been completely sequenced, and the developmental anatomy of this model organism is described at single-cell resolution. Here we utilize strategies that exploit this precisely defined architecture to link gene expression to cell type. We obtained RNAs from specific cells and from each developmental stage using tissue-specific promoters to mark cells for isolation by FACS or for mRNA extraction by the mRNA-tagging method. We then generated gene expression profiles of more than 30 different cells and developmental stages using tiling arrays. Machine-learning-based analysis detected transcripts corresponding to established gene models and revealed novel transcriptionally active regions (TARs) in noncoding domains that comprise at least 10% of the total C. elegans genome. Our results show that about 75% of transcripts with detectable expression are differentially expressed among developmental stages and across cell types. Examination of known tissue- and cell-specific transcripts validates these data sets and suggests that newly identified TARs may exercise cell-specific functions. Additionally, we used self-organizing maps to define groups of coregulated transcripts and applied regulatory element analysis to identify known transcription factor- and miRNA-binding sites, as well as novel motifs that likely function to control subsets of these genes. By using cell-specific, whole-genome profiling strategies, we have detected a large number of novel transcripts and produced high-resolution gene expression maps that provide a basis for establishing the roles of individual genes in cellular differentiation.
Authors William C Spencer, Georg Zeller, Joseph D Watson, Stefan R Henz, Kathie L Watkins, Rebecca D McWhirter, Sarah Petersen, Vipin T Sreedharan, Christian K Widmer, Jeanyoung Jo, Valerie Reinke, Lisa Petrella, Susan Strome, Stephen E Von Stetina, Menachem Katz, Shai Shaham, Gunnar Rätsch, David M Miller
Submitted Genome research
Abstract CO(2) is both a critical regulator of animal physiology and an important sensory cue for many animals for host detection, food location, and mate finding. The free-living soil nematode Caenorhabditis elegans shows CO(2) avoidance behavior, which requires a pair of ciliated sensory neurons, the BAG neurons. Using in vivo calcium imaging, we show that CO(2) specifically activates the BAG neurons and that the CO(2)-sensing function of BAG neurons requires TAX-2/TAX-4 cyclic nucleotide-gated ion channels and the receptor-type guanylate cyclase GCY-9. Our results delineate a molecular pathway for CO(2) sensing and suggest that activation of a receptor-type guanylate cyclase is an evolutionarily conserved mechanism by which animals detect environmental CO(2).
Authors Elissa A Hallem, William C Spencer, Rebecca D McWhirter, Georg Zeller, Stefan R Henz, Gunnar Rätsch, David M Miller, H Robert Horvitz, Paul W Sternberg, Niels Ringstad
Submitted Proceedings of the National Academy of Sciences of the United States of America
2010
Authors Soren Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian K Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojt Franc
Submitted J. Mach. Learn. Res.
Authors Christian K Widmer, J Leiva, Yasemin Altun, Gunnar Rätsch
Authors Christian K Widmer, Nora C Toussaint, Yasemin Altun, Oliver Kohlbacher, Gunnar Rätsch
Authors Gunnar Rätsch, Regina Bohnert
Authors Oliver Stegle, Philipp Drewe, Regina Bohnert, Karsten Borgwardt, Gunnar Rätsch
Abstract The classic phytohormones cytokinin and auxin play essential roles in the maintenance of stem-cell systems embedded in shoot and root meristems, and exhibit complex functional interactions. Here we show that the activity of both hormones directly converges on the promoters of two A-type ARABIDOPSIS RESPONSE REGULATOR (ARR) genes, ARR7 and ARR15, which are negative regulators of cytokinin signalling and have important meristematic functions. Whereas ARR7 and ARR15 expression in the shoot apical meristem (SAM) is induced by cytokinin, auxin has a negative effect, which is, at least in part, mediated by the AUXIN RESPONSE FACTOR5/MONOPTEROS (MP) transcription factor. Our results provide a mechanistic framework for hormonal control of the apical stem-cell niche and demonstrate how root and shoot stem-cell systems differ in their response to phytohormones.
Authors Z Zhao, SU Andersen, K Ljung, K Dolezal, A Miotk, Sebastian J Schultheiss, Jan U Lohmann
Submitted Nature
Authors Soren Sonnenburg, Vojt Franc
Abstract The challenge of identifying cis-regulatory modules (CRMs) is an important milestone for the ultimate goal of understanding transcriptional regulation in eukaryotic cells. It has been approached, among others, by motif-finding algorithms that identify overrepresented motifs in regulatory sequences. These methods succeed in finding single, well-conserved motifs, but fail to identify combinations of degenerate binding sites, like the ones often found in CRMs. We have developed a method that combines the abilities of existing motif finding with the discriminative power of a machine learning technique to model the regulation of genes (Schultheiss et al. (2009) Bioinformatics 25, 2126-2133). Our software is called KIRMES: , which stands for kernel-based identification of regulatory modules in eukaryotic sequences. Starting from a set of genes thought to be co-regulated, KIRMES: can identify the key CRMs responsible for this behavior and can be used to determine for any other gene not included on that list if it is also regulated by the same mechanism. Such gene sets can be derived from microarrays, chromatin immunoprecipitation experiments combined with next-generation sequencing or promoter/whole genome microarrays. The use of an established machine learning method makes the approach fast to use and robust with respect to noise. By providing easily understood visualizations for the results returned, they become interpretable and serve as a starting point for further analysis. Even for complex regulatory relationships, KIRMES: can be a helpful tool in directing the design of biological experiments.
Authors Sebastian J Schultheiss
Submitted Methods Mol Biol
Authors Bona F De, S Riezler, K Hall, M Ciaramita, A Herdagdelen, M Holmqvist
Abstract Despite the independent evolution of multicellularity in plants and animals, the basic organization of their stem cell niches is remarkably similar. Here, we report the genome-wide regulatory potential of WUSCHEL, the key transcription factor for stem cell maintenance in the shoot apical meristem of the reference plant Arabidopsis thaliana. WUSCHEL acts by directly binding to at least two distinct DNA motifs in more than 100 target promoters and preferentially affects the expression of genes with roles in hormone signaling, metabolism, and development. Striking examples are the direct transcriptional repression of CLAVATA1, which is part of a negative feedback regulation of WUSCHEL, and the immediate regulation of transcriptional repressors of the TOPLESS family, which are involved in auxin signaling. Our results shed light on the complex transcriptional programs required for the maintenance of a dynamic and essential stem cell niche.
Authors Wolfgang Busch, A Miotk, FD Ariel, Z Zhao, J Forner, G Daum, T Suzaki, C Schuster, Sebastian J Schultheiss, A Leibfried, S Haubeiss, N Ha, R L Chan, Jan U Lohmann
Submitted Dev Cell
Abstract String kernels are commonly used for the classification of biological sequences, nucleotide as well as amino acid sequences. Although string kernels are already very powerful, when it comes to amino acids they have a major short coming. They ignore an important piece of information when comparing amino acids: the physico-chemical properties such as size, hydrophobicity, or charge. This information is very valuable, especially when training data is less abundant. There have been only very few approaches so far that aim at combining these two ideas.
Authors Nora C Toussaint, Christian K Widmer, Oliver Kohlbacher, Gunnar Rätsch
Submitted BMC bioinformatics
Abstract The lack of sufficient training data is the limiting factor for many Machine Learning applications in Computational Biology. If data is available for several different but related problem domains, Multitask Learning algorithms can be used to learn a model based on all available information. In Bioinformatics, many problems can be cast into the Multitask Learning scenario by incorporating data from several organisms. However, combining information from several tasks requires careful consideration of the degree of similarity between tasks. Our proposed method simultaneously learns or refines the similarity between tasks along with the Multitask Learning classifier. This is done by formulating the Multitask Learning problem as Multiple Kernel Learning, using the recently published q-Norm MKL algorithm.
Authors Christian K Widmer, Nora C Toussaint, Yasemin Altun, Gunnar Rätsch
Submitted BMC bioinformatics
Abstract In Arabidopsis thaliana, four different dicer-like (DCL) proteins have distinct but partially overlapping functions in the biogenesis of microRNAs (miRNAs) and siRNAs from longer, noncoding precursor RNAs. To analyze the impact of different components of the small RNA biogenesis machinery on the transcriptome, we subjected dcl and other mutants impaired in small RNA biogenesis to whole-genome tiling array analysis. We compared both protein-coding genes and noncoding transcripts, including most pri-miRNAs, in two tissues and several stress conditions. Our analysis revealed a surprising number of common targets in dcl1 and dcl2 dcl3 dcl4 triple mutants. Furthermore, our results suggest that the DCL1 is not only involved in miRNA action but also contributes to silencing of a subset of transposons, apparently through an effect on DNA methylation.
Authors Sascha Laubinger, Georg Zeller, Stefan R Henz, Sabine Buechel, Timo Sachsenberg, Jia Wei Wang, Gunnar Rätsch, Detlef Weigel
Submitted Proceedings of the National Academy of Sciences of the United States of America
Abstract We provide a novel web service, called rQuant.web, allowing convenient access to tools for quantitative analysis of RNA sequencing data. The underlying quantitation technique rQuant is based on quadratic programming and estimates different biases induced by library preparation, sequencing and read mapping. It can tackle multiple transcripts per gene locus and is therefore particularly well suited to quantify alternative transcripts. rQuant.web is available as a tool in a Galaxy installation at http://galaxy.fml.mpg.de. Using rQuant.web is free of charge, it is open to all users, and there is no login requirement.
Authors Regina Bohnert, Gunnar Rätsch
Submitted Nucleic acids research
Abstract We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor-binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor-binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.
Authors Mark B Gerstein, Zhi John Lu, Eric L Van Nostrand, Chao Cheng, Bradley I Arshinoff, Tao Liu, Kevin Y Yip, Rebecca Robilotto, Andreas Rechtsteiner, Kohta Ikegami, Pedro Alves, Aurelien Chateigner, Marc Perry, Mitzi Morris, Raymond K Auerbach, Xin Feng, Jing Leng, Anne Vielle, Wei Niu, Kahn Rhrissorrakrai, Ashish Agarwal, Roger P Alexander, Galt Barber, Cathleen M Brdlik, Jennifer Brennan, Jeremy Jean Brouillet, Adrian Carr, Ming Sin Cheung, Hiram Clawson, Sergio Contrino, Luke O Dannenberg, Abby F Dernburg, Arshad Desai, Lindsay Dick, Andrea C Dose, Jiang Du, Thea Egelhofer, Sevinc Ercan, Ghia Euskirchen, Brent Ewing, Elise A Feingold, Reto Gassmann, Peter J Good, Phil Green, Francois Gullier, Michelle Gutwein, Mark S Guyer, Lukas Habegger, Ting Han, Jorja G Henikoff, Stefan R Henz, Angie Hinrichs, Heather Holster, Tony Hyman, A Leo Iniguez, Judith Janette, Morten Jensen, Masaomi Kato, W James Kent, Ellen Kephart, Vishal Khivansara, Ekta Khurana, John K Kim, Paulina Kolasinska Zwierz, Eric C Lai, Isabel Latorre, Amber Leahey, Suzanna Lewis, Paul Lloyd, Lucas Lochovsky, Rebecca F Lowdon, Yaniv Lubling, Rachel Lyne, Michael MacCoss, Sebastian D Mackowiak, Marco Mangone, Sheldon McKay, Desirea Mecenas, Gennifer Merrihew, David M Miller, Andrew Muroyama, John I Murray, Siew Loon Ooi, Hoang Pham, Taryn Phippen, Elicia A Preston, Nikolaus Rajewsky, Gunnar Rätsch, Heidi Rosenbaum, Joel Rozowsky, Kim Rutherford, Peter Ruzanov, Mihail Sarov, Rajkumar Sasidharan, Andrea Sboner, Paul Scheid, Eran Segal, Hyunjin Shin, Chong Shou, Frank J Slack, Cindie Slightam, Richard Smith, William C Spencer, E O Stinson, Scott Taing, Teruaki Takasaki, Dionne Vafeados, Ksenia Voronina, Guilin Wang, Nicole L Washington, Christina M Whittle, Beijing Wu, Koon Kiu Yan, Georg Zeller, Zheng Zha, Mei Zhong, Xingliang Zhou, Julie Ahringer, Susan Strome, Kristin C Gunsalus, Gos Micklem, X Shirley Liu, Valerie Reinke, Sang Tae Kim, LaDeana W Hillier, Steven Henikoff, Fabio Piano, Michael Snyder, Lincoln Stein, Jason D Lieb, Robert H Waterston
Submitted Science (New York, N.Y.)
Abstract Next-generation sequencing technologies have revolutionized genome and transcriptome sequencing. RNA-Seq experiments are able to generate huge amounts of transcriptome sequence reads at a fraction of the cost of Sanger sequencing. Reads produced by these technologies are relatively short and error prone. To utilize such reads for transcriptome reconstruction and gene-structure identification, one needs to be able to accurately align the sequence reads over intron boundaries. In this unit, we describe PALMapper, a fast and easy-to-use tool that is designed to accurately compute both unspliced and spliced alignments for millions of RNA-Seq reads. It combines the efficient read mapper GenomeMapper with the spliced aligner QPALMA, which exploits read-quality information and predictions of splice sites to improve the alignment accuracy. The PALMapper package is available as a command-line tool running on Unix or Mac OS X systems or through a Web interface based on Galaxy tools.
Authors Geraldine Jean, Andre Kahles, Vipin T Sreedharan, Fabio de Bona, Gunnar Rätsch
Submitted Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.]
2009
Authors Sebastian J Schultheiss, Wolfgang Busch, Jan U Lohmann, Oliver Kohlbacher, Gunnar Rätsch
Authors Gabriele Schweikert, Christian K Widmer, Bernhard Schölkopf, Gunnar Rätsch
Authors Manfred K Warmuth, K Glocer, Gunnar Rätsch
Authors Alexander Zien, N Kramer, Soren Sonnenburg, Gunnar Rätsch
Abstract In Arabidopsis thaliana, gene expression level polymorphisms (ELPs) between natural accessions that exhibit simple, single locus inheritance are promising quantitative trait locus (QTL) candidates to explain phenotypic variability. It is assumed that such ELPs overwhelmingly represent regulatory element polymorphisms. However, comprehensive genome-wide analyses linking expression level, regulatory sequence and gene structure variation are missing, preventing definite verification of this assumption. Here, we analyzed ELPs observed between the Eil-0 and Lc-0 accessions. Compared with non-variable controls, 5' regulatory sequence variation in the corresponding genes is indeed increased. However, approximately 42\% of all the ELP genes also carry major transcription unit deletions in one parent as revealed by genome tiling arrays, representing a >4-fold enrichment over controls. Within the subset of ELPs with simple inheritance, this proportion is even higher and deletions are generally more severe. Similar results were obtained from analyses of the Bay-0 and Sha accessions, using alternative technical approaches. Collectively, our results suggest that drastic structural changes are a major cause for ELPs with simple inheritance, corroborating experimentally observed indel preponderance in cloned Arabidopsis QTL.
Authors S Plantegenet, J Weber, DR Goldstein, Georg Zeller, C Nussbaumer, J Thomas, Detlef Weigel, K Harshman, CS Hardtke
Submitted Mol Syst Biol
Abstract Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability. Unfortunately, l1-norm MKL is hardly observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary lp-norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary p > 1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the traditionally used wrapper approaches. Finally, we apply lp-norm MKL to real-world problems from computational biology, showing that non-sparse MKL achieves accuracies that go beyond the state-of-the-art.
Authors Marius Kloft, U Brefeld, Soren Sonnenburg, P Laskow, Klaus Robert Müller, Alexander Zien
Authors E Georgii, Koji Tsuda, Bernhard Schölkopf
Abstract Modern systems biology aims at understanding how the different molecular components of a biological cell interact. Often, cellular functions are performed by complexes consisting of many different proteins. The composition of these complexes may change according to the cellular environment, and one protein may be involved in several different processes. The automatic discovery of functional complexes from protein interaction data is challenging. While previous approaches use approximations to extract dense modules, our approach exactly solves the problem of dense module enumeration. Furthermore, constraints from additional information sources such as gene expression and phenotype data can be integrated, so we can systematically mine for dense modules with interesting profiles.
Authors E Georgii, S Dietmann, T Uno, P Pagel, Koji Tsuda
Submitted Bioinformatics
Abstract SUMMARY: The DICS database is a dynamic web repository of computationally predicted functional modules from the human protein-protein interaction network. It provides references to the CORUM, DrugBank, KEGG and Reactome pathway databases. DICS can be accessed for retrieving sets of overlapping modules and protein complexes that are significantly enriched in a gene list, thereby providing valuable information about the functional context. AVAILABILITY: Supplementary information on datasets and methods is available on the web server http://mips.gsf.de/proj/dics.
Authors S Dietmann, E Georgii, A Antonov, Koji Tsuda, HW Mewes
Submitted Bioinformatics
Abstract The Affymetrix ATH1 array provides a robust standard tool for transcriptome analysis, but unfortunately does not represent all of the transcribed genes in Arabidopsis thaliana. Recently, Affymetrix has introduced its Arabidopsis Tiling 1.0R array, which offers whole-genome coverage of the sequenced Col-0 reference strain. Here, we present an approach to exploit this platform for quantitative mRNA expression analysis, and compare the results with those obtained using ATH1 arrays. We also propose a method for selecting unique tiling probes for each annotated gene or transcript in the most current genome annotation, TAIR7, generating Chip Definition Files for the Tiling 1.0R array. As a test case, we compared the transcriptome of wild-type plants with that of transgenic plants overproducing the heterodimeric E2Fa-DPa transcription factor. We show that with the appropriate data pre-processing, the estimated changes per gene for those with significantly different expression levels is very similar for the two array types. With the tiling arrays we could identify 368 new E2F-regulated genes, with a large fraction including an E2F motif in the promoter. The latter groups increase the number of excellent candidates for new, direct E2F targets by almost twofold, from 181 to 334.
Authors Naira Naouar, Klaas Vandepoele, Tim Lammens, Tineke Casneuf, Georg Zeller, Paul van Hummelen, Detlef Weigel, Gunnar Rätsch, Dirk Inze, Martin Kuiper, Lieven De Veylder, Marnik Vuylsteke
Submitted The Plant journal : for cell and molecular biology
Abstract Rice, the primary source of dietary calories for half of humanity, is the first crop plant for which a high-quality reference genome sequence from a single variety was produced. We used resequencing microarrays to interrogate 100 Mb of the unique fraction of the reference genome for 20 diverse varieties and landraces that capture the impressive genotypic and phenotypic diversity of domesticated rice. Here, we report the distribution of 160,000 nonredundant SNPs. Introgression patterns of shared SNPs revealed the breeding history and relationships among the 20 varieties; some introgressed regions are associated with agronomic traits that mark major milestones in rice improvement. These comprehensive SNP data provide a foundation for deep exploration of rice diversity and gene-trait relationships and their use for future rice improvement.
Authors Kenneth L McNally, Kevin L Childs, Regina Bohnert, Rebecca M Davidson, Keyan Zhao, Victor J Ulat, Georg Zeller, Richard M Clark, Douglas R Hoen, Thomas E Bureau, Renee Stokowski, Dennis G Ballinger, Kelly A Frazer, David R Cox, Badri Padhukasahasram, Carlos D Bustamante, Detlef Weigel, David J Mackill, Richard M Bruskiewich, Gunnar Rätsch, C Robin Buell, Hei Leung, Jan E Leach
Submitted Proceedings of the National Academy of Sciences of the United States of America
Abstract We present a highly accurate gene-prediction system for eukaryotic genomes, called mGene. It combines in an unprecedented manner the flexibility of generalized hidden Markov models (gHMMs) with the predictive power of modern machine learning methods, such as Support Vector Machines (SVMs). Its excellent performance was proved in an objective competition based on the genome of the nematode Caenorhabditis elegans. Considering the average of sensitivity and specificity, the developmental version of mGene exhibited the best prediction performance on nucleotide, exon, and transcript level for ab initio and multiple-genome gene-prediction tasks. The fully developed version shows superior performance in 10 out of 12 evaluation criteria compared with the other participating gene finders, including Fgenesh++ and Augustus. An in-depth analysis of mGene's genome-wide predictions revealed that approximately 2200 predicted genes were not contained in the current genome annotation. Testing a subset of 57 of these genes by RT-PCR and sequencing, we confirmed expression for 24 (42%) of them. mGene missed 300 annotated genes, out of which 205 were unconfirmed. RT-PCR testing of 24 of these genes resulted in a success rate of merely 8%. These findings suggest that even the gene catalog of a well-studied organism such as C. elegans can be substantially improved by mGene's predictions. We also provide gene predictions for the four nematodes C. briggsae, C. brenneri, C. japonica, and C. remanei. Comparing the resulting proteomes among these organisms and to the known protein universe, we identified many species-specific gene inventions. In a quality assessment of several available annotations for these genomes, we find that mGene's predictions are most accurate.
Authors Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, Fabio de Bona, Lisa Hartmann, Anja Bohlen, Nina Kruger, Soren Sonnenburg, Gunnar Rätsch
Submitted Genome research
Abstract We describe mGene.web, a web service for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. It offers pre-trained models for the recognition of gene structures including untranslated regions in an increasing number of organisms. With mGene.web, users have the additional possibility to train the system with their own data for other organisms on the push of a button, a functionality that will greatly accelerate the annotation of newly sequenced genomes. The system is built in a highly modular way, such that individual components of the framework, like the promoter prediction tool or the splice site predictor, can be used autonomously. The underlying gene finding system mGene is based on discriminative machine learning techniques and its high accuracy has been demonstrated in an international competition on nematode genomes. mGene.web is available at http://www.mgene.org/web, it is free of charge and can be used for eukaryotic genomes of small to moderate size (several hundred Mbp).
Authors Gabriele Schweikert, Jonas Behr, Alexander Zien, Georg Zeller, Cheng Soon Ong, Soren Sonnenburg, Gunnar Rätsch
Submitted Nucleic acids research
Abstract We shed light on the discrimination between patterns belonging to two different classes by casting this decoding problem into a generalized prototype framework. The discrimination process is then separated into two stages: a projection stage that reduces the dimensionality of the data by projecting it on a line and a threshold stage where the distributions of the projected patterns of both classes are separated. For this, we extend the popular mean-of-class prototype classification using algorithms from machine learning that satisfy a set of invariance properties. We report a simple yet general approach to express different types of linear classification algorithms in an identical and easy-to-visualize formal framework using generalized prototypes where these prototypes are used to express the normal vector and offset of the hyperplane. We investigate non-margin classifiers such as the classical prototype classifier, the Fisher classifier, and the relevance vector machine. We then study hard and soft margin classifiers such as the support vector machine and a boosted version of the prototype classifier. Subsequently, we relate mean-of-class prototype classification to other classification algorithms by showing that the prototype classifier is a limit of any soft margin classifier and that boosting a prototype classifier yields the support vector machine. While giving novel insights into classification per se by presenting a common and unified formalism, our generalized prototype framework also provides an efficient visualization and a principled comparison of machine learning classification.
Authors Arnulf B A Graf, Olivier Bousquet, Gunnar Rätsch, Bernhard Schölkopf
Submitted Neural computation
Abstract Understanding transcriptional regulation is one of the main challenges in computational biology. An important problem is the identification of transcription factor (TF) binding sites in promoter regions of potential TF target genes. It is typically approached by position weight matrix-based motif identification algorithms using Gibbs sampling, or heuristics to extend seed oligos. Such algorithms succeed in identifying single, relatively well-conserved binding sites, but tend to fail when it comes to the identification of combinations of several degenerate binding sites, as those often found in cis-regulatory modules.
Authors Sebastian J Schultheiss, Wolfgang Busch, Jan U Lohmann, Oliver Kohlbacher, Gunnar Rätsch
Submitted Bioinformatics (Oxford, England)
Abstract The responses of plants to abiotic stresses are accompanied by massive changes in transcriptome composition. To provide a comprehensive view of stress-induced changes in the Arabidopsis thaliana transcriptome, we have used whole-genome tiling arrays to analyze the effects of salt, osmotic, cold and heat stress as well as application of the hormone abscisic acid (ABA), an important mediator of stress responses. Among annotated genes in the reference strain Columbia we have found many stress-responsive genes, including several transcription factor genes as well as pseudogenes and transposons that have been missed in previous analyses with standard expression arrays. In addition, we report hundreds of newly identified, stress-induced transcribed regions. These often overlap with known, annotated genes. The results are accessible through the Arabidopsis thaliana Tiling Array Express (At-TAX) homepage, which provides convenient tools for displaying expression values of annotated genes, as well as visualization of unannotated transcribed regions along each chromosome.
Authors Georg Zeller, Stefan R Henz, Christian K Widmer, Timo Sachsenberg, Gunnar Rätsch, Detlef Weigel, Sascha Laubinger
Submitted The Plant journal : for cell and molecular biology
Abstract Novel high-throughput sequencing technologies open exciting new approaches to transcriptome profiling. Sequencing transcript populations of interest, e.g. from different tissues or variable stress conditions, with RNA sequencing (RNA-Seq) [1] generates millions of short reads. Accurately aligned to a reference genome, they provide digital counts and thus facilitate transcript quantification. As the observed read counts only provide the summation of all expressed sequences at one locus, the inference of the underlying transcript abundances is crucial for further quantitative analyses.
Authors Regina Bohnert, Jonas Behr, Gunnar Rätsch
Submitted BMC Bioinformatics
2008
Abstract Next generation sequencing technologies open exciting new possibilities for genome and transcriptome sequencing. While reads produced by these technologies are relatively short and error prone compared to the Sanger method their throughput is several magnitudes higher. To utilize such reads for transcriptome sequencing and gene structure identification, one needs to be able to accurately align the sequence reads over intron boundaries. This represents a significant challenge given their short length and inherent high error rate. RESULTS: We present a novel approach, called QPALMA, for computing accurate spliced alignments which takes advantage of the read's quality information as well as computational splice site predictions. Our method uses a training set of spliced reads with quality information and known alignments. It uses a large margin approach similar to support vector machines to estimate its parameters to maximize alignment accuracy. In computational experiments, we illustrate that the quality information as well as the splice site predictions help to improve the alignment quality. Finally, to facilitate mapping of massive amounts of sequencing data typically generated by the new technologies, we have combined our method with a fast mapping pipeline based on enhanced suffix arrays. Our algorithms were optimized and tested using reads produced with the Illumina Genome Analyzer for the model plant Arabidopsis thaliana. AVAILABILITY: Datasets for training and evaluation, additional results and a stand-alone alignment tool implemented in C++ and python are available at http://www.fml.mpg.de/raetsch/projects/qpalma.
Authors Fabio de Bona, Stephan Ossowski, Korbinian Schneeberger, Gunnar Rätsch
Submitted Bioinformatics,
Authors Sebastian J Schultheiss, Wolfgang Busch, Jan U Lohmann, Oliver Kohlbacher, Gunnar Rätsch
Authors Soren Sonnenburg, Gunnar Rätsch, Christin Schafer
Authors Cheng Soon Ong, Alexander Zien
Abstract For the analysis of transcriptional tiling arrays we have developed two methods based on state-of-the-art machine learning algorithms. First, we present a novel transcript normalization technique to alleviate the effect of oligonucleotide probe sequences on hybridization intensity. It is specifically designed to decrease the variability observed for individual probes complementary to the same transcript. Applying this normalization technique to Arabidopsis tiling arrays, we are able to reduce sequence biases and also significantly improve separation in signal intensity between exonic and intronic/intergenic probes. Our second contribution is a method for transcript mapping. It extends an algorithm proposed for yeast tiling arrays to the more challenging task of spliced transcript identification. When evaluated on raw versus normalized intensities our method achieves highest prediction accuracy when segmentation is performed on transcript-normalized tiling array data.
Authors Georg Zeller, Stefan R Henz, Sascha Laubinger, Detlef Weigel, Gunnar Rätsch
Submitted Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Abstract Next generation sequencing technologies open exciting new possibilities for genome and transcriptome sequencing. While reads produced by these technologies are relatively short and error prone compared to the Sanger method their throughput is several magnitudes higher. To utilize such reads for transcriptome sequencing and gene structure identification, one needs to be able to accurately align the sequence reads over intron boundaries. This represents a significant challenge given their short length and inherent high error rate.
Authors Fabio de Bona, Stephan Ossowski, Korbinian Schneeberger, Gunnar Rätsch
Submitted Bioinformatics (Oxford, England)
Abstract Gene expression maps for model organisms, including Arabidopsis thaliana, have typically been created using gene-centric expression arrays. Here, we describe a comprehensive expression atlas, Arabidopsis thaliana Tiling Array Express (At-TAX), which is based on whole-genome tiling arrays. We demonstrate that tiling arrays are accurate tools for gene expression analysis and identified more than 1,000 unannotated transcribed regions. Visualizations of gene expression estimates, transcribed regions, and tiling probe measurements are accessible online at the At-TAX homepage.
Authors Sascha Laubinger, Georg Zeller, Stefan R Henz, Timo Sachsenberg, Christian K Widmer, Naira Naouar, Marnik Vuylsteke, Bernhard Schölkopf, Gunnar Rätsch, Detlef Weigel
Submitted Genome biology
Abstract At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts.
Authors Soren Sonnenburg, Alexander Zien, Petra Philips, Gunnar Rätsch
Submitted Bioinformatics (Oxford, England)
Abstract The processing of Arabidopsis thaliana microRNAs (miRNAs) from longer primary transcripts (pri-miRNAs) requires the activity of several proteins, including DICER-LIKE1 (DCL1), the double-stranded RNA-binding protein HYPONASTIC LEAVES1 (HYL1), and the zinc finger protein SERRATE (SE). It has been noted before that the morphological appearance of weak se mutants is reminiscent of plants with mutations in ABH1/CBP80 and CBP20, which encode the two subunits of the nuclear cap-binding complex. We report that, like SE, the cap-binding complex is necessary for proper processing of pri-miRNAs. Inactivation of either ABH1/CBP80 or CBP20 results in decreased levels of mature miRNAs accompanied by apparent stabilization of pri-miRNAs. Whole-genome tiling array analyses reveal that se, abh1/cbp80, and cbp20 mutants also share similar splicing defects, leading to the accumulation of many partially spliced transcripts. This is unlikely to be an indirect consequence of improper miRNA processing or other mRNA turnover pathways, because introns retained in se, abh1/cbp80, and cbp20 mutants are not affected by mutations in other genes required for miRNA processing or for nonsense-mediated mRNA decay. Taken together, our results uncover dual roles in splicing and miRNA processing that distinguish SE and the cap-binding complex from specialized miRNA processing factors such as DCL1 and HYL1.
Authors Sascha Laubinger, Timo Sachsenberg, Georg Zeller, Wolfgang Busch, Jan U Lohmann, Gunnar Rätsch, Detlef Weigel
Submitted Proceedings of the National Academy of Sciences of the United States of America
Detecting polymorphic regions in Arabidopsis thaliana with resequencing microarrays. Genome research
Abstract Whole-genome oligonucleotide resequencing arrays have allowed the comprehensive discovery of single nucleotide polymorphisms (SNPs) in eukaryotic genomes of moderate to large size. With this technology, the detection rate for isolated SNPs is typically high. However, it is greatly reduced when other polymorphisms are located near a SNP as multiple mismatches inhibit hybridization to arrayed oligonucleotides. Contiguous tracts of suppressed hybridization therefore typify polymorphic regions (PRs) such as clusters of SNPs or deletions. We developed a machine learning method, designated margin-based prediction of polymorphic regions (mPPR), to predict PRs from resequencing array data. Conceptually similar to hidden Markov models, the method is trained with discriminative learning techniques related to support vector machines, and accurately identifies even very short polymorphic tracts (<10 bp). We applied this method to resequencing array data previously generated for the euchromatic genomes of 20 strains (accessions) of the best-characterized plant, Arabidopsis thaliana. Nonredundantly, 27% of the genome was included within the boundaries of PRs predicted at high specificity ( approximately 97%). The resulting data set provides a fine-scale view of polymorphic sequences in A. thaliana; patterns of polymorphism not apparent in SNP data were readily detected, especially for noncoding regions. Our predictions provide a valuable resource for evolutionary genetic and functional studies in A. thaliana, and our method is applicable to similar data sets in other species. More broadly, our computational approach can be applied to other segmentation tasks related to the analysis of genomic variation.
Authors Georg Zeller, Richard M Clark, Korbinian Schneeberger, Anja Bohlen, Detlef Weigel, Gunnar Rätsch
Submitted Genome research
Abstract he major breakthrough at the turn of the millennium was the completion of genome sequences for individuals from many species, including human, worm and rice. More recently, it has also been important to describe sequence variation within one species, providing the first step towards the linkage of genetic variation to traits. Today, rice is the most important source for human caloric intake, making up 20% of the calorie supply and feeding millions of people daily. The more detailed understanding and findings on the molecular assembly of phenotypic rice varieties will therefore be essential for future improvement in rice cultivation and breeding. In order to reveal patterns of sequence variation in Oryza sativa (rice), the non-repetitive portion of the genomes of 20 diverse rice cultivars was resequenced, in collaboration with Perlegen Sciences, Inc., using a high-density oligonucleotide microarray technology.
Authors Regina Bohnert, Georg Zeller, Richard M Clark, Kevin L Childs, Victor J Ulat, Renee Stokowski, Dennis G Ballinger, Kelly A Frazer, David R Cox, Richard M Bruskiewich, C Robin Buell, Jan E Leach, Hei Leung, Kenneth L McNally, Detlef Weigel, Gunnar Rätsch
Submitted BMC Bioinformatics
2007
Authors Soren Sonnenburg, Mikio L Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus Robert Müller, Fernando Pereira, Carl E Rasmussen, Gunnar Rätsch
Authors Gunnar Rätsch, Soren Sonnenburg
Authors Soren Sonnenburg, Gunnar Rätsch, K Rieck
Abstract The cDNA array technology is a powerful tool to analyze a high number of genes in parallel. We investigated whether large-scale gene expression analysis allows clustering and identification of cellular phenotypes of chondrocytes in different in vivo and in vitro conditions. In 100\% of cases, clustering analysis distinguished between in vivo and in vitro samples, suggesting fundamental differences in chondrocytes in situ and in vitro regardless of the culture conditions or disease status. It also allowed us to differentiate between healthy and osteoarthritic cartilage. The clustering also revealed the relative importance of the investigated culturing conditions (stimulation agent, stimulation time, bead/monolayer). We augmented the cluster analysis with a statistical search for genes showing differential expression. The identified genes provided hints to the molecular basis of the differences between the sample classes. Our approach shows the power of modern bioinformatic algorithms for understanding and classifying chondrocytic phenotypes in vivo and in vitro. Although it does not generate new experimental data per se, it provides valuable information regarding the biology of chondrocytes and may provide tools for diagnosing and staging the osteoarthritic disease process.
Authors Alexander Zien, PM Gebhard, K Fundel, T Aigner
Submitted Clin Orthop Relat Res
Authors Alexander Zien, U Brefeld, T Scheffer
Authors Alexander Zien, Cheng Soon Ong
Abstract The support vector machine (SVM) has been spotlighted in the machine learning community because of its theoretical soundness and practical performance. When applied to a large data set, however, it requires a large memory and a long time for training. To cope with the practical difficulty, we propose a pattern selection algorithm based on neighborhood properties. The idea is to select only the patterns that are likely to be located near the decision boundary. Those patterns are expected to be more informative than the randomly selected patterns. The experimental results provide promising evidence that it is possible to successfully employ the proposed algorithm ahead of SVM training.
Authors Hyunjin Shin, S Cho
Submitted Neural Computation
Authors E Georgii, S Dietmann, T Uno, P Pagel, Koji Tsuda
Abstract The genomes of individuals from the same species vary in sequence as a result of different evolutionary processes. To examine the patterns of, and the forces shaping, sequence variation in Arabidopsis thaliana, we performed high-density array resequencing of 20 diverse strains (accessions). More than 1 million nonredundant single-nucleotide polymorphisms (SNPs) were identified at moderate false discovery rates (FDRs), and approximately 4% of the genome was identified as being highly dissimilar or deleted relative to the reference genome sequence. Patterns of polymorphism are highly nonrandom among gene families, with genes mediating interaction with the biotic environment having exceptional polymorphism levels. At the chromosomal scale, regional variation in polymorphism was readily apparent. A scan for recent selective sweeps revealed several candidate regions, including a notable example in which almost all variation was removed in a 500-kilobase window. Analyzing the polymorphisms we describe in larger sets of accessions will enable a detailed understanding of forces shaping population-wide sequence variation in A. thaliana.
Authors Richard M Clark, Gabriele Schweikert, Christopher Toomajian, Stephan Ossowski, Georg Zeller, Paul Shinn, Norman Warthmann, Tina T Hu, Glenn Fu, David A Hinds, Huaming Chen, Kelly A Frazer, Daniel H Huson, Bernhard Schölkopf, Magnus Nordborg, Gunnar Rätsch, Joseph R Ecker, Detlef Weigel
Submitted Science (New York, N.Y.)
Abstract Despite many years of research on how to properly align sequences in the presence of sequencing errors, alternative splicing and micro-exons, the correct alignment of mRNA sequences to genomic DNA is still a challenging task.
Authors Uta Schulze, Bettina Hepp, Cheng Soon Ong, Gunnar Rätsch
Submitted Bioinformatics (Oxford, England)
Abstract For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%-13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.
Authors Gunnar Rätsch, Soren Sonnenburg, Jagan Srinivasan, Hanh Witte, Klaus Robert Müller, Ralf J Sommer, Bernhard Schölkopf
Submitted PLoS computational biology
Abstract Since prilocaine is being increasingly used for day case surgery as a short acting local anaesthetic for spinal anaesthesia and because of its low risk for transient neurological symptoms, we compared it to bupivacaine.
Authors Gunnar Rätsch, H Niebergall, L Hauenstein, A Reber
Submitted Der Anaesthesist
Abstract For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks.
Authors Soren Sonnenburg, Gabriele Schweikert, Petra Philips, Jonas Behr, Gunnar Rätsch
Submitted BMC bioinformatics
2006
Authors Gunnar Rätsch, Bettina Hepp, Uta Schulze, Cheng Soon Ong
Authors Hyunjin Shin, N J Hill, Gunnar Rätsch
Authors Manfred K Warmuth, Jun Liao, Gunnar Rätsch
Authors Soren Sonnenburg, Gunnar Rätsch, Bernhard Schölkopf
Submitted Journal of Machine Learning Research
Authors Alexander Zien, Cheng Soon Ong, Gunnar Rätsch
Authors O Chapelle, M Chi, Alexander Zien
Abstract Despite many research efforts in recent decades, the major pathogenetic mechanisms of osteoarthritis (OA), including gene alterations occurring during OA cartilage degeneration, are poorly understood, and there is no disease-modifying treatment approach. The present study was therefore initiated in order to identify differentially expressed disease-related genes and potential therapeutic targets.
Authors T Aigner, K Fundel, J Saas, PM Gebhard, J Haag, T Weiss, Alexander Zien, F Obermayr, R Zimmer, E Bartnik
Submitted Arthritis Rheum
Abstract We develop new methods for finding transcription start sites (TSS) of RNA Polymerase II binding genes in genomic DNA sequences. Employing Support Vector Machines with advanced sequence kernels, we achieve drastically higher prediction accuracies than state-of-the-art methods.
Authors Soren Sonnenburg, Alexander Zien, Gunnar Rätsch
Submitted Bioinformatics (Oxford, England)
Abstract Support Vector Machines (SVMs)--using a variety of string kernels--have been successfully applied to biological sequence classification problems. While SVMs achieve high classification accuracy they lack interpretability. In many applications, it does not suffice that an algorithm just detects a biological signal in the sequence, but it should also provide means to interpret its solution in order to gain biological insight.
Authors Gunnar Rätsch, Soren Sonnenburg, Christin Schafer
Submitted BMC bioinformatics
2005
Authors Soren Sonnenburg, Gunnar Rätsch, Christin Schafer
Authors Soren Sonnenburg, G Rätsch}, B Schölkopf"
Authors Koji Tsuda, Gunnar Rätsch, Manfred K Warmuth
Abstract We tackle the problem of finding regularities in microarray data. Various data mining tools, such as clustering, classification, Bayesian networks and association rules, have been applied so far to gain insight into gene-expression data. Association rule mining techniques used so far work on discretizations of the data and cannot account for cumulative effects. In this paper, we investigate the use of quantitative association rules that can operate directly on numeric data and represent cumulative effects of variables. Technically speaking, this type of quantitative association rules based on half-spaces can find non-axis-parallel regularities.
Authors E Georgii, L Richter, U Rückert, S Kramer
Submitted Bioinformatics
Abstract Computational approaches to protein function prediction infer protein function by finding proteins with similar sequence, structure, surface clefts, chemical properties, amino acid motifs, interaction partners or phylogenetic profiles. We present a new approach that combines sequential, structural and chemical information into one graph model of proteins. We predict functional class membership of enzymes and non-enzymes using graph kernels and support vector machine classification on these protein graphs.
Authors Karsten Borgwardt, Cheng Soon Ong, S Schönauer, SVN Vishwanathan, A J Smola, HP Kriegel
Submitted Bioinformatics
Abstract One way of image denoising is to project a noisy image to the subspace of admissible images derived, for instance, by PCA. However, a major drawback of this method is that all pixels are updated by the projection, even when only a few pixels are corrupted by noise or occlusion. We propose a new method to identify the noisy pixels by l1-norm penalization and to update the identified pixels only. The identification and updating of noisy pixels are formulated as one linear program which can be efficiently solved. In particular, one can apply the upsilon trick to directly specify the fraction of pixels to be reconstructed. Moreover, we extend the linear program to be able to exploit prior knowledge that occlusions often appear in contiguous blocks (e.g., sunglasses on faces). The basic idea is to penalize boundary points and interior points of the occluded area differently. We are also able to show the upsilon property for this extended LP leading to a method which is easy to use. Experimental results demonstrate the power of our approach.
Authors Koji Tsuda, Gunnar Rätsch
Submitted IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Abstract Eukaryotic pre-mRNAs are spliced to form mature mRNA. Pre-mRNA alternative splicing greatly increases the complexity of gene expression. Estimates show that more than half of the human genes and at least one-third of the genes of less complex organisms, such as nematodes or flies, are alternatively spliced. In this work, we consider one major form of alternative splicing, namely the exclusion of exons from the transcript. It has been shown that alternatively spliced exons have certain properties that distinguish them from constitutively spliced exons. Although most recent computational studies on alternative splicing apply only to exons which are conserved among two species, our method only uses information that is available to the splicing machinery, i.e. the DNA sequence itself. We employ advanced machine learning techniques in order to answer the following two questions: (1) Is a certain exon alternatively spliced? (2) How can we identify yet unidentified exons within known introns?
Authors Gunnar Rätsch, Soren Sonnenburg, Bernhard Schölkopf
Submitted Bioinformatics (Oxford, England)
Abstract In this article we report about a successful application of modern machine learning technology, namely Support Vector Machines, to the problem of assessing the 'drug-likeness' of a chemical from a given set of descriptors of the substance. We were able to drastically improve the recent result by Byvatov et al. (2003) on this task and achieved an error rate of about 7% on unseen compounds using Support Vector Machines. We see a very high potential of such machine learning techniques for a variety of computational chemistry problems that occur in the drug discovery and drug design process.
Authors Klaus Robert Müller, Gunnar Rätsch, Soren Sonnenburg, Sebastian Mika, Michael Grimm, Nikolaus Heinrich
Submitted Journal of chemical information and modeling
2004 and earlier
Authors Gunnar Rätsch, Soren Sonnenburg
Authors S Knabe, Sebastian Mika, Klaus Robert Müller, Gunnar Rätsch, W Schruff
Submitted Die Wirtschaftsprüfung
Authors Gunnar Rätsch
Authors Gunnar Rätsch
Authors Gunnar Rätsch, A J Smola, Sebastian Mika
Authors R Meir, Gunnar Rätsch
Authors Sebastian Mika, Gunnar Rätsch, J Weston, Bernhard Schölkopf, A J Smola, Klaus Robert Müller
Submitted IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract We investigate the following data mining problem from computer-aided drug design: From a large collection of compounds, find those that bind to a target molecule in as few iterations of biochemical testing as possible. In each iteration a comparatively small batch of compounds is screened for binding activity toward this target. We employed the so-called "active learning paradigm" from Machine Learning for selecting the successive batches. Our main selection strategy is based on the maximum margin hyperplane-generated by "Support Vector Machines". This hyperplane separates the current set of active from the inactive compounds and has the largest possible distance from any labeled compound. We perform a thorough comparative study of various other selection strategies on data sets provided by DuPont Pharmaceuticals and show that the strategies based on the maximum margin hyperplane clearly outperform the simpler ones.
Authors Manfred K Warmuth, Jun Liao, Gunnar Rätsch, Michael Mathieson, Santosh Putta, Christian Lemmen
Submitted Journal of chemical information and computer sciences
Authors Gunnar Rätsch, Sebastian Mika, Manfred K Warmuth
Authors Gunnar Rätsch, Manfred K Warmuth
Authors Soren Sonnenburg, Gunnar Rätsch, A Jagoda, Klaus Robert Müller
Authors Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Soren Sonnenburg, Klaus Robert Müller
Authors Gunnar Rätsch, Manfred K Warmuth
Authors Manfred K Warmuth, Gunnar Rätsch, Michael Mathieson, Jun Liao, Christian Lemmen
Authors Gunnar Rätsch, Sebastian Mika, Bernhard Schölkopf, Klaus Robert Müller
Submitted IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract Recently, Jaakkola and Haussler (1999) proposed a method for constructing kernel functions from probabilistic models. Their so-called Fisher kernel has been combined with discriminative classifiers such as support vector machines and applied successfully in, for example, DNA and protein analysis. Whereas the Fisher kernel is calculated from the marginal log-likelihood, we propose the TOP kernel derived; from tangent vectors of posterior log-odds. Furthermore, we develop a theoretical framework on feature extractors from probabilistic models and use it for analyzing the TOP kernel. In experiments, our new discriminative TOP kernel compares favorably to the Fisher kernel.
Authors Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Soren Sonnenburg, Klaus Robert Müller
Submitted Neural computation
Authors Sebastian Mika, Gunnar Rätsch, Klaus Robert Müller
Authors Koji Tsuda, Gunnar Rätsch, Sebastian Mika, Klaus Robert Müller
Authors T Onoda, Gunnar Rätsch, Klaus Robert Müller
Submitted Journal of the Japanese Society for AI
Authors Gunnar Rätsch
Abstract This paper provides an introduction to support vector machines, kernel Fisher discriminant analysis, and kernel principal component analysis, as examples for successful kernel-based learning methods. We first give a short background about Vapnik-Chervonenkis theory and kernel feature spaces and then proceed to kernel based learning in supervised and unsupervised scenarios including practical and algorithmic considerations. We illustrate the usefulness of kernel algorithms by discussing applications such as optical character recognition and DNA analysis.
Authors Klaus Robert Müller, Sebastian Mika, Gunnar Rätsch, Koji Tsuda, Bernhard Schölkopf
Submitted IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council
Abstract Recently, ensemble methods like AdaBoost have been applied successfully in many problems, while seemingly defying the problems of overfitting. AdaBoost rarely overfits in the low noise regime, however, we show that it clearly does so for higher noise levels. Central to the understanding of this fact is the margin distribution. AdaBoost can be viewed as a constraint gradient descent in an error function with respect to the margin. We find that AdaBoost asymptotically achieves a hard margin distribution, i.e. the algorithm concentrates its resources on a few hard-to-learn patterns that are interestingly very similar to Support Vectors. A hard margin is clearly a sub-optimal strategy in the noisy case, and regularization, in our case a mistrust in the data, must be introduced in the algorithm to alleviate the distortions that single difficult patterns (e.g. outliers) can cause to the margin distribution. We propose several regularization methods and generalizations of the original AdaBoost algorithm to achieve a soft margin. In particular we suggest (1) regularized AdaBoostREG where the gradient decent is done directly with respect to the soft margin and (2) regularized linear and quadratic programming (LP/QP-) AdaBoost, where the soft margin is attained by introducing slack variables. Extensive simulations demonstrate that the proposed regularized AdaBoost-type algorithms are useful and yield competitive results for noisy data.
Authors Gunnar Rätsch, T Onoda, Klaus Robert Müller
Submitted Machine Learning
Authors J Kohlmorgen, S Lemm, Gunnar Rätsch, Klaus Robert Müller
Authors Sebastian Mika, Gunnar Rätsch, J Weston, Bernhard Schölkopf, A J Smola, Klaus Robert Müller
Authors T Onoda, Gunnar Rätsch, Klaus Robert Müller
Authors G Rätsch}, Bernhard Schölkopf, A J Smola, Sebastian Mika, T Onoda, K R Müller"
Authors Gunnar Rätsch, Bernhard Schölkopf, A J Smola, Klaus Robert Müller, T Onoda, Sebastian Mika
Authors Gunnar Rätsch, Manfred K Warmuth, Sebastian Mika, T Onoda, S Lemm, Klaus Robert Müller
Authors Gunnar Rätsch, B Scherkopf, A J Smola, Sebastian Mika, T Onoda, Klaus Robert Müller
Authors T Onoda, Gunnar Rätsch
Authors Gunnar Rätsch, Bernhard Schölkopf, Sebastian Mika, Klaus Robert Müller
Abstract In order to extract protein sequences from nucleotide sequences, it is an important step to recognize points at which regions start that code for proteins. These points are called translation initiation sites (TIS).
Authors Alexander Zien, Gunnar Rätsch, Sebastian Mika, Bernhard Schölkopf, T Lengauer, Klaus Robert Müller
Submitted Bioinformatics (Oxford, England)
Authors Sebastian Mika, Bernhard Schölkopf, A J Smola, Klaus Robert Müller, M Scholz, Gunnar Rätsch
Authors Gunnar Rätsch, T Onoda, Klaus Robert Müller
Authors A J Smola, Bernhard Schölkopf, Gunnar Rätsch
Abstract In order to extract protein sequences from nucleotide sequences, it is an important step to recognize points from which regions encoding pro­ teins start, the so­called translation initiation sites (TIS). This can be modeled as a classification prob­ lem. We demonstrate the power of support vector machines (SVMs) for this task, and show how to suc­ cessfully incorporate biological prior knowledge by engineering an appropriate kernel function.
Authors Alexander Zien, Gunnar Rätsch, Sebastian Mika, Bernhard Schölkopf, Christian Lemmen, A J Smola, T Lengauer, Klaus Robert Müller
Authors Klaus Robert Müller, A J Smola, Gunnar Rätsch, Bernhard Schölkopf, J Kohlmorgen, V Vapnik
Authors Sebastian Mika, Gunnar Rätsch, Bernhard Schölkopf, Klaus Robert Müller
Submitted Neural networks for signal processing IX
Authors W Schubert, A Koutzevlov, E Horn, Gunnar Rätsch, A Tschapek
Abstract This paper collects some ideas targeted at advancing our understanding of the feature spaces associated with support vector (SV) kernel functions. We first discuss the geometry of feature space. In particular, we review what is known about the shape of the image of input space under the feature space map, and how this influences the capacity of SV methods. Following this, we describe how the metric governing the intrinsic geometry of the mapped surface can be computed in terms of the kernel, using the example of the class of inhomogeneous polynomial kernels, which are often used in SV pattern recognition. We then discuss the connection between feature space and input space by dealing with the question of how one can, given some vector in feature space, find a preimage (exact or approximate) in input space. We describe algorithms to tackle this issue, and show their utility in two applications of kernel methods. First, we use it to reduce the computational complexity of SV decision functions; second, we combine it with the Kernel PCA algorithm, thereby constructing a nonlinear statistical denoising technique which is shown to perform well on real-world data.
Authors Bernhard Schölkopf, Sebastian Mika, C C Burges, P Knirsch, Klaus Robert Müller, Gunnar Rätsch, A J Smola
Submitted IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council
Authors T Onoda, Gunnar Rätsch, Klaus Robert Müller
Authors Gunnar Rätsch, T Onoda, Klaus Robert Müller
Authors Bernhard Schölkopf, Sebastian Mika, A J Smola, Gunnar Rätsch, Klaus Robert Müller
Authors Gunnar Rätsch
Authors Klaus Robert Müller, A J Smola, Gunnar Rätsch, Bernhard Schölkopf, J Kohlmorgen, V Vapnik