Vincent Fortuin, MSc CBB ETH UZH
"The scientist is not a person who gives the right answers, he's one who asks the right questions." - Claude Lévi-Strauss
- fortuin@ inf.ethz.ch
- +41 44 633 66 87
Department of Computer Science
Biomedical Informatics Group
CAB F 39
- CAB F 39
I am interested in the interface between deep learning and probabilistic modeling. I am particularly keen to develop models that are more interpretable and data efficient, since these are two major requirements in the field of health care.
I did my undergraduate studies in Molecular Life Sciences at the University of Hamburg, where I worked on phylogeny inference for quickly mutating virus strains with Andrew Torda. I then went to ETH Zürich to study Computational Biology and Bioinformatics, in a joint program with the University of Zürich, with a focus on systems biology and machine learning. My master's thesis was about the application of deep learning to gene regulatory network inference under supervision of Manfred Claassen. During my studies I also spent some time in Jacob Hanna's group at the Weizmann Institute of Science, working on multiomics data analysis in stem cell research. Before joining the Biomedical Informatics group as a PhD student, I worked on deep learning applications in natural language understanding at Disney Research.
Abstract Generating visualizations and interpretations from high-dimensional data is a common problem in many fields. Two key approaches for tackling this problem are clustering and representation learning. There are very performant deep clustering models on the one hand and interpretable representation learning techniques, often relying on latent topological structures such as self-organizing maps, on the other hand. However, current methods do not yet successfully combine these two approaches. We present a new deep architecture for probabilistic clustering, VarPSOM, and its extension to time series data, VarTPSOM. We show that they achieve superior clustering performance compared to current deep clustering methods on static MNIST/Fashion-MNIST data as well as medical time series, while inducing an interpretable representation. Moreover, on the medical time series, VarTPSOM successfully predicts future trajectories in the original data space.
Authors Laura Manduchi, Matthias Hüser, Gunnar Rätsch, Vincent Fortuin
Submitted arXiv Preprints
Abstract Metagenomic studies have increasingly utilized sequencing technologies in order to analyze DNA fragments found in environmental samples. It can provide useful insights for studying the interactions between hosts and microbes, infectious disease proliferation, and novel species discovery. One important step in this analysis is the taxonomic classification of those DNA fragments. Of particular interest is the determination of the distribution of the taxa of microbes in metagenomic samples. Recent attempts using deep learning focus on architectures that classify single DNA reads independently from each other. In this work, we attempt to solve the task of directly predicting the distribution over the taxa of whole metagenomic read sets. We formulate this task as a Multiple Instance Learning (MIL) problem. We extend architectures used in single-read taxonomic classification with two different types of permutation-invariant MIL pooling layers: a) deepsets and b) attention-based pooling. We illustrate that our architecture can exploit the co-occurrence of species in metagenomic read sets and outperforms the single-read architectures in predicting the distribution over the taxa at higher taxonomic ranks.
Authors Andreas Georgiou, Vincent Fortuin, Harun Mustafa, Gunnar Rätsch
Submitted arXiv Preprints
Abstract Multivariate time series with missing values are common in many areas, for instance in healthcare and finance. To face this problem, modern data imputation approaches should (a) be tailored to sequential data, (b) deal with high dimensional and complex data distributions, and (c) be based on the probabilistic modeling paradigm for interpretability and confidence assessment. However, many current approaches fall short in at least one of these aspects. Drawing on advances in deep learning and scalable probabilistic modeling, we propose a new deep sequential variational autoencoder approach for dimensionality reduction and data imputation. Temporal dependencies are modeled with a Gaussian process prior and a Cauchy kernel to reflect multi-scale dynamics in the latent space. We furthermore use a structured variational inference distribution that improves the scalability of the approach. We demonstrate that our model exhibits superior imputation performance on benchmark tasks and challenging real-world medical data.
Authors Vincent Fortuin, Gunnar Rätsch, Stephan Mandt
Submitted arXiv Preprints
Abstract Fitting machine learning models in the low-data limit is challenging. The main challenge is to obtain suitable prior knowledge and encode it into the model, for instance in the form of a Gaussian process prior. Recent advances in meta-learning offer powerful methods for extracting such prior knowledge from data acquired in related tasks. When it comes to meta-learning in Gaussian process models, approaches in this setting have mostly focused on learning the kernel function of the prior, but not on learning its mean function. In this work, we propose to parameterize the mean function of a Gaussian process with a deep neural network and train it with a meta-learning procedure. We present analytical and empirical evidence that mean function learning can be superior to kernel learning alone, particularly if data is scarce.
Authors Vincent Fortuin, Gunnar Rätsch
Submitted arXiv Preprints
Abstract High-dimensional time series are common in many domains. Since human cognition is not optimized to work well in high-dimensional spaces, these areas could benefit from interpretable low-dimensional representations. However, most representation learning algorithms for time series data are difficult to interpret. This is due to non-intuitive mappings from data features to salient properties of the representation and non-smoothness over time. To address this problem, we propose a new representation learning framework building on ideas from interpretable discrete dimensionality reduction and deep generative modeling. This framework allows us to learn discrete representations of time series, which give rise to smooth and interpretable embeddings with superior clustering performance. We introduce a new way to overcome the non-differentiability in discrete representation learning and present a gradient-based version of the traditional self-organizing map algorithm that is more performant than the original. Furthermore, to allow for a probabilistic interpretation of our method, we integrate a Markov model in the representation space. This model uncovers the temporal transition structure, improves clustering performance even further and provides additional explanatory insights as well as a natural representation of uncertainty. We evaluate our model in terms of clustering performance and interpretability on static (Fashion-)MNIST data, a time series of linearly interpolated (Fashion-)MNIST images, a chaotic Lorenz attractor system with two macro states, as well as on a challenging real world medical time series application on the eICU data set. Our learned representations compare favorably with competitor methods and facilitate downstream tasks on the real world data.
Authors Vincent Fortuin, Matthias Hüser, Francesco Locatello, Heiko Strathmann, Gunnar Rätsch
Submitted ICLR 2019
Abstract Kernel methods on discrete domains have shown great promise for many challenging tasks, e.g., on biological sequence data as well as on molecular structures. Scalable kernel methods like support vector machines offer good predictive performances but they often do not provide uncertainty estimates. In contrast, probabilistic kernel methods like Gaussian Processes offer uncertainty estimates in addition to good predictive performance but fall short in terms of scalability. We present the first sparse Gaussian Process approximation framework on discrete input domains. Our framework achieves good predictive performance as well as uncertainty estimates using different discrete optimization techniques. We present competitive results comparing our framework to support vector machine and full Gaussian Process baselines on synthetic data as well as on challenging real-world DNA sequence data.
Authors Vincent Fortuin, Gideon Dresdner, Heiko Strathmann, Gunnar Rätsch
Submitted arXiv Preprints
Abstract The reconstruction of gene regulatory networks from time resolved gene expression measurements is a key challenge in systems biology with applications in health and disease. While the most popular network inference methods are based on unsupervised learning approaches, supervised learning methods have proven their potential for superior reconstruction performance. However, obtaining the appropriate volume of informative training data constitutes a key limitation for the success of such methods. Here, we introduce a supervised learning approach to detect gene-gene regulation based on exclusively synthetic training data, termed surrogate learning, and show its performance for synthetic and experimental time-series. We systematically investigate different simulation configurations of biologically representative time-series of transcripts and augmentation of the data with a measurement model. We compare the resulting synthetic datasets to experimental data, and evaluate classifiers trained on them for detection of gene-gene regulation from experimental time-series. For classifiers, we consider hybrid convolutional recurrent neural networks, random forests and logistic regression, and evaluate the reconstruction performance of different simulation settings, data pre-processing and classifiers. When training and test time-courses are generated from the same distribution, we find that the largest tested neural network architecture achieves the best performance of 0.448 +/- 0.047 (mean +/- std) in maximally achievable F1 score over all datasets outperforming random forests by 32.4 % +/- 14 % (mean +/- std). Reconstruction performance is sensitive to discrepancies between synthetic training and test data, highlighting the importance of matching training and test data domains. For an experimental gene expression dataset from E.coli, we find that training data generated with measurement model, multi-gene perturbations, but without data standardization is best suited for training classifiers for network reconstruction from the experimental test data. We further demonstrate superiority to multiple unsupervised, state-of-the-art methods for networks comprising 20 genes of the experimental data from E.coli (average AUPR best supervised = 0.22 vs best unsupervised = 0.07). We expect the proposed surrogate learning approach to be broadly applicable. It alleviates the requirement for large, difficult to attain volumes of experimental training data and instead relies on easily accessible synthetic data. Successful application for new experimental conditions and other data types is only limited by the automatable and scalable process of designing simulations which generate suitable synthetic data.
Authors Stefan Ganscha, Vincent Fortuin, Max Horn, Eirini Arvaniti, Manfred Claassen
Submitted bioRxiv Preprints
Abstract We present a novel approach to modeling stories using recurrent neural networks. Different story features are extracted using natural language processing techniques and used to encode the stories as sequences. These sequences can be learned by deep neural networks, in order to predict the next story events. The predictions can be used as an inspiration for writers who experience a writer's block. We further assist writers in their creative process by generating visualizations of the character interactions in the story. We show that suggestions from our model are rated as highly as the real scenes from a set of films and that our visualizations can help people in gaining deeper story understanding.
Authors Vincent Fortuin, Romann M. Weber, Sasha Schriber, Diana Wotruba, Markus Gross
Submitted The Thirtieth AAAI Conference on Innovative Applications of Artificial Intelligence