Alexander Immer, MSc

"You can’t connect the dots looking forward; you can only connect them looking backwards." - Steve Jobs

PhD Student


I am interested in probabilistic inference for flexible models like neural networks and how it can help improving biomedical applications.

I received my BSc in IT-Systems Engineering from Hasso Plattner Institute in Potsdam where I first got in contact with data science. During my MSc studies at EPFL, I became interested in approximate Bayesian inference which I further pursued during my time at RIKEN AIP in Tokyo. Since July 2020, I am a PhD student within the Max-Planck ETH Center for Learning Systems where I am supervised by Gunnar Rätsch and Bernhard Schölkopf. My goal is to design machine learning algorithms that can incorporate prior knowledge, quantify uncertainty, and automatically select the most likely model given data. Apart from that, these algorithms need to be practical and interpretable to be relevant to biomedical applications.

Please consult my website for details on current and previous projects.

Abstract In recent years, the transformer has established itself as a workhorse in many applications ranging from natural language processing to reinforcement learning. Similarly, Bayesian deep learning has become the gold-standard for uncertainty estimation in safety-critical applications, where robustness and calibration are crucial. Surprisingly, no successful attempts to improve transformer models in terms of predictive uncertainty using Bayesian inference exist. In this work, we study this curiously underpopulated area of Bayesian transformers. We find that weight-space inference in transformers does not work well, regardless of the approximate posterior. We also find that the prior is at least partially at fault, but that it is very hard to find well-specified weight priors for these models. We hypothesize that these problems stem from the complexity of obtaining a meaningful mapping from weight-space to function-space distributions in the transformer. Therefore, moving closer to function-space, we propose a novel method based on the implicit reparameterization of the Dirichlet distribution to apply variational inference directly to the attention weights. We find that this proposed method performs competitively with our baselines.

Authors Tristan Cinquin, Alexander Immer, Max Horn, Vincent Fortuin

Submitted AABI 2022


Abstract Pre-trained contextual representations have led to dramatic performance improvements on a range of downstream tasks. This has motivated researchers to quantify and understand the linguistic information encoded in them. In general, this is done by probing, which consists of training a supervised model to predict a linguistic property from said representations. Unfortunately, this definition of probing has been subject to extensive criticism, and can lead to paradoxical or counter-intuitive results. In this work, we present a novel framework for probing where the goal is to evaluate the inductive bias of representations for a particular task, and provide a practical avenue to do this using Bayesian inference. We apply our framework to a series of token-, arc-, and sentence-level tasks. Our results suggest that our framework solves problems of previous approaches and that fastText can offer a better inductive bias than BERT in certain situations.

Authors Alexander Immer, Lucas Torroba Hennigen, Vincent Fortuin, Ryan Cotterell

Submitted arXiv Preprints


Abstract Marginal-likelihood based model-selection, even though promising, is rarely used in deep learning due to estimation difficulties. Instead, most approaches rely on validation data, which may not be readily available. In this work, we present a scalable marginal-likelihood estimation method to select both hyperparameters and network architectures, based on the training data alone. Some hyperparameters can be estimated online during training, simplifying the procedure. Our marginal-likelihood estimate is based on Laplace's method and Gauss-Newton approximations to the Hessian, and it outperforms cross-validation and manual-tuning on standard regression and image classification datasets, especially in terms of calibration and out-of-distribution detection. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable (e.g., in nonstationary settings).

Authors Alexander Immer, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, Mohammad Emtiyaz Khan

Submitted ICML 2021