Machine Learning Research

Machine Learning is the area of computer science  that concerns developing computational methods using data to make accurate predictions of complex phenomena.

For instance, given a set of known “sites” in the genome, machine learning can be used to predict the location of other such sites. Machine learning algorithms detect and exploit statistical regularities hidden in known observations, often without imposing strong assumptions on the model of the underlying problem. Such data-driven approaches excel in situations where a detailed understanding of the underlying biological mechanisms is lacking but data is abundant, which is frequently the case, e.g., in genome biology. An overview of the group’s contributions to the development, theoretical analysis, and application of machine learning methods is given below.

 

Multi-task Learning and Domain Adaptation

The classical machine learning theory is built upon the assumption that the observed data has been realized from independent and identically distributed random variables—an assumption that is too often violated in biological applications, e.g., when interdependencies between observations occur (e.g., temporal dependencies, linkage, or population structure) or when their joint distribution is non-stationary (domain adaptation or multi-task learning). The group has been substantially contributing to the development and theoretical analysis of domain adaptation and multi-task learning algorithms, and it has pioneered their application to genomic sequence analysis.

 

Structured Output Learning

The group has a long record in the development of novel inference algorithms for predicting structured outputs, such as gene structures, RNA secondary structures, image annotations, etc. These techniques are also used in various projects related to gene structure prediction, the analysis of transcriptional activity, and the identification of genomic polymorphisms (see Computational Genome Annotation and Computational Transcriptomics). The application of these techniques has led to the most accurate genome annotation techniques that currently exist.

 

Probabilistic Models

Recent efforts in modeling clinical data, pathology images, and complex biological datasets in an unsupervised fashion have motivated the usage and development of probabilistic generative models. The main focus here is learning interpretable latent variable models and scaling the inference to clinical-scale datasets in order to integrate and explain multifactorial systems across modalities and time. Among the models in development and use in the group are graphical image models, non-parametric clustering models, dynamical Markov systems and Boltzmann machines. Further interests of the group in this regard are efficient sampling algorithms for evaluation of partition functions to perform quantified model checking.

 

Large-scale Machine Learning for Sequence Analysis

The group has developed very efficient string data structures that allow us, in combination with innovative SVM optimization techniques, to solve sequence analysis tasks with up to 50 million training examples and other tasks with 7 billion test examples. With these new developments we have enabled large-scale sequence analyses, paving the way for establishing machine learning in the field of genomic sequence analysis. All algorithms are implemented within the widely adopted SHOGUN Machine Learning Toolbox, an open-source project developed by the group with the help an active and growing community.

 

Kernel-based Machine Learning

Since more than 15 years, the group has been contributing to a number of important developments in kernel-based machine learning, which is one of the most prominent developments in the field of machine learning today. This includes the development of position-specific string kernels for sequence data and learning algorithms for fusing the information contained in multiple data abstractions or heterogenous views in order to form an effective, combined representation (multiple kernel learning).

 

Boosting

Boosting is a machine learning meta-algorithm that combines multiple hypotheses generated by a possibly simple learning algorithm in order to form a sophisticated combined hypothesis. The group made substantial contributions to this area, notable on the connection of boosting and support vector machines, which helped forming the field.

Selected Publications:

  1. C. Widmer, M. Kloft, N. Görnitz, G. Rätsch. Effient Training of Graph-Regularized Multitask SVMs. Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), pages 633-647, 2012.
  2. Schweikert, G, Widmer, C, Schölkopf, B, and Rätsch, G (2008). An empirical Analysis of Domain Adaptation Algorithms. In: Proc. NIPS 2008. Advances in Neural Information Processing Systems.
  3. Lou, X., Kloft, M., Rätsch, G, Hamprecht, FA. Structured Learning from Cheap Data. In Advanced Structured Prediction. The MIT Press. In press.
  4. Sonnenburg, S, Rätsch, G, and Rieck, K (2007). Large Scale Learning with String Kernels. In: Large-Scale Kernel Machines, ed. by Léon Bottou, Olivier Chapelle, Dennis DeCoste and Jason Weston. MIT Press, Cambridge, MA, chap. 4, pp. 73-104.
  5. S. Sonnenburg, G. Rätsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. De Bona, A. Binder, C. Gehl, V. Franc: The SHOGUN Machine Learning Toolbox. Journal of Machine Learning Research 11: 1799-1802 (2010)
  6. Sören Sonnenburg, Alexander Zien, Gunnar Rätsch: ARTS: accurate recognition of transcription starts in human. ISMB (Supplement of Bioinformatics) 2006: 472-480
  7. Rätsch, G and Sonnenburg, S (2007). Large Scale Hidden Semi-Markov SVMs. In: Advances in Neural Information Processing Systems (NIPS’06), ed. by B. Schölkopf and J. Platt and T. Hoffman, vol. 19, pp. 1161-1168, Cambridge, MA, MIT Press.
  8. G Schweikert, J Behr, A Zien, G Zeller, C Ong, S Sonnenburg, G Rätsch: mGene.web: a web service for accurate computational gene finding. Nucleic Acids Research 37(Web-Server-Issue): 312-316 (2009)
  9. J Behr, A Kahles, Y Zhong, VT Sreedharan, P Drewe, G Rätsch: MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples. Bioinformatics 29(20): 2529-2538 (2013)
  10. A Zien, G Rätsch, S Mika, B Schölkopf, T Lengauer, KR Müller: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16(9): 799-807 (2000)
  11. Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, Bernhard Schölkopf: Large Scale Multiple Kernel Learning. Journal of Machine Learning Research 7: 1531-1565 (2006)
  12. G Rätsch, S  Mika, B Schölkopf, KR Müller: Constructing Boosting Algorithms from SVMs: An Application to One-Class Classification. IEEE Trans. Pattern Anal. Mach. Intell. 24(9): 1184-1199 (2002)
  13. G Rätsch, T Onoda, KR Müller: Soft Margins for AdaBoost. Machine Learning 42(3): 287-320 (2001)