POIM
At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts.
To make SVM-based sequence classifiers more accessible and profitable, we introduce the concept of Positional Oligomer Importance Matrices (POIMs) and propose an efficient algorithm for their computation. In contrast to the raw SVM feature weighting, POIMs take the underlying correlation structure of k-mer features induced by overlaps of related k-mers into account. POIMs can be seen as a powerful generalization of sequence logos: they allow to capture and visualize sequence patterns that are relevant for the investigated biological phenomena.
We are currently working on methods to visualize and condense the information extracted via POIMs in order to find discriminative motifs for different signals on the DNA like splice sites, promoters or polyadenylation sites.
Downloadable Material
The paper can be downloaded here (local) and here (served by OUP Bioinformatics). Supplemental material for the manuscript "Positional Oligomer Importance Matrices -- Understanding Support Vector Machine Based Signal Detectors" by Sören Sonnenburg, Alexander Zien, Petra Philips and Gunnar Rätsch.
- Supplementary material download
- Technical report Computing POIMs.
- Shogun toolbox link
Online Tool
At our Galaxy site you can try POIMs with an uploaded set of sequences. First go to http://galaxy.ratschlab.org and select "SVM Toolbox->Positional Oligomer Matrices" from the list of tools at the left. Then (i) upload two sets of sequences, (ii) train an SVM, and (iii) have the POIMs displayed!