Accurate Splice Site Detection (Supplementary Material)
Abstract
For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks. In this work we consider Support Vector Machines for splice site recognition. We employ the so-called weighted degree kernel which turns out well suited for this task, as we will illustrate in several experiments where we compare its prediction accuracy with that of recently proposed systems. We apply our method to the genome-wide recognition of splice sites in Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, and Homo sapiens. Our performance estimates indicate that splice sites can be recognized very accurately in these genomes and that our method outperforms many other methods including Markov Chains, GeneSplicer and SpliceMachine. We provide genome-wide predictions of splice sites and a stand-alone prediction tool ready to be used for incorporation in a gene finder.
The paper is available here.
The data splits, additional information on model selection, the whole genome predictions as well as the stand-alone prediction tool are available on request. If there are questions, please contact raetsch@cbio.mskcc.org.
Positional Oligomer Importance Matrices (POIMs)
Discriminating k-mers
Stand-alone Splice Site Predictor
-
- We have developed a stand-alone splice-site predictor software
-
- Download version 0.3 from here (version 0.2)
- alternatively, please use the web server (see below)
-
- Prerequisites:
-
- http://www.shogun-toolbox.org/ version 0.7.3
- python version at least 2.4
- Organism specific files
-
- Organism specific files
-
- H. sapiens (human)
- D. melanogaster (fly)
- C. elegans (worm)
- A. thaliana (cress)
- D. rerio (zebra fish)
Web-Server
We offer a web server for predicting splice sites with pre-trained SVMs and even for training your own splice site sensors. It is implemented via the Galaxy framework. To use it, please follow the following steps:
- go to our Galaxy service, http://galaxy.raetschlab.org/
- use "Upload file" or "Get data" (in the tool bar at the left) to fix a data set in FASTA format
- in the left, open "mGene Tools"
- use "GenomeTool" to preprocess your sequences
- optionally, use "SignalTrain" to train a splice sensor for a new species
- use "SignalPredict" to predict acceptor or donor splice sites
Final Model Parameters
H.sapiens | |||||||
---|---|---|---|---|---|---|---|
window | C | order | shift | ppseudo | npseudo | type | method |
199+[-60,80] | 3 | 22 | 0 | acceptor | WD-SVM | ||
199+[-60,80] | 3 | 22 | 0.3 | acceptor | WDS-SVM | ||
199+[-25,25] | 3 | 10 | 1000 | acceptor | MCs | ||
199+[-80,60] | 3 | 22 | 0 | donor | WD-SVM | ||
199+[-80,60] | 3 | 22 | 0.3 | donor | WDS-SVM | ||
199+[-17,18] | 3 | 0.01 | 1000 | donor | MCs |
A.thaliana | |||||||
---|---|---|---|---|---|---|---|
window | C | order | shift | ppseudo | npseudo | type | method |
199+[-60,80] | 3 | 22 | 0 | acceptor | WD-SVM | ||
199+[-60,80] | 3 | 22 | 0.5 | acceptor | WDS-SVM | ||
199+[-80,80] | 4 | 10 | 1 | acceptor | MCs | ||
199+[-80,60] | 3 | 26 | 0 | donor | WD-SVM | ||
199+[-80,60] | 3 | 22 | 0.5 | donor | WDS-SVM | ||
199+[-80,80] | 4 | 10 | 10 | donor | MCs |
C.elegans | |||||||
---|---|---|---|---|---|---|---|
window | C | order | shift | ppseudo | npseudo | type | method |
199+[-60,80] | 3 | 22 | 0 | acceptor | WD-SVM | ||
199+[-60,80] | 3 | 22 | 0.3 | acceptor | WDS-SVM | ||
199+[-25,25] | 3 | 10 | 1000 | acceptor | MCs | ||
199+[-80,60] | 3 | 22 | 0 | donor | WD-SVM | ||
199+[-80,60] | 3 | 22 | 0.3 | donor | WDS-SVM | ||
199+[-17,18] | 3 | 0.01 | 1000 | donor | MCs |
D.rerio | |||||||
---|---|---|---|---|---|---|---|
window | C | order | shift | ppseudo | npseudo | type | method |
199+[-60,80] | 3 | 22 | 0 | acceptor | WD-SVM | ||
199+[-60,80] | 3 | 22 | 0.3 | acceptor | WDS-SVM | ||
199+[-60,60] | 3 | 0 | 1000 | acceptor | MCs | ||
199+[-80,60] | 3 | 22 | 0 | donor | WD-SVM | ||
199+[-80,60] | 3 | 22 | 0.3 | donor | WDS-SVM | ||
199+[-60,60] | 3 | 0 | 1000 | donor | MCs |
Genome-wide Data Sets for Worm, Fly, Cress, Fish, and Human.
- All Donor and Acceptor data sets for all the organisms as well as genome wide predictions in custom track format, are available for download from the public ftp server
Genome-wide Predictions (Custom Tracks)
- Are available for download here