Accurate Splice Site Detection in C. elegans

Download paper: pdf gz

Gunnar Rätsch (homepage) (contact me in case of trouble with this page)
Sören Sonnenburg (homepage)

appeared in Kernel Methods in Computational Biology B. Schölkopf, K. Tsuda and J.-P. Vert Editors, MIT press link

This page contains additional material to the above mentioned paper. We tried to document exactly

which data sets where used,
what the model selection results were and
provide an implementation of the Weighted Degree Kernel.

In Section 1 we provide the virtual gene list from which acceptor and donor sites have been derived. This data can be found in Section 2. Model selection results for Splice Site Recognition is provided in Section 3 while Section 4 provides the data to evaluate complete Splice Forms for that model selection results can be found in Section 5. The Weighted Degree Kernel Implementation is found in Section 6.

Training, Validation and Test sets of "virtual genes"
These genes were used to generate the splice data set and to perform the comparison with genscan. The files contain gene strings in one line, followed by two lines of
gene_start intron_end+1 intron_end+1 intron_start+1 intron_start+1 gene_end+2
i.e. gene_start is on atg, intron_start on gt, intron end on agx and gene end on tagxx. so the data looks like this:
tccgaatatcaatgtga... 571 738 1287 2018 683 939 1449 2144 tccgaatatcaatgtg... 571 695 868 648 818 1031 ...
Download:

Training, Validation and Test data for Acceptor and Donor splice sites

The data looks like this

-1 TTCTGAAGAAGACGATGACGAAGACGAAGGAGAAGCCGTTGCAGAACTTGTCACAAAGTG
-1 CCAACCTAATCGTTATACATATGTATTTACAGTCGCAAATGACAATTGAACAAATAAATG
  ....
+1 AATGTTTCAATTATAAAAATTGTTAATTACAGGGGGACACCTGTATCAGTGTGACATTTC
  ....

whereas the number -1 means no splice site while +1 means splice site. Then after a space the sequence follows. Download:

Model Selection Results for Splice Site Prediction

(selected for largest validation ROC) All files result files names *.{tst|dat} contain a line about the actual validation or test error followed by the actual classifier output.

validation error = 0.014181

-12.143139
-10.286769
...

Readily trained SVMs are saved in the following format:

b=-3.577909
alphas=[
       2 -1.000000
      13 +0.373805
      57 +1.000000
      68 -0.332549
      85 -1.000000
      ...
]

Here b is the bias term and alphas contain pairs of index and value, where index is the index to a nonzero support vector and value the product of the lagrange multiplier and label of that support vector. Results:

Positional Weight Matrixes

pseudo_p pseudo_n order RSE Err

acceptor 1 1 2 98.88 1.54

donor 10 1e-4 2 98.23 1.85

Download result files:
- gfpie_corrected2_acc_order2_ps=[1.00e+00,1.00e+00]_val.asc.gz (acceptor validation outputs)
- gfpie_corrected2_acc_order2_ps=[1.00e+00,1.00e+00]_tst.asc.gz (acceptor test outputs)
- gfpie_corrected2_don_order2_ps=[1.00e+01,1.00e-04]_val.asc.gz (donor validation outputs)
- gfpie_corrected2_don_order2_ps=[1.00e+01,1.00e-04]_tst.asc.gz (donor test outputs)
Weighted Degree Kernel

C degree RSE Err

acceptor 1 4 99.06 1.42

donor 1 3 98.47 1.78

Download result files:
- gfwd3_acc_order4_C1.00_val.asc.gz (acceptor validation outputs)
- gfwd3_acc_order4_C1.00_tst.asc.gz (acceptor test outputs)
- gfwd3_don_order3_C1.00_val.asc.gz (donor validation outputs)
- gfwd3_don_order3_C1.00_tst.asc.gz (donor test outputs)
- gfwd2_acc_C1.00_order4.asc.gz (acceptor SVM)
- gfwd2_don_C1.00_order3.asc.gz (donor SVM)
Locality Improved Kernel

C degree width RSE Err

acceptor 0.75 4 15 99.08 1.44

donor 1 3 10 98.48 1.80

Download result files:
- gfslik_acc_C0.75_width15_degree4_val.asc.gz (acceptor validation output)
- gfslik_acc_C0.75_width15_degree4_tst.asc.gz (acceptor test output)
- gfslik_don_C1.00_width10_degree3_val.asc.gz (donor validation output)
- gfslik_don_C1.00_width10_degree3_tst.asc.gz (donor test output)
- gfsvm_slik_acc_C0.75_width15_d14_d21.asc.gz (acceptor svm)
- gfsvm_slik_don_C1.00_width10_d13_d21.asc.gz (donor svm)
TOP-Linear Kernel

C degree RSE Err

acceptor 0.5 3 98.88 1.52

donor 0.5 2 98.35 1.82

Download result files:
- gfhist_acc_order3_ps=[1.00e+03,1.00e+04]_C0.50_val.asc.gz (acceptor validation outputs)
- gfhist_acc_order3_ps=[1.00e+03,1.00e+04]_C0.50_tst.asc.gz (acceptor test outputs)
- gfhist_don_order2_ps=[1.00e+01,1.00e-04]_C0.50_val.asc.gz (donor validation outputs)
- gfhist_don_order2_ps=[1.00e+01,1.00e-04]_C0.50_tst.asc.gz (donor test outputs)
- gfhist_acc_C0.50_order3_ps=[1.00e+03,1.00e+04].asc.gz (acceptor SVM)
- gfhist_don_C0.50_order2_ps=[1.00e+01,1.00e-04].asc.gz (donor SVM)
SVM-Pairwise with 500 reference examples
(trained on 20k), only first 10k test

C gapcost RSE Err

acceptor 5 0.5 98.01 1.93

donor 50 0.5 97.60 2.03

Download result files:
- gfalign_acc_nR=500_gapCost=0.50_C5.00_val.asc.gz (acceptor validation outputs)
- gfalign_acc_nR=500_gapCost=0.50_C5.00_tst.asc.gz (acceptor test outputs)
- gfalign_don_nR=500_gapCost=0.50_C50.00_val.asc.gz (donor validation outputs)
- gfalign_don_nR=500_gapCost=0.50_C50.00_tst.asc.gz (donor test outputs)
- gfalign_acc_C1.00_gapCost=0.40_nR=500.asc.gz (acceptor SVM; wrong file)
- gfalign_don_C50.00_gapCost=0.50_nR=500.asc.gz (donor SVM)
Polynomial Kernel

C degree RSE Err

acceptor 2 6 98.94 1.80

donor 2 5 98.31 2.08

Download result files:
- gfpoly_acc_order6_C2.00_val.asc.gz (acceptor validation outputs)
- gfpoly_acc_order6_C2.00_tst.asc.gz (acceptor test outputs)
- gfpoly_don_order5_C2.00_val.asc.gz (donor validation outputs)
- gfpoly_don_order5_C2.00_tst.asc.gz (donor test outputs)
- gfpoly_acc_C2.00_order6.asc.gz (acceptor SVM)
- gfpoly_don_C2.00_order5.asc.gz (donor SVM)

Sequences used for Evaluation of Splice Form Prediction

Alignment-Reference-Examples

(250 of each class):
Files contain plain sequences:

AATGTTTCAATTATAAAAATTGTTAATTACAGGGGGACACCTGTATCAGTGTGACATTTC
TTTTGTGGACAAGTTAGAGCAAACGATTATAGATGCAGCGACAGAGGGATTTGGAATCAA
TGAGGTAAAAATTTAAACTGTGAAAATTTCAGCGTATCTTCGAAATCTAGTGGAAAGCGC

Download

SVM-pairwise-acc-ref.asc.gz (acceptor)
SVM-pairwise-don-ref.asc.gz (donor)

These examples where used in testing; format as in Section 1. The genscan_exon_no_pred_test.asc.gz file contains two columns with as many rows as test genes. A one in the first column denotes genescan correctly found the gene start and end (zero otherwise). The number in the second column is the predicted number of exons, e.g.

0   3
1   3
...

Download test data sets
test_genes.asc.gz (all sequences including start and end of exons)
genscan_exon_no_pred_test.asc.gz (the number of exons genscan predicted and a bitvector explaining whether genscan found the right start and end)

We used the following constants:

min_exon_len 8

min_intron_len 35

max_pos 1000

Model-Selection for alpha, sigma_a, sigma_b:
- Positional Weight Matrixes
  
  sigmoid_a 0.45
  
  sigmoid_b -0.9
  
  alpha -3.75
  
  used model parameters (may differ from above)
  
  order pseudo_p pseudo_n
  
  acceptor 3 1 1e-6
  
  donor 3 10 100
- Weighted Degree Kernel
  
  sigmoid_a 0.75
  
  sigmoid_b -0.9375
  
  alpha 1.7
  
  used model parameters (may differ from above)
  
  C degree
  
  acceptor 2 3
  
  donor 1 3
- Locality Improved Kernel
  
  sigmoid_a 0.75
  
  sigmoid_b -0.75
  
  alpha 1.0
  
  used model parameters (may differ from above!)
  
  degree width C
  
  acceptor 4 15 2
  
  donor 3 10 5
Implementation of the WD Kernel

Download Implementation wd_kernel.cpp
Please not that the Shogun toolbox contains an easy-to-use version of that kernel.

Accurate Splice Site Detection in C. elegans

Training, Validation and Test sets of "virtual genes"

Training, Validation and Test data for Acceptor and Donor splice sites

Model Selection Results for Splice Site Prediction

Positional Weight Matrixes

Weighted Degree Kernel

Locality Improved Kernel

TOP-Linear Kernel

SVM-Pairwise with 500 reference examples

Polynomial Kernel

Sequences used for Evaluation of Splice Form Prediction

Alignment-Reference-Examples

Model-Selection for alpha, sigma_a, sigma_b:

Positional Weight Matrixes

Weighted Degree Kernel

Locality Improved Kernel

Implementation of the WD Kernel