Accurate Splice Site Detection in C. elegans

Download paper: pdf gz
  • Gunnar Rätsch (homepage) (contact me in case of trouble with this page)
  • Sören Sonnenburg (homepage)
KMCB Book appeared in Kernel Methods in Computational Biology B. Schölkopf, K. Tsuda and J.-P. Vert Editors, MIT press link
This page contains additional material to the above mentioned paper. We tried to document exactly
  1. which data sets where used,
  2. what the model selection results were and
  3. provide an implementation of the Weighted Degree Kernel.

In Section 1 we provide the virtual gene list from which acceptor and donor sites have been derived. This data can be found in Section 2. Model selection results for Splice Site Recognition is provided in Section 3 while Section 4 provides the data to evaluate complete Splice Forms for that model selection results can be found in Section 5. The Weighted Degree Kernel Implementation is found in Section 6.

  1. Training, Validation and Test sets of "virtual genes"

    These genes were used to generate the splice data set and to perform the comparison with genscan. The files contain gene strings in one line, followed by two lines of
    gene_start     intron_end+1   intron_end+1
    intron_start+1 intron_start+1 gene_end+2
    
    i.e. gene_start is on atg, intron_start on gt, intron end on agx and gene end on tagxx. so the data looks like this:
    tccgaatatcaatgtga...
    571 738 1287 2018 
    683 939 1449 2144 
    tccgaatatcaatgtg...
    571 695 868 
    648 818 1031
    ...
    
    Download:
  2. Training, Validation and Test data for Acceptor and Donor splice sites

    The data looks like this
    -1 TTCTGAAGAAGACGATGACGAAGACGAAGGAGAAGCCGTTGCAGAACTTGTCACAAAGTG
    -1 CCAACCTAATCGTTATACATATGTATTTACAGTCGCAAATGACAATTGAACAAATAAATG
      ....
    +1 AATGTTTCAATTATAAAAATTGTTAATTACAGGGGGACACCTGTATCAGTGTGACATTTC
      ....
    
    whereas the number -1 means no splice site while +1 means splice site. Then after a space the sequence follows. Download:
  3. Model Selection Results for Splice Site Prediction

    (selected for largest validation ROC) All files result files names *.{tst|dat} contain a line about the actual validation or test error followed by the actual classifier output.
    validation error = 0.014181
    
    -12.143139
    -10.286769
    ...
    
    Readily trained SVMs are saved in the following format:
    b=-3.577909
    alphas=[
           2 -1.000000
          13 +0.373805
          57 +1.000000
          68 -0.332549
          85 -1.000000
          ...
    ]
    
    Here b is the bias term and alphas contain pairs of index and value, where index is the index to a nonzero support vector and value the product of the lagrange multiplier and label of that support vector. Results:
  4. Sequences used for Evaluation of Splice Form Prediction

    • Alignment-Reference-Examples

      (250 of each class):
      Files contain plain sequences:
      AATGTTTCAATTATAAAAATTGTTAATTACAGGGGGACACCTGTATCAGTGTGACATTTC
      TTTTGTGGACAAGTTAGAGCAAACGATTATAGATGCAGCGACAGAGGGATTTGGAATCAA
      TGAGGTAAAAATTTAAACTGTGAAAATTTCAGCGTATCTTCGAAATCTAGTGGAAAGCGC
      

      Download

      These examples where used in testing; format as in Section 1. The genscan_exon_no_pred_test.asc.gz file contains two columns with as many rows as test genes. A one in the first column denotes genescan correctly found the gene start and end (zero otherwise). The number in the second column is the predicted number of exons, e.g.
      0   3
      1   3
      ...
      
    • Download test data sets
      test_genes.asc.gz (all sequences including start and end of exons)
      genscan_exon_no_pred_test.asc.gz (the number of exons genscan predicted and a bitvector explaining whether genscan found the right start and end)

      We used the following constants:

      min_exon_len 8
      min_intron_len 35
      max_pos 1000
  5. Model-Selection for alpha, sigma_a, sigma_b:

    • Positional Weight Matrixes

      sigmoid_a 0.45
      sigmoid_b -0.9
      alpha -3.75
      used model parameters (may differ from above)
      order pseudo_p pseudo_n
      acceptor 3 1 1e-6
      donor 3 10 100
    • Weighted Degree Kernel

      sigmoid_a 0.75
      sigmoid_b -0.9375
      alpha 1.7
      used model parameters (may differ from above)
      C degree
      acceptor 2 3
      donor 1 3
    • Locality Improved Kernel

      sigmoid_a 0.75
      sigmoid_b -0.75
      alpha 1.0
      used model parameters (may differ from above!)
      degree width C
      acceptor 4 15 2
      donor 3 10 5
  6. Implementation of the WD Kernel

    Download Implementation wd_kernel.cpp
    Please not that the Shogun toolbox contains an easy-to-use version of that kernel.