RASE: Recognition of alternatively Spliced Exons

Supplementary Material for the ISMB 2005 paper

RASE

RASE: Recognition of Alternatively Spliced Exons in C. elegans

The abstract and paper is available from the bioinformatics site.

This page contains additional material to the above mentioned paper. We tried to document exactly

  1. how the different data sets were generated (and make them available for download),
  2. what results where achieved in:
    • the cross-validation procedure for the "Could this exon be excluded in the geneproduct ?" predictor as well as
    • in the wetlab validation on this task.
    • model selection for the "Is there an alternative exon within this given intron and where ?" predictor
  3. how the methods perform on all available ESTs of C.elegans and
  4. which features were most valuable in SVM-training.

In Section 1 we provide the datasets used for learning splice sites, the dataset for alternatively spliced exons and the splits used for its training. Extensive model selection results are shown in Section 2. More details about the wetlab experiments including e.g. sequencing primers and gel images are given in Section 3, followed by an interpretation of the trained SVM using multiple kernel learning in Section 4. There the most significant k-mers as well as kernel weights and penalty functions are presented. In the last Section a list of alternative exons as predicted by our methods is displayed.

Data Generation and Datasets

  • Splice database

    We collected all known C. elegans ESTs from Wormbase (release WS118; 236,868 sequences), dbEST (as of February 22, 2004; 231,096 sequences) and UniGene (as of October 15, 2003; 91,480 sequences). Using blat we aligned them against the genomic DNA (release WS118). The alignment was used to confirm exons and introns. We refined the alignment by correcting typical sequencing errors, for instance by removing minor insertions and deletions. If an intron did not exhibit the consensus GT/AG or GC/AG at the 5' and 3' ends, then we tried to achieve this by shifting the boundaries up to 2 nucleotides (nt). If this still did not lead to the consensus, then we split the sequence into two parts and considered each subsequence separately. For each sequence we determined the longest open reading frame (ORF) and only used the part of each sequence within the ORF. In a next step we merged alignments, if they did not disagree and shared at least one complete exon. This lead to a set of 135,239 unique EST-based sequences.

    We repeated the above procedure with all known cDNAs from Wormbase (release WS118; 4,848 sequences) and UniGene (as of October 15, 2003; 1,231 sequences), which lead to 4,979 unique sequences. We removed all EST matches fully contained in the cDNA matches, leaving 109,693 EST-base sequences.

    Finally we obtained the following splice dataset:
    Download splice.tar.gz

  • Alternatively Spliced Exons database

    We collected all known C. elegans ESTs and cDNAs from Wormbase (release WS135), dbEST(as of December 17, 2004) and UniGene (as of December 17, 2004). We merged the data bases and removed duplicate EST sequences (either orientation). Using blat we aligned them against the genomic DNA (release WS135). We only considered sequences with at least 90% sequence identity (over the full length of the sequence). We refined the alignment by correcting typical sequencing errors and by handling polycistronic sequences (see supplementary website for more details). The alignment was used to confirm exons and introns. Finally we merged the alignments, if they did not disagree and shared at least one complete exon or intron. For each determined exon and intron we counted how often they were confirmed by (unique) ESTs.

    In the following step we identified pairs of sequences in our set that share the same 3' and 5' boundaries of the upstream and downstream exon, respectively, where one sequence contains an internal exon and the other does not (i.e. shows evidence of alternative exon usage with the same flanking exon boundaries). This way we identified 487 exons for which ESTs show evidence for alternative splicing. As negative examples we only considered exon triples that did not show evidence for alternative splicing and the internal exon and the flanking introns were at least two times confirmed by an EST sequence. We were able to extract 2,531 exon triples with the internal exon likely to be consitutively spliced. This data base of in total 3,018 examples is used for training, model selection and evaluation of our methods.

    Finally we obtained the following alternatively spliced exon dataset:
    Download altsplicedexons.tar.gz

  • Dataset splits Here the dataset splits as used in the crossvalidation procedure in are provided:
    Download altsplicedexonsplits.tar.gz

Model Selection Results

  • Alternative Exons (Skipped in one Spliceform)
    The following model parameters were tuned in crossvalidation: SVM-C (0.5, 1, 2), sigma (1/L, 0.1/L), kappa (0, 0.05, 0.07, 0.1, 0.14, 0.19, 0.26, 0.37, 0.51, 0.72, 1) and d (10,15,33). The optimal parameters for each train-test split were chosen such that they maximize the fp 1% validation score. Parameters as well as validation and test error are given in the following table. There fp 0.5% (1%) stands for the percentage of true positives achieved at a level of 0.5% (1%) false negatives. AUC denotes the area under the ROC curve. Note when performing model selection based on fp 0.5% or AUC a different parameter sets may be optimal. However the respective test error is given when model selection was based on the corresponding validation error.
    validation score test score
    C sigma d kappa fp 0.5% fp 1% AUC fp 0.5% fp 1% AUC
    split1 1.00 1.00 33 0.37 40.90% 47.65% 90.84% 45.37% 51.85% 89.90%
    split2 1.00 10.00 15 0.19 39.72% 45.54% 89.91% 48.98% 60.20% 92.63%
    split3 0.50 10.00 10 0.72 43.25% 47.15% 90.06% 32.26% 49.46% 92.88%
    split4 2.00 10.00 10 0.26 45.52% 52.81% 91.45% 40.86% 40.86% 88.32%
    split5 2.00 1.00 15 1.00 46.37% 52.75% 91.24% 33.68% 40.00% 88.70%
  • Splice site detection
  • Potential Alternative Exons hidden in Introns

Wetlab: Experimental Details

  • Materials and Methods

    We considered 21,508 exon triples (only single EST confirmed) for alternative splicing. For 18 randomly selected cases from the 1% top ranked predictions, we performed a confirmation experiment. In 11 experiments we obtained at least two PCR products of appropriate size, while in 5 cases we obtained only one PCR product (see figure below). In two cases the PCR failed and did not lead to a measurable product. For the negative control we correctly obtained only one product and for two of the three positive controls we obtained two products (PCR failed for the third). For 11 test cases and the two positive controls we sequenced the different PCR products and obtained 6 significant sequencing results (including one for a positive control). Out of the 5 significant test cases three exhibited alternative exon usage (verified by aligning the sequences against the genome). Unfortunately, the sequenced products for the remaining positive control did not show evidence for alternative splicing although the exon is known to be alternatively spliced. This indicates that the biological testing setup is not yet optimal, and that further scrutiny might well reveal that more of the candidates predicted by our algorithm do indeed show alternative splicing.
  • Gel Electrophoresis Plot

    The gel electrophoresis plot obtained in the wetlab experiment. The control sequences A, B, C, D (B-D are positive controls) and the first 6 evaluated sequences are shown in the upper plot, the remaining 10 are displayed in the lower figure.

    gel electrophoresis plot
    gel electrophoresis plot

  • The following table shows the sequence products for the three correctly predicted alternatively spliced exons. For each item the number of bands as well as their sequence are shown.

    ID products product lengths sequenced products
    1 3 110,130,130

    ATATGTGCACTGACCACATGGCCTTACACTGGCAACCACGAACACACTCC
    ACATCTGNGCGATGTNGTCGAGCTATGCGAGGANGTTTCCCTAGTGGATT
    GCGGACA

    CCTCCAGTCCGGTCGAGCAGGAATCGATCACAAAGTAGAGCTTCTTGTCC
    GCAATCCACATAGGGAAATGTCCTCGCATATGCTCGACGTACATCGCGCA
    TCATGTGGAGTGGGNCGTGGTGGCCA

    TCCAGTTCCGTCCGAGCATGAATCGATCACAAAGATGTGGAGTGTGTCGT
    GGTTGCCAGTGTAAGCCATGTGGTCAGTGCACATATTGTCAGGATTCACC
    ACAGTTTGGAGGTCCTGGTGTTAAGAAAAA

    view at wormbase
    3 2 170,132

    GATGCGCACATTGCGACGCGTCAAGCGTGCGCCAACCAGAAATAATCGAC
    CCGAACCGGCTAGTTTGTGGGTCGCAATGGAACTGGTAAATGCGATGATA
    TCCGCATGATCGCGCATCTCATTTTTGCGGAATGGAAGAAGAAGTGTCCG
    CACCTCCGATTCCGCCGCCAGATG

    ATGCGCACATTGCGACGCGCTCAAGCGTGCGCCAACCAGAATGTCCGCAC
    CTCCGTATTCCGCCGCCAGATGAAGGAAAATGCATCATTTCGAAGGCATC
    GGGCCGTGAGATTTGCTACCCATCGTACAGTCA

    view at wormbase
    5 3-4 180,210,140

    GCACAATTCTCCAGCTGATTTGACTGAGGATCAACGGAATGCATATCTTC
    TTCAACTCGAAATTGAGGACGCCACACGGAAACTGCGTCTCGCAGATTTT
    GGAGTCGCCGAGGGAAGAGAACGATCTCCATCTCCTGAACCAGTTTACGA
    CGCAAATGGTAAGCGGTTGAACACTCGTGAAGTGCG

    CACAATTCTACCAGCTGATTTGACTGAGGATCAACGGAATGCATATCTTC
    TTCAACTCGAAATTGAGGACGCCACACGGAAACTACGTCTCGCAGATTTT
    GGAGTCGCCGAGGGAAGAGAACGATCTCCATCTCCTGAACCAGTTTACGA
    CGCAAATGGTAAGCGGTTGAACACTCGTGAAGTGCGGAAACGGCAGGAAT
    TGGAACAGTTGAG

    CACAATTCTACCAGCTGATTTGACTGAGGATCAACGGAATGCATATCTTC
    ATCTCCATCTCCTGAACCAGTTTACGACGCAAATGGTAAGCGGTTGAACA
    CTCGTGAAGTGCGGAAACGGCAGGAATTGGAACAGTTGAGA

    view at wormbase
  • In the following one finds the list of primers used to tag the exon tripples in the wetlab experiment. The ID is the sequence ID as in the figure above. Additionally the length of the gene product including the exon as well as excluding the exon and the exon length itself are given.

    ID rank length w/ exon length w/o exon length of exon left primer right primer gene id
    A neg probe 499 167 332 CCAGCTACAGAAAGTGAGGGA TTATGATCAAGACCAAACCACG ZK121.2
    B pos probe 239 179 60 ACGGATGTCATCGTCTAGTCA GAGTGTTGATTTGCTTCTGCC ZK899.8a
    C pos probe 255 177 78 CTCATGTTAGCTCGCACAGAA GCCGTTCTGATTGGATATTGA C18B2.5a
    D pos probe 564 176 408 GGACGCCGAGGAAGGATA TGCTCCAGTTGTTTGAGCAC H19M22.2a
    1 6 239 176 63 ATCGTCACAAAACAGTCAGGC TTCTTAACACCAGGACCTCCA Y75B8A.6
    2 8 302 219 83 AGAAGCTTGCCAAGGAAGTTG CGTCTTGCGTATCCACTGAA T22B7.1a
    3 30 272 173 99 CGAAGTAGAAAGCCTTGCACA TGACTGTACGATGGGTAGCAA K07E12.1
    4 32 234 174 60 ATGAGTTTGACAGCTCCGCTA CGTACAGTGCAATGACAAAGTG Y16B4A.1
    5 52 252 179 73 GTCGTTGGTCTACAACAAAAAGT CTCAACTGTTCCAATTCCTGC Y116A8C.32
    6 54 226 142 84 CAACAATAAACCTTGAAGAACGA TTCTGCGCCTCTCTCATACTC F35G2.5
    7 74 261 153 108 ACATAATTTCCGTCCAAAACG TTTGGAGCCGAAGAAAGTGTA ZC416.6
    8 83 305 178 127 AGACAATGACGAGTACGACCG AAATACGCATCATAGACATTTCG F12F6.3
    9 99 202 146 56 ATGGCGACAATCAACAGCC ATTCTATTCGCTTTTCCGACG B0563.4
    10 107 291 159 132 GACGATTTGGCCTTGGATT TCGACGAGAGGTGTCTAGAAGA C24A1.2
    11 108 232 151 81 CACCATGACGAGAAGGTTGAT TTGTAGCAGACACCTCTCCAAA T06G6.6
    12 109 235 158 77 CCATCGCCTTGGAGCAAC GCTACTTGACACTGGCTGAGG W10D9.4
    13 121 268 139 129 CCGATGAAGTGGAGCTGAAT TCTGCTTGCCATTGACTCTTT C08B11.3
    14 122 291 155 136 AAACGGAATCCCGAGCTG CGTCTTCCACAATTTGTTTGAT Y51A2D.7a
    15 130 467 179 288 CATCAAGGACAAGAAGTGCAAG GCAGTGATGATTTTCATGGGT R06B9.4
    16 134 197 144 53 TATGTCGGCTCCCAGAAATC TCAAGTGTCCCGTTCAAATTC C06G1.%

Explaining the Learning Results

  • The following figure displays the kernel weights obtained by multiple kernel learning. The importance of the k-mers starting at a certain position in the sequence is shown.

  • In the table below the twelve most significant k-mers (including e-value and counts) for the following parts of the sequence are shown: intron -70 to -40, intron -30 to 0, exon 30 to 70, exon -90 to -30,intron 0 to 70. K-mmers were extracted within these region seperately, while those regions were chosen which obtained a high kernel weight (see mkl plot above).
    intron -70 to -40 intron -30 to 0 exon 30 to 70 exon -90 to -30 intron 0 to 70
    6-mmer e-value #
    ctaacc 1.2e-17 12
    cccccc 3.8e-11 10
    taaccc 9.8e-10 9
    cacttt 6.2e-09 21
    atcccc 1.6e-07 6
    ctttcc 2.4e-07 12
    ctctat 3.6e-07 9
    tcccct 4.8e-07 7
    actaac 5.3e-07 11
    tctatc 8.4e-07 12
    ttctct 1.1e-06 15
    gtctat 4.2e-06 8
    6-mmer e-value #
    cattct 1.3e-09 17
    ctctct 1.9e-09 11
    gcatgt 4.4e-09 7
    gttgtc 4.4e-09 7
    tctcta 2.2e-08 15
    ctctat 1.1e-07 13
    cctatc 1.6e-07 6
    tatcgc 4.8e-07 7
    cactct 5.0e-07 8
    tctaac 5.3e-07 11
    tctatc 5.3e-07 11
    tgtgta 5.3e-07 11
    6-mmer e-value #
    agtgag 4.2e-11 18
    tttttt 2.7e-09 21
    atatat 1.3e-08 17
    tatata 3.6e-07 13
    ataggt 4.8e-07 7
    taggtt 5.0e-07 8
    ggtaaa 8.4e-07 12
    caccac 2.2e-06 21
    gtgagt 6.0e-06 14
    aggttt 6.6e-06 12
    taagtt 6.6e-06 12
    tagtat 8.8e-06 6
    6-mmer e-value #
    tttaaa 1.8e-12 34
    aatttt 2.2e-10 61
    atttta 2.9e-09 39
    cagcag 1.2e-08 30
    taattt 8.3e-08 30
    ttcccc 2.1e-07 10
    tttttt 5.2e-07 55
    atatat 7.8e-07 18
    atttaa 1.3e-06 21
    taaaaa 1.5e-06 31
    gctagc 5.1e-06 5
    aggcgg 5.9e-06 11
    6-mmer e-value #
    tgtgtg 5.9e-31 49
    ttgtgt 1.7e-24 60
    gtgtgt 3.6e-16 34
    gttgtg 4.4e-15 31
    tgttgt 3.3e-14 37
    tgcatg 1.3e-13 22
    gcatgt 7.1e-12 22
    tccttt 1.1e-11 31
    tgtgtt 2.9e-11 42
    gtgttt 2.2e-10 40
    ttttgt 7.2e-10 91
    tgtggt 1.9e-09 18
  • length penalties (ExonSkiper & ExonInIntron)
  • all Penalties for ExonInIntron

EST Scan for alt spliced and skipped exons

$Id: index.html,v 1.18 2005/05/20 12:52:10 cvs24 Exp $