README file for training.tar.gz Uta Schulze, 18-Jul-2006 The following files are in FASTA format: training_dna.fa (4604 DNA sequences) training_query_noise=0.00.fa (4604 EST/cDNA sequences) training_query_noise=0.01.fa (4604 EST/cDNA sequences) training_query_noise=0.10.fa (4604 EST/cDNA sequences) training_query_noise=0.20.fa (4604 EST/cDNA sequences) training_query_noise=0.50.fa (4604 EST/cDNA sequences) The description lines in each FASTA file contain an identifier. In addition, the headers in the files training_dna.fa and training_query_noise=0.00.fa contain the true exon boundaries created by an alignment between a DNA sequence with its corresponding EST/cDNA (query) sequence. They are written the following way: "exon1_start..exon1_end,exon2_start..exon2_end,exon3_start..exon3_end" For example a DNA sequence is: >train_dna1, 1..59,229..243,561..693 atggacgatgagggggagtttgtgttgtatctccgttcactgaccggttaatcagtgagg tgctactcgtttttttctttgtttgagaattaatgttgatcaacaatacttaggtgaaca tttgaattaatggcattttacccaaagtttcgaactcgaatatcctattttactggggtt ttgcactattttccatttattgtaaattcattttatttattttttcagaaatgatatctg caggtatgcgacttttcagcaaaattgattgtgtatattctagaaggtctctgagtagac atacttgaaagtatacactccaggagtaactctcctcttattcaagacaaattcaaagca atgttcatttcctacaatgttcaaatacaaacatgcgaacactattaaattataaaatct gaagaaaaaaacgttttttttttgacaattaccaaatttatgaaaaagtaactctattag aaatcattcaaaaaatcacatcggatataagtgtaatttgatttttttttaaactatatg ccatattacaatattttcaggccctcgacatcgaaaatcccgaaaacgagaatcaactgg aagtagtggaagttcagaagaagaaaaaagaccacgtactgcattcactggagatcaatt ggaccggctcaaaactgaattccgggaaagcag and the corresponding EST sequence: >train_noise=0.00_query1, 1..59,60..74,75..207 atggacgatgagggggagtttgtgttgtatctccgttcactgaccggttaatcagtgaga aatgatatctgcaggccctcgacatcgaaaatcccgaaaacgagaatcaactggaagtag tggaagttcagaagaagaaaaaagaccacgtactgcattcactggagatcaattggaccg gctcaaaactgaattccgggaaagcag