ARTS: Accurate Recognition of Transcription Starts in Human
The paper has been published in Bioinformatics and can be downloaded here.
ARTS for Human stand-a-lone tool
The ARTS stand-a-lone tool is finally available too.
It requires a recent shogun. version and python at least version 2.4. Note that this version does not use the linear kernels (they didn't lead to accuracy improvements anyway).
Human Genome Browser Custom Tracks at a 1/50 resolution for hg17 and 1/500 for hg16
Note: While ARTS' predictions are point-wise, resolution has been decreased to 1/50 and 1/500 respectively to reduce traffic. Also note that these scores are real-valued, i.e. no artificial cut-off value has been set. This has the advantage that one may choose the cut-off threshold based on ones own cost function and that the relative promoter activity is visible. Finally, note that non ACGT bases have been randomly substituted, therefore especially long N-sleds may completely screw up results on not yet reliably annotated chromosome parts.
Datasets
These are the datasets we used for training/testing. They come annotated with the NM (see RefSeq), chromosome name, strand, tss locus or start and stop respectively. The trainset is based on dbtss version 4 and human genome version hg16 whereas the testset is based on dbtss version 5 and hg17, i.e. we trained on all NMs of dbtssv4 in hg16 and evaluated only on genes in dbtssv5 in hg17 which where not already in dbtssv4 (dbtssv5 genes were excluded when they had an overlap of >30% to dbtssv4 or their NM was contained in dbtssv4).
Dataset for Training and Validation (Model Selection)
The datasets used for training and validation (model selection), hg16_training.txt.gz and hg16_validation.txt.gz are supplied in the form of fasta files. Each sequence consists of a window of size [-1200,+1200] (2400 nucleotides) around a given genomic locus. The locus is encoded in the identifier: each ID consist of the chromosome ("chr1",..., "chr22", "chrX", "chrY"), the strand (either "+" or "-"), and the position at nucleotide resolution (variable number of digits); all appended without any spaces inbetween. This ID is followed by the sequence label ("+1" or "-1" for true TSS and decoy, respectively) and by the NM that it was derived from; both separated by a space. The next line contains the corresponding sequence.
Dataset ranges used in Training and Validation
In addition to the sequence-datasets we used we provide the ranges from which we sampled the sequences. The file arts_train_validation_ranges_hg16.txt.gz contains the ranges from which positive and negative examples for the ARTS tss finder were generated for train and validation. Being based on hg16, it contains five columns: the NM identifier, the tss start as annotated by dbtss version 4, the blat match start, stop, chromosome and strand of the mRNA (as obtained from ncbi nucleotide for the NM identifier). Note that start (stop) should coincide with the tss position on the positive strand (the reverse strand). We extracted windows around true and false tss location for training (see paper for details).
Dataset ranges used in Testing
The file arts_test_ranges_hg17.txt.gz contains the ranges where the test performance was evaluated. To reproduce the result perform a genome-wide prediction and chunk the outputs as described in the paper. This dataset is used to generate true labels. Please see the paper for more details. The file format is the same as above: However it based on hg17 and contains five columns: the NM identifier, the tss start as annotated by dbtss version 4, the blat match start, stop, chromosome and strand of the mRNA (as obtained from ncbi nucleotide for the NM identifier). Note that start (stop) should coincide with the tss position on the positive strand (the reverse strand).
Results
These are the results in tabular form and also graphically using the Receiver Operator Characteristic Curve (ROC) aswell as the Precision Recall Curve (PRC).
Area under the Curves in Tabular Form
dbtssv5-dbtssv4 chunksize 50 | ||
---|---|---|
TSF | area under ROC | area under PRC |
Eponine | 88.48% | 11.79% |
McPromoter | 92.55% | 06.32% |
FirstEF | 71.29% | 06.54% |
ARTS | 92.77% | 26.18% |
dbtssv5-dbtssv4 chunksize 100 | ||
---|---|---|
TSF | area under ROC | area under PRC |
Eponine | 89.98% | 17.85% |
McPromoter | 92.60% | 08.98% |
FirstEF | 79.46% | 13.11% |
ARTS | 92.66% | 33.35% |
dbtssv5-dbtssv4 chunksize 200 | ||
---|---|---|
TSF | area under ROC | area under PRC |
Eponine | 91.52% | 26.69% |
McPromoter | 93.44% | 14.17% |
FirstEF | 85.61% | 23.17% |
ARTS | 93.17% | 43.75% |
dbtssv5-dbtssv4 chunksize 500 | ||
---|---|---|
TSF | area under ROC | area under PRC |
Eponine | 91.51% | 40.80% |
McPromoter | 93.59% | 24.23% |
FirstEF | 90.25% | 40.89% |
ARTS | 93.44% | 57.19% |
dbtssv5-dbtssv4 chunksize 1000 | ||
---|---|---|
TSF | area under ROC | area under PRC |
Eponine | 92.07% | 52.75% |
McPromoter | 93.80% | 35.43% |
FirstEF | 92.86% | 56.00% |
ARTS | 93.85% | 67.77% |
dbtssv5-dbtssv4 chunksize 2000 | ||
---|---|---|
TSF | area under ROC | area under PRC |
Eponine | 92.14% | 61.10% |
McPromoter | 93.05% | 44.65% |
FirstEF | 93.19% | 64.09% |
ARTS | 93.90% | 73.73% |
dbtssv5-dbtssv4 chunksize 5000 | ||
---|---|---|
TSF | area under ROC | area under PRC |
Eponine | 91.68% | 67.88% |
McPromoter | 91.32% | 52.21% |
FirstEF | 92.81% | 70.22% |
ARTS | 93.62% | 77.68% |
Area under the Curves Evaluation
- The next two figures show the area under the ROC (PRC) curve for chunksizes 50,100,200,500,1000,2000 and 5000. ARTS dominates all of them.
ROC Curves
- We now show the ROC Curves for chunksizes 50-5000:
- Please note that the 'bumps' in the upper right corner in the FirstEF/Eponine plots for low window sizes are artefacts, caused by the method not giving predictions for every position. However the interesting area is in the left (lower false positive rate).
PRC Curves
- We now show the PRC Curves for chunksizes 50-5000: