ARTS: Accurate Recognition of Transcription Starts in Human

The paper has been published in Bioinformatics and can be downloaded here.

ARTS for Human stand-a-lone tool

The ARTS stand-a-lone tool is finally available too.

It requires a recent shogun. version and python at least version 2.4. Note that this version does not use the linear kernels (they didn't lead to accuracy improvements anyway).

Human Genome Browser Custom Tracks at a 1/50 resolution for hg17 and 1/500 for hg16

Note: While ARTS' predictions are point-wise, resolution has been decreased to 1/50 and 1/500 respectively to reduce traffic. Also note that these scores are real-valued, i.e. no artificial cut-off value has been set. This has the advantage that one may choose the cut-off threshold based on ones own cost function and that the relative promoter activity is visible. Finally, note that non ACGT bases have been randomly substituted, therefore especially long N-sleds may completely screw up results on not yet reliably annotated chromosome parts.

Predictions on the positive strand
  hg16 hg17
Chromosome 1 cust_hg16_1+ cust_hg17_1+
Chromosome 2 cust_hg16_2+ cust_hg17_2+
Chromosome 3 cust_hg16_3+ cust_hg17_3+
Chromosome 4 cust_hg16_4+ cust_hg17_4+
Chromosome 5 cust_hg16_5+ cust_hg17_5+
Chromosome 6 cust_hg16_6+ cust_hg17_6+
Chromosome 7 cust_hg16_7+ cust_hg17_7+
Chromosome 8 cust_hg16_8+ cust_hg17_8+
Chromosome 9 cust_hg16_9+ cust_hg17_9+
Chromosome 10 cust_hg16_10+ cust_hg17_10+
Chromosome 11 cust_hg16_11+ cust_hg17_11+
Chromosome 12 cust_hg16_12+ cust_hg17_12+
Chromosome 13 cust_hg16_13+ cust_hg17_13+
Chromosome 14 cust_hg16_14+ cust_hg17_14+
Chromosome 15 cust_hg16_15+ cust_hg17_15+
Chromosome 16 cust_hg16_16+ cust_hg17_16+
Chromosome 17 cust_hg16_17+ cust_hg17_17+
Chromosome 18 cust_hg16_18+ cust_hg17_18+
Chromosome 19 cust_hg16_19+ cust_hg17_19+
Chromosome 20 cust_hg16_20+ cust_hg17_20+
Chromosome 21 cust_hg16_21+ cust_hg17_21+
Chromosome 22 cust_hg16_22+ cust_hg17_22+
Chromosome X cust_hg16_X+ cust_hg17_X+
Chromosome Y cust_hg16_Y+ cust_hg17_Y+
Predictions on the negative strand
  hg16 hg17
Chromosome 1 cust_hg16_1- cust_hg17_1-
Chromosome 2 cust_hg16_2- cust_hg17_2-
Chromosome 3 cust_hg16_3- cust_hg17_3-
Chromosome 4 cust_hg16_4- cust_hg17_4-
Chromosome 5 cust_hg16_5- cust_hg17_5-
Chromosome 6 cust_hg16_6- cust_hg17_6-
Chromosome 7 cust_hg16_7- cust_hg17_7-
Chromosome 8 cust_hg16_8- cust_hg17_8-
Chromosome 9 cust_hg16_9- cust_hg17_9-
Chromosome 10 cust_hg16_10- cust_hg17_10-
Chromosome 11 cust_hg16_11- cust_hg17_11-
Chromosome 12 cust_hg16_12- cust_hg17_12-
Chromosome 13 cust_hg16_13- cust_hg17_13-
Chromosome 14 cust_hg16_14- cust_hg17_14-
Chromosome 15 cust_hg16_15- cust_hg17_15-
Chromosome 16 cust_hg16_16- cust_hg17_16-
Chromosome 17 cust_hg16_17- cust_hg17_17-
Chromosome 18 cust_hg16_18- cust_hg17_18-
Chromosome 19 cust_hg16_19- cust_hg17_19-
Chromosome 20 cust_hg16_20- cust_hg17_20-
Chromosome 21 cust_hg16_21- cust_hg17_21-
Chromosome 22 cust_hg16_22- cust_hg17_22-
Chromosome X cust_hg16_X- cust_hg17_X-
Chromosome Y cust_hg16_Y- cust_hg17_Y-


These are the datasets we used for training/testing. They come annotated with the NM (see RefSeq), chromosome name, strand, tss locus or start and stop respectively. The trainset is based on dbtss version 4 and human genome version hg16 whereas the testset is based on dbtss version 5 and hg17, i.e. we trained on all NMs of dbtssv4 in hg16 and evaluated only on genes in dbtssv5 in hg17 which where not already in dbtssv4 (dbtssv5 genes were excluded when they had an overlap of >30% to dbtssv4 or their NM was contained in dbtssv4).

Dataset for Training and Validation (Model Selection)

The datasets used for training and validation (model selection), hg16_training.txt.gz and hg16_validation.txt.gz are supplied in the form of fasta files. Each sequence consists of a window of size [-1200,+1200] (2400 nucleotides) around a given genomic locus. The locus is encoded in the identifier: each ID consist of the chromosome ("chr1",..., "chr22", "chrX", "chrY"), the strand (either "+" or "-"), and the position at nucleotide resolution (variable number of digits); all appended without any spaces inbetween. This ID is followed by the sequence label ("+1" or "-1" for true TSS and decoy, respectively) and by the NM that it was derived from; both separated by a space. The next line contains the corresponding sequence.

Dataset ranges used in Training and Validation

In addition to the sequence-datasets we used we provide the ranges from which we sampled the sequences. The file arts_train_validation_ranges_hg16.txt.gz contains the ranges from which positive and negative examples for the ARTS tss finder were generated for train and validation. Being based on hg16, it contains five columns: the NM identifier, the tss start as annotated by dbtss version 4, the blat match start, stop, chromosome and strand of the mRNA (as obtained from ncbi nucleotide for the NM identifier). Note that start (stop) should coincide with the tss position on the positive strand (the reverse strand). We extracted windows around true and false tss location for training (see paper for details).

Dataset ranges used in Testing

The file arts_test_ranges_hg17.txt.gz contains the ranges where the test performance was evaluated. To reproduce the result perform a genome-wide prediction and chunk the outputs as described in the paper. This dataset is used to generate true labels. Please see the paper for more details. The file format is the same as above: However it based on hg17 and contains five columns: the NM identifier, the tss start as annotated by dbtss version 4, the blat match start, stop, chromosome and strand of the mRNA (as obtained from ncbi nucleotide for the NM identifier). Note that start (stop) should coincide with the tss position on the positive strand (the reverse strand).


These are the results in tabular form and also graphically using the Receiver Operator Characteristic Curve (ROC) aswell as the Precision Recall Curve (PRC).

Area under the Curves in Tabular Form

dbtssv5-dbtssv4 chunksize 50
TSF area under ROC area under PRC
Eponine 88.48% 11.79%
McPromoter 92.55% 06.32%
FirstEF 71.29% 06.54%
ARTS 92.77% 26.18%
dbtssv5-dbtssv4 chunksize 100
TSF area under ROC area under PRC
Eponine 89.98% 17.85%
McPromoter 92.60% 08.98%
FirstEF 79.46% 13.11%
ARTS 92.66% 33.35%
dbtssv5-dbtssv4 chunksize 200
TSF area under ROC area under PRC
Eponine 91.52% 26.69%
McPromoter 93.44% 14.17%
FirstEF 85.61% 23.17%
ARTS 93.17% 43.75%
dbtssv5-dbtssv4 chunksize 500
TSF area under ROC area under PRC
Eponine 91.51% 40.80%
McPromoter 93.59% 24.23%
FirstEF 90.25% 40.89%
ARTS 93.44% 57.19%
dbtssv5-dbtssv4 chunksize 1000
TSF area under ROC area under PRC
Eponine 92.07% 52.75%
McPromoter 93.80% 35.43%
FirstEF 92.86% 56.00%
ARTS 93.85% 67.77%
dbtssv5-dbtssv4 chunksize 2000
TSF area under ROC area under PRC
Eponine 92.14% 61.10%
McPromoter 93.05% 44.65%
FirstEF 93.19% 64.09%
ARTS 93.90% 73.73%
dbtssv5-dbtssv4 chunksize 5000
TSF area under ROC area under PRC
Eponine 91.68% 67.88%
McPromoter 91.32% 52.21%
FirstEF 92.81% 70.22%
ARTS 93.62% 77.68%

Area under the Curves Evaluation

  • The next two figures show the area under the ROC (PRC) curve for chunksizes 50,100,200,500,1000,2000 and 5000. ARTS dominates all of them.

ROC Curves

  • We now show the ROC Curves for chunksizes 50-5000:
  • Please note that the 'bumps' in the upper right corner in the FirstEF/Eponine plots for low window sizes are artefacts, caused by the method not giving predictions for every position. However the interesting area is in the left (lower false positive rate).

PRC Curves

  • We now show the PRC Curves for chunksizes 50-5000: