mTiM: margin-based transcript mapping from RNA-seq
Recent advances in high-throughput cDNA sequencing (RNA-seq) technology have made it a powerful tool for transcriptome studies. A pivotal step in the analysis of RNA-seq data is the accurate reconstruction of expressed transcripts. Computational approaches to this problem can be broadly categorized into three classes. First, methods that attempt to resolve transcript structures by assembling RNA-seq reads only. Second, methods that rely on alignments of RNA-seq reads to a reference genome and reconstruct transcripts from these alignments. Third, gene finding-related methods that try to predict protein-coding transcripts based on signals encoded in the genome sequence and use the read alignments as additional information to improve the predictions.
Here we propose a method combining the benefits of the latter two approaches. Our machine learning-based transcript reconstruction method, which we call mTiM (margin-based transcript mapping), exploits features derived from spliced and unspliced RNA-seq read alignments and from computational splice sites predictions to infer the exon-intron structure of the corresponding transcripts.
The inference technique used to train mTiM on RNA-seq data aligned in regions of well-annotated transcript structures are based on Hidden Markov Support Vector Machines (HMSVMs). This machine learning technique is related to Hidden Markov Models, which are employed in many gene finding systems, but HMSVMs are trained using a discriminative, large-margin approach with a novel Bundle method for efficient parameter optimization. Parameter learning in general and the discriminative training algorithm in particular have been shown to confer high noise tolerance in related applications.
In contrast to most gene finding systems, mTiM is strongly evidence-based and models only very few genic sequence motifs (only splice sites), whereas most gene finders are more strongly sequence-based with a much more complex model of genic sequence characteristics. Most importantly, mTiM does not require an open reading frame (it does not model coding sequence at all) and is thus able to predict noncoding transcripts as well. Unlike purely alignment-based methods such as Cufflinks or Scripture, it can fill gaps in the read coverage, an advantage for predicting complete transcripts, in particularly for weakly expressed genes. For instance, since read coverage is only one out of several features used to detect transcript boundaries, other features (such as splice site predictions) can help to distinguish introns that lack strong alignment support from intergenic regions.
We applied mTiM to strand-specific, paired-end Illumina RNA-seq data from C. elegans (2x76bp). Reads had been aligned to the genome independently with different methods to be able to assess the influence of alignment quality on subsequent transcript reconstruction. We used TopHat as well as PALMapper in combination with postprocessing routines to filter out uncertain alignments. We evaluated the accuracy of mTiM's transcript predictions using annotated genes as a benchmark and compared these to transcripts reconstructed by Cufflinks.
From our experiments we conclude that mTiM's transcript reconstruction accuracy is virtually as good as that of a state-of-the-art method on carefully curated RNA-seq alignments. Moreover, it is considerably more tolerant to alignment errors present in the results of widely used RNA-seq alignment tools, resulting in improved transcript predictions for most RNA-seq alignments analyzed. Its robustness can be attributed to the fact that mTiM explicitly models noise in its input features and is thus able to evaluate alignment quality and to some extent correct alignment errors if there is strong evidence from other features. These advantages will make mTiM an ideal tool for leveraging the growing wealth of RNA-seq data to accurately (re-)annotate genomes.