CodingQuarry: Highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts

Background: The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and ar...

Full description

Bibliographic Details
Main Authors: Testa, Alison, Hane, James, Ellwood, Simon, Oliver, Richard
Format: Journal Article
Published: Biomed Central Ltd 2015
Subjects:
Online Access:http://hdl.handle.net/20.500.11937/38071
_version_ 1848755219868942336
author Testa, Alison
Hane, James
Ellwood, Simon
Oliver, Richard
author_facet Testa, Alison
Hane, James
Ellwood, Simon
Oliver, Richard
author_sort Testa, Alison
building Curtin Institutional Repository
collection Online Access
description Background: The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. Results: CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. Conclusions: We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available (<a href="https://sourceforge.net/projects/codingquarry">https://sourceforge.net/projects/codingquarry</a>/), and suitable for incorporation into genome annotation pipelines.
first_indexed 2025-11-14T08:52:50Z
format Journal Article
id curtin-20.500.11937-38071
institution Curtin University Malaysia
institution_category Local University
last_indexed 2025-11-14T08:52:50Z
publishDate 2015
publisher Biomed Central Ltd
recordtype eprints
repository_type Digital Repository
spelling curtin-20.500.11937-380712017-09-13T14:14:26Z CodingQuarry: Highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts Testa, Alison Hane, James Ellwood, Simon Oliver, Richard Gene prediction Generalised hidden Markov model Fungi Gene annotation Background: The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. Results: CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. Conclusions: We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available (<a href="https://sourceforge.net/projects/codingquarry">https://sourceforge.net/projects/codingquarry</a>/), and suitable for incorporation into genome annotation pipelines. 2015 Journal Article http://hdl.handle.net/20.500.11937/38071 10.1186/s12864-015-1344-4 Biomed Central Ltd fulltext
spellingShingle Gene prediction
Generalised hidden Markov model
Fungi
Gene annotation
Testa, Alison
Hane, James
Ellwood, Simon
Oliver, Richard
CodingQuarry: Highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts
title CodingQuarry: Highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts
title_full CodingQuarry: Highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts
title_fullStr CodingQuarry: Highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts
title_full_unstemmed CodingQuarry: Highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts
title_short CodingQuarry: Highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts
title_sort codingquarry: highly accurate hidden markov model gene prediction in fungal genomes using rna-seq transcripts
topic Gene prediction
Generalised hidden Markov model
Fungi
Gene annotation
url http://hdl.handle.net/20.500.11937/38071