Benchmarking Fungal Annotation Pipeline

Complex intron-exon structure of eukaryotic genes makes their prediction challenging. Quality of gene prediction in eukaryotic genomes can be improved by combining different gene prediction approaches (ab initio, based on homology, ESTs, synteny, or their combinations) and experimental data (transcriptomics, proteomics, etc). In the course of fungal genome annotations we compared different gene predictors and annotation pipelines to assess and refine our annotation strategies for future genomes. Results of two such tests are presented here:

1. Annotation of Heterobasidion annosum genome (Dec 2008)

  • Results: Several gene predictors and annotation pipelines were used in annotating the genome of fungus H. annosum v1.0 and accuracy of gene prediction was compared based on homology and EST support. Combination of tools used in the JGI annotation pipeline predicted larger sets of genes with best support.

    EuGene
    [1]
    GeneMark
    [2]
    FgenesH
    [3]
    JGI Pipe
    [4,5]
    Number of predicted gene models 11,547 9,609 8,409 12,270
    with partial EST support 5,544 3,829 4,567 5,248
    with full length EST support 2,538 1,182 2,896 3,073
    with homology support 6,758 6,043 5,750 7,214
    with strong homology support (>80% aa identity, >80% coverage) 112 109 174 187
    with homology and EST support 2,894 2,172 2,720 2,953
    Average EST coverage per gene 77.7% 68.2% 80.8% 79.1%
    Supported splice sites 41,581 40,808 45,498 47,671
    Average homology coverage per gene 64% 60% 68% 69%

    EuGene models were built and provided by a collaborator. All models were used in JGI pipeline. EST support was computed based on 40,807 ESTs and 10,126 EST cluster consensus sequences mapped by BLAT; protein homology was computed by blast against NCBI NR.
  • Reference

2. Comparison of MAKER and JGI Annotation pipeline (Oct 2011)

  • Results: Publicly available annotation pipeline MAKER[6] was compared with JGI annotation pipeline [4,5]. For Basidiomycete Dichomitus squalens , JGI pipeline predicted more genes with better support using several lines of evidence.

    MAKER
    [6]
    JGI Annotation pipeline
    [4,5]
    Number of predicted gene models 9,940 12,290
    with Swissprot hits 6,521 7,356
    with non-repeat PFAM domains 5,365 6,010
    with EST support 9,252 10,796
    with >90% EST support 7,729 9,178
    Number of unique PFAM domains 2,207 2,245
    Average EST coverage per gene 93.0% 93.3%
    Splice sites supported by ESTs 99,627 102,200

    Inputs: Aassembly v1.0 of D. squalens, 359,410 proteins seeds from NCBI NR, 16,501 EST cluster consensus sequences mapped by BLAT to the assembly. Mapper used the following gene predictors: Exonerate, FgenesH (same parameters as in JGI pipeline) and Augustus. All genes were blasted against the same Swissprot set of 530,264 protein sequences (downloaded Jul5 2011), EST sequences, and PFAM database(Pfam_v21)

  • Reference:
    1. Schiex T, Moisan A, Rouzé P. (2001) Computational Biology, selected papers from JOBIM' 2000, no 2066 in LNCS. Springer Verlag; EuGène, an eukaryotic gene finder that combines several type of evidence; pp. 118–133.
    2. Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. (2008) Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 18(12):1979-90.
    3. Solovyev V, Kosarev P, Seledsov I, Vorobyev D. (2006) Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 7 Suppl 1:S10.1-12.
    4. Grigoriev IV, Martinez DA, Salamov AA (2006) Fungal genomic annotation. In Applied Mycology and Biotechnology (Eds. Aurora, DK, Berka, RM, Singh, GB), Elsevier Press, Vol 6 (Bioinformatics), 123-142.
    5. http://genome.jgi.doe.gov/programs/fungi/FungalGenomeAnnotationSOP.pdf
    6. Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M. (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18(1):188-96.