Predicted protein-coding genes in IMG and IMG/M form the basis for most analyses and derivative data. However, the quality of these gene models is unknown and expected to vary greatly depending on the quality of sequence and the software used for annotation. Preliminary analysis indicates that about 10% (~1 million) of predicted protein-coding genes in IMG 3.2 are erroneous: they are false positive genes, unidentified pseudogene fragments or genes with translational exceptions, or have incorrectly predicted start sites. In order to improve the consistency of annotation and the quality of predicted genes, a project for the re-annotation of all public microbial genomes has been launched recently at JGI.
100 finished genomes have been included in a pilot reannotation project. This set includes phylogenetically diverse genomes sequenced and annotated by the JGI and JCVI prior to 2005. The genomes were permuted and reannotated using IMG-ER pipeline, which identifies structural RNAs, some repeat elements and protein-coding genes. Re-annotated genomes were run through the JGI gene quality assessment pipeline GenePRIMP; the results of this pipeline were used to perform automated correction of gene models including insertion of missed genes, extension of "short" genes and identification of putative pseudogenes. Protein-coding genes are currently being mapped between the 3 versions (original annotation, ab initio IMG-ER annotation and GenePRIMP-corrected annotation) and the results are being prepared for manual review.