Comparative Analysis Methods and Tools
Genome annotation and analysis requires development and validation of new algorithms and tools. Several directions of this development include methods to analyze eukaryotic genome organization (tandem and segmental duplication, gene-based synteny, including for multiple related genomes), gene structure (intron conservation or loss across genomes), gene gain/loss (detection of possible errors in automated clustering results for analysis of gene families, creating whole genome based phylogenetic trees based on clustering results, pfam domain analysis to detect expanded and lost families), genome evolution, gene expression, genome variation, metabolic pathways and regulatory elements. Test new gene predictors, including those using Rna-Seq data and synteny-based approaches on validated gene sets in terms of accuracy and speed, pipelines (eg, MAKER), repeat finding software, and non-coding RNA finding software. This project aims at (1) developing algorithms and prototypes for new genome analysis methods for publications; (2) testing new gene prediction and genome analysis tools for possible integration into production annotation process.
Comparative Gene Modeling.
Comparative gene modeling aimed to improve the initial gene predictions for a set of closely related organisms and correct for missing or incorrectly predicted genes (incorrect splice sites, chimeras, gene fragments, etc).The idea of comparative modeling is that for closely related genomes, most orthologs have the same conserved gene structure. The algorithm maps all gene models predicted in all genomes to all individual genomes, and for each locus selects among the potentially many competing models, the one which is most closely resemble the homologous genes from other genomes. This procedure maybe iterated several times until no change in gene models will be observed
For Basidiomycete Dichomitus squalens reannotation using comparative modeling is compared with initial JGI production annotation:
|JGI Annotation pipeline||Comparative modeling|
|Number of predicted gene models||12,290||12,802|
|with Swissprot hits||7,356||7,900|
|with non-repeat PFAM domains||6,010||6,353|
|with EST support||10,796||11,105|
|with >90% EST support||9,178||9,444|
|Number of unique PFAM domains||2,245||2,322|
|Average EST coverage per gene||93.3%||93.3%|
|Splice sites supported by ESTs||102,200||104,246|