This initial assembly of 61 million reads was produced by a collaboration between Steve Rounsley at University of Arizona and 454 Life Sciences using the Newbler assembly software. An incremental approach was taken where single end reads were assembled alone, and then paired end reads (Sanger and 454) were added.
Comparison with a cassava unigene set (from a different genotype) suggests that 95% of known cassava genes are represented in the assembly. This result supports the claim that Cassava1.1 is largely complete with respect to "gene space." In addition to the gene space, repeat masking also shows that over 100Mb of repetitive sequence was assembled. We will be pursuing further improvements in the assembly with improved assembly algorithms as they become available.
Which germplasm was sequenced?
Both the Sanger and 454 sequence data were generated from a partially inbred line called AM560-2 which was generated at CIAT (International Center for Tropical Agriculture) in Cali, Colombia.
This version (Cassava1) of the assembly consists of 11,243 scaffolds spanning 416Mb. Half of the assembled sequence is in the largest 514 scaffolds, each 180kb or larger.
Although cassava has an estimated genome size of ~760Mb, this initial assembly spans only 416Mb. We believe that the 416Mb represents nearly all of the genic regions of the genome, and that the missing portion is repetitive sequence that could not be assembled. This is supported by two pieces of evidence. 1. A large fraction of reads (both Sanger and 454) were not used by the assembly software, and were primarily repetitive in nature. 2. Transcripts assembled from publicly available cassava ESTs (P. Rabinowicz, unpublished) were mapped to the genome assembly. We were able to map 95% of the transcripts showing near-complete coverage of protein-coding genes in the assembly.
The current gene set (Cassava1.1) integrates 1.5 million ESTs with homology and ab initio-based gene predictions. 47,164 protein-coding loci have been predicted, of which 24,388 have ESTs covering more than 25% of their length.
- US Department of Energy Joint Genome Institute (JGI)
This work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program, and by the University of California, Lawrence Berkeley National Laboratory under contract No. DE-AC02-05CH11231, Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344, and Los Alamos National Laboratory under contract No. DE-AC02-06NA25396.