Glycine max v1.0
Please note that the latest data is available at Phytozome.

Status

Overview
The Soybean (Glycine max) genome project was initiated through the DOE-JGI Community Sequencing Program (CSP) by a consortium led by Gary Stacey, Randy Shoemaker, Scott Jackson, Jeremy Schmutz, and Dan Rokhsar. Large-scale shotgun sequencing of soybean began in the middle of 2006 and was completed early in 2008. A total of ~13 million attempted Sanger shotgun reads were produced and deposited in the NCBI Trace Archive in accordance with our commitment to early access and the Fort Lauderdale genome data release policy. The present assembly (Glyma1) is the first chromosome-scale assembly of the soybean genome. The current gene set (Glyma1.0) integrates ~1.6 million ESTs with homology and ab initio-based gene predictions. Protein-coding genes have been given identifiers using the convention adopted by the Arabidopsis community. The identifiers are of the form Glyma%%g####, where %% is the chromosome number and #### is a numerical index that increases along each chromosome. We expect that these identifiers will be preserved in future releases.

Assembly
The Glyma1 release was produced by Jeremy Schmutz at JGI-Stanford Human Genome Center using the Arachne2 assembler in a mode tuned to the highly repetitive soybean genome. These sequence scaffolds were then integrated with soybean genetic and physical maps in collaboration with Steve Cannon and his group at the University of Minnesota.
Comparison with the soybean EST set suggests that more than 98% of known soybean protein-coding genes are represented in the assembly (many that aren't are turning out to be contamination of EST libraries). This result supports the claim that Glyma1 is largely complete with respect to "gene space." You'll also find that vast tracts of repetitive sequence are also assembled.
The vast majority of Glycine max ESTs align to the genome at nearly 100% identity, suggesting that Glyma1 is highly accurate in genic regions. We are currently evaluating the base-pair-level accuracy in repetitive regions by comparing the assembly with BAC clones produced for the project. Discrepancies between the shotgun assembly and the independently obtained genetic and physical maps have been manually reviewed and corrected, so there should be no errors in the large-scale structure of the genome.
What about polyploidy? The soybean genome experienced a tetraploidization event an estimated 10-15 million years ago. Homologous regions have diverged sufficiently, however, that they can assembled apart from one another in the shotgun assembly. Thus both homologs are typically represented in the Glyma1 sequence.

Statistics
Genome Size
Approximately 975Mb is captured in 20 chromosomes, with a small additional amount of mostly repetitive sequence in unmapped scaffolds.
Loci
66,153 protein-coding loci have been predicted. These genes were assigned a letter-code to indicate the level of support for each gene.

Collaborators

  • US Department of Energy Joint Genome Institute (JGI)
  • Gary Stacey, Natl. Ctr. for Soybean Biotechnology, Univ. of Missouri
  • Randy Shoemaker, USDA Agricultural Research Service
  • Scott Jackson, Purdue University
  • Bill Beavis, Natl. Center for Genome Resources
  • Daniel Rokhsar, DOE-JGI

Funding

This work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program, and by the University of California, Lawrence Berkeley National Laboratory under contract No. DE-AC02-05CH11231, Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344, and Los Alamos National Laboratory under contract No. DE-AC02-06NA25396.