The 10Gen Data Set
To fulfill the promise of personal whole genome sequencing it will be critical to compare individual genomes to the reference genome and to one another. Without a data standard, ambiguities and misunderstandings hamper comparative analyses and software development. The 10Gen data set represents the first 10 publicly available individual human genomes in a standardized GVF format. These genomes represent a diverse assortment of ethnicities, and were produced using a variety of sequencing platforms. Our hope is that the 10Gen data set will be used as a benchmark for personal genomics software development.
The 10Gen data set is based on the single nucleotide variants (SNV) called by the original group that published each genome. Each of the 10 genomes is listed below along with linked references to the original publication describing how the variants were discovered.
|NA19240||African||SOLiD/HapMap||De la Vega, et al. 2009
|NA18507||African||Illumina||Bentley, et al. 2008|
|NA18507||African||SOLiD||McKernan, et al. 2009|
|Chinese||Asian||Illumina||Wang, et al. 2008|
|Korean||Asian||Illumina||Ahn, et al. 2009|
|Venter||Caucasian||Sanger||Levy, et al. 2007|
|Watson||Caucasian||Roche 454||Wheeler, et al. 2008|
|NA07022||Caucasian||CGenomics||Drmanac, et al. 2009|
|NA12878||Caucasian||SOLiD||De la Vega, et al. 2009|
|Quake||Caucasian||Helicos||Pushkarev, et al. 2009|
The 10Gen data set is available for download from the SO website, from a public Amazon Simple Storage Service (S3) bucket, and as an Amazon Elastic Block Store (EBS) for mounting as a drive to an Amazon Machine Instance (AMI).
Download data from Amazon S3
- All 10 Genomes - 10Gen_1.04_SNV.tar.gz
- NA19240 - 10Gen_NA19240_SOLiD_SNV.gvf.gz
- NA18507 Illumina - 10Gen_NA18507_ILMN_SNV.gvf.gz
- NA18507 SOLiD- 10Gen_NA18507_SOLiD_SNV.gvf.gz
- Chinese - 10Gen_Chinese_SNV.gvf.gz
- Korean - 10Gen_Korean_SNV.gvf.gz
- Venter - 10Gen_Venter_SNV.gvf.gz
- Watson - 10Gen_Watson_CSHL_SNV.gvf.gz
- NA07022 - 10Gen_NA07022_SNV.gvf.gz
- NA12878 - 10Gen_NA12878_SNV.gvf.gz
- Quake - 10Gen_Quake_SNV.gvf.gz
- README - README
- CHANGELOG - CHANGELOG
- MD5 Checksums - md5sum.txt
Mounting data to an Amazon EC2 Machine using an Amazon EBS snapshot
You will need an Amazon AWS account to use the resources within Amazon's Elastic Compute Cloud (EC2). Once you have set up an account, log on to the AWS Management Console, launch a new instance (with a linux operating system), create an EBS volume based on the EBS snapshot snap-4d20c920 (10Gen_v1.04_GVF-Linux) and attach that volume to your instance. Use an SSH client to connect to your new instance. On the command line in your SSH console run the following commands (making the appropriate changes for your situation) to create a directory to serve as a mount point and then to mount the new drive.
For information about how to use Amazon EC2, EBS and S3 please refer to the documentation at Amazon's AWS site.
To read more about the 10Gen data set and the GVF format, or to cite either, please see:
We thank all the groups for generating and depositing these data sets into the public domain. Public access to whole genome sequence is critical for developing the tools and resources (such as this data set) necessary for the emerging field of personal genomics.
The data provided is for research purposes only. The data were obtained from the original publishing groups and may contain errors from one or more of the following sources or from other unspecified sources: sequencing, read mapping and assembly, variant calling and format conversion.
The 10Gen Data set is limited to SNVs from ten of the first human genomes sequenced because this was the one data type available for all of these genomes. Other types of variants, data from technologies other than sequencing and data from other organisms is, of course, equally valuable. We have created an additional repository 10Gen_Plus to hold limited examples of these types of GVF.
- What is this data?
- The 10Gen data set represents Single Nucleotide Variants (SNVs) from 10 recently sequenced humans whose sequence data has been made publicly available.
- Who are the people that have been sequenced?
- Some individuals have been sequenced and made their sequence as well as their identity publicly available. These include Seong-Jin Kim, Craig Venter, James Watson, and Stephen Quake. Other genomes in the data set are from anonymous individuals participating in related research projects.
- Where did this data come from?
- All of the data in this data set was accessed from the original group who published the work. Links to those publications are provided in the table above.
- What format is this data in?
- The data are in GVF format, which is an extension of GFF3. More details on the GVF format can be found by reading A standard variation file format for human genome sequences. or visiting the GVF format page on this site.
- Are there any tools for analysis or view this data?
- The GVF format is valid GFF3, so any tools that analyze or view GFF3 can be used. These include but are not limited to: BioPerl, CGL, GBrowse and other GMOD tools and Apollo.