The 10Gen Data Set

Version 1.04


Overview

To fulfill the promise of personal whole genome sequencing it will be critical to compare individual genomes to the reference genome and to one another. Without a data standard, ambiguities and misunderstandings hamper comparative analyses and software development. The 10Gen data set represents the first 10 publicly available individual human genomes in a standardized GVF format. These genomes represent a diverse assortment of ethnicities, and were produced using a variety of sequencing platforms. Our hope is that the 10Gen data set will be used as a benchmark for personal genomics software development.

The Data Set

The 10Gen data set is based on the single nucleotide variants (SNV) called by the original group that published each genome. Each of the 10 genomes is listed below along with linked references to the original publication describing how the variants were discovered.

Individual Ethnicity Platform Reference
NA19240 African SOLiD/HapMap De la Vega, et al. 2009
HapMap
NA18507 African Illumina Bentley, et al. 2008
NA18507 African SOLiD McKernan, et al. 2009
Chinese Asian Illumina Wang, et al. 2008
Korean Asian Illumina Ahn, et al. 2009
Venter Caucasian Sanger Levy, et al. 2007
Watson Caucasian Roche 454 Wheeler, et al. 2008
NA07022 Caucasian CGenomics Drmanac, et al. 2009
NA12878 Caucasian SOLiD De la Vega, et al. 2009
Quake Caucasian Helicos Pushkarev, et al. 2009

Data Access

The 10Gen data set is available for download from the SO website, from a public Amazon Simple Storage Service (S3) bucket, and as an Amazon Elastic Block Store (EBS) for mounting as a drive to an Amazon Machine Instance (AMI).


Download data from Amazon S3

Mounting data to an Amazon EC2 Machine using an Amazon EBS snapshot

You will need an Amazon AWS account to use the resources within Amazon's Elastic Compute Cloud (EC2). Once you have set up an account, log on to the AWS Management Console, launch a new instance (with a linux operating system), create an EBS volume based on the EBS snapshot snap-4d20c920 (10Gen_v1.04_GVF-Linux) and attach that volume to your instance. Use an SSH client to connect to your new instance. On the command line in your SSH console run the following commands (making the appropriate changes for your situation) to create a directory to serve as a mount point and then to mount the new drive.

# Change /dev/sdf below to the device that you attached the EBS volume to. sudo mkdir /mnt/10Gen sudo mount /dev/sdf /mnt/10Gen # The data should now be available at /mnt/10Gen.

For information about how to use Amazon EC2, EBS and S3 please refer to the documentation at Amazon's AWS site.


Reference

To read more about the 10Gen data set and the GVF format, or to cite either, please see:

A standard variation file format for human genome sequences. Reese MG, Moore B, Batchelor C, Salas F, Yandell M, Eilbeck K. In Review

Acknowledgments

We thank all the groups for generating and depositing these data sets into the public domain. Public access to whole genome sequence is critical for developing the tools and resources (such as this data set) necessary for the emerging field of personal genomics.


Disclaimer

The data provided is for research purposes only. The data were obtained from the original publishing groups and may contain errors from one or more of the following sources or from other unspecified sources: sequencing, read mapping and assembly, variant calling and format conversion.


10Gen_Plus

The 10Gen Data set is limited to SNVs from ten of the first human genomes sequenced because this was the one data type available for all of these genomes. Other types of variants, data from technologies other than sequencing and data from other organisms is, of course, equally valuable. We have created an additional repository 10Gen_Plus to hold limited examples of these types of GVF.


FAQ

What is this data?
The 10Gen data set represents Single Nucleotide Variants (SNVs) from 10 recently sequenced humans whose sequence data has been made publicly available.

Who are the people that have been sequenced?
Some individuals have been sequenced and made their sequence as well as their identity publicly available. These include Seong-Jin Kim, Craig Venter, James Watson, and Stephen Quake. Other genomes in the data set are from anonymous individuals participating in related research projects.

Where did this data come from?
All of the data in this data set was accessed from the original group who published the work. Links to those publications are provided in the table above.

What format is this data in?
The data are in GVF format, which is an extension of GFF3. More details on the GVF format can be found by reading A standard variation file format for human genome sequences. or visiting the GVF format page on this site.

Are there any tools for analysis or view this data?
The GVF format is valid GFF3, so any tools that analyze or view GFF3 can be used. These include but are not limited to: BioPerl, CGL, GBrowse and other GMOD tools and Apollo.

CHANGELOG

Version 1.04 Wed Dec 8 14:31:49 MST 2010 -------------------------------------------------------------------------------- -The Display_name tag was added to the Individual-id pragma for all individuals. -Added PMID values to the Dbxref tag for the Individual-id pragma for the NA_* genomes. -10Gen_NA18507_ILM_SNV.gvf was renamed to 10Gen_NA18507_ILMN_SNV.gvf -10Gen_Watson_CSHL_SNV.gvf was replaced with 10Gen_Watson_SNV.gvf. The previous file obtained from CSHL only had 2,060,590 SNVs. The current file was created from Watson SNVs in dbSNP and novel SNVs obtained from David Wheeler and has 3,261,428 SNVs. -Incremented all ##gvf-version numbers to 1.04. -Replaced mixture of tabs and spaces in ##sequence-region pragma with only spaces. -Make IDs unique for 10Gen_Chinese_SNV.gvf -Fixed misspelled source-method pragma name (was source_method) in all files. -Fixed misspelled phenotype-description pragma name (was phenotype-descriptions) in 10Gen_Chinese_SNV.gvf Version 1.03 Thu Oct 28 10:30:51 MDT 2010 -------------------------------------------------------------------------------- -10Gen_NA18507_SOLiD_SNV.gvf was accidentally truncated in release 10.2 and has been fixed in this release. -10Gen_Korean_SNV.gvf, 10Gen_NA07022_SNV.gvf, 10Gen_NA12878_SNV.gvf had their ##gvf-version numbers incremented to 1.02. Version 1.02 Wed Jul 21 16:20:19 MDT 2010 -------------------------------------------------------------------------------- -Changed the name of 10Gen_NA18507_Sanger_SNV.gvf to 10Gen_NA18507_ILM_SNV.gvf. The Sanger name was misleading, suggesting that the genome had been sequenced by Sanger sequencing when in fact it was sequenced by Illumina, but the variants represented in this file were called by the Wellcome Trust Sanger Institute. -10Gen_Venter_SNV.gvf has changed to correct for cases where the Reference_seq tag had the wrong value. -All files were regenerated to provide a consistent ordering of attribute tags. -All files were updated to reflect addition of or changes to the following pragmas: file-version, file-date 2010-07-09 and phenotype-description. -The 10Gen_Plus data set was started with HapMap Genotyping data for NA19240 Version 1.01 Mon Mar 22 12:12:11 MDT 2010 -------------------------------------------------------------------------------- - Fixed coordinates from 0-based to 1-based in 10Gen_Venter_SNV.gvf. - Fixed a few typographical errors in the pragmas of 10Gen_Korean_SNV.gvf and 10Gen_NA12878_SNV.gvf. - Removed a bad link from in the data-source pragma from 10Gen_NA18507_SOLiD_SNV.gvf. - Update to README to add URL for GVF spec. Version 1.00 Tues Feb 9 3:35:17 MDT 2010 -------------------------------------------------------------------------------- - Initial data upload. All 10 genomes converted to GVF 1.0 compliant formats.