Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Nov 1;491(7422):56-65.
doi: 10.1038/nature11632.

An integrated map of genetic variation from 1,092 human genomes

Collaborators

An integrated map of genetic variation from 1,092 human genomes

1000 Genomes Project Consortium et al. Nature. .

Abstract

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Power and accuracy
a, Power to detect SNPs as a function of variant count (and proportion) across the entire set of samples, estimated by comparison to independent SNP array data in the exome (green) and whole genome (blue). b, Genotype accuracy compared to the same SNP array data as a function of variant frequency summarised by the r2 between true and inferred genotype (coded as 0, 1 and 2) within the exome (green), whole genome after haplotype integration (blue) and whole genome without haplotype integration (red).
Figure 2
Figure 2. The distribution of rare and common variants
a, Summary of inferred haplotypes across a 100 kb region of chromosome 2 spanning the genes ALMS1 and NAT8, variation in which has been associated with kidney disease. Each row represents an estimated haplotype, with the population of origin indicated on the right. Reference alleles are indicated by the light blue background. Variants (non-reference alleles) above 0.5% frequency are indicated by pink (typed on the high density SNP array), white (previously known) and dark blue (not previously known). Low frequency variants (<0.5%) are indicated by blue crosses. Indels are indicated by green triangles and novel variants by dashes below. A large, low-frequency deletion (black line) spanning NAT8 is present in some populations. Multiple structural haplotypes mediated by segmental duplications are present at this locus, including copy number gains, which were not genotyped for this study. Within each population haplotypes are ordered by total variant count across the region. b, The fraction of variants identified across the project that are found in only one population (white line), are restricted to a single ancestry-based group (defined as in part A, solid colour), are found in all groups (solid black line) and are found in all populations (dotted black line). c, The density of the expected number of variants per kb carried by a genome drawn from each population, as a function of variant frequency (see Supplementary Information). Colours as for part a. Under a model of constant population size, the expected density is constant across the frequency spectrum.
Figure 3
Figure 3. Allele sharing within and between populations
a, Sharing of f2 variants, those found exactly twice across the entire sample, within and between populations. Each row represents the distribution across populations for the origin of samples sharing an f2 variant with the target population (indicated by the left-hand side). The grey bar represents the average number of f2 variants carried by a randomly-chosen genome in each population. b, Median length of haplotype identity (excluding cryptically-related samples and singleton variants and allowing for up to two genotype errors) between two chromosomes that share variants of a given frequency in each population. Estimates are from 200 randomly-sampled regions of 1 Mb each and up to 15 pairs of individuals for each variant. c, The average proportion of variants that are novel (compared to the pilot phase of the project) among those found in regions inferred to have different ancestries within ASW, PUR, CLM and MXL. Error bars represent 95% bootstrap confidence intervals.
Figure 4
Figure 4. Purifying selection within and between populations
a, The relationship between evolutionary conservation (measured by GERP score) and rare variant proportion (fraction of all variants with derived allele frequency < 0.5%) for variants occurring in different functional elements and with different coding consequences. Crosses indicate the average GERP score at variant sites (x-axis) and proportion of rare variants (y-axis) in each category. b, Levels of evolutionary conservation (mean GERP score, top) and genetic diversity (per nucleotide pairwise differences, bottom) for sequences matching the CTCF-binding motif within CTCF-binding peaks as experimentally identified by ChIP-Seq in the ENCODE project (blue) and in a matched set of motifs outside peaks (red). The logo plot shows the distribution of identified motifs within peaks. Error bars represent ± 2 s.e.m.
Figure 5
Figure 5. Implications of Phase 1 1000 Genomes data for GWAS
a, Accuracy of imputation of genome-wide SNPs, exome SNPs and indels (using sites on the Illumina 1M array) into 10 individuals of African ancestry (3 LWK, 4 Masaai from Kenya - MKK, 2 YRI) sequenced to high coverage by an independent technology. Only indels in regions of high sequence complexity with frequency >1% are analysed. Deletion imputation accuracy estimated by comparison to array data (note this is for a different set of individuals though with a similar ancestry, but included on the same plot for clarity). Accuracy measured by squared Pearson correlation coefficient between imputed and true dosage across all sites in a frequency range estimated from the 1000 Genomes data. Lines represent whole genome SNPs (solid), exome SNPs (long dashes), short indels (dotted) and large deletions (short dashes). b, The average number of variants in linkage disequilibrium (r2>0.5 among EUR) to focal SNPs identified in GWAS as a function of distance from the index SNP. Lines indicate the number of HapMap, Pilot and Phase 1 variants.
Figure 6
Figure 6

Comment in

Similar articles

  • An integrated map of structural variation in 2,504 human genomes.
    Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Fritz MH, Konkel MK, Malhotra A, Stütz AM, Shi X, Casale FP, Chen J, Hormozdiari F, Dayama G, Chen K, Malig M, Chaisson MJP, Walter K, Meiers S, Kashin S, Garrison E, Auton A, Lam HYK, Mu XJ, Alkan C, Antaki D, Bae T, Cerveira E, Chines P, Chong Z, Clarke L, Dal E, Ding L, Emery S, Fan X, Gujral M, Kahveci F, Kidd JM, Kong Y, Lameijer EW, McCarthy S, Flicek P, Gibbs RA, Marth G, Mason CE, Menelaou A, Muzny DM, Nelson BJ, Noor A, Parrish NF, Pendleton M, Quitadamo A, Raeder B, Schadt EE, Romanovitch M, Schlattl A, Sebra R, Shabalin AA, Untergasser A, Walker JA, Wang M, Yu F, Zhang C, Zhang J, Zheng-Bradley X, Zhou W, Zichner T, Sebat J, Batzer MA, McCarroll SA; 1000 Genomes Project Consortium; Mills RE, Gerstein MB, Bashir A, Stegle O, Devine SE, Lee C, Eichler EE, Korbel JO. Sudmant PH, et al. Nature. 2015 Oct 1;526(7571):75-81. doi: 10.1038/nature15394. Nature. 2015. PMID: 26432246 Free PMC article.
  • A global reference for human genetic variation.
    1000 Genomes Project Consortium; Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. 1000 Genomes Project Consortium, et al. Nature. 2015 Oct 1;526(7571):68-74. doi: 10.1038/nature15393. Nature. 2015. PMID: 26432245 Free PMC article.
  • Molecular genetic studies of complex phenotypes.
    Marian AJ. Marian AJ. Transl Res. 2012 Feb;159(2):64-79. doi: 10.1016/j.trsl.2011.08.001. Epub 2011 Aug 31. Transl Res. 2012. PMID: 22243791 Free PMC article. Review.
  • A map of human genome variation from population-scale sequencing.
    1000 Genomes Project Consortium; Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. 1000 Genomes Project Consortium, et al. Nature. 2010 Oct 28;467(7319):1061-73. doi: 10.1038/nature09534. Nature. 2010. PMID: 20981092 Free PMC article.
  • Small insertions and deletions (INDELs) in human genomes.
    Mullaney JM, Mills RE, Pittard WS, Devine SE. Mullaney JM, et al. Hum Mol Genet. 2010 Oct 15;19(R2):R131-6. doi: 10.1093/hmg/ddq400. Epub 2010 Sep 21. Hum Mol Genet. 2010. PMID: 20858594 Free PMC article. Review.

Cited by

References

    1. Tennessen JA, et al. Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes. Science. 2012 doi:10.1126/science.1219240. - PMC - PubMed
    1. The 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi:10.1038/nature09534. - PMC - PubMed
    1. Drmanac R, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010;327:78–81. doi:10.1126/science.1181498. - PubMed
    1. Mills RE, et al. Mapping copy number variation by population-scale genome sequencing. Nature. 2011;470:59–65. doi:10.1038/nature09708. - PMC - PubMed
    1. Marth GT, et al. The functional spectrum of low-frequency coding variation. Genome Biol. 2011;12:R84. doi:10.1186/gb-2011-12-9-r84. - PMC - PubMed

Publication types

Substances

Grants and funding

-