An integrated map of genetic variation from 1,092 human genomes

doi:10.1038/nature11632

. 2012 Nov 1;491(7422):56-65.

doi: 10.1038/nature11632.

An integrated map of genetic variation from 1,092 human genomes

1000 Genomes Project Consortium; Goncalo R Abecasis, Adam Auton, Lisa D Brooks, Mark A DePristo, Richard M Durbin, Robert E Handsaker, Hyun Min Kang, Gabor T Marth, Gil A McVean

Collaborators

PMID: 23128226
PMCID: PMC3498066
DOI: 10.1038/nature11632

An integrated map of genetic variation from 1,092 human genomes

1000 Genomes Project Consortium et al. Nature. 2012.

. 2012 Nov 1;491(7422):56-65.

doi: 10.1038/nature11632.

PMID: 23128226
PMCID: PMC3498066
DOI: 10.1038/nature11632

Abstract

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

PubMed Disclaimer

Figures

**Figure 1. Power and accuracy**
a, Power to detect SNPs as a function of variant count (and proportion) across the entire set of samples, estimated by comparison to independent SNP array data in the exome (green) and whole genome (blue). b, Genotype accuracy compared to the same SNP array data as a function of variant frequency summarised by the r² between true and inferred genotype (coded as 0, 1 and 2) within the exome (green), whole genome after haplotype integration (blue) and whole genome without haplotype integration (red).

**Figure 2. The distribution of rare and common variants**
a, Summary of inferred haplotypes across a 100 kb region of chromosome 2 spanning the genes *ALMS1* and *NAT8*, variation in which has been associated with kidney disease. Each row represents an estimated haplotype, with the population of origin indicated on the right. Reference alleles are indicated by the light blue background. Variants (non-reference alleles) above 0.5% frequency are indicated by pink (typed on the high density SNP array), white (previously known) and dark blue (not previously known). Low frequency variants (<0.5%) are indicated by blue crosses. Indels are indicated by green triangles and novel variants by dashes below. A large, low-frequency deletion (black line) spanning *NAT8* is present in some populations. Multiple structural haplotypes mediated by segmental duplications are present at this locus, including copy number gains, which were not genotyped for this study. Within each population haplotypes are ordered by total variant count across the region. b, The fraction of variants identified across the project that are found in only one population (white line), are restricted to a single ancestry-based group (defined as in part A, solid colour), are found in all groups (solid black line) and are found in all populations (dotted black line). c, The density of the expected number of variants per kb carried by a genome drawn from each population, as a function of variant frequency (see Supplementary Information). Colours as for part a. Under a model of constant population size, the expected density is constant across the frequency spectrum.

**Figure 3. Allele sharing within and between populations**
a, Sharing of f₂ variants, those found exactly twice across the entire sample, within and between populations. Each row represents the distribution across populations for the origin of samples sharing an f₂ variant with the target population (indicated by the left-hand side). The grey bar represents the average number of f₂ variants carried by a randomly-chosen genome in each population. b, Median length of haplotype identity (excluding cryptically-related samples and singleton variants and allowing for up to two genotype errors) between two chromosomes that share variants of a given frequency in each population. Estimates are from 200 randomly-sampled regions of 1 Mb each and up to 15 pairs of individuals for each variant. c, The average proportion of variants that are novel (compared to the pilot phase of the project) among those found in regions inferred to have different ancestries within ASW, PUR, CLM and MXL. Error bars represent 95% bootstrap confidence intervals.

**Figure 4. Purifying selection within and between populations**
a, The relationship between evolutionary conservation (measured by GERP score) and rare variant proportion (fraction of all variants with derived allele frequency < 0.5%) for variants occurring in different functional elements and with different coding consequences. Crosses indicate the average GERP score at variant sites (x-axis) and proportion of rare variants (y-axis) in each category. b, Levels of evolutionary conservation (mean GERP score, top) and genetic diversity (per nucleotide pairwise differences, bottom) for sequences matching the CTCF-binding motif within CTCF-binding peaks as experimentally identified by ChIP-Seq in the ENCODE project (blue) and in a matched set of motifs outside peaks (red). The logo plot shows the distribution of identified motifs within peaks. Error bars represent ± 2 s.e.m.

**Figure 5. Implications of Phase 1 1000 Genomes data for GWAS**
a, Accuracy of imputation of genome-wide SNPs, exome SNPs and indels (using sites on the Illumina 1M array) into 10 individuals of African ancestry (3 LWK, 4 Masaai from Kenya - MKK, 2 YRI) sequenced to high coverage by an independent technology. Only indels in regions of high sequence complexity with frequency >1% are analysed. Deletion imputation accuracy estimated by comparison to array data (note this is for a different set of individuals though with a similar ancestry, but included on the same plot for clarity). Accuracy measured by squared Pearson correlation coefficient between imputed and true dosage across all sites in a frequency range estimated from the 1000 Genomes data. Lines represent whole genome SNPs (solid), exome SNPs (long dashes), short indels (dotted) and large deletions (short dashes). b, The average number of variants in linkage disequilibrium (r²>0.5 among EUR) to focal SNPs identified in GWAS as a function of distance from the index SNP. Lines indicate the number of HapMap, Pilot and Phase 1 variants.

See this image and copyright information in PMC

Comment in

A new era of human population genetics.
Platt A, Novembre J. Platt A, et al. Genome Biol. 2012 Dec 26;13(12):182. doi: 10.1186/gb-2012-13-12-182. Genome Biol. 2012. PMID: 23268745 Free PMC article.

Cited by

Biobank-wide association scan identifies risk factors for late-onset Alzheimer's disease and endophenotypes.
Yan D, Hu B, Darst BF, Mukherjee S, Kunkle BW, Deming Y, Dumitrescu L, Wang Y, Naj A, Kuzma A, Zhao Y, Kang H, Johnson SC, Carlos C, Hohman TJ, Crane PK, Engelman CD; Alzheimer’s Disease Genetics Consortium (ADGC); Lu Q. Yan D, et al. Elife. 2024 May 24;12:RP91360. doi: 10.7554/eLife.91360. Elife. 2024. PMID: 38787369 Free PMC article.
Predicting functional UTR variants by integrating region-specific features.
Li G, Wu J, Wang X. Li G, et al. Brief Bioinform. 2024 May 23;25(4):bbae248. doi: 10.1093/bib/bbae248. Brief Bioinform. 2024. PMID: 38783704 Free PMC article.
Genetically predicted fatty liver disease and risk of psychiatric disorders: A mendelian randomization study.
Xu WM, Zhang HF, Feng YH, Li SJ, Xie BY. Xu WM, et al. World J Clin Cases. 2024 May 16;12(14):2359-2369. doi: 10.12998/wjcc.v12.i14.2359. World J Clin Cases. 2024. PMID: 38765736 Free PMC article.
Interactions between circulating inflammatory factors and autism spectrum disorder: a bidirectional Mendelian randomization study in European population.
Long J, Dang H, Su W, Moneruzzaman M, Zhang H. Long J, et al. Front Immunol. 2024 Apr 29;15:1370276. doi: 10.3389/fimmu.2024.1370276. eCollection 2024. Front Immunol. 2024. PMID: 38742104 Free PMC article.
A combined observational and Mendelian randomization investigation reveals NMR-measured analytes to be risk factors of major cardiovascular diseases.
Zheng R, Lind L. Zheng R, et al. Sci Rep. 2024 May 9;14(1):10645. doi: 10.1038/s41598-024-61440-5. Sci Rep. 2024. PMID: 38724583 Free PMC article.

See all "Cited by" articles

References

1. Tennessen JA, et al. Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes. Science. 2012 doi:10.1126/science.1219240. - PMC - PubMed
1. The 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi:10.1038/nature09534. - PMC - PubMed
1. Drmanac R, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010;327:78–81. doi:10.1126/science.1181498. - PubMed
1. Mills RE, et al. Mapping copy number variation by population-scale genome sequencing. Nature. 2011;470:59–65. doi:10.1038/nature09708. - PMC - PubMed
1. Marth GT, et al. The functional spectrum of low-frequency coding variation. Genome Biol. 2011;12:R84. doi:10.1186/gb-2011-12-9-r84. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect
- The Lens - Patent Citations

[1] Tennessen JA, et al. Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes. Science. 2012 doi:10.1126/science.1219240. - PMC - PubMed

[2] Tennessen JA, et al. Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes. Science. 2012 doi:10.1126/science.1219240. - PMC - PubMed

[3] The 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi:10.1038/nature09534. - PMC - PubMed

[4] The 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi:10.1038/nature09534. - PMC - PubMed

[5] Drmanac R, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010;327:78–81. doi:10.1126/science.1181498. - PubMed

[6] Drmanac R, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010;327:78–81. doi:10.1126/science.1181498. - PubMed

[7] Mills RE, et al. Mapping copy number variation by population-scale genome sequencing. Nature. 2011;470:59–65. doi:10.1038/nature09708. - PMC - PubMed

[8] Mills RE, et al. Mapping copy number variation by population-scale genome sequencing. Nature. 2011;470:59–65. doi:10.1038/nature09708. - PMC - PubMed

[9] Marth GT, et al. The functional spectrum of low-frequency coding variation. Genome Biol. 2011;12:R84. doi:10.1186/gb-2011-12-9-r84. - PMC - PubMed

[10] Marth GT, et al. The functional spectrum of low-frequency coding variation. Genome Biol. 2011;12:R84. doi:10.1186/gb-2011-12-9-r84. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An integrated map of genetic variation from 1,092 human genomes

An integrated map of genetic variation from 1,092 human genomes

Abstract

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources