Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Aug 18;536(7616):285-91.
doi: 10.1038/nature19057.

Analysis of protein-coding genetic variation in 60,706 humans

Monkol Lek  1   2   3   4 Konrad J Karczewski  1   2 Eric V Minikel  1   2   5 Kaitlin E Samocha  1   2   5   6 Eric Banks  2 Timothy Fennell  2 Anne H O'Donnell-Luria  1   2   7 James S Ware  2   8   9   10   11 Andrew J Hill  1   2   12 Beryl B Cummings  1   2   5 Taru Tukiainen  1   2 Daniel P Birnbaum  2 Jack A Kosmicki  1   2   6   13 Laramie E Duncan  1   2   6 Karol Estrada  1   2 Fengmei Zhao  1   2 James Zou  2 Emma Pierce-Hoffman  1   2 Joanne Berghout  14   15 David N Cooper  16 Nicole Deflaux  17 Mark DePristo  18 Ron Do  19   20   21   22 Jason Flannick  2   23 Menachem Fromer  1   6   19   20   24 Laura Gauthier  18 Jackie Goldstein  1   2   6 Namrata Gupta  2 Daniel Howrigan  1   2   6 Adam Kiezun  18 Mitja I Kurki  2   25 Ami Levy Moonshine  18 Pradeep Natarajan  2   26   27   28 Lorena Orozco  29 Gina M Peloso  2   27   28 Ryan Poplin  18 Manuel A Rivas  2 Valentin Ruano-Rubio  18 Samuel A Rose  6 Douglas M Ruderfer  19   20   24 Khalid Shakir  18 Peter D Stenson  16 Christine Stevens  2 Brett P Thomas  1   2 Grace Tiao  18 Maria T Tusie-Luna  30 Ben Weisburd  2 Hong-Hee Won  31 Dongmei Yu  6   25   27   32 David M Altshuler  2   33 Diego Ardissino  34 Michael Boehnke  35 John Danesh  36 Stacey Donnelly  2 Roberto Elosua  37 Jose C Florez  2   26   27 Stacey B Gabriel  2 Gad Getz  18   26   38 Stephen J Glatt  39   40   41 Christina M Hultman  42 Sekar Kathiresan  2   26   27   28 Markku Laakso  43 Steven McCarroll  6   8 Mark I McCarthy  44   45   46 Dermot McGovern  47 Ruth McPherson  48 Benjamin M Neale  1   2   6 Aarno Palotie  1   2   5   49 Shaun M Purcell  19   20   24 Danish Saleheen  50   51   52 Jeremiah M Scharf  2   6   25   27   32 Pamela Sklar  19   20   24   53   54 Patrick F Sullivan  55   56 Jaakko Tuomilehto  57 Ming T Tsuang  58 Hugh C Watkins  44   59 James G Wilson  60 Mark J Daly  1   2   6 Daniel G MacArthur  1   2 Exome Aggregation Consortium
Collaborators, Affiliations

Analysis of protein-coding genetic variation in 60,706 humans

Monkol Lek et al. Nature. .

Abstract

Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.

PubMed Disclaimer

Figures

Extended Data Figure 1
Extended Data Figure 1. The impact of recurrence across different mutation and functional classes
a) TiTv (Transition to transversion) ratio of synonymous variants at downsampled intervals of ExAC. The TiTv is relatively stable at previous sample sizes (<5000) but changes drastically at larger sample sizes. b) For synonymous doubleton variants, mutability of each trinucleotide context is correlated with mean Euclidean distance of individuals that share the doubleton. Transversion (red) and non-CpG transition (green) doubletons are more likely to be found in closer PCA space (i.e. more similar ethnicities) than CpG transitions (blue) c) The proportion singleton among various functional categories. The functional category stop lost has a higher singleton rate than nonsense. Error bars represent standard error of the mean. d) Among synonymous variants, mutability of each trinucleotide context is correlated with proportion singleton, suggesting CpG transitions (blue) are more likely to have multiple independent origins driving their allele frequency up. e) The proportion singleton metric from c) broken down by transversions, non-CpG transitions, and CpG variants. Notably, there is a wide variation in singleton rates among mutational contexts in functional classes, and there are no stop-lost CpG transitions. Error bars represent standard error of the mean.
Extended Data Figure 2
Extended Data Figure 2. Multi-nucleotide variants discovered in the ExAC data set
a) Number of MNPs per impact on the variant interpretation. b) Distribution of the number of MNPs per sample where phasing changes interpretation, separated by allele frequency. Common > 1%, Rare < 1%. MNPs comprised of a rare and common allele are considered rare as this defines the frequency of the MNP.
Extended Data Figure 3
Extended Data Figure 3. Relationships between depth and observed vs expected variants as well as correlations between observed and expected variant counts for synonymous, missense, and protein-truncating
a) The relationship between the median depth of exons (bins of 2) and the sum of all observed synonymous variants in those exons divided by the sum of all expected synonymous variants. The curve was used to determine the appropriate depth adjustment for expected variant counts. For the rest of the panels, the correlation between the depth-adjusted expected variants counts and observed are depicted for synonymous (b), missense (c), and protein-truncating (d). The black line indicates a perfect correlation (slope = 1). Axes have been trimmed to remove TTN.
Extended Data Figure 4
Extended Data Figure 4. Number of protein-truncating variants in constrained genes per individual by allele frequency bin
Equivalent to Figure 5b limited to constrained (pLI ≥ 0.9) genes.
Extended Data Figure 5
Extended Data Figure 5. Principal component analysis (PCA) and key metrics used to filter samples
a) Principal component analysis using a set of 5,400 common exome SNPs. Individuals are colored by their distance from each of the population cluster centers using the first 4 principal components. b) The metrics number of variants, TiTv, alternate heterozygous/homozygous (HetHom) ratio and Insertion/Deletion (InsDel) ratio. Populations are their respective colors: Latino (red), African (purple), European (blue), South Asian (yellow) and East Asian (green).
Figure 1
Figure 1. Patterns of genetic variation in 60,706 humans
a) The size and diversity of public reference exome datasets. ExAC exceeds previous datasets in size for all studied populations. b) Principal component analysis (PCA) dividing ExAC individuals into five continental populations. PC2 and PC3 are shown; additional PCs are in Extended Data Figure 5a. c) The allele frequency spectrum of ExAC highlights that the majority of genetic variants are rare and novel. d) The proportion of possible variation observed by mutational context and functional class. Over half of all possible CpG transitions are observed. Error bars represent standard error of the mean. e-f) The number (e) and frequency distribution (proportion singleton; f) of indels, by size. Compared to in-frame indels, frameshift variants are less common (have a higher proportion of singletons, a proxy for predicted deleteriousness on gene product). Error bars indicate 95% confidence intervals.
Figure 2
Figure 2. Mutational recurrence at large sample sizes
a) Proportion of validated de novo variants from two external datasets that are independently found in ExAC, separated by functional class and mutational context. Error bars represent standard error of the mean. Colors are consistent in a-d. b) Number of unique variants observed, by mutational context, as a function of number of individuals (down-sampled from ExAC). CpG transitions, the most likely mutational event, begin reaching saturation at ~20,000 individuals. c) The site frequency spectrum is shown for each mutational context. d) For doubletons (variants with an allele count of 2), mutation rate is positively correlated with the likelihood of being found in two individuals of different continental populations. e) The mutability-adjusted proportion of singletons (MAPS) is shown across functional classes. Error bars represent standard error of the mean of the proportion of singletons.
Figure 3
Figure 3. Quantifying intolerance to functional variation in genes and gene sets
a) Histograms of constraint Z scores for 18,225 genes. This measure of departure of number of variants from expectation is normally distributed for synonymous variants, but right-shifted (higher constraint) for missense and protein-truncating variants (PTVs), indicating that more genes are intolerant to these classes of variation. b) The proportion of genes that are very likely intolerant of loss-of-function variation (pLI ≥ 0.9) is highest for ClinGen haploinsufficient genes, and stratifies by the severity and age of onset of the haploinsufficient phenotype. Genes essential in cell culture and dominant disease genes are likewise enriched for intolerant genes, while recessive disease genes and olfactory receptors have fewer intolerant genes. Black error bars indicate 95% confidence intervals (CI). c) Synonymous Z scores show no correlation with the number of tissues in which a gene is expressed, but the most missense- and PTV-constrained genes tend to be expressed in more tissues. Thick black bars indicate the first to third quartiles, with the white circle marking the median. d) Highly missense- and PTV-constrained genes are less likely to have eQTLs discovered in GTEx as the average gene. Shaded regions around the lines indicate 95% CI. e) Highly missense- and PTV-constrained genes are more likely to be adjacent to GWAS signals than the average gene. Shaded regions around the lines indicate 95% CI. f) MAPS (Figure 2d) is shown for each functional category, broken down by constraint score bins as shown. Missense and PTV constraint score bins provide information about natural selection at least partially orthogonal to MAPS, PolyPhen, and CADD scores, indicating that this metric should be useful in identifying variants associated with deleterious phenotypes. Shaded regions around the lines indicate 95% CI. For panels a,c-f: synonymous shown in gray, missense in orange, and protein-truncating in maroon.
Figure 4
Figure 4. Filtering for Mendelian variant discovery
a) Predicted missense and protein-truncating variants in 500 randomly chosen ExAC individuals were filtered based on allele frequency information from ESP, or from the remaining ExAC individuals. At a 0.1% allele frequency (AF) filter, ExAC provides greater power to remove candidate variants, leaving an average of 154 variants for analysis, compared to 1090 after filtering against ESP. Popmax AF also provides greater power than global AF, particularly when populations are unequally sampled. b) Estimates of allele frequency in Europeans based on ESP are more precise at higher allele frequencies. Sampling variance and ascertainment bias make AF estimates unreliable, posing problems for Mendelian variant filtration. 69% of ESP European singletons are not seen a second time in ExAC (tall bar at left), illustrating the dangers of filtering on very low allele counts. c) Allele frequency spectrum of disease-causing variants in the Human Gene Mutation Database (HGMD) and/or pathogenic or likely pathogenic variants in ClinVar for well characterized autosomal dominant and autosomal recessive disease genes. Most are not found in ExAC; however, many of the reportedly pathogenic variants found in ExAC are at too high a frequency to be consistent with disease prevalence and penetrance. d) Literature review of variants with >1% global allele frequency or >1% Latin American or South Asian population allele frequency confirmed there is insufficient evidence for pathogenicity for the majority of these variants. Variants were reclassified by ACMG guidelines.
Figure 5
Figure 5. Protein-truncating variation in ExAC
a) The average ExAC individual has 85 heterozygous and 35 homozygous protein-truncating variants (PTVs), of which 18 and 0.19 are rare (<0.1% popmax AF), respectively. Error bars represent standard deviation. b) Breakdown of PTVs per individual (a) by popmax AF bin. Across all populations, most PTVs found in a given individual are common (>5% popmax AF). c-d) Number of genes with at least one PTV (c) or homozygous PTV (d) as a function of number of individuals, downsampled from ExAC. South Asian population is broken down by consanguinity (Inbreeding coefficient, F). At 60,000 individuals for ExAC, the plots in c) and d) extends to 15,750 with at least one PTV and 1,550 genes with at least one homozygous PTV.

Comment in

Similar articles

Cited by

References

    1. Fu W, et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013;493:216–220. - PMC - PubMed
    1. 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature. 2015;526:68–74. - PMC - PubMed
    1. Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475:493–496. - PMC - PubMed
    1. Stoneking M, Krause J. Learning about human population history from ancient and modern genomes. Nat. Rev. Genet. 2011;12:603–614. - PubMed
    1. MacArthur DG, et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012;335:823–828. - PMC - PubMed

Publication types

Grants and funding

-