Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Apr 23;360(17):1759-68.
doi: 10.1056/NEJMra0808700. Epub 2009 Apr 15.

Genomewide association studies and human disease

Affiliations

Genomewide association studies and human disease

John Hardy et al. N Engl J Med. .
No abstract available

PubMed Disclaimer

Conflict of interest statement

No potential conflict of interest relevant to this article was reported.

Figures

Figure 1 (facing page)
Figure 1 (facing page). Stages of a Genomewide Association Study
Although genomewide association studies are increasingly popular, they present formidable logistical and technical challenges. The primary challenge lies in selecting a disease or a trait suitable for analysis. A successful analysis is more likely when the phenotype of interest can be sensitively and specifically diagnosed or measured. For such studies, extremely large sample series are required, involving thousands of case subjects and control subjects. This process usually mandates collaboration among groups that were previously competitors, which in itself presents a formidable challenge to success. In the first stage, single-nucleotide polymorphisms (SNPs) across the genome are genotyped, almost exclusively on chip-based products generated by one of two companies, Illumina or Affymetrix. The genotyping content of these products differs, but recent advances allow the imputation of ungenotyped SNPs from those that have been genotyped, which facilitates collaboration and comparison among groups that have used different techniques. Second, after the generation of SNP data, the data are subjected to quality control and cleaning procedures, such as ensuring that the genotyped sex (based on X and Y genotypes) matches the reported sex for individual samples, measuring how well the samples are matched as a group, and identifying individual outliers (all based on general patterns of genetic variability). This step allows the removal of samples from ethnically distant subjects and adjustment for any systematic differences between or within cohorts. Third, each SNP that survives quality control and cleaning is then tested for association with a disease or trait. Shown is a Manhattan plot, which is typically used in genomewide association studies and plots the negative log of the P value against chromosomal position. Because of the number of statistical tests that are performed, there is a high false positive rate. Therefore, depending on the study design, genomewide statistical significance is set at P values of approximately 1.0×10−8 or less at this stage of the analysis. The models of risk that are most typically tested are dominant, recessive, genotypic, allelic, and additive (with the additive model, which assumes that the presence of one risk allele confers an intermediate risk between having no allele and having two alleles, most frequently tested). Fourth, SNPs or loci are selected for replication in an independent sample set, ideally of the same or larger size than the sample analyzed in the genomewide association. The selection of loci may be based on statistical significance alone or a combination of statistical significance and biologic plausibility; the number of SNPs that are selected for testing may be as few as 10 or as many as 20,000, depending on the initial study design and resources available. Fifth, replication experiments lead to any combination of three results: selected loci show clear and unequivocal association with disease, show no association signal whatsoever, or show an association with disease that is not of sufficient magnitude to pass a predetermined statistical threshold. Sixth, additional genotyping is performed in independent replication cohorts to determine whether an association with a disease is genuine or not. Seventh, data mining at unequivocally associated loci reveals transcripts in and around this locus, in addition to the mapping of all known genetic variation within the region. Further fine mapping of the locus is performed by a combination of deep-resequencing methods to discover new variants and genotyping of untyped variants to determine which are most significantly associated with disease. Further analysis of the region is performed to determine the most critical variants, the pathologically relevant gene, and the likely biologic effect.
Figure 2 (facing page)
Figure 2 (facing page). Genetic Control of Gene Expression in Various Tissues
After the identification of a genetic locus associated with disease, the next step is to determine whether this variant alters the expression of transcripts within the region. In Panel A, a disease-associated region of the genome contains four genes. Although it is clear that genetic variability at a locus may affect distal genes (and even those on other chromosomes), in most instances the most proximal genes are investigated. In this case, expression of all four genes would be assessed, but one is shown for clarity. The single-nucleotide polymorphisms (SNPs) across this region form haplotypes that confer risk (red) or protection (blue) against disease. In Panel B, three splice forms are known to exist for the gene of interest: forms 1, 2, and 3, which differ according to their inclusion or removal of exons 3 and 5. In Panel C, genotyping of the risk variants in human tissues and expression analysis of the three splice forms in the same tissues allow a test of association between the genotype or haplotype and expression level. In this example, the risk haplotype is associated with decreasing levels of form 1 of the gene in the brain, has no measurable effect in the heart, and increases expression of form 3 in the liver. If no genotype-expression association were observed in the other three genes in the region, it would be reasonable to suggest that this is the pathogenically relevant transcript; if the disease of interest is neurologic, it would be reasonable to hypothesize that the risk is mediated by reduced levels of form 1 messenger RNA.

Similar articles

Cited by

References

    1. Hunter DJ, Kraft P. Drinking from the fire hose — statistical issues in genome-wide association studies. N Engl J Med. 2007;357:436–439. - PubMed
    1. International HapMap Consortium. A haplomap type of the human genome. Nature. 2005;437:1299–1320. - PMC - PubMed
    1. Wang WY, Barratt BJ, Clayton DG, Todd JA. Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet. 2005;6:109–118. - PubMed
    1. Manolio TA, Brooks LD, Collins FS. A HapMap harvest of insights into the genetics of common disease. J Clin Invest. 2008;118:1590–1605. - PMC - PubMed
    1. Lupski JR. Structural variation in the human genome. N Engl J Med. 2007;356:1169–1171. - PubMed

Publication types

-