Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr;376(6588):44-53.
doi: 10.1126/science.abj6987. Epub 2022 Mar 31.

The complete sequence of a human genome

Sergey Nurk #  1 Sergey Koren #  1 Arang Rhie #  1 Mikko Rautiainen #  1 Andrey V Bzikadze  2 Alla Mikheenko  3 Mitchell R Vollger  4 Nicolas Altemose  5 Lev Uralsky  6   7 Ariel Gershman  8 Sergey Aganezov  9 Savannah J Hoyt  10 Mark Diekhans  11 Glennis A Logsdon  4 Michael Alonge  9 Stylianos E Antonarakis  12 Matthew Borchers  13 Gerard G Bouffard  14 Shelise Y Brooks  14 Gina V Caldas  15 Nae-Chyun Chen  9 Haoyu Cheng  16   17 Chen-Shan Chin  18 William Chow  19 Leonardo G de Lima  13 Philip C Dishuck  4 Richard Durbin  19   20 Tatiana Dvorkina  3 Ian T Fiddes  21 Giulio Formenti  22   23 Robert S Fulton  24 Arkarachai Fungtammasan  18 Erik Garrison  11   25 Patrick G S Grady  10 Tina A Graves-Lindsay  26 Ira M Hall  27 Nancy F Hansen  28 Gabrielle A Hartley  10 Marina Haukness  11 Kerstin Howe  19 Michael W Hunkapiller  29 Chirag Jain  1   30 Miten Jain  11 Erich D Jarvis  22   23 Peter Kerpedjiev  31 Melanie Kirsche  9 Mikhail Kolmogorov  32 Jonas Korlach  29 Milinn Kremitzki  26 Heng Li  16   17 Valerie V Maduro  33 Tobias Marschall  34 Ann M McCartney  1 Jennifer McDaniel  35 Danny E Miller  4   36 James C Mullikin  14   28 Eugene W Myers  37 Nathan D Olson  35 Benedict Paten  11 Paul Peluso  29 Pavel A Pevzner  32 David Porubsky  4 Tamara Potapova  13 Evgeny I Rogaev  6   7   38   39 Jeffrey A Rosenfeld  40 Steven L Salzberg  9   41 Valerie A Schneider  42 Fritz J Sedlazeck  43 Kishwar Shafin  11 Colin J Shew  44 Alaina Shumate  41 Ying Sims  19 Arian F A Smit  45 Daniela C Soto  44 Ivan Sović  29   46 Jessica M Storer  45 Aaron Streets  5   47 Beth A Sullivan  48 Françoise Thibaud-Nissen  42 James Torrance  19 Justin Wagner  35 Brian P Walenz  1 Aaron Wenger  29 Jonathan M D Wood  19 Chunlin Xiao  42 Stephanie M Yan  49 Alice C Young  14 Samantha Zarate  9 Urvashi Surti  50 Rajiv C McCoy  49 Megan Y Dennis  44 Ivan A Alexandrov  3   7   51 Jennifer L Gerton  13   52 Rachel J O'Neill  10 Winston Timp  8   41 Justin M Zook  35 Michael C Schatz  9   49 Evan E Eichler  4   53 Karen H Miga  11   54 Adam M Phillippy  1
Affiliations

The complete sequence of a human genome

Sergey Nurk et al. Science. 2022 Apr.

Abstract

Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion-base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.. Summary of the complete T2T-CHM13 human genome assembly.
(A) Ideogram of T2T-CHM13v1.1 assembly features. Bottom to top: gaps/issues in GRCh38 fixed by CHM13 overlaid with the density of genes exclusive to CHM13 in red; segmental duplications (SDs) (42) and centromeric satellites (CenSat) (30); and CHM13 ancestry predictions (EUR, European; SAS, South Asian; EAS, East Asian; AMR, Ad Mixed American). (B) Additional (non-syntenic) bases in the CHM13 assembly relative to GRCh38 per chromosome, with the acrocentrics highlighted in black, and (C) by sequence type (note that the CenSat and SD annotations overlap). (D) Total non-gap bases in UCSC reference genome releases dating back to September 2000 (hg4) and ending with T2T-CHM13 in 2021.
Fig. 2.
Fig. 2.. High-resolution assembly string graph of the CHM13 genome.
(A) Bandage (60) visualization, where nodes represent unambiguously assembled sequences scaled by length, and edges correspond to the overlaps between node sequences. Each chromosome is both colored and numbered on the short (p) arm. Long (q) arms are labeled where unclear. The five acrocentric chromosomes (bottom right) are connected due to similarity between their short arms, and the rDNA arrays form five dense tangles due to their high copy number. The graph is partially fragmented due to HiFi coverage dropout surrounding GA-rich sequence (black triangles). Centromeric satellites (30) are the source of most ambiguity in the graph (gray highlights). (B) The ONT-assisted graph traversal for the 2p11 locus is given by numerical order. Based on low depth-of-coverage, the unlabeled light gray node represents an artifact or heterozygous variant and was not used. (C) The multi-megabase tandem HSat3 duplication (9qh+) at 9q12 requires two traversals of the large loop structure (the size of the loop is exaggerated because graph edges are of constant size). Nodes used by the first traversal are in dark purple and the second traversal in light purple. Nodes used by both traversals typically have twice the sequencing coverage. (D) Enlargement of the distal short arms of the acrocentrics, showing the colored graph walks and edges between highly similar sequences in the distal junctions (DJs) adjacent to the rDNA arrays.
Fig. 3.
Fig. 3.. Sequencing coverage and assembly validation.
(A) Uniform whole-genome coverage of mapped HiFi and ONT reads is shown with primary alignments in light shades and marker-assisted alignments overlaid in dark shades. Large HSat arrays (30) are noted by triangles, with inset regions are marked by arrowheads and the location of the rDNA arrays marked with asterisks. Regions with low unique marker frequency (light green) correspond to drops in unique marker density, but are recovered by the lower-confidence primary alignments. Annotated assembly issues are compared for T2T-CHM13 and GRCh38. (B–D) Enlargements corresponding to regions of the genome featured in Fig. 2. Uniform coverage changes within certain satellites are reproducible and likely caused by sequencing bias. Identified heterozygous variants and assembly issues are marked below and typically correspond with low coverage of the primary allele (black) and elevated coverage of the secondary allele (red). % microsatellite repeats for every 128 bp window is shown at the bottom.
Fig. 4.
Fig. 4.. Short arms of the acrocentric chromosomes.
Each short arm is shown along with annotated genes, percent of methylated CpGs (29), and a color-coded satellite repeat annotation (30). The rDNA arrays are represented by a directional arrow and copy number due to their high self-similarity, which prohibits ONT mapping. Percent identity heatmaps versus the other four arms were computed in 10 kbp windows and smoothed over 100 kbp intervals. Each position shows the maximum identity of that window to any window in the other chromosome. The distal short arms include conserved satellite structure and inverted repeats (thin arrows), while the proximal short arms show a diversity of structures. The proximal short arms of Chromosomes 13, 14, and 21 share a segmentally duplicated core, including small alpha satellite HOR arrays and a central, highly methylated, SST1 array (thin arrows with teal block). Yellow triangles indicate hypomethylated centromeric dip regions (CDRs), marking the sites of kinetochore assembly (29).
Fig. 5.
Fig. 5.. Resolved FRG1 paralogs.
(A) Protein-coding gene FRG1 and its 23 paralogs in CHM13. Only 9 are found in GRCh38. Genes are drawn larger than their actual size and the “FRG1” prefix is omitted for brevity. All paralogs are found near satellite arrays. Most copies exhibit evidence of expression, including CpG islands present at the 5′ start site with varying degrees of methylation. (B) Reference (gray) and variant (colored) allele coverage is shown for four human HiFi samples mapped to the paralog FRG1DP. When mapped to GRCh38, the region shows excessive HiFi coverage and variants, indicating that reads from the missing paralogs are mis-mapped to FRG1DP (variants with >80% coverage shown). When mapped to CHM13, HiFi reads show the expected coverage and a typical heterozygous variation pattern for the three non-CHM13 samples (variants >20% coverage shown). These non-reference alleles are also found in other populations from 1KGP ILMN data. (C) Mapped HiFi read coverage for other FRG1 paralogs, with an extended context shown for Chromosome 20. Coverage of HiFi reads that mapped to FRG1DP in GRCh38 are highlighted (dark gray), showing the paralogous copies they originate from (FRG1BP4–10, FRG1GP, FRG1GP2, and FRG1KP4). Background coverage is variable for some paralogs, suggesting copy number polymorphism in the population. (D) Methylation and expression profiles suggest transcription of FRG1DP in CHM13. In the copy number display (bottom), each length k sequence (k-mer) of the CHM13 assembly is painted with a color representing the copy number of that k-mer sequence in an SGDP sample. The CHM13 and GRCh38 tracks show the copy number of these same k-mers in the respective assemblies. CHM13 copy number resembles all samples from the SGDP, whereas GRCh38 underrepresents the true copy number.

Comment in

Similar articles

  • The complete sequence of a human Y chromosome.
    Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, Altemose N, Hook PW, Koren S, Rautiainen M, Alexandrov IA, Allen J, Asri M, Bzikadze AV, Chen NC, Chin CS, Diekhans M, Flicek P, Formenti G, Fungtammasan A, Garcia Giron C, Garrison E, Gershman A, Gerton JL, Grady PGS, Guarracino A, Haggerty L, Halabian R, Hansen NF, Harris R, Hartley GA, Harvey WT, Haukness M, Heinz J, Hourlier T, Hubley RM, Hunt SE, Hwang S, Jain M, Kesharwani RK, Lewis AP, Li H, Logsdon GA, Lucas JK, Makalowski W, Markovic C, Martin FJ, Mc Cartney AM, McCoy RC, McDaniel J, McNulty BM, Medvedev P, Mikheenko A, Munson KM, Murphy TD, Olsen HE, Olson ND, Paulin LF, Porubsky D, Potapova T, Ryabov F, Salzberg SL, Sauria MEG, Sedlazeck FJ, Shafin K, Shepelev VA, Shumate A, Storer JM, Surapaneni L, Taravella Oill AM, Thibaud-Nissen F, Timp W, Tomaszkiewicz M, Vollger MR, Walenz BP, Watwood AC, Weissensteiner MH, Wenger AM, Wilson MA, Zarate S, Zhu Y, Zook JM, Eichler EE, O'Neill RJ, Schatz MC, Miga KH, Makova KD, Phillippy AM. Rhie A, et al. Nature. 2023 Sep;621(7978):344-354. doi: 10.1038/s41586-023-06457-y. Epub 2023 Aug 23. Nature. 2023. PMID: 37612512 Free PMC article.
  • The p-Arms of Human Acrocentric Chromosomes Play by a Different Set of Rules.
    McStay B. McStay B. Annu Rev Genomics Hum Genet. 2023 Aug 25;24:63-83. doi: 10.1146/annurev-genom-101122-081642. Epub 2023 Feb 28. Annu Rev Genomics Hum Genet. 2023. PMID: 36854315 Review.
  • Short arms of human acrocentric chromosomes and the completion of the human genome sequence.
    Antonarakis SE. Antonarakis SE. Genome Res. 2022 Apr;32(4):599-607. doi: 10.1101/gr.275350.121. Epub 2022 Mar 31. Genome Res. 2022. PMID: 35361624 Free PMC article.
  • Completing the human genome: the progress and challenge of satellite DNA assembly.
    Miga KH. Miga KH. Chromosome Res. 2015 Sep;23(3):421-6. doi: 10.1007/s10577-015-9488-2. Chromosome Res. 2015. PMID: 26363799 Review.
  • Finishing the euchromatic sequence of the human genome.
    International Human Genome Sequencing Consortium. International Human Genome Sequencing Consortium. Nature. 2004 Oct 21;431(7011):931-45. doi: 10.1038/nature03001. Nature. 2004. PMID: 15496913

Cited by

References

    1. Schneider VA et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017). - PMC - PubMed
    1. International Human Genome Sequencing Consortium, Initial sequencing and analysis of the human genome. Nature. 409, 860–921 (2001). - PubMed
    1. Venter JC et al. The sequence of the human genome. Science. 291, 1304–1351 (2001). - PubMed
    1. Myers EW et al. A whole-genome assembly of Drosophila. Science. 287, 2196–2204 (2000). - PubMed
    1. Eichler EE, Clark RA, She X, An assessment of the sequence gaps: unfinished business in a finished human genome. Nat. Rev. Genet 5, 345–354 (2004). - PubMed

Publication types

Grants and funding

LinkOut - more resources

-