Skip to main content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Nucleic Acids Res. 2020 Jan 8; 48(D1): D84–D86.
Published online 2019 Oct 29. doi: 10.1093/nar/gkz956
PMCID: PMC7145611
PMID: 31665464

GenBank

Abstract

GenBank® (www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public database that contains over 6.25 trillion base pairs from over 1.6 billion nucleotide sequences for 450 000 formally described species. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. Recent updates include a new version of Genome Workbench that supports GenBank submissions, new submission wizards for viral genomes, enhancements to BankIt and improved handling of taxonomy for sequences from pathogens.

INTRODUCTION

GenBank (1) is a comprehensive public database of nucleotide sequences and supporting bibliographic and biological annotations built and distributed by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), located on the campus of the US National Institutes of Health (NIH) in Bethesda, MD, USA. After summarizing the growth of GenBank in the past year, this paper will briefly review recent updates and developments.

GROWTH OF THE DATABASE

The size and growth of the various divisions of GenBank are shown in Table Table11 and Figure Figure1.1. Notable increases in the past year include the submission of 57 synthetic chromosomal constructs in January 2019 to the SYN division (ftp.ncbi.nlm.nih.gov/genbank/release.notes/gb230.release.notes) and the submission of about 60 chromosome-scale eukaryotic sequences to the VRT division as part of Release 231 (ftp.ncbi.nlm.nih.gov/genbank/release.notes/gb231.release.notes). NCBI provides GenBank sequence records in both the traditional flat file format and in a structured ASN.1 format by anonymous FTP at ftp.ncbi.nlm.nih.gov/genbank. For release 233 there are 2467 files requiring 1057 GB of uncompressed disk storage. In addition, daily GenBank incremental update files containing new and updated records since the most recent release are available in flat file format at ftp.ncbi.nlm.nih.gov/genbank/daily-nc/.

Table 1.

Growth of GenBank Divisions (nucleotide base-pairs)

DivisionDescriptionRelease 233 (8/2019)Annual Increase (%)a
SYNSynthetic7 701 613 755b545.96%
VRTOther vertebrates46 205 911 214b342.51%
PLNPlants59 248 524 178157.29%
UNAUnannotated548 04184.71%
WGSWhole genome shotgun data5 585 922 333 16074.30%
TLSTargeted locus studies10 531 800 82973.28%
INVInvertebrates12 578 394 10446.31%
PHGPhages637 015 04437.58%
BCTBacteria72 495 994 96635.40%
TSATranscriptome shotgun data294 727 165 17930.69%
VRLViruses4 782 719 53517.40%
PATPatent sequences24 715 727 03012.24%
ENVEnvironmental samples6 139 560 3125.51%
PRIPrimates8 491 950 6122.78%
HTCHigh-throughput cDNA728 868 4231.03%
MAMOther mammals6 258 926 0800.71%
ESTExpressed sequence tags43 280 039 5630.68%
RODRodents4 554 525 9050.43%
HTGHigh-throughput genomic27 774 725 9220.01%
STSSequence tagged sites640 918 5720.01%
GSSGenome survey sequences26 339 260 6410.00%
TOTALAll GenBank sequences6 233 224 722 23669.52%

aMeasured relative to Release 227 (8/2018).

bSee the text for descriptions of these large increases.

An external file that holds a picture, illustration, etc.
Object name is gkz956fig1.jpg

Growth of the five GenBank divisions that received the most sequence data in 2019, along with the growth of GenBank as a whole.

RECENT DEVELOPMENTS

Genome submissions

Advice for submitters

We would like to call attention to two cases where submitters can add additional value to their data. We encourage submitters of genomic sequences, including whole genome shotgun (WGS) sequences, to provide contextual metadata to support further use and analysis of the data. For example, where possible submitters should provide geographical data (e.g. country, latitude and longitude of the sampling location) along with other data such as the isolate name or number plus museum/collection identifiers as applicable. We also urge submitters to use evidence tags to provide information about supporting evidence for annotations of the form ‘/experimental = text’ and ‘/inference = TYPE:text’, where TYPE is a standard inference type and text consists of structured text, as explained at www.ncbi.nlm.nih.gov/genbank/evidence/. In cases where submitters have used existing public sequencing reads to improve the quality of their assemblies prior to submission, we encourage submitters to cite the accession numbers of these reads within their submission. Regarding prokaryotic genomes, while annotations are not required, we encourage submitters to request that the genome be annotated by the NCBI Prokaryotic Genome Annotation Pipeline (www.ncbi.nlm.nih.gov/genome/annotation_prok/) before being released.

NCBI strongly encourages submitters to register large-scale sequencing projects in the BioProject database (www.ncbi.nlm.nih.gov/bioproject) and to update their BioProject records after relevant publications are available. Doing so provides reliable linkages between sequencing projects and the data they produce, and may also allow links to the BioSample database (2) that provides additional information about the biological materials used in the study.

Taxonomy assignments

For submissions of bacterial genomes, GenBank performs an average nucleotide identity analysis (ANI) (3) to investigate whether the asserted organism name may be incorrect. GenBank also regularly scans existing data using this ANI analysis to identify other possibly erroneous taxonomic assignments. In cases where the taxonomic assignment is a generic species (e.g. Genussp.), GenBank staff will update the organism name with the calculated binomial name. If the record was submitted with an incorrect binomial name (e.g. Genus species), GenBank staff will consult with the original submitters before updating the name.

Using Genome Workbench for preparing submissions

Genome Workbench 3.0 offers a new interface for creating a genome submission for GenBank. This new interface supports both prokaryotic and eukaryotic genomes and allows submitters to enter information directly into dialog boxes and then generate a finished submission file. Using this tool, submitters can also edit sequence features and validate the data against GenBank submission standards. More information is available at www.ncbi.nlm.nih.gov/tools/gbench/releasenotes/.

Submission enhancements

Influenza and Norovirus submissions

The Submission Portal for GenBank (submit.ncbi.nlm.nih.gov/subs/genbank/) provides two new wizards to streamline submissions of Influenza and Norovirus genomes. These wizards accelerate the submission process and provide automatic feature annotation and validation functions. The wizards accept FASTA formatted sequences and require the following source information: isolate, serotype/genotype, collection date, host, and country of collection. In addition, all of the sequences in a submission must be derived from one virus subtype. Looking forward, we plan to continue releasing similar tools for additional marker genes. More information is available at submit.ncbi.nlm.nih.gov/about/genbank/.

Feature propagation

Submitters often need to deposit a large set of related sequences that typically share a common set of feature annotations. In such cases it is convenient to provide annotations for one sequence in the set and then automatically propagate these annotations onto the remaining members of the set. BankIt now supports this feature propagation function, greatly easing the handling of large sequence sets.

Taxonomy handling in BLAST

Recent enhancements to the BLAST+ command-line programs leverage the new version 5 BLAST databases (ftp.ncbi.nlm.nih.gov/blast/db/v5/blastdbv5.pdf) to provide important new functions. In particular, the version 5 databases (BLASTDBv5) index proteins by their accession.version identifiers and also include taxonomic information. This allows users to restrict protein BLAST searches by taxonomy and also to retrieve sequences from these databases using taxonomic limits. Users can also efficiently limit searches by lists of accession.version identifiers.

Pathogen sequences

Genome sequences from the NCBI Pathogen Detection Project (4,5) are now being deposited in GenBank. Given the large amount of data being submitted by these surveillance efforts, the Assembly resource (www.ncbi.nlm.nih.gov/assembly/) now provides an easy way to exclude such sequences from searches of the database. The new filter, available on the left side bar, is ‘exclude derived from surveillance project’ and is checked by default for all Assembly searches.

Expanded sequence identifier formats

As announced in the notes for GenBank release 226, in 2018 the INSDC expanded the ranges of accession number formats to accommodate the rapid growth of sequence databases. We want to emphasize that none of these new accessions will replace any existing accession, and all existing sequences will continue to be retrievable using their current accession.version identifiers. In some cases existing (and exhausted) prefixes have been reactivated with extended numerical suffixes (e.g. JG00000001–JG99999999 versus the exhausted JG000001–JG999999). To be clear, JG000001 (existing accession) and JG00000001 (new accession) will refer to two distinct sequences.

MAILING ADDRESS

GenBank, National Center for Biotechnology Information, Building 45, Room 6AN12D-37, 45 Center Drive, Bethesda, MD 20892, USA.

ELECTRONIC ADDRESSES

www.ncbi.nlm.nih.gov - NCBI Home Page.

vog.hin.mln.ibcn@bus-bg - Submission of sequence data to GenBank.

vog.hin.mln.ibcn@etadpu - Revisions to, or notification of release of, ‘confidential’ GenBank entries.

vog.hin.mln.ibcn@ofni - General information about NCBI resources.

CITING GENBANK

If you use the GenBank database in your published research, we ask that this article be cited.

FUNDING

Funding for open access charge: Intramural Research Program of the National Library of Medicine, National Institutes of Health.

Conflict of interest statement. None declared.

REFERENCES

1. Sayers E.W., Cavanaugh M., Clark K., Ostell J., Pruitt K.D., Karsch-Mizrachi I.. GenBank. Nucleic Acids Res. 2019; 47:D94–D99. [PMC free article] [PubMed] [Google Scholar]
2. Barrett T., Clark K., Gevorgyan R., Gorelenkov V., Gribov E., Karsch-Mizrachi I., Kimelman M., Pruitt K.D., Resenchuk S., Tatusova T. et al... BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 2012; 40:D57–D63. [PMC free article] [PubMed] [Google Scholar]
3. Ciufo S., Kannan S., Sharma S., Badretdin A., Clark K., Turner S., Brover S., Schoch C.L., Kimchi A., DiCuccio M.. Using average nucleotide identity to improve taxonomic assignments in prokaryotic genomes at the NCBI. Int. J. Syst. Evol. Microbiol. 2018; 68:2386–2392. [PMC free article] [PubMed] [Google Scholar]
4. Timme R.E., Rand H., Shumway M., Trees E.K., Simmons M., Agarwala R., Davis S., Tillman G.E., Defibaugh-Chavez S., Carleton H.A. et al... Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance. PeerJ. 2017; 5:e3893. [PMC free article] [PubMed] [Google Scholar]
5. NCBI Resource Coordinators Database resources of the national center for biotechnology information. Nucleic Acids Res. 2017; 45:D12–D17. [PMC free article] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

-