Tag: NCBI Prokaryotic Genome Annotation Pipeline (PGAP)

New in RAPT: Better taxonomic assignment and GO annotation

New in RAPT: Better taxonomic assignment and GO annotation

We are excited to announce two improvements to the Read assembly and Annotation Pipeline Tool (RAPT), which allows you to assemble genomic reads for bacterial or archaeal isolates and annotate their genes at the click of a button.

Improved taxonomic assignment

Now RAPT verifies the scientific name you provide with the reads, and corrects it as needed with the Average Nucleotide Identity (ANI) tool, which compares your genome to type strain assemblies in GenBank to place it in the taxonomic tree. So, even if you only have a rough idea of the species you have sequenced, input datasets tailored to your genome will be used for the annotation and you will get the best possible gene set from RAPT. Continue reading “New in RAPT: Better taxonomic assignment and GO annotation”

NCBI hidden Markov models (HMM) release 8.0 now available!

NCBI hidden Markov models (HMM) release 8.0 now available!

Release 8.0 of the NCBI Hidden Markov models (HMM), used by the Prokaryotic Genome Annotation Pipeline (PGAP), is now available for download. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.

The 8.0 release contains 15,358 models, including 160 that are new since 7.0. In addition, we have added better names, EC numbers, Gene Ontology (GO) terms, gene symbols or publications to over 550 existing HMMs. You can search and view the details for these in the Protein Family Model collection, which also includes conserved domain architectures and BlastRules, and find all RefSeq proteins they name.

GO terms associated with HMMs are now propagated to  coding sequences and proteins annotated with PGAP. In case you missed it, see our previous blog post on this topic.

New version of PGAP available now!

We are happy to announce the release of a new version of the stand-alone Prokaryotic Genome Annotation Pipeline (PGAP).

This version of PGAP offers a more streamlined experience to users who are uncertain about the taxonomic classification of the genomes they wish to annotate. Adding one flag to the command (--auto-correct-tax) results in the override of the species name provided on input if the taxonomy verification process predicts a different organism with high confidence. Continue reading “New version of PGAP available now!”

Bacterial and archaeal genomes with GO terms in RefSeq!

RefSeq prokaryotic genomes and proteins are now annotated with Gene Ontology (GO) terms. Over the years we have received many requests to add GO terms to the annotations we provide. We heard you!

We are embarking on this adventure and starting to place terms from the Biological Process, Molecular Function and Cellular Component ontologies to genomes and proteins we annotate with the Prokaryotic Genome Annotation Pipeline (PGAP). Because of the hierarchical nature of the Gene Ontologies, these annotations will help the comparison of gene content across genomes at variable levels of specificity and eventually allow GO term enrichment analysis. GO terms are now associated with coding sequence (CDS) features on newly-submitted genomes (See Figure 1). They will progressively appear on genomes that are already in RefSeq as these get reannotated (about once a year). We expect all RefSeq genomes to have some GO terms by the spring of 2023.

Continue reading “Bacterial and archaeal genomes with GO terms in RefSeq!”

New PGAP release: Structural and functional annotation improvements

A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) is available on GitHub. With this release, you can expect:

  • Incremental improvements in structural annotation, driven by increased weight of GeneMarkS2+ ab initio models at loci with only weak evidence, such as low identity and low coverage protein alignments or partial HMM signatures.
  • Better structural annotation and more specific functional annotation as a result of the incorporation of PFAM 34 and extensive curation of HMMs, BlastRules and Conserved Domain architectures by NCBI experts.
  • Fewer overly stringent calls by the taxonomy verification module for several species, including the human pathogens Listeria monocytogenes, Campylobacter lari, and Vibrio vulnificus. This is a result of manual review and adjustment of the minimum percent identity thresholds used by the Average Nucleotide Identity tool.
  • Multiple bug fixes. Notably, users of Azure Debian 10 machines can now run PGAP successfully, as we have incorporated GeneMarkS2+ compiled under Linux kernel 3 into the PGAP image.

Please try this release and send us your feedback!

New models added to the NCBI Hidden Markov models (HMM) collection with release 7.0

Release 7.0 of the NCBI Hidden Markov models (HMM), used by the Prokaryotic Genome Annotation Pipeline (PGAP), is now available for download. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.

Figure 1. Recently added HMM-based Protein Family Model for the histidine-histamine antiporter family (NF040512), with GO terms (framed in red).

Continue reading “New models added to the NCBI Hidden Markov models (HMM) collection with release 7.0”

Search the NCBI Hidden Markov models collection against your favorite prokaryotic proteins

The NCBI Hidden Markov models (HMM) 6.0 release, available on our FTP site, has 15,247 models supported at NCBI. We created 80 more new HMMs and consolidated the collection by removing 2,151 HMMs that were nearly identical to another. Release 6.0 also incorporates 12,656 PFAM from release 34 that apply to prokaryotic proteins. You can use the HMMER sequence analysis package to search the collection against your favorite prokaryotic proteins to identify their function. We have also added more specific names or associated EC number, gene symbols and publication to over 500 HMMs.

Gene Ontology (GO) term attributes are now available for 20% of HMM models (see Figure 1 below). We added most of these based on existing mappings, but our experts are working on creating more associations. Starting in the fall, we’ll start propagating GO terms from HMMs to annotated genomes and proteins!

Example Protein Family Model, TIGR03697.1 for the global nitrogen regulator NtcA protein family, with newly shown GO terms (framed in red).
Figure 1. Example Protein Family Model, TIGR03697.1 for the global nitrogen regulator NtcA protein family, with newly shown GO terms (framed in red).

Continue reading “Search the NCBI Hidden Markov models collection against your favorite prokaryotic proteins”

RefSeq release 207 is available!

RefSeq release 207 is available!

RefSeq release 207 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of July 12, 2021, and contains 285,425,070 records, including 209,035,492 proteins, 39,039,901 RNAs, and sequences from 112,462 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq release 207 is available!”

Announcing the re-annotation of RefSeq genome assemblies for E. coli and four other species!

We have re-annotated all RefSeq genomes for Escherichia coliMycobacterium tuberculosis, Bacillus subtilis, Acinetobacter pittii, and Campylobacter jejuni using the most recent release of PGAP. You will find that more genes now have gene symbols (e.g. recA). Your feedback indicated that the lack of symbols was an impediment to comparative analysis, so we hope that this improvement will help.

The number of re-annotated genomes is 25,619 for E. coli, 470 for B. subtilis, 6,828 for M. tuberculosis, 316 for A. pittii, and 1,829 for C. jejuni. On average, the increase in gene symbols is 30% in E. coli, 110% in B. subtilis, 57% in M. tuberculosis, 94% in A. pittii and 62% in C. jejuni (see Figure 1). After re-annotation, on average, 73% of PGAP-annotated E. coli genes and 79% of B. subtilis have symbols (35% for M. tuberculosis, 40% for A. pittii and 46% for C. jejuni). We assigned symbols to the annotated genes by calculating the orthologs between the genome of interest and the reference assembly for the species, and transferring the symbols from the reference genes to their orthologs in the annotated genomes.

Figure 1: Average and standard deviation of the number of genes annotated with symbols per genome, in the previous (blue) and the current annotation (orange). 

Continue reading “Announcing the re-annotation of RefSeq genome assemblies for E. coli and four other species!”

New version of PGAP available now!

We are happy to announce that a new version of PGAP is available. This version will annotate 20 to 25% more genes with symbols (e.g. recA) on the assembled genomes of key species, compared to previous versions.

You will observe an increase in symbols when you annotate the genomes of Escherichia coli, Campylobacter jejuni and a few other species. As several users have requested, this feature will facilitate the comparison of gene content across multiple genomes. It is permitted by the addition of a new workflow to PGAP for identifying orthologs between the reference genomes of Escherichia coli str. K-12 substr. MG1655, Bacillus subtilis subsp. subtilis str. 168, Campylobacter jejuni subsp. jejuni NCTC 11168, Mycobacterium tuberculosis H37Rv, and Acinetobacter pittii PHEA-2 and genomes in the same species being annotated. Symbols of reference genes with defined function are propagated to their orthologs in the genome annotated with PGAP.

Continue reading “New version of PGAP available now!”