README ftp://ftp.ncbi.nlm.nih.gov/hmm/README.txt What is in this directory? ========================== This site provides Hidden Markov Models (HMMs) of protein family profiles maintained at NCBI. These HMMs are used by the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) as hints to the structural annotation process, and/or for naming of annotated proteins. See more information in: https://github.com/ncbi/pgap https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ https://www.ncbi.nlm.nih.gov/genome/annotation_prok/evidence/ A subset of these HMMs are also used for the detection of AMR genes by AMRFinder. See more information in: https://github.com/ncbi/amr/wiki/AMRFinder-database For information on creating HMMs, searching against HMMs, or file formats, see http://eddylab.org/software/hmmer/Userguide.pdf How to cite? ============ When using this collection of HMMs, please cite: Li W, O'Neill KR et a., Nucleic Acids Res. 2021 Jan 8; 49(D1):D1020-D1028. PMID: 33270901 Sources of HMMs =============== * NCBI: many HMMs were derived from protein clusters (PRKs, Nucleic Acids Res. 2009 Jan;37(Database issue):D216-23, PMID: 18940865), and represent either the full cluster or a subgroup of the proteins in the original cluster. In addition, some HMMs were created from scratch by expert curators based on publications describing protein function, including the anti-microbial resistance HMMs used by AMRFinder. * TIGR/JCVI: Some HMMs were originally built at The Institute for Genomic Research (TIGR), the precursor to the J. Craig Venter Institute (JCVI). This collection, the TIGRFAMs, are now maintained at NCBI. TIGRFAM release 15.0 is available in the TIGRFAMs directory, and contains the HMMs as provided by JCVI in April 2018. Since then, the cut-offs, seed proteins and attributes of the TIGRFAMs may have been modified by NCBI curators. The HMMs in the numbered releases directories include the changes made at NCBI since the hand-off of TIGRFAM release 15.0 Note about the PFAMs: The PFAM collection is not provided here. It is maintained by the European Bioinformatics Institute and can be obtained from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/ However, attributes of the PFAMs are provided in the tab-delimited file hmm_PGAP.tsv. NCBI curators have associated a product name a nd other attributes to some PFAM models, so they can be used for the functional annotation of proteins predicted by PGAP. The defining characteristics of the PFAMs models (threshold for inclusion and seed alignments) are not modified by NCBI. Hierarchical structure and family type ====================================== The HMMs used by PGAP are part of a hierarchical system of evidence based on specificity. Each model is assigned a family type that expresses how specific it is. In situations where a protein has hits to multiple evidence, the name and attributes of the most specific evidence is propagated to the protein. See more details here: https://www.ncbi.nlm.nih.gov/genome/annotation_prok/evidence/#family-types File content ============ Numbered releases (i.e. 13.0): ---------------------------- Set of HMM models included in the release For description of the formats, see: http://eddylab.org/software/hmmer/Userguide.pdf * RELEASE_NOTES.txt: notes describing the specific release * for-interpro.tsv: list of NCBI HMM models that are or will be incorporated into InterPro (file added with release 13) * hmm_PGAP.HMM: directory of HMM models files in profile flatfile format. NOTE: starting with release 12.0, the DESC field contains the name for the HMM, rather than the name to assign proteins named by the family (for this, see product_name in hmm_PGAP.tsv). * hmm_PGAP.HMM.tar.gz: tarred directory of HMM models files in profile flatfile format. * hmm_PGAP.LIB: concatenated HMM models in flatfile format * hmm_PGAP.SEED: directory of HMM seed alignments * hmm_PGAP.SEED.tar.gz: tarred directory of HMM seed alignments * hmm_PGAP.tsv: tab-delimited file containing the attributes of the HMMs included in the collection used by PGAP Note that PFAMs are included in this file. The defining characteristics of the PFAMs models (threshold for inclusion and seed alignments) are these defined by EBI-EMBL. However, product name and other attributes may have been modified or added by NCBI curators. NOTES: ** GO terms associated with some HMMs were added to file (col 14) starting with release 6 (August 2021). ** Names of the HMMs (col 22) and a comment describing the family (col 23) were added with release 12 (Appril 2023). col 1 ncbi_accession -- TIGR#.# or NF#.#. NCBI accession for the HMM. col 2 source_identifier -- (TIGR#, PF#, PRK#). For HMMs derived from PRK clusters, name of the cluster. For TIGRFAM and PFAMS, identifier in an external resource (JCVI and EBI_EMBL respectively) col 3 label -- NAME in the profile files. A single word containing no spaces [see definition in HMM manual] col 4 sequence_cutoff -- Minimum hmmsearch full-sequence score for the protein to be considered a hit to this HMM col 5 domain_cutoff -- Minimum hmmsearch domain score for the protein to be considered a hit to this HMM col 6 hmm_length -- The number of amino acid positions in the HMM seed alignment col 7 family_type -- See definitions in https://www.ncbi.nlm.nih.gov/genome/annotation_prok/evidence/#family-types col 8 for_structural_annotation --(Y or N). Used/not used as evidence in the PGAP structural annotation process col 9 for_naming --(Y or N). Used/not used as evidence in the PGAP functional annotation process col 10 for_AMRFinder -- (Y or N). Used/not used by AMRFinder col 11 product_name -- Function or description of proteins in the family col 12 gene_symbol -- Gene name that applies to all genes/proteins in the family col 13 ec_number -- Enzyme Commission number(s) that apply to all proteins in the family col 14 go_terms -- Gene Ontology term(s) that apply to all proteins in the family col 15 pmid -- PubMed identifier of publication(s) describing the protein family or one of its founding member col 16 taxonomic_range -- Identifier for the highest taxonomic level represented in the family, excluding outliers col 17 taxonomic_range_name -- Name for the highest taxonomic level represented in the family, excluding outliers col 18 taxonomic_rank_name -- Rank for the highest taxonomic level represented in the family, excluding outliers col 19 n_refseq_protein_hits -- Count of RefSeq WP_ proteins that are a match for this model col 20 source -- Original source for the HMMs. One of: NCBI PRK, NCBIFAM, NCBI AMR, JCVI or EBI-EMBL col 21 name_orig -- For TIGRFAM: name assigned by TIGR/JCVI. For PFAM: domain name/name assigned by EBI-EMBL. For others: empty col 22 hmm_name -- Function or short description for the protein family col 23 comment -- Long description and other information about the protein family ARCHIVES: -------- HMM releases 1.1 and older NCBIfam-AMRFinder: ----------------- Subset of models used by AMRFinder * Numbered release or dated release ** NCBIfam-AMRFinder.HMM.tar.gz -- see description for hmm_PGAP.HMM.tar.gz ** NCBIfam-AMRFinder.LIB -- see description for hmm_PGAP.LIB ** NCBIfam-AMRFinder.SEED.tar.gz -- see description for hmm_PGAP.LIB ** NCBIfam-AMRFinder.changelog.txt -- changes since last release ** NCBIfam-AMRFinder.tsv -- changes since last release * latest: latest NCBIfam-AMRFinder release README.txt ---------- This README file TIGRFAMs: --------- Release 15.0 of the TIGRFAMs as handed off to NCBI in April 2018 current: ------- Link pointing to the latest release of the HMM collection

-