Automated Lineage Definitions Now Available in NCBI Virus SARS-CoV-2 Variants Overview

Automated Lineage Definitions Now Available in NCBI Virus SARS-CoV-2 Variants Overview

Recently, NCBI Virus SARS-CoV-2 Variants Overview moved from a manual to an automated process for selecting mutations required to define a lineage (e.g., Omicron, BA.2, JN.1, etc.). With this update, the SARS-CoV-2 Variant Overview provides coverage for all SARS-CoV-2 lineages and is no longer limited to only lineages with CDC status. The SARS-CoV-2 Variants Overview website reports results from analyzing both GenBank and unassembled Sequence Read Archive (SRA) sequence data. It allows you to view geographic and frequency trends of records assigned to Pango lineages and search for sequence records using lineage-defining or other mutations (example shown in Figure 1) 

Screenshot of the SARS-CoV-2 Variants Overview page

Figure 1: On the “Lineage Frequency and Location” tab, select a Pango lineage such as JN.1 to access details including the lineage-defining mutations, the change in frequency of the lineage in GenBank and SRA records, and the geographic locations where the samples were collected. 

Automated lineage definition 

Pangolin in UShER mode is run daily to assign lineages to all GenBank sequences. The new automated lineage definition pipeline identifies the mutations that are characteristic and unique to each Pango lineage. Specifically, the pipeline creates a set of mutations defining a single lineage according to the following rules: 

  • The lineage must have at least 10 GenBank records assigned to it by Pangolin  
  • Lineage-defining mutations from the set must occur in at least 80% of records assigned to that lineage by Pangolin 
  • Mutations must be specific to a single lineage, i.e., if more than 20% of the records containing a candidate lineage-defining mutation were assigned to a different lineage by Pangolin, the candidate mutation is ineligible 

The pipeline subsequently uses these predefined sets of mutations to assign lineages to SRA and GenBank sequence records. A sequence record must contain 100% of the lineage-defining mutations for it to be assigned to that lineage. After a sequence is released to the public, it is usually classified into a lineage and available through the NCBI Virus SARS-CoV-2 Variants Overview webpage within a couple of days. Mutation sets are recalculated weekly, and all sequence records are reclassified periodically. More details on the lineage definition pipeline can be found in NCBI Virus help documentation. 

Stay up to date 

Follow us on social media @NCBI and join our mailing list to keep up to date with NCBI Virus and other NCBI news. 

Questions? 

Please reach out to us with questions or feedback. 

 

Leave a Reply