Tag: Genome assemblies

Important Update! Changes to ASSEMBLY_REPORTS and GENOME_REPORTS on FTP

Important Update! Changes to ASSEMBLY_REPORTS and GENOME_REPORTS on FTP

Do you currently access genome assembly data through the FTP site? We are consolidating information provided in the ASSEMBLY_REPORTS and GENOME_REPORTS directories on the genomes FTP site to simplify access and ensure that you have the most accurate, up to date, and consistently reported data.  

The assembly_summary files in the ASSEMBLY_REPORTS directory are gaining information in newly added columns 24-38, including statistics about the assembly (size, GC content, genome size, and number of sequences) as well as details about the provided annotation (number of genes, annotation name and date). See example below (Table 1). Check out the README for more details about the contents of the summary files.  Continue reading “Important Update! Changes to ASSEMBLY_REPORTS and GENOME_REPORTS on FTP”

Now Available! More Mammalian Cross-Species Alignments in the Comparative Genome Viewer (CGV)

Now Available! More Mammalian Cross-Species Alignments in the Comparative Genome Viewer (CGV)

In response to your feedback, we’ve made more whole genome cross-species alignments available in NCBI’s Comparative Genome Viewer (CGV). You can use these alignments to explore genome rearrangements between species. You can also zoom in to analyze regions of conserved gene synteny.

There are over 20 new cross-species alignments available, including human-mouse, mouse-rat, human-chimp, human-cattle, dog-cat, and others! These cross-species alignments provide additional opportunities to explore evolutionary relationships at the genomic and gene levels. We will add more cross-species alignments in the coming months.

The latest cross-species alignments added to CGV include imports from the UCSC Genomics Institute, as well as those generated at NCBI.

Check out two examples of cross-species whole-genome alignments in CGV below (Figure 1).

Figure 1. Whole genome alignments between (A) mouse and human (GRCm39 vs. GRCh38.p14)  and (B) cat and dog (F.catus_Fca126_mat1.0 vs. ROS_Cfam_1.0). Colored bands connects aligned regions; green indicates same orientation, blue indicates opposite orientation.

When you zoom in on an alignment (Figure 2), you can compare gene annotation on the two assemblies and see the extent of conservation of synteny. You can also see which genes are missing from one or the other assembly, indicating changes in sequence or differences in annotation.

Continue reading “Now Available! More Mammalian Cross-Species Alignments in the Comparative Genome Viewer (CGV)”

Foreign Contamination Screen (FCS) tool for GenBank submissions

Foreign Contamination Screen (FCS) tool for GenBank submissions

We are excited to introduce a Foreign Contamination Screen (FCS) tool that you can now run yourself, with enhanced contaminant detection sensitivity to improve your genome assemblies and facilitate high-quality data submissions to GenBank. If you submit genome assembly data to GenBank, the FCS tool is for you!

What is the FCS tool?

FCS, a quality assurance process used to make data suitable for submission, consists of two parts: FCS-adaptor and FCS-GX. FCS-adaptor searches for short sequences that are used as part of the lab preparation process and sometimes wind up in the final assembly by mistake. FCS-GX searches for sequences from a wide range of organisms including bacteria, fungi, protists, viruses, and others to identify sequences that don’t look like they are from the intended organism. In each case, you receive a report of the coordinates and identities of potential contaminants to be reviewed and removed (see Figure 1 for a sample report of the FCS-GX summary output). Both tools are designed to screen both eukaryote and prokaryote genomes.

Figure 1. FCS-GX report showing the summary of contamination identified in a tomato genome. The output indicates there are 83 sequences, adding up to 381 kb total length, to be removed from a mix of insect, fungal, and bacterial sources.

How do I use FCS?

FCS is available from GitHub. Simply download the two programs (FCS-adaptor and FCS-GX), and follow a few steps as outlined in the Quickstart. Both tools are also easy and inexpensive to run on commercial clouds such as Amazon Web Services (AWS) or Google Cloud Platform (GCP), and can screen genomes in a fraction of the time of other approaches. 

Why is FCS important?

Having high quality data available for analysis is necessary in order to arrive at accurate conclusions during research. With FCS, rapid detection of contaminants from foreign organisms in assembled genomes ensures that high value data is being provided for submission and available for reuse. We’ve already used FCS-GX to remove over one hundred megabases of contaminants and thousands of erroneous genes and proteins from previously submitted eukaryote genomes to make the data more useful for all. 

We want to hear from you!

We will update the FCS tool based on your feedback, so try it out and let us know what you think. Please contact us with comments and suggestions.

FCS is part of the NIH Comparative Genomics Resource (CGR), an NLM project to establish an ecosystem to facilitate reliable comparative genomics analyses for all eukaryotic organisms.

Join our mailing list to keep up to date with FCS and other CGR news.

Introducing NLM’s new NCBI Datasets genome page!

Introducing NLM’s new NCBI Datasets genome page!

As part of an ongoing effort to modernize and improve your experience, NLM’s NCBI Datasets is introducing all-new genome pages. These pages make it easier for you to browse and download genome sequence and metadata, and navigate to tools such as the Genome Data Viewer (GDV) and BLAST.

To get started, search NCBI Datasets by assembly accession (e.g., GCF_016699485.2), assembly name (e.g., bGalGal1.mat.broiler.GRCg7b), WGS accession (e.g., JAENSK01), or species name + genome (e.g., chicken genome), and click on the title in the box. See the top red arrow in Figure 1 below where we search for ‘chicken genome’.

Figure 1: Finding the chicken reference assembly. A search for ‘chicken genome’ returns a box that provides a quick link to the new genome page (middle red arrow). From there, the download button (bottom red arrow) allows you to select the files you need (see ‘Download Package’ window on the left) along with a detailed metadata report that includes all the metadata on the web page.  Continue reading “Introducing NLM’s new NCBI Datasets genome page!”

Come see NCBI at the ASM Microbe Conference 2022

Come see NCBI at the ASM Microbe Conference 2022

The American Society of Microbiology (ASM) Microbe conference is back, and scheduled to take place in-person, June 9th-13th in Washington, D.C.

NCBI staff member Dr. Michael Feldgarden will be recognized by ASM with an award for his research. Other NCBI staff will present posters on NCBI resources and will also be available at our booth (#1128) to address your questions. Drop by to see what’s new and provide your feedback. We hope to see you there! Check out NCBI’s schedule of activities:  Continue reading “Come see NCBI at the ASM Microbe Conference 2022”

Average Nucleotide Identity (ANI) for assembly validation

Average Nucleotide Identity (ANI) for assembly validation

Validating genome assemblies submitted to GenBank using ANI based workflow

Average Nucleotide Identity (ANI) analysis is a useful tool to verify taxonomic identities in prokaryotic genomes. As part of the NCBI bacterial genome submission process, GenBank performs ANI analyses to compare submitted prokaryotic genome assemblies against reference data generated from type strains. You can learn about more about the relevant workflow and about type strain curation in our publications (PMC6978984 and PMC4383940).

We use genomes obtained from type strains (type assemblies) in computational comparisons, for example using ANI to reclassify or modify existing taxonomy with reasonable confidence. The taxonomy check status for all 1.3 million bacterial genome assemblies is summarized in the ANI_report_prokaryotes.txt file available from the ASSEMBLY_REPORTS FTP directory.  The README file describes the contents of the report in detail. You can run ANI on your genome on its own or in the context of annotation. Find more information here. Continue reading “Average Nucleotide Identity (ANI) for assembly validation”

RefSeq release 208 is available!

RefSeq release 208 is available!

RefSeq release 208 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of September 7, 2021, and contains 288,903,207 records, including 210,703,648 proteins, 40,213,945 RNAs, and sequences from 113,002 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq release 208 is available!”

Sept 22 Webinar: Using NCBI Datasets command-line tools to access data and metadata for genomes

Sept 22 Webinar: Using NCBI Datasets command-line tools to access data and metadata for genomes

Join us on September 22, 2021 at 12PM eastern time learn to use the datasets command-line tools (datasets and dataformat) to access, filter, download, and format data and metadata for genomes. Through examples from eukaryotes and the SARS-CoV-2 coronavirus, you will see how to use metadata to filter for genome sequences with desired properties such as genomes with high contig N50 values.

  • Date and time: Wed, September 22, 2021 12:00 PM – 12:45 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI webinars playlist on the NLM YouTube channel. You can learn about future webinars on the Webinars and Courses page.

Announcing the RefSeq annotation of sheep ARS-UI_Ramb_v2.0!

Announcing the RefSeq annotation of sheep ARS-UI_Ramb_v2.0!

The new reference assembly for sheep is now annotated! Assembly ARS-UI_Ramb_v2.0 is made of 142 scaffolds, a drop from 2,640 in the 2017 assembly Oar_rambouillet_v1.0. With a contig N50 of 43 Mb, ARS-UI_Ramb_v2.0 is 15 times more contiguous than the first assembly of the Rambouillet breed.

Annotation Release 104 (AR 104) of ARS-UI_Ramb_v2.0 reflects these improvements. Nearly 200 more coding genes have a 1:1 ortholog in the human genome than in the annotation of Oar_rambouillet_v1.0 (AR 103). The number of coding models annotated as partial is down 35% from 165 to 107, and the number of coding models labeled low quality due to suspected indels or base substitutions in the underlying genomic sequence decreased by 51% (1646 to 796). Based on BUSCO analysis, 99.1% of the models (cetartiodactyla_odb10) are complete in AR 104 versus 98.8% in AR 103. Details of this annotation, including statistics on the annotation products, the input data used in the pipeline and intermediate alignment results, can be found here. Continue reading “Announcing the RefSeq annotation of sheep ARS-UI_Ramb_v2.0!”

Aug 18 Webinar: Finding Data for your Research Organism: Plants and RNA-Seq data

Aug 18 Webinar: Finding Data for your Research Organism: Plants and RNA-Seq data

Join us on August 18, 2021 at 12PM eastern time for the second webinar on finding data for your non-model research organism. In this webinar, you will learn how to use NCBI’s web resources to get data for a plant species, the black cottonwood. You will see how to find, access, and analyze gene and sequence data from Datasets and other NCBI web resources, as well as sample metadata and gene expression RNA-Seq data from SRA and the SRA Run Selector. You will also see an example that highlights how to use and analyze these data in a typical workflow set up in a Jupyter notebook that uses the NCBI next-gen aligner Magic-BLAST to get relative gene expression levels across samples.

  • Date and time: Wed, August 18, 2021 12:00 PM – 12:45 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI webinars playlist on the NLM YouTube channel. You can learn about future webinars on the Webinars and Courses page.