Cleaner BLAST Databases for More Accurate Results

Cleaner BLAST Databases for More Accurate Results

Removing contaminated sequences using NCBI quality assurance tools 

Do you use BLAST to identify a sequence or the evolutionary scope of a gene? That can be challenging if contaminated and misclassified sequences are in the BLAST databases and show up in your search results. To address this problem, we now use the NCBI quality assurance tools listed below to systematically remove these misleading sequences from the default nucleotide (nt) and protein (nr) BLAST databases. 

This process has removed approximately 2.23% of sequences from nr and 0.01% from nt. Lists of nucleotide and protein sequences identified as contaminant or misclassified are available from our FTP site.  

Stay up to date

BLAST is part of the NIH Comparative Genomics Resource (CGR). CGR facilitates reliable comparative genomics analyses for all eukaryotic organisms through an NCBI Toolkit and community collaboration.   

Follow us on social @NCBI and join our mailing list to keep up to date with BLAST and other CGR news. 

Questions?

We want to hear from you! Try it out and let us know what you think. We are making ongoing improvements based on your feedback. If you have questions or would like to provide feedback, please reach out to us at info@ncbi.nlm.nih.gov.   

2 thoughts on “Cleaner BLAST Databases for More Accurate Results

  1. It would be helpful if the 16S rRNA/ITS databases had the genome derived complete sequences rather than PCR product derived partial sequences.

Leave a Reply