Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements
- PMID: 11452024
- PMCID: PMC55814
- DOI: 10.1093/nar/29.14.2994
Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements
Abstract
PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.
Figures
![Figure 1](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/55814/bin/gke43501.gif)
![Figure 2](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/55814/bin/gke43502.gif)
![Figure 3](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/55814/bin/gke43503.gif)
![Figure 4](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/55814/bin/gke43504.gif)
Similar articles
-
Simple is beautiful: a straightforward approach to improve the delineation of true and false positives in PSI-BLAST searches.Bioinformatics. 2008 Jun 1;24(11):1339-43. doi: 10.1093/bioinformatics/btn130. Epub 2008 Apr 10. Bioinformatics. 2008. PMID: 18403442
-
Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST.BMC Biol. 2006 Dec 7;4:41. doi: 10.1186/1741-7007-4-41. BMC Biol. 2006. PMID: 17156431 Free PMC article.
-
Identifying remote protein homologs by network propagation.FEBS J. 2005 Oct;272(20):5119-28. doi: 10.1111/j.1742-4658.2005.04947.x. FEBS J. 2005. PMID: 16218946 Review.
-
IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices.Bioinformatics. 1999 Dec;15(12):1000-11. doi: 10.1093/bioinformatics/15.12.1000. Bioinformatics. 1999. PMID: 10745990
-
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. doi: 10.1093/nar/25.17.3389. Nucleic Acids Res. 1997. PMID: 9254694 Free PMC article. Review.
Cited by
-
First report on in-depth genome and comparative genome analysis of a metal-resistant bacterium Acinetobacter pittii S-30, isolated from environmental sample.Front Microbiol. 2024 Apr 29;15:1351161. doi: 10.3389/fmicb.2024.1351161. eCollection 2024. Front Microbiol. 2024. PMID: 38741743 Free PMC article.
-
DPI_CDF: druggable protein identifier using cascade deep forest.BMC Bioinformatics. 2024 Apr 5;25(1):145. doi: 10.1186/s12859-024-05744-3. BMC Bioinformatics. 2024. PMID: 38580921
-
Design, synthesis and preliminary biological evaluation of rivastigmine-INDY hybrids as multitarget ligands against Alzheimer's disease by targeting butyrylcholinesterase and DYRK1A/CLK1 kinases.RSC Med Chem. 2024 Feb 20;15(3):963-980. doi: 10.1039/d3md00708a. eCollection 2024 Mar 20. RSC Med Chem. 2024. PMID: 38516603
-
Lambda3: homology search for protein, nucleotide, and bisulfite-converted sequences.Bioinformatics. 2024 Mar 4;40(3):btae097. doi: 10.1093/bioinformatics/btae097. Bioinformatics. 2024. PMID: 38485699 Free PMC article.
-
Genetic Diversity and DNA Barcoding of Thrips in Bangladesh.Insects. 2024 Feb 3;15(2):107. doi: 10.3390/insects15020107. Insects. 2024. PMID: 38392526 Free PMC article.
References
-
- Altschul S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
-
- Schäffer A.A., Wolf,Y.I., Ponting,C.P., Koonin,E.V., Aravind,L. and Altschul,S.F. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics, 15, 1000–1011. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Research Materials