Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2001 Jul 15;29(14):2994-3005.
doi: 10.1093/nar/29.14.2994.

Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements

Affiliations
Review

Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements

A A Schäffer et al. Nucleic Acids Res. .

Abstract

PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Sensitivity curves of baseline program at three settings of the threshold parameter for including matching sequences in the PSI-BLAST multiple alignment. The versions compared are FWh10–3, FWh10–6 and FWh10–9. The sensitivity curve for FWh10–3 crosses the others because at this setting for h, three queries yield substantially corrupted results, while many other queries show improved search accuracy. The ROC100 scores for FWh10–9, FWh10–6 and FW10–3 are 0.713 ± 0.005, 0.758 ± 0.005 and 0.721 ± 0.020, respectively.
Figure 2
Figure 2
Sensitivity curves comparing the effects of adding filtering of the database and composition-based statistics. The versions compared are FWh10–6, WSh0.0002 and FWSh0.002.
Figure 3
Figure 3
Sensitivity curves showing the benefit of the ‘dispersed’ method for columns with gaps in the multiple alignment. The versions compared are FWSh0.002 and FWSDh0.005.
Figure 4
Figure 4
Sensitivity curves showing the benefits of restricted score rescaling and of tuning the pseudocount parameter and the purging percentage. The versions compared are FWSDh0.005 and FWSDMb9p94h0.005.

Similar articles

Cited by

References

    1. Altschul S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
    1. Altschul S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
    1. Altschul S.F., Bundschuh,R., Olsen,R. and Hwa,T. (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res., 29, 351–361. - PMC - PubMed
    1. Chervitz S.A., Aravind,L., Sherlock,G., Ball,C.A., Koonin,E.V., Dwight,S.S., Harris,M.A., Dolinski,K., Mohr,S., Smith,T. et al. (1998) Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science, 282, 2022–2028. - PMC - PubMed
    1. Schäffer A.A., Wolf,Y.I., Ponting,C.P., Koonin,E.V., Aravind,L. and Altschul,S.F. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics, 15, 1000–1011. - PubMed
-