Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Oct 13:6:248.
doi: 10.1186/1471-2105-6-248.

Better prediction of protein contact number using a support vector regression analysis of amino acid sequence

Affiliations

Better prediction of protein contact number using a support vector regression analysis of amino acid sequence

Zheng Yuan. BMC Bioinformatics. .

Abstract

Background: Protein tertiary structure can be partly characterized via each amino acid's contact number measuring how residues are spatially arranged. The contact number of a residue in a folded protein is a measure of its exposure to the local environment, and is defined as the number of Cbeta atoms in other residues within a sphere around the Cbeta atom of the residue of interest. Contact number is partly conserved between protein folds and thus is useful for protein fold and structure prediction. In turn, each residue's contact number can be partially predicted from primary amino acid sequence, assisting tertiary fold analysis from sequence data. In this study, we provide a more accurate contact number prediction method from protein primary sequence.

Results: We predict contact number from protein sequence using a novel support vector regression algorithm. Using protein local sequences with multiple sequence alignments (PSI-BLAST profiles), we demonstrate a correlation coefficient between predicted and observed contact numbers of 0.70, which outperforms previously achieved accuracies. Including additional information about sequence weight and amino acid composition further improves prediction accuracies significantly with the correlation coefficient reaching 0.73. If residues are classified as being either "contacted" or "non-contacted", the prediction accuracies are all greater than 77%, regardless of the choice of classification thresholds.

Conclusion: The successful application of support vector regression to the prediction of protein contact number reported here, together with previous applications of this approach to the prediction of protein accessible surface area and B-factor profile, suggests that a support vector regression approach may be very useful for determining the structure-function relation between primary protein sequence and higher order consecutive protein structural and functional properties.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Contact number distributions according to different definitions. The radius cutoffs are selected as 8 Å, 10 Å, 12 Å and 14 Å, represented by dotted, slashed, solid and dot-and-slashed lines, respectively. A is for discrete contact number while B is for consecutive contact number.
Figure 2
Figure 2
The accessible surface area as a function of contact number. Discrete contact numbers are used with a radius cutoff of 12 Å. Error bars represent the standard deviations.
Figure 3
Figure 3
The predicted and observed contact numbers for proteins GP130 (PDB: 1bj8) and Human chorionic gonadotropin (PDB: 1dz7, chain A). Discrete contact numbers are used with a radius cutoff of 12 Å. Observed and predicted contact numbers are represented by solid and dashed lines, respectively. A) GP 130 is predicted with a correlation coefficient of 0.75 and a root-mean-squar∈error of 6.07; B) Human chorionic gonadotropin is predicted with a correlation coefficient of 0.58 and a root mean square error of 9.73.
Figure 4
Figure 4
Distributions of correlation coefficients and root mean square errors given different input information. Discrete definition is used with a radius cutoff of 12 Å. The four inputs "LS", "LS+W", "LS+AA" and "LS+W+AA" are represented by dotted, slashed, dot-and-slashed and solid lines, respectively. A is for correlation coefficients while B is for root mean square errors.
Figure 5
Figure 5
The mean absolute errors for residues of different contact numbers. The four inputs "LS", "LS+W", "LS+AA" and "LS+W+AA" are represented by dotted, slashed, dot-and-slashed and solid lines, respectively.
Figure 6
Figure 6
Prediction accuracies when predictions are formulated as two-class problems using different contact number thresholds. The four inputs "LS", "LS+W", "LS+AA" and "LS+W+AA" are represented by dotted, slashed, dot-and-slashed and solid lines, respectively.

Similar articles

Cited by

References

    1. Karchin R, Cline M, Karplus K. Evaluation of local structure alphabets based on residue burial. Proteins. 2004;55:508–518. doi: 10.1002/prot.20008. - DOI - PubMed
    1. Hamelryck T. An amino acid has two sides: A new 2D measure provides a different view of solvent exposure. Proteins. 2005;59:38–48. doi: 10.1002/prot.20379. - DOI - PubMed
    1. Kinjo AR, Horimoto K, Nishikawa K. Predicting absolute contact numbers of native protein structure from amino acid sequence. Proteins. 2005;58:158–165. doi: 10.1002/prot.20300. - DOI - PubMed
    1. Pollastri G, Baldi P, Fariselli P, Casadio R. Prediction of coordination number and relative solvent accessibility in proteins. Proteins. 2002;47:142–153. doi: 10.1002/prot.10069. - DOI - PubMed
    1. Pollastri G, Baldi P, Fariselli P, Casadio R. Improved prediction of the number of residue contacts in proteins by recurrent neural networks. Bioinformatics. 2001;17 Suppl 1:S234–42. - PubMed

Publication types

-