Better prediction of protein contact number using a support vector regression analysis of amino acid sequence

doi:10.1186/1471-2105-6-248

. 2005 Oct 13:6:248.

doi: 10.1186/1471-2105-6-248.

Better prediction of protein contact number using a support vector regression analysis of amino acid sequence

Zheng Yuan¹

Affiliations

PMID: 16221309
PMCID: PMC1277819
DOI: 10.1186/1471-2105-6-248

Better prediction of protein contact number using a support vector regression analysis of amino acid sequence

Zheng Yuan. BMC Bioinformatics. 2005.

. 2005 Oct 13:6:248.

doi: 10.1186/1471-2105-6-248.

Author

Zheng Yuan¹

Affiliation

¹ Institute for Molecular Bioscience, ARC Centre in Bioinformatics, The University of Queensland, St. Lucia, 4072, Australia. z.yuan@imb.uq.edu.au

PMID: 16221309
PMCID: PMC1277819
DOI: 10.1186/1471-2105-6-248

Abstract

Background: Protein tertiary structure can be partly characterized via each amino acid's contact number measuring how residues are spatially arranged. The contact number of a residue in a folded protein is a measure of its exposure to the local environment, and is defined as the number of Cbeta atoms in other residues within a sphere around the Cbeta atom of the residue of interest. Contact number is partly conserved between protein folds and thus is useful for protein fold and structure prediction. In turn, each residue's contact number can be partially predicted from primary amino acid sequence, assisting tertiary fold analysis from sequence data. In this study, we provide a more accurate contact number prediction method from protein primary sequence.

Results: We predict contact number from protein sequence using a novel support vector regression algorithm. Using protein local sequences with multiple sequence alignments (PSI-BLAST profiles), we demonstrate a correlation coefficient between predicted and observed contact numbers of 0.70, which outperforms previously achieved accuracies. Including additional information about sequence weight and amino acid composition further improves prediction accuracies significantly with the correlation coefficient reaching 0.73. If residues are classified as being either "contacted" or "non-contacted", the prediction accuracies are all greater than 77%, regardless of the choice of classification thresholds.

Conclusion: The successful application of support vector regression to the prediction of protein contact number reported here, together with previous applications of this approach to the prediction of protein accessible surface area and B-factor profile, suggests that a support vector regression approach may be very useful for determining the structure-function relation between primary protein sequence and higher order consecutive protein structural and functional properties.

PubMed Disclaimer

Figures

**Figure 1**
**Contact number distributions according to different definitions**. The radius cutoffs are selected as 8 Å, 10 Å, 12 Å and 14 Å, represented by dotted, slashed, solid and dot-and-slashed lines, respectively. A is for discrete contact number while B is for consecutive contact number.

**Figure 2**
**The accessible surface area as a function of contact number**. Discrete contact numbers are used with a radius cutoff of 12 Å. Error bars represent the standard deviations.

**Figure 3**
**The predicted and observed contact numbers for proteins GP130 (PDB: 1bj8) and Human chorionic gonadotropin (PDB: 1dz7, chain A)**. Discrete contact numbers are used with a radius cutoff of 12 Å. Observed and predicted contact numbers are represented by solid and dashed lines, respectively. A) GP 130 is predicted with a correlation coefficient of 0.75 and a root-mean-squar∈error of 6.07; B) Human chorionic gonadotropin is predicted with a correlation coefficient of 0.58 and a root mean square error of 9.73.

**Figure 4**
**Distributions of correlation coefficients and root mean square errors given different input information**. Discrete definition is used with a radius cutoff of 12 Å. The four inputs "LS", "LS+W", "LS+AA" and "LS+W+AA" are represented by dotted, slashed, dot-and-slashed and solid lines, respectively. A is for correlation coefficients while B is for root mean square errors.

**Figure 5**
**The mean absolute errors for residues of different contact numbers**. The four inputs "LS", "LS+W", "LS+AA" and "LS+W+AA" are represented by dotted, slashed, dot-and-slashed and solid lines, respectively.

**Figure 6**
**Prediction accuracies when predictions are formulated as two-class problems using different contact number thresholds**. The four inputs "LS", "LS+W", "LS+AA" and "LS+W+AA" are represented by dotted, slashed, dot-and-slashed and solid lines, respectively.

See this image and copyright information in PMC

Cited by

Prediction of protein-protein interaction sites in intrinsically disordered proteins.
Chen R, Li X, Yang Y, Song X, Wang C, Qiao D. Chen R, et al. Front Mol Biosci. 2022 Sep 30;9:985022. doi: 10.3389/fmolb.2022.985022. eCollection 2022. Front Mol Biosci. 2022. PMID: 36250006 Free PMC article. Review.
Deep learning methods in protein structure prediction.
Torrisi M, Pollastri G, Le Q. Torrisi M, et al. Comput Struct Biotechnol J. 2020 Jan 22;18:1301-1310. doi: 10.1016/j.csbj.2019.12.011. eCollection 2020. Comput Struct Biotechnol J. 2020. PMID: 32612753 Free PMC article. Review.
Predicting protein inter-residue contacts using composite likelihood maximization and deep learning.
Zhang H, Zhang Q, Ju F, Zhu J, Gao Y, Xie Z, Deng M, Sun S, Zheng WM, Bu D. Zhang H, et al. BMC Bioinformatics. 2019 Oct 29;20(1):537. doi: 10.1186/s12859-019-3051-7. BMC Bioinformatics. 2019. PMID: 31664895 Free PMC article.
A sparse autoencoder-based deep neural network for protein solvent accessibility and contact number prediction.
Deng L, Fan C, Zeng Z. Deng L, et al. BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):569. doi: 10.1186/s12859-017-1971-7. BMC Bioinformatics. 2017. PMID: 29297299 Free PMC article.
3DCONS-DB: A Database of Position-Specific Scoring Matrices in Protein Structures.
Sanchez-Garcia R, Sorzano COS, Carazo JM, Segura J. Sanchez-Garcia R, et al. Molecules. 2017 Dec 15;22(12):2230. doi: 10.3390/molecules22122230. Molecules. 2017. PMID: 29244774 Free PMC article.

See all "Cited by" articles

References

1. Karchin R, Cline M, Karplus K. Evaluation of local structure alphabets based on residue burial. Proteins. 2004;55:508–518. doi: 10.1002/prot.20008. - DOI - PubMed
1. Hamelryck T. An amino acid has two sides: A new 2D measure provides a different view of solvent exposure. Proteins. 2005;59:38–48. doi: 10.1002/prot.20379. - DOI - PubMed
1. Kinjo AR, Horimoto K, Nishikawa K. Predicting absolute contact numbers of native protein structure from amino acid sequence. Proteins. 2005;58:158–165. doi: 10.1002/prot.20300. - DOI - PubMed
1. Pollastri G, Baldi P, Fariselli P, Casadio R. Prediction of coordination number and relative solvent accessibility in proteins. Proteins. 2002;47:142–153. doi: 10.1002/prot.10069. - DOI - PubMed
1. Pollastri G, Baldi P, Fariselli P, Casadio R. Improved prediction of the number of residue contacts in proteins by recurrent neural networks. Bioinformatics. 2001;17 Suppl 1:S234–42. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

[1] Karchin R, Cline M, Karplus K. Evaluation of local structure alphabets based on residue burial. Proteins. 2004;55:508–518. doi: 10.1002/prot.20008. - DOI - PubMed

[2] Karchin R, Cline M, Karplus K. Evaluation of local structure alphabets based on residue burial. Proteins. 2004;55:508–518. doi: 10.1002/prot.20008. - DOI - PubMed

[3] Hamelryck T. An amino acid has two sides: A new 2D measure provides a different view of solvent exposure. Proteins. 2005;59:38–48. doi: 10.1002/prot.20379. - DOI - PubMed

[4] Hamelryck T. An amino acid has two sides: A new 2D measure provides a different view of solvent exposure. Proteins. 2005;59:38–48. doi: 10.1002/prot.20379. - DOI - PubMed

[5] Kinjo AR, Horimoto K, Nishikawa K. Predicting absolute contact numbers of native protein structure from amino acid sequence. Proteins. 2005;58:158–165. doi: 10.1002/prot.20300. - DOI - PubMed

[6] Kinjo AR, Horimoto K, Nishikawa K. Predicting absolute contact numbers of native protein structure from amino acid sequence. Proteins. 2005;58:158–165. doi: 10.1002/prot.20300. - DOI - PubMed

[7] Pollastri G, Baldi P, Fariselli P, Casadio R. Prediction of coordination number and relative solvent accessibility in proteins. Proteins. 2002;47:142–153. doi: 10.1002/prot.10069. - DOI - PubMed

[8] Pollastri G, Baldi P, Fariselli P, Casadio R. Prediction of coordination number and relative solvent accessibility in proteins. Proteins. 2002;47:142–153. doi: 10.1002/prot.10069. - DOI - PubMed

[9] Pollastri G, Baldi P, Fariselli P, Casadio R. Improved prediction of the number of residue contacts in proteins by recurrent neural networks. Bioinformatics. 2001;17 Suppl 1:S234–42. - PubMed

[10] Pollastri G, Baldi P, Fariselli P, Casadio R. Improved prediction of the number of residue contacts in proteins by recurrent neural networks. Bioinformatics. 2001;17 Suppl 1:S234–42. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Better prediction of protein contact number using a support vector regression analysis of amino acid sequence

Affiliation

Better prediction of protein contact number using a support vector regression analysis of amino acid sequence

Author

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Research Materials