Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2003 Mar 15;31(6):1780-9.
doi: 10.1093/nar/gkg254.

ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes

Affiliations
Comparative Study

ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes

Feng-Biao Guo et al. Nucleic Acids Res. .

Abstract

A new system, ZCURVE 1.0, for finding protein- coding genes in bacterial and archaeal genomes has been proposed. The current algorithm, which is based on the Z curve representation of the DNA sequences, lays stress on the global statistical features of protein-coding genes by taking the frequencies of bases at three codon positions into account. In ZCURVE 1.0, since only 33 parameters are used to characterize the coding sequences, it gives better consideration to both typical and atypical cases, whereas in Markov-model-based methods, e.g. Glimmer 2.02, thousands of parameters are trained, which may result in less adaptability. To compare the performance of the new system with that of Glimmer 2.02, both systems were run, respectively, for 18 genomes not annotated by the Glimmer system. Comparisons were also performed for predicting some function-known genes by both systems. Consequently, the average accuracy of both systems is well matched; however, ZCURVE 1.0 has more accurate gene start prediction, lower additional prediction rate and higher accuracy for the prediction of horizontally transferred genes. It is shown that the joint applications of both systems greatly improve gene-finding results. For a typical genome, e.g. Escherichia coli, the system ZCURVE 1.0 takes approximately 2 min on a Pentium III 866 PC without any human intervention. The system ZCURVE 1.0 is freely available at: http://tubic. tju.edu.cn/Zcurve_B/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Distributions of points of GC3 versus GC1 corresponding to 405 function-known genes verified experimentally (23), 1206 and 3144 genes additionally predicted by ZCURVE 1.0 and Glimmer 2.02, respectively, for the genome of P.aeruginosa. Here GC3 and GC1 denote the GC content at the third and first codon positions, respectively. Note that the points corresponding to the function-known genes verified experimentally are situated almost all at the region of GC3 > GC1, whereas those for the 1206 and 3144 genes additionally predicted by ZCURVE and Glimmer are situated mainly at the regions of GC3 > GC1 and GC3 < GC1, respectively. This fact indicates that most of the 3144 genes additionally predicted by Glimmer 2.02 are very unlikely to code for proteins, implying that Glimmer 2.02 has a high false positive prediction rate for this genome.
Figure 2
Figure 2
Relation between the overlapping ratio of long ORFs defined in equation 7 and the G+C content. The mean overlapping ratio averaged over 18 bacterial or archaeal genomes studied here is 52.69, whereas the mean overlapping ratio averaged over 14 bacterial or archaeal genomes with relatively lower G+C content is only 1.77. Fitting the points by an exponential curve, it is found that the curve has a turning point at about G+C = 56%, starting from which the value of p increases remarkably.

Similar articles

Cited by

References

    1. Borodovsky M. and McIninch,J. (1993) GenMark: parallel gene recognition for both DNA strands. Comput. Chem., 17, 123–133.
    1. Besemer J., Lomsadze,A. and Borodovsky,M. (2001) GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res., 29, 2607–2618. - PMC - PubMed
    1. Salzberg S.L., Delcher,A.L., Kasif,S. and White,O. (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res., 26, 544–548. - PMC - PubMed
    1. Delcher A.L., Harmon,D., Kasif,S., White,O. and Salzberg,S.L. (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res., 27, 4636–4641. - PMC - PubMed
    1. Frishman D., Mironov,A., Mewes,H.W. and Gelfand,M. (1998) Combining diverse evidence for gene recognition in completely sequenced bacterial genomes [published erratum appears in Nucleic Acids Res., 26, 3870]. Nucleic Acids Res., 26, 2941–2947. - PMC - PubMed

Publication types

-