GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions

doi:10.1093/nar/29.12.2607

. 2001 Jun 15;29(12):2607-18.

doi: 10.1093/nar/29.12.2607.

GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions

J Besemer¹, A Lomsadze, M Borodovsky

Affiliations

PMID: 11410670
PMCID: PMC55746
DOI: 10.1093/nar/29.12.2607

GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions

J Besemer et al. Nucleic Acids Res. 2001.

. 2001 Jun 15;29(12):2607-18.

doi: 10.1093/nar/29.12.2607.

Authors

J Besemer¹, A Lomsadze, M Borodovsky

Affiliation

¹ School of Biology and School of Mathematics, Georgia Institute of Technology, Atlanta, GA 30332-0230, USA.

PMID: 11410670
PMCID: PMC55746
DOI: 10.1093/nar/29.12.2607

Abstract

Improving the accuracy of prediction of gene starts is one of a few remaining open problems in computer prediction of prokaryotic genes. Its difficulty is caused by the absence of relatively strong sequence patterns identifying true translation initiation sites. In the current paper we show that the accuracy of gene start prediction can be improved by combining models of protein-coding and non-coding regions and models of regulatory sites near gene start within an iterative Hidden Markov model based algorithm. The new gene prediction method, called GeneMarkS, utilizes a non-supervised training procedure and can be used for a newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes. The GeneMarkS implementation uses an improved version of the gene finding program GeneMark.hmm, heuristic Markov models of coding and non-coding regions and the Gibbs sampling multiple alignment program. GeneMarkS predicted precisely 83.2% of the translation starts of GenBank annotated Bacillus subtilis genes and 94.4% of translation starts in an experimentally validated set of Escherichia coli genes. We have also observed that GeneMarkS detects prokaryotic genes, in terms of identifying open reading frames containing real genes, with an accuracy matching the level of the best currently used gene detection methods. Accurate translation start prediction, in addition to the refinement of protein sequence N-terminal data, provides the benefit of precise positioning of the sequence region situated upstream to a gene start. Therefore, sequence motifs related to transcription and translation regulatory sites can be revealed and analyzed with higher precision. These motifs were shown to possess a significant variability, the functional and evolutionary connections of which are discussed.

PubMed Disclaimer

Figures

**Figure 1**
Step-by-step diagram of the GeneMarkS procedure.

**Figure 2**
(A) In the process of GeneMarkS training there is no division of the coding sequence into two clusters. However, in applying the GeneMark.hmm 2.0 program, the model of coding region derived by GeneMarkS can be used as the Typical model along with a heuristic model used as the Atypical model (see Table 3). For simplicity, only the direct strand is shown. (B) In this simplified diagram of hidden state transitions in GeneMark.hmm 2.0, the state ‘gene’ represents a sequence composed of an RBS plus a spacer plus the protein-coding sequence (CDS). Gene overlaps encompass all possible types of superpositions: overlap of genes on the same strand (as observed in operons), overlap of genes on opposite strands, overlap of coding region with RBS, and so on.

**Figure 3**
(A) Sequence logo representing the RBS positional frequency pattern detected by GeneMarkS in the analysis of *B.subtilis* genomic data. The total height of the four letters in each position indicates the position specific information content, while the height of each letter is proportional to the nucleotide frequency (42). (B) Graph of probability distribution of spacer length, the sequence between the RBS sequence and the gene start.

**Figure 4**
Venn diagram showing group relationships between the GenBank annotation and sets of genes detected by GeneMark.hmm 2.0 and Glimmer 2.02 for the *B.subtilis* genome (A) and the *E.coli* genome (B).

**Figure 5**
Distributions of log-odds scores of RBS sites, as detected by GeneMarkS, in sets of overlapping and non-overlapping of genes of (A) *B.subtilis,* (B) *E.coli* and (C) *M.jannaschii*. As can be seen, the overlapping genes, which are likely to be located inside operons, frequently have strong RBS sites. Still, most strong sites of ribosome binding precede the non-overlapping genes (stand alone genes and genes leading operons). This tendency is much more apparent in the case of the archaeal genome of *M.jannaschii* than in the *E.coli* and B.subtilis genomes.

**Figure 6**
Sequence logo representing the upstream sequence motif detected by GeneMarkS for *A.fulgidus.* This consensus sequence is rather indicative of a eukaryotic-like promoter element, than an RBS signal as often found in prokaryotes. Sites that match this pattern are ubiquitous in *A.fulgidus*, although further analysis of a subset of upstream sequences reveals a second motif (see Fig. 7) complementary to the 3′ terminal section of the *A.fulgidus* 16S rRNA.

**Figure 7**
Sequence logo representing the RBS motif observed in a subset of upstream sequences of the *A.fulgidus* genome. This subset consisted of 50 nt long upstream sequences overlapping the 3′ end of the preceding gene. The consensus of this motif is complementary to a section of the *A.fulgidus* 16S rRNA.

**Figure 8**
Distributions of spacer length for two species with strong RBS patterns, *B.subtilis* and *E.coli* (solid and dashed lines, respectively), and one species with a strong eukaryotic promoter-like pattern, *A.fulgidus* (dotted line). The promoter-like pattern of *A.fulgidus* is localized much further upstream of the start codon than the RBS patterns of *B.subtilis* and E.coli.

**Figure 9**
(A) Distribution of spacer lengths observed in the *B.subtilis* genome for two different types of possible RBS hexamers: AGGAGG and AGGTGA. Multiple alignment allows these hexamers to be superimposed. In actual upstream sequences, these hexamers tend to occupy different locations relative to the start codon. This preference may be involved in the precise positioning of the ribosome at the translation initiation site when the 16S rRNA binds to mRNA. The more frequent hexamer was observed on average at a further distance from the gene start than the rare hexamer. (B) Distribution of spacer lengths observed in the *M.thermoautotrophicum* genome for two different types of RBS hexamers: GGAGGT and GGTGAT. Properties of these hexamers are similar to the two hexamers observed in the *B.subtilis* genome (A), except that more frequent hexamer is now found on average at a closer distance to the gene start than the rare hexamer.

See this image and copyright information in PMC

Cited by

Tracking the footsteps of Burkholderia mallei: determination of the molecular differences and potential resistance genes.
Dülger D, Ekici S, Demirci M, Yiğin A, Babacan O. Dülger D, et al. Turk J Med Sci. 2023 Dec 21;54(1):16-25. doi: 10.55730/1300-0144.5761. eCollection 2024. Turk J Med Sci. 2023. PMID: 38812620 Free PMC article.
Characterization of Pseudomonas aeruginosa bacteriophages and control hemorrhagic pneumonia on a mice model.
Zhang Y, Wang R, Hu Q, Lv N, Zhang L, Yang Z, Zhou Y, Wang X. Zhang Y, et al. Front Microbiol. 2024 May 14;15:1396774. doi: 10.3389/fmicb.2024.1396774. eCollection 2024. Front Microbiol. 2024. PMID: 38808279 Free PMC article.
Phylogenomic Characterization of Ranavirus Isolated from Wild Smallmouth Bass (Micropterus dolomieu).
Quail H, Viadanna PHO, Vann JA, Hsu HM, Pohly A, Smith W, Hansen S, Nietlisbach N, Godard D, Waltzek TB, Subramaniam K. Quail H, et al. Viruses. 2024 Apr 30;16(5):715. doi: 10.3390/v16050715. Viruses. 2024. PMID: 38793597 Free PMC article.
First Report of Endemic Frog Virus 3 (FV3)-like Ranaviruses in the Korean Clawed Salamander (Onychodactylus koreanus) in Asia.
Kim J, Sung HW, Jung TS, Park J, Park D. Kim J, et al. Viruses. 2024 Apr 25;16(5):675. doi: 10.3390/v16050675. Viruses. 2024. PMID: 38793557 Free PMC article.
Introduction of Cellulolytic Bacterium Bacillus velezensis Z2.6 and Its Cellulase Production Optimization.
Cai Z, Wang Y, You Y, Yang N, Lu S, Xue J, Xing X, Sha S, Zhao L. Cai Z, et al. Microorganisms. 2024 May 13;12(5):979. doi: 10.3390/microorganisms12050979. Microorganisms. 2024. PMID: 38792808 Free PMC article.

See all "Cited by" articles

References

1. Fickett J.W. (1981) Recognition of protein coding regions in DNA sequences. Nucleic Acids Res., 10, 5303–5318. - PMC - PubMed
1. Gribskov M., Devereux,J. and Burgess,R.R. (1984) The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Res., 12, 539–549. - PMC - PubMed
1. Staden R. (1984) Measurements of the effects that coding for a protein has on a DNA sequence and their use for finding genes. Nucleic Acids Res., 12, 551–567. - PMC - PubMed
1. Borodovsky M.Y., Sprizhitskii,Y.A., Golovanov,E.I. and Aleksandrov,A.A. (1986) Statistical patterns in primary structures of functional regions in the in E. coli genome: III. Computer recognition of coding regions. Mol. Biol., 20, 1145–1150. - PubMed
1. Borodovsky M.Y. and McIninch,J.D. (1993) GeneMark: parallel gene recognition for both DNA strands. Comput. Chem., 17, 123–153.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations

[1] Fickett J.W. (1981) Recognition of protein coding regions in DNA sequences. Nucleic Acids Res., 10, 5303–5318. - PMC - PubMed

[2] Fickett J.W. (1981) Recognition of protein coding regions in DNA sequences. Nucleic Acids Res., 10, 5303–5318. - PMC - PubMed

[3] Gribskov M., Devereux,J. and Burgess,R.R. (1984) The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Res., 12, 539–549. - PMC - PubMed

[4] Gribskov M., Devereux,J. and Burgess,R.R. (1984) The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Res., 12, 539–549. - PMC - PubMed

[5] Staden R. (1984) Measurements of the effects that coding for a protein has on a DNA sequence and their use for finding genes. Nucleic Acids Res., 12, 551–567. - PMC - PubMed

[6] Staden R. (1984) Measurements of the effects that coding for a protein has on a DNA sequence and their use for finding genes. Nucleic Acids Res., 12, 551–567. - PMC - PubMed

[7] Borodovsky M.Y., Sprizhitskii,Y.A., Golovanov,E.I. and Aleksandrov,A.A. (1986) Statistical patterns in primary structures of functional regions in the in E. coli genome: III. Computer recognition of coding regions. Mol. Biol., 20, 1145–1150. - PubMed

[8] Borodovsky M.Y., Sprizhitskii,Y.A., Golovanov,E.I. and Aleksandrov,A.A. (1986) Statistical patterns in primary structures of functional regions in the in E. coli genome: III. Computer recognition of coding regions. Mol. Biol., 20, 1145–1150. - PubMed

[9] Borodovsky M.Y. and McIninch,J.D. (1993) GeneMark: parallel gene recognition for both DNA strands. Comput. Chem., 17, 123–153.

[10] Borodovsky M.Y. and McIninch,J.D. (1993) GeneMark: parallel gene recognition for both DNA strands. Comput. Chem., 17, 123–153.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions

Affiliation

GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources