Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2001 Jun 15;29(12):2607-18.
doi: 10.1093/nar/29.12.2607.

GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions

Affiliations

GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions

J Besemer et al. Nucleic Acids Res. .

Abstract

Improving the accuracy of prediction of gene starts is one of a few remaining open problems in computer prediction of prokaryotic genes. Its difficulty is caused by the absence of relatively strong sequence patterns identifying true translation initiation sites. In the current paper we show that the accuracy of gene start prediction can be improved by combining models of protein-coding and non-coding regions and models of regulatory sites near gene start within an iterative Hidden Markov model based algorithm. The new gene prediction method, called GeneMarkS, utilizes a non-supervised training procedure and can be used for a newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes. The GeneMarkS implementation uses an improved version of the gene finding program GeneMark.hmm, heuristic Markov models of coding and non-coding regions and the Gibbs sampling multiple alignment program. GeneMarkS predicted precisely 83.2% of the translation starts of GenBank annotated Bacillus subtilis genes and 94.4% of translation starts in an experimentally validated set of Escherichia coli genes. We have also observed that GeneMarkS detects prokaryotic genes, in terms of identifying open reading frames containing real genes, with an accuracy matching the level of the best currently used gene detection methods. Accurate translation start prediction, in addition to the refinement of protein sequence N-terminal data, provides the benefit of precise positioning of the sequence region situated upstream to a gene start. Therefore, sequence motifs related to transcription and translation regulatory sites can be revealed and analyzed with higher precision. These motifs were shown to possess a significant variability, the functional and evolutionary connections of which are discussed.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Step-by-step diagram of the GeneMarkS procedure.
Figure 2
Figure 2
(A) In the process of GeneMarkS training there is no division of the coding sequence into two clusters. However, in applying the GeneMark.hmm 2.0 program, the model of coding region derived by GeneMarkS can be used as the Typical model along with a heuristic model used as the Atypical model (see Table 3). For simplicity, only the direct strand is shown. (B) In this simplified diagram of hidden state transitions in GeneMark.hmm 2.0, the state ‘gene’ represents a sequence composed of an RBS plus a spacer plus the protein-coding sequence (CDS). Gene overlaps encompass all possible types of superpositions: overlap of genes on the same strand (as observed in operons), overlap of genes on opposite strands, overlap of coding region with RBS, and so on.
Figure 3
Figure 3
(A) Sequence logo representing the RBS positional frequency pattern detected by GeneMarkS in the analysis of B.subtilis genomic data. The total height of the four letters in each position indicates the position specific information content, while the height of each letter is proportional to the nucleotide frequency (42). (B) Graph of probability distribution of spacer length, the sequence between the RBS sequence and the gene start.
Figure 3
Figure 3
(A) Sequence logo representing the RBS positional frequency pattern detected by GeneMarkS in the analysis of B.subtilis genomic data. The total height of the four letters in each position indicates the position specific information content, while the height of each letter is proportional to the nucleotide frequency (42). (B) Graph of probability distribution of spacer length, the sequence between the RBS sequence and the gene start.
Figure 4
Figure 4
Venn diagram showing group relationships between the GenBank annotation and sets of genes detected by GeneMark.hmm 2.0 and Glimmer 2.02 for the B.subtilis genome (A) and the E.coli genome (B).
Figure 5
Figure 5
Distributions of log-odds scores of RBS sites, as detected by GeneMarkS, in sets of overlapping and non-overlapping of genes of (A) B.subtilis, (B) E.coli and (C) M.jannaschii. As can be seen, the overlapping genes, which are likely to be located inside operons, frequently have strong RBS sites. Still, most strong sites of ribosome binding precede the non-overlapping genes (stand alone genes and genes leading operons). This tendency is much more apparent in the case of the archaeal genome of M.jannaschii than in the E.coli and B.subtilis genomes.
Figure 5
Figure 5
Distributions of log-odds scores of RBS sites, as detected by GeneMarkS, in sets of overlapping and non-overlapping of genes of (A) B.subtilis, (B) E.coli and (C) M.jannaschii. As can be seen, the overlapping genes, which are likely to be located inside operons, frequently have strong RBS sites. Still, most strong sites of ribosome binding precede the non-overlapping genes (stand alone genes and genes leading operons). This tendency is much more apparent in the case of the archaeal genome of M.jannaschii than in the E.coli and B.subtilis genomes.
Figure 5
Figure 5
Distributions of log-odds scores of RBS sites, as detected by GeneMarkS, in sets of overlapping and non-overlapping of genes of (A) B.subtilis, (B) E.coli and (C) M.jannaschii. As can be seen, the overlapping genes, which are likely to be located inside operons, frequently have strong RBS sites. Still, most strong sites of ribosome binding precede the non-overlapping genes (stand alone genes and genes leading operons). This tendency is much more apparent in the case of the archaeal genome of M.jannaschii than in the E.coli and B.subtilis genomes.
Figure 6
Figure 6
Sequence logo representing the upstream sequence motif detected by GeneMarkS for A.fulgidus. This consensus sequence is rather indicative of a eukaryotic-like promoter element, than an RBS signal as often found in prokaryotes. Sites that match this pattern are ubiquitous in A.fulgidus, although further analysis of a subset of upstream sequences reveals a second motif (see Fig. 7) complementary to the 3′ terminal section of the A.fulgidus 16S rRNA.
Figure 7
Figure 7
Sequence logo representing the RBS motif observed in a subset of upstream sequences of the A.fulgidus genome. This subset consisted of 50 nt long upstream sequences overlapping the 3′ end of the preceding gene. The consensus of this motif is complementary to a section of the A.fulgidus 16S rRNA.
Figure 8
Figure 8
Distributions of spacer length for two species with strong RBS patterns, B.subtilis and E.coli (solid and dashed lines, respectively), and one species with a strong eukaryotic promoter-like pattern, A.fulgidus (dotted line). The promoter-like pattern of A.fulgidus is localized much further upstream of the start codon than the RBS patterns of B.subtilis and E.coli.
Figure 9
Figure 9
(A) Distribution of spacer lengths observed in the B.subtilis genome for two different types of possible RBS hexamers: AGGAGG and AGGTGA. Multiple alignment allows these hexamers to be superimposed. In actual upstream sequences, these hexamers tend to occupy different locations relative to the start codon. This preference may be involved in the precise positioning of the ribosome at the translation initiation site when the 16S rRNA binds to mRNA. The more frequent hexamer was observed on average at a further distance from the gene start than the rare hexamer. (B) Distribution of spacer lengths observed in the M.thermoautotrophicum genome for two different types of RBS hexamers: GGAGGT and GGTGAT. Properties of these hexamers are similar to the two hexamers observed in the B.subtilis genome (A), except that more frequent hexamer is now found on average at a closer distance to the gene start than the rare hexamer.
Figure 9
Figure 9
(A) Distribution of spacer lengths observed in the B.subtilis genome for two different types of possible RBS hexamers: AGGAGG and AGGTGA. Multiple alignment allows these hexamers to be superimposed. In actual upstream sequences, these hexamers tend to occupy different locations relative to the start codon. This preference may be involved in the precise positioning of the ribosome at the translation initiation site when the 16S rRNA binds to mRNA. The more frequent hexamer was observed on average at a further distance from the gene start than the rare hexamer. (B) Distribution of spacer lengths observed in the M.thermoautotrophicum genome for two different types of RBS hexamers: GGAGGT and GGTGAT. Properties of these hexamers are similar to the two hexamers observed in the B.subtilis genome (A), except that more frequent hexamer is now found on average at a closer distance to the gene start than the rare hexamer.

Similar articles

Cited by

References

    1. Fickett J.W. (1981) Recognition of protein coding regions in DNA sequences. Nucleic Acids Res., 10, 5303–5318. - PMC - PubMed
    1. Gribskov M., Devereux,J. and Burgess,R.R. (1984) The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Res., 12, 539–549. - PMC - PubMed
    1. Staden R. (1984) Measurements of the effects that coding for a protein has on a DNA sequence and their use for finding genes. Nucleic Acids Res., 12, 551–567. - PMC - PubMed
    1. Borodovsky M.Y., Sprizhitskii,Y.A., Golovanov,E.I. and Aleksandrov,A.A. (1986) Statistical patterns in primary structures of functional regions in the in E. coli genome: III. Computer recognition of coding regions. Mol. Biol., 20, 1145–1150. - PubMed
    1. Borodovsky M.Y. and McIninch,J.D. (1993) GeneMark: parallel gene recognition for both DNA strands. Comput. Chem., 17, 123–153.

MeSH terms

Substances

-