Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jul;28(7):1079-1089.
doi: 10.1101/gr.230615.117. Epub 2018 May 17.

Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes

Affiliations

Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes

Alexandre Lomsadze et al. Genome Res. 2018 Jul.

Abstract

In a conventional view of the prokaryotic genome organization, promoters precede operons and ribosome binding sites (RBSs) with Shine-Dalgarno consensus precede genes. However, recent experimental research suggesting a more diverse view motivated us to develop an algorithm with improved gene-finding accuracy. We describe GeneMarkS-2, an ab initio algorithm that uses a model derived by self-training for finding species-specific (native) genes, along with an array of precomputed "heuristic" models designed to identify harder-to-detect genes (likely horizontally transferred). Importantly, we designed GeneMarkS-2 to identify several types of distinct sequence patterns (signals) involved in gene expression control, among them the patterns characteristic for leaderless transcription as well as noncanonical RBS patterns. To assess the accuracy of GeneMarkS-2, we used genes validated by COG (Clusters of Orthologous Groups) annotation, proteomics experiments, and N-terminal protein sequencing. We observed that GeneMarkS-2 performed better on average in all accuracy measures when compared with the current state-of-the-art gene prediction tools. Furthermore, the screening of ∼5000 representative prokaryotic genomes made by GeneMarkS-2 predicted frequent leaderless transcription in both archaea and bacteria. We also observed that the RBS sites in some species with leadered transcription did not necessarily exhibit the Shine-Dalgarno consensus. The modeling of different types of sequence motifs regulating gene expression prompted a division of prokaryotic genomes into five categories with distinct sequence patterns around the gene starts.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Principal state diagram of the generalized hidden Markov model (GHHM) of prokaryotic genomic sequence. States shown in the top panel were used to model a gene in the direct strand. Genes in the reverse strand were modeled by the identical set of states (with directions of transition reversed). The states modeling genes in direct and reverse strands were connected through the intergenic region state as well as the states of genes overlapping in opposite strands.
Figure 2.
Figure 2.
Principal workflow of the unsupervised training.
Figure 3.
Figure 3.
Motif logos and spacer length distributions for genomes of M. tuberculosis (Group C) and H. salinarum (Group D). In M. tuberculosis, the ‘mixed’ motif found by GeneMarkS has no preferred localization (panel A) in the upstream regions of the first genes in operons. To the contrary, the motif found by GeneMarkS-2 has a clear localization at −10 distance from gene starts, the distance typical for bacterial TATA box and leaderless transcription (B). In upstream regions of internal genes in operons, GeneMarkS-2 built the RBS model and the spacer length distribution (C). For H. salinarum, comparison of GeneMarkS-2 outcomes (E,F) with ones by GeneMarkS (D) shows similar improvements.
Figure 4.
Figure 4.
Color-coded scheme of the distribution of groups A–D and X among ∼5000 representative genomes. The diagram shows the top three levels of the taxonomy trees of both archaea and bacteria.
Figure 5.
Figure 5.
Distributions of the percentage of predicted atypical genes in archaeal and bacterial genomes.
Figure 6.
Figure 6.
Distributions of the percentage of leaderless transcripts among all transcripts in bacterial Group C and archaeal Group D.
Figure 7.
Figure 7.
The motif logo and the spacer length distribution of Bacteroides ovatus, the group B genome.

Similar articles

Cited by

References

    1. Aivaliotis M, Gevaert K, Falb M, Tebbe A, Konstantinidis K, Bisle B, Klein C, Martens L, Staes A, Timmerman E, et al. 2007. Large-scale identification of N-terminal peptides in the halophilic archaea Halobacterium salinarum and Natronomonas pharaonis. J Proteome Res 6: 2195–2204. - PubMed
    1. Babski J, Haas KA, Nather-Schindler D, Pfeiffer F, Forstner KU, Hammelmann M, Hilker R, Becker A, Sharma CM, Marchfelder A, et al. 2016. Genome-wide identification of transcriptional start sites in the haloarchaeon Haloferax volcanii based on differential RNA-Seq (dRNA-Seq). BMC Genomics 17: 629. - PMC - PubMed
    1. Barrick D, Villanueba K, Childs J, Kalil R, Schneider TD, Lawrence CE, Gold L, Stormo GD. 1994. Quantitative analysis of ribosome binding sites in E. coli. Nucleic Acids Res 22: 1287–1295. - PMC - PubMed
    1. Besemer J, Lomsadze A, Borodovsky M. 2001. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29: 2607–2618. - PMC - PubMed
    1. Borodovsky M, McIninch J. 1993. GeneMark: parallel gene recognition for both DNA strands. Compu Chem 17: 123–133.

Publication types

-