Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep 26;11(5):e0170423.
doi: 10.1128/spectrum.01704-23. Online ahead of print.

A large-scale phylogeny-guided analysis of pseudogenes in Pseudomonas aeruginosa bacterium

Affiliations

A large-scale phylogeny-guided analysis of pseudogenes in Pseudomonas aeruginosa bacterium

Nimrod Cohen et al. Microbiol Spectr. .

Abstract

Pseudogenes, once considered "junk DNA" based on the incorrect assumption that the absence of full coding potential means a complete lack of functionality, have recently become a subject of significant interest in the scientific community. Concurrently, it is widely assumed that bacterial genomes are compact and have a high density of coding genes with little room for non-coding genes, including pseudogenes. A key aspect of genome annotation is the correct identification of genes and the distinction between coding genes and pseudogenes, as it directly impacts functional and comparative genomics studies. In this study, we analyzed the genomic data of 4,699 strains of the bacterium Pseudomonas aeruginosa (P. aeruginosa) as they exhibit high variability in the number of annotated pseudogenes. In particular, we looked for correlations between the number of pseudogenes and other genomic and meta-features of the strains. We identified clusters of orthologous genes and pseudogenes and compared cluster size distributions and length homogeneity within clusters. We then mapped and examined orthology relationships between genes and pseudogenes. Additionally, we generated a phylogenetic tree of the strains and found that phylogenetically related strains are more homogeneous in the number of pseudogenes and share a significant amount of pseudogenes. Finally, we delved into clusters of orthologous genes and pseudogenes and quantified their phylogenetic neighborhood, classifying pseudogenes into evolutionary preserved pseudogenes, mis-annotated pseudogenes, or pseudogenes formed by failed horizontal transfer events. This in-depth study provides important insights that can be incorporated into pseudogene annotation pipelines in the future. IMPORTANCE Accurate annotation of genes and pseudogenes is vital for comparative genomics analysis. Recent studies have shown that bacterial pseudogenes have an important role in regulatory processes and can provide insight into the evolutionary history of homologous genes or the genome as a whole. Due to pseudogenes' nature as non-functional genes, there is no commonly accepted definition of a pseudogene, which poses difficulties in verifying the annotation through experimental methods and resolving discrepancies among different annotation techniques. Our study introduces an in-depth analysis of annotated genes and pseudogenes and insights that can be incorporated into improved pseudogene annotation pipelines in the future.

Keywords: Pseudomonas aeruginosa; bacteria; comparative genomics; phylogenetics; pseudogenes.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig 1
Fig 1
Association between the number of pseudogenes and (a) assembly level and (b) isolation type of the strains. For each group, the number of observations denoted by n is shown above the median line. The box and whiskers span the interquartile range (IQR) and 1.5 × IQR, respectively, and diamonds represent outliers. One strain and 1,990 strains are missing information about assembly level and isolation type, respectively, and thus were dropped from the plots.
Fig 2
Fig 2
Distributions of cluster size and sequence length variability within gene and pseudogene clusters. (a and b) Distributions of cluster size in terms of the number of strains represented in the cluster of (a) gene and (b) pseudogene. In both graphs, the leftmost peak represents "singletons" (clusters that consist of genes/pseudogenes that appear in only one strain) and rare accessory genes/pseudogenes. In (a), the rightmost peak represents core genes that are present in a majority of the strains. (c and d) Distribution of coefficient of variation (CV) of sequence length within (c) gene and (d) pseudogene clusters. For a better view, the graph in (c) was cut in the y-axis (max = 16,322), and both (c) and (d) were cut in the x-axis (max = 1.9).
Fig 3
Fig 3
(a) Distribution of the number of sequences in gene/pseudogene representative clusters. The plot is in log scale. Clusters are labeled based on the content of the clusters gene-only, pseudogene-only, or mixed. (b) Further classification of pseudogene-only and mixed clusters, based on the size of the original clusters. (c) Distribution of pseudogenes by cluster classification. Each strain’s pseudogenes are classified into one of the classes shown in (b). The proportions of each class across all strains are plotted. The box and whiskers span IQR and 1.5 × IQR, respectively, and diamonds represent outliers. Distribution parameters matching the box plots are found in Table S4.
Fig 4
Fig 4
Distribution of CV of the number of pseudogenes or genes for different groupings of strains. (a–d) CV of the number of pseudogenes. (a) Grouping by phylogenetic tree versus random grouping. (b) Grouping by MLST. (c) Grouping by BioProject. (d) Comparison of CV distribution between different groupings (P values are computed with Mann-Whitney U rank test, shown are only significant P values <0.05). The box and whiskers span IQR and 1.5 × IQR, respectively, and diamonds represent outliers. (e) CV of the number of genes; grouping by phylogenetic tree. (f) CV of the number of genes versus CV of the number of pseudogenes; grouping by phylogenetic tree. Group size in (a) and (d–f) is 9. A similar analysis for group size 15 can be found in Fig. S3.
Fig 5
Fig 5
Strain pairwise analysis of shared pseudogenes. Heatmaps show (a) absolute count and (b) Jaccard index of the number of shared pseudogenes between pairs of strains. The mean and maximum values are as follows: mean = 33, maximum = 933 and mean = 0.11, maximum = 1 in (a) and (b), respectively. (c) Phylogenetic tree of P. aeruginosa strains: the green and orange layers on top of the tree indicate the number of genes and pseudogenes, respectively, for the strains at the tips of the tree. The branches are colored based on MLST as follows. Each sequence type (ST) was transformed into a random hex number which is translated into a color. STs that were assigned to less than five different strains were filtered out. Altogether there are 135 colors in the tree. Black color is assigned to strains that have no ST information or belong to STs that were filtered out. The order of the strains in rows (up to bottom) and columns (left to right) of both heatmaps is identical and corresponds to the order of the tips (indicated with an arrow) of the phylogenetic tree shown in (c).
Fig 6
Fig 6
Distribution of genes and pseudogenes neighborhood labels within clusters visualized through heatmaps. These heatmaps represent 2,653 gene-pseudogene mixed clusters in the rows and three possible labels in the columns for (a) genes or (b) pseudogenes. The color of each cell indicates the proportion of the label (column) within the cluster (row). The left bars in each heatmap represent cluster size, i.e., the number of strains containing genes/pseudogenes in the cluster. The clusters were sorted by the proportions of different labels from high to low in the following priority: with same type, with opposite type, and alone.

Similar articles

References

    1. Zhang Z, Carriero N, Gerstein M. 2004. Comparative analysis of processed pseudogenes in the mouse and human genomes. Trends Genet. 20:62–67. doi:10.1016/j.tig.2003.12.005 - DOI - PubMed
    1. Kuo CH, Ochman H. 2010. The extinction dynamics of bacterial pseudogenes. PLoS Genet. 6:e1001050. doi:10.1371/journal.pgen.1001050 - DOI - PMC - PubMed
    1. Goodhead I, Darby AC. 2015. Taking the pseudo out of pseudogenes. Curr Opin Microbiol 23:102–109. doi:10.1016/j.mib.2014.11.012 - DOI - PubMed
    1. Goodhead I, Blow F, Brownridge P, Hughes M, Kenny J, Krishna R, McLean L, Pongchaikul P, Beynon R, Darby AC. 2020. Large-scale and significant expression from pseudogenes in sodalis glossinidius – a facultative bacterial endosymbiont. Microb Genom 6:1. doi:10.1099/mgen.0.000285 - DOI - PMC - PubMed
    1. Schwarze K, Buchanan J, Taylor JC, Wordsworth S. 2018. Are whole-exome and whole-genome sequencing approaches cost-effective? A systematic review of the literature. Genet Med 20:1122–1130. doi:10.1038/gim.2017.247 - DOI - PubMed

LinkOut - more resources

-