Performance and Application of 16S rRNA Gene Cycle Sequencing for Routine Identification of Bacteria in the Clinical Microbiology Laboratory

Deirdre L. Church; Lorenzo Cerutti; Antoine Gürtler; Thomas Griener; Adrian Zelazny; Stefan Emler

doi:10.1128/CMR.00053-19

Clin Microbiol Rev. 2020 Oct; 33(4): e00053-19.

Published online 2020 Sep 9. doi: 10.1128/CMR.00053-19

PMCID: PMC7484979

PMID: 32907806

Performance and Application of 16S rRNA Gene Cycle Sequencing for Routine Identification of Bacteria in the Clinical Microbiology Laboratory

Deirdre L. Church,^a,^b,* Lorenzo Cerutti,^c,* Antoine Gürtler,^d,^e,* Thomas Griener,^a,* Adrian Zelazny,^f,* and Stefan Emler^d,^e,*

Author information Copyright and License information PMC Disclaimer

This review provides a state-of-the-art description of the performance of Sanger cycle sequencing of the 16S rRNA gene for routine identification of bacteria in the clinical microbiology laboratory. A detailed description of the technology and current methodology is outlined with a major focus on proper data analyses and interpretation of sequences. The remainder of the article is focused on a comprehensive evaluation of the application of this method for identification of bacterial pathogens based on analyses of 16S multialignment sequences.

KEYWORDS: 16S rRNA, bacteria, cycle sequencing, identification

SUMMARY

This review provides a state-of-the-art description of the performance of Sanger cycle sequencing of the 16S rRNA gene for routine identification of bacteria in the clinical microbiology laboratory. A detailed description of the technology and current methodology is outlined with a major focus on proper data analyses and interpretation of sequences. The remainder of the article is focused on a comprehensive evaluation of the application of this method for identification of bacterial pathogens based on analyses of 16S multialignment sequences. In particular, the existing limitations of similarity within 16S for genus- and species-level differentiation of clinically relevant pathogens and the lack of sequence data currently available in public databases is highlighted. A multiyear experience is described of a large regional clinical microbiology service with direct 16S broad-range PCR followed by cycle sequencing for direct detection of pathogens in appropriate clinical samples. The ability of proteomics (matrix-assisted desorption ionization-time of flight) versus 16S sequencing for bacterial identification and genotyping is compared. Finally, the potential for whole-genome analysis by next-generation sequencing (NGS) to replace 16S sequencing for routine diagnostic use is presented for several applications, including the barriers that must be overcome to fully implement newer genomic methods in clinical microbiology. A future challenge for large clinical, reference, and research laboratories, as well as for industry, will be the translation of vast amounts of accrued NGS microbial data into convenient algorithm testing schemes for various applications (i.e., microbial identification, genotyping, and metagenomics and microbiome analyses) so that clinically relevant information can be reported to physicians in a format that is understood and actionable. These challenges will not be faced by clinical microbiologists alone but by every scientist involved in a domain where natural diversity of genes and gene sequences plays a critical role in disease, health, pathogenicity, epidemiology, and other aspects of life-forms. Overcoming these challenges will require global multidisciplinary efforts across fields that do not normally interact with the clinical arena to make vast amounts of sequencing data clinically interpretable and actionable at the bedside.

KEYWORDS: 16S rRNA, bacteria, cycle sequencing, identification

INTRODUCTION

Nucleic acid sequencing of the bacterial 16S rRNA gene (here designated 16S) has been used for several decades to identify clinical and environmental isolates and to assign phylogenetic relationships. Carl Woese and George Fox pioneered comparisons of 16S sequence data prior to the development of DNA sequencing methods to perform complex phylogenetic studies, initially using it to classify methanogenic bacteria and to describe the Archaebacterium Halobacterium volcanii (1,–3). Subsequent accumulation of large amounts of small-subunit rRNA gene sequence data (16S of bacteria and 18S rRNA of eukaryotes) allowed other phylogenetic reconstruction studies, which established the three fundamental domains (Archaea, Bacteria, and Eucarya) within the universal tree of life (4). 16S remains the most widely used stable target for bacterial identification and genetic evolutionary studies, because other highly conserved genes have not been as thoroughly studied. However, various 16S regions and/or a longer gene sequence (i.e., up to ∼1,060 bp) are required for definitive identification of many bacterial genera and/or species as outlined here and in the recently published revision of CLSI MM-18-A2 (5).

Definitive identification of human bacterial pathogens using targeted partial 16S cycle sequencing has been used in clinical microbiology laboratories for the past ∼30 years (6,–15). The advent of commercial capillary gel genetic analyzers and the availability of public database repositories containing a large amount of 16S sequence data made this feasible. GenBank (NCBI) currently contains >29,000,000 entries for 16S sequences of various lengths and quality derived from a diverse range of bacteria recovered from various clinical/environmental sources (https://www.ncbi.nlm.nih.gov/genbank); many of these entries contain just 16S sequences, but increasingly partial or entire genomes containing complete 16S sequences are being deposited. Several excellent reviews of the impact and utility of this method on clinical microbiology practices were published more than a decade ago and outlined the labor intensity, expense, and technological constraints of Sanger sequencing methods available at that time (6, 11, 16). Substantial advances have subsequently occurred in the efficiency of cycle sequencing and the standardized interpretation of 16S sequence data for assigning a definitive bacterial genus- and/or species-level identification. Whereas diagnostic laboratories may have previously referred a clinical isolate to an academic core facility for partial targeted 16S sequencing analysis, advances in the efficiency of PCR technology, sequencing instrumentation, and 16S sequence interpretation made its performance in-house possible (17). Therefore, genomic identification using variable regions within the 16S gene for species-specific differentiation has been widely used in the pre-matrix-assisted desorption ionization-time of flight mass spectrometry (MALDI-TOF MS) era for precise identification of a wide range of clinically relevant bacterial pathogens where phenotypic methods could not provide an adequate level of discrimination or gave discrepant results (18,–24). Many larger facilities also routinely use 16S universal PCR or broad-range PCR and cycle sequencing to identify amplified bacterial DNA directly from clinical isolates or samples (25,–27).

MALDI-TOF MS has recently supplanted routine phenotypic tests to a large degree as the routine identification method used for pathogen identification and revolutionized the ability of clinical microbiology laboratories to rapidly identify a much wider diversity of microorganisms (23, 28, 29). A combination of phenotypic and genotypic tests is commonly performed to arrive at specific bacterial identification, including a Gram stain, rapid biochemical tests, MALDI-TOF MS, and, where necessary, genetic analysis of 16S or other gene targets. Ready access to isolate biorepositories storing clinical strains characterized by phenotypic tests and 16S sequencing are essential for the ongoing expansion of existing MALDI-TOF MS databases to include more unusual pathogens (18, 19, 21,–24, 29,–32). Genetic analyses of 16S sequences either through Sanger cycle or next-generation sequencing methods will continue to be an important diagnostic technology alongside current proteomic methods, which will be addressed (11, 25, 33).

This review describes the state-of-the-art approach for performing fast 16S cycle sequencing (Sanger) for routine identification of bacterial pathogens in the clinical microbiology laboratory, building upon a comprehensive previous Clinical Microbiology Reviews article by Clarridge (16). Because of the widespread use of 16S for isolate identification and its accelerating use for metagenomics and microbiome studies, we provide a detailed analysis of the identified limitations of using this target for these various applications. Gaps remain in the currently available sequence database(s) that limits analysis and interpretation of 16S sequence data. GenBank (NCBI) holds the most sequences but has no curation in place to ensure correct sequence and annotation content. Other more curated databases are often not representative of the microbial diversity, because they mainly focus on some type and reference strains. Longitudinal clinical experience from a large integrated regional clinical microbiology is described that routinely used both 16S sequencing along with MALDI-TOF MS for the identification of bacterial pathogens. A detailed analysis of the current utility of using the 16S target for bacterial identification is presented based on analysis of sequence alignments done to revise CLSI MM-18 A2 (5), and the ability of 16S to discriminate clinically relevant genus/species is outlined in Tables 3 to 12. Aside from primer selection, the main factors limiting 16S discriminatory ability for clinical/environmental bacterial isolates is the current lack of available sequence data and a high degree of 16S homology between several related genera and/or species. The current and future use of proteomics and next-generation sequencing is discussed as a replacement for targeted 16S rRNA gene capillary cycle sequencing.

THE 16S rRNA GENE AND PRIMER SELECTION

The prokaryotic rRNA genes specifically include 5S, 16S, and 23S and intergenic regions (34,–36). The 16S rRNA gene is about ∼1,500 nucleotides long (∼1.5 kb, although this is an average and some organisms can have 16S sequences that are shorter or longer) and is part of the 30S small subunit of prokaryotic ribosomes that binds to the Shine-Dalgarno sequence at the 3′ end (36,–40). 16S rRNA has several functions, including a structural role as well as being crucial to protein synthesis. Along with 23S, it provides a scaffold to assist with the binding of the 50S and 30S ribosomal subunits, as well as defining the ribosomal protein positions (34,–36). The 3′ end of 16S RNA also binds to the S1 and S21 proteins, known to be involved in initiation of protein synthesis by RNA-protein cross-linking (41). All microorganisms have at least one copy of 16S, making it ubiquitous, and as it is highly conserved and evolves slowly, it is the most widely used single target for phylogenetic studies of bacteria and archaea (3, 4, 42). Multiple sequences of the 16S can exist with a single bacterium, and some copies may differ (43, 44). Genomic sequencing studies also show that many bacterial species have intragenic heterogeneity (i.e., harbor multiple 16S gene copies and polymorphisms between these copies) that allow interspecies subtyping via partial or full sequencing of 16S (45, 46). Interspecies discrimination based on intragenic heterogeneity has been demonstrated for a variety of human pathogens, including Neisseria (47, 48), Haemophilus (49), Salmonella (50), and Listeria (51) species. Horizontal transfer of 16S also occurs, albeit infrequently and only at the intragenus or intraspecies level (52). Although this is much more restricted than bacterial horizontal transfer of operational genes (i.e., enzyme-encoding genes), some investigators question whether 16S should be the only target used for identification or phylogenetic purposes. However, 16S has been extensively studied and applied to establish a species description, species taxonomy, and phylogenetic relationships. Therefore, 16S is the molecular target of choice for genus- or species-level identification in the clinical laboratory because of its ubiquitous nature amongst bacteria and archaea (∼10% to 15%) and the abundance of sequence data compared to that for other targets (5).

The 16S gene contains mosaics of sequence that range from highly conserved, variable, and hypervariable regions, as illustrated previously by Baker and colleagues in their schematic of the Escherichia coli one (Fig. 1) (39). Within certain stretches, 16S provides genus- and/or species-specific signatures that enable accurate identification depending on the targeted gene regions for a particular bacterium/microorganism group(s). Universal 16S primers can be designed to target the conserved regions of 16S, of which some motifs are shared across the entire kingdom (“eubacterial primers”). In laboratory practice, these primers most often target the first ∼500 bp of the small ribosomal subunit gene, because analysis of the V1-V3 regions is considered sufficient to allow accurate identification of most specific genera/species; hence, most 16S sequences currently deposited in public databases correspond to this part of the gene. Inaccurate biological conclusions, however, are derived from experiments using suboptimal 16S primer design because of nonamplification and/or detection of some critical genera and/or species (i.e., some species or groups are missed entirely or proportionally misrepresented within the population), and there will be a significant loss of taxonomic classification by shorter 16S sequences (53). It is important to remember that forward (F; defined as oligonucleotide sequence that is complementary to the antisense strand of double-stranded DNA) and reverse (R; defined as oligonucleotide sequence that is complementary to the sense strand of double-stranded DNA) primer sets targeting the 16S V1-V3 regions were historically designed for environmental microbiome community analysis (54,–56) and not for clinical isolates (Fig. 1). More rigorous identification of human pathogens requires the use of other 16S specific F/R primer pairs, as previously reported (5, 57). Although previous analyses show it is difficult to design primers to universally detect all prokaryotic 16S rRNA gene sequences (39, 58, 59), this has been achieved (SmartGene patent number EP1863922B1). Optimal primer design can also mitigate the specificity issues of using standard 16S primers for broad-range PCR/sequencing or microbiome analyses, which decrease amplification of potential contaminants and cross-reactivity with common human host DNA sequences (60, 61).

An external file that holds a picture, illustration, etc.
Object name is CMR.00053-19-f0001.jpg

Open in a separate window

FIG 1

E. coli 16S rRNA gene and locations of conserved and variable regions.

Partial targeted sequencing of the 16S V1-V3 region (i.e., first ∼500 bp), however, may not provide enough coverage of variable regions to allow unambiguous species-level identification of a number of important human bacterial pathogens, as is outlined in detail by the recently revised Clinical and Laboratory Standards Institute (CLSI) guideline MM-18-A2 (5). Initial interpretive criteria published by the Clinical and Laboratory Standards Institute (CLSI MM-18) for the identification of a human bacterial pathogen by partial 16S DNA target sequencing recommended the use of specific bacterial primer pairs (i.e., 4F, 27F, 534R, and 801R) that target the first ∼500 bp of the gene (i.e., V1-V3 region) (62). The recently published CLSI MM-18-A2 update also indicates a longer section of 16S sequence across several gene regions (V1-V6) (∼1,060 bp, covering the V5 and V6 variable regions), or even the entire gene (1,540 bp) may need to be analyzed within many genera to achieve a species-level identification (5). 16S multisequence alignments within all major bacterial pathogen genera/species were analyzed across the various gene regions to make the recommendations outlined by CLSI MM-18-A2 for bacterial identification using a shorter (V1-V3) or longer sequence (5). This work highlighted that an international standard should be developed for the use of 16S primers for various clinical applications, particularly for the accurate identification of human pathogens, or human microbiome analyses that should include the precise limitations of particular published 16S primer pairs for this purpose.

Use of the 16S target alone may not be sufficient to reliably identity many human clinical pathogens for several reasons, such as (i) high genetic similarity within specific microorganisms or groups, (ii) the presence of variable copy numbers of 16S rRNA genes with sequence variation in their genomes (45, 46, 63), and (iii) the lack of 16S sequence information currently available in published data repositories. This is not surprising, since the vast majority of microorganism/microorganism groups have yet to be identified or classified (i.e., only an estimated 1% of all microbes have been discovered) (64, 65). In any case, 16S sequencing will yield a result allowing an approximate organism classification against those present in the database.

Several other genetic targets have been used for research purposes for improved identification of microorganism/microorganism groups and most commonly include the conserved genes in the ribosomal region (rpoA, rpoB, rpoC, and rpoD) (66,–68), the spacer region 16S-23S (69), DNA metabolic enzymes (gyrA and gyrB) (70,–72), DNA repair genes (recA and recN) (73, 74), the elongation factor Tu gene (tuf) (75,–78), superoxide dismutase (sodA) (75, 79), and the chaperonin family of proteins (cpn60) (75, 80). Other, more rarely reported, genetic targets, such as dnaJ, may also be efficient microbial identification targets (81,–84). These alternative gene targets, like 16S, have functionally conserved regions with flanking regions of variability, making them ideal for potentially closer separation of related species. Primer selection for alternate targets will often not be universal or “eubacterial” and must be carefully designed to amplify and sequence the intended microorganism/microorganism group. As such, alternate target analysis aside from the rpoB gene and the spacer region 16S-23S has been used mainly for research studies to distinguish specific genera/species (85,–87). Because of the rather limited amount of data available for most alternate targets with a number of species not adequately covered (compared to 16S), one would have to make the necessary efforts to build, populate, and validate an in-house database before clinical use. Clinical laboratories should not rely on alternate target analysis alone for reporting identification on clinical isolates unless they have developed a comprehensive gene target database and subsequently done extensive preclinical validation of this bioinformatics tool.

PRINCIPLES OF SANGER AND PYROSEQUENCING METHODS

Partial or complete target sequencing of 16S is commonly performed in the clinical laboratory using either chain termination (Sanger) or pyrosequencing chemical analysis. Figure 2 gives a schematic outline of a chain termination and a pyrosequencing reaction (88, 89).

An external file that holds a picture, illustration, etc.
Object name is CMR.00053-19-f0002.jpg

Open in a separate window

FIG 2

Schematic outline of a chain termination and pyrosequencing reaction.

Chain Termination Sequencing

During a Sanger procedure, PCR is initially performed using short oligonucleotide 16S primers to synthesize complementary amplicons to the template (90, 91). Secondary cycle sequencing of the amplicon involves a thermostable DNA polymerase, a primer designed to anneal to the template nucleic acid, and small amounts of the required double-stranded DNA template. Four chain-terminating dideoxynucleoside triphosphates (ddNTPS; ddATP, ddTTP, ddGTP, and ddCTP) labeled with individual fluorescent markers of different spectra are also added to the reaction mix at a lower concentration than the deoxyribonucleotide triphosphates (DNTPS; dATP, dTTP, dGTP, and dCTP). Synthesis of DNA by DNA polymerase incorporates ddNTPs, causing termination of sequence elongation, as these bases lack a 3′-hydroxyl group needed to polymerize to the next nucleotide normally provided by a dNTP base. Each incorporated ddNTP is in a chain-terminated fragment at the same position as the dNTP base in the DNA template.

The cycle sequencing reaction successively builds up DNA strands of different lengths that have a different fluorescently labeled ddNTP (A, T, C, or G) (i.e., dye terminators) at the 3′ end due to many chain termination events (88). BigDye (Applied Biosystems Inc., Thermo Fisher Scientific, Foster City, CA) terminators use single energy transfer molecules, which include an energy donor and acceptor (i.e., dichlororhodamine or rhodamine) dye connected by a highly efficient energy transfer linker (91). These dyes have significantly less overlap at their maximum excitation wavelength than conventional rhodamine dyes, so that sequencing products are produced with a cleaner fluorescent signal and improved base-calling accuracy, particularly at longer read lengths (91).

The single-stranded DNA (ssDNA) fragment mixture generated by the fluorescence cycle sequencing reaction is loaded by electrokinetics into a polyacrylamide (PA; acrylamide monomers [CH₂=CH-CO-CH₂] cross-linked with N,N′-methylenebisacrylamide or bis unit [CHS=CH-CO-NH-CH₂-NH-CO-CH=CH₂]) gel capillary housed in an automated genetic analyzer, where the fragments are separated by electrophoresis and sequentially read by a fluorometric detector to generate an electropherogram trace of the derived DNA sequence (92, 93). Sequences between 100 and ∼1,300 nucleotides long can be resolved into a series of bands on a PA gel even when ssDNA fragments differ by only one nucleotide (94). Automated capillary sequencing genetic analyzers typically house four or more thin-column capillaries (0.1-mm diameter, 50 to 80 cm long) filled with PA; the optimal PA concentration in the 6% to 7% gel matrix is an acrylamide/bis ratio of 19:1 for resolution of ssDNA fragments between 100 and 750 nucleotides long (95). Longer sequence reads may be obtained by altering the pore size of the gel by using a different PA concentration and PA/bis ratio (92).

Pyrosequencing

Pyrosequencing determines the order of nucleotides in template DNA by synthesis and detection of released pyrophosphate (PPi) upon nucleotide base (A/T/C or G) incorporation (89). This method is mainly used for fast and accurate short reads of DNA templates that do not contain repetitive homopolymer regions (if known) (95). The ssDNA template is hybridized to a sequencing primer and incubated with several enzymes. DNA polymerase synthesizes the complementary sequence, while ATP sulfurylase converts PPi to ATP in the presence of adenosine 5′ phosphosulfate. ATP then acts as a substrate for the luciferase-mediated conversion of luciferin to oxyluciferin that generates visible light proportional to the amount of ATP produced. The light produced by this enzymatic reaction is detected by a charge-coupled device camera and analyzed in a pyrogram. The intensity of light measured by a pyrosequencer, such as the Pyromark (Qiagen, Germantown, MD), determines if there are multiple bases in a row in the sequence (i.e., single G versus GGG). The previous nucleotide is degraded by apyrase before the next one is added for synthesis. No light is emitted if the added base is not complementary to the first unpaired base of the template and dNTPs are incorporated until the entire strand is synthesized. Pyrosequencing can only sequence lengths of DNA that are ∼300 to 500 nucleotides long, which is much shorter than that obtained by the Sanger method (89).

OVERVIEW OF PCR/CYCLE SEQUENCING USING AN AUTOMATED GENETIC ANALYZER

This section briefly outlines the principles and procedures for performing sequential 16S PCR/cycle sequencing in the clinical microbiology laboratory. The sequential steps performed in a PCR/cycle sequencing procedure using a fast protocol in the clinical laboratory has previously been reported and are summarized in Fig. 3 (17, 96, 97). However, the individual steps are similar throughout the procedure if standard PCR protocols are used. Comprehensive detailed procedures for each step performed in a cycle sequencing analysis have been published elsewhere (92, 95). In-house sequencing procedures should be performed in a facility that has strict separation between pre- and post-PCR environments either by physically separating these work areas or using self-contained hoods with dedicated equipment in each area (96, 97). The reader is referred to previously published reports for a detailed discussion of the front-end handling and storage of isolates for molecular analyses as well as nucleic acid sequencing (95, 98). Freshly collected clinical isolates should be obtained that are free from preservatives that can interfere with PCR/cycle sequencing reactions.

An external file that holds a picture, illustration, etc.
Object name is CMR.00053-19-f0003.jpg

Open in a separate window

FIG 3

Summary of steps of a fast 16S PCR and cycle sequencing procedure.

DNA Extraction/Purification and the Use of Controls

Accurate high-quality sequencing data are highly reliant on efficient extraction and purification of nucleic acid, so the template is free of contaminants. The use of “DNA-free” reagents is highly recommended to minimize detection of contaminating bacterial DNA from commercial products (60, 95). Several methods are available for preparing microbial DNA for sequence analysis, and different protocols need to be verified for different pathogen types as well as clinical isolates (i.e., Gram-negative versus Gram-positive, etc.). Some bacteria, such as Mycobacterium spp., are more difficult to lyse, and special extraction protocols are required for sequencing these genera (99). Clinical laboratories must verify the isolate/isolate extraction method used with manufacturer cycle sequencing protocols as part of the overall method validation process. Although other manual DNA extraction methods, such as proteinase K lysis, bead beating, or direct boiling, may be used (100,–103), clinical laboratories currently rely on commercial manual or automated extraction methods for in-house cycle sequencing depending on the number of isolates to be tested and the type of downstream nucleic acid to be sequenced (17). The DNAzol reagent (contains guanidine isothiocyanate) protocol (Thermo Fisher Scientific) is a cost-effective method for recovery of bacterial genomic DNA from a wide variety of liquid and solid isolates. A commercial manual spin-column filtration method efficiently extracts a small number of isolates (i.e., ≤24 isolates), whereby the isolate is processed through several manual steps (101, 102). For example, high-purity yields of DNA can be routinely obtained from bacteria and human blood or tissues using the PureLink genomic DNA kit (Invitrogen, Thermo Fisher Systems) or the DNeasy blood and tissue kit (Qiagen), but there are many other commercial suppliers of similar kits on the market (produced by EdgeBio, Qiagen/MoBio, etc.). A simple DNA extraction for cycle sequencing of bacteria, fungi, and food types can also be achieved using the PrepMan Ultra isolate preparation reagent (Applied Biosystems, Thermo Fisher Scientific). Commercial automated extraction instruments allow more efficient isolation of nucleic acid from a large number of isolates/isolates (i.e., >24 per run), and many of these platforms deploy magnetic bead particle technology to separate and purify nucleic acids (104). Specialized procedures are also required for efficient extraction of nucleic acid from formaldehyde or formaldehyde-fixed, paraffin-embedded tissues, molds, or difficult clinical isolates, such as stools, that contain lots of particulate matter that may inhibit PCR (105,–109).

The NanoDrop (Thermo-Fisher) instrument provides an efficient, reliable means to check DNA purity using spectrophotometric optical density measurements by fluorescence at 260 and 280 nm; the A₂₆₀/A₂₈₀ ratio should be 1.8 to 2.0. Lower ratios indicate protein contamination, but nucleic acid contamination will not be detected (108). Other methods may be used but are more laborious, including total phosphorous content, dye intercalations, and limiting dilution (92, 95). A more exact measurement of DNA concentration is required for NGS applications, and this is currently achieved in clinical operations by using either an Agilent Bioanalyzer or a Qubit fluorometer (Life Technologies, Thermo Fisher Scientific), whereby intercalated dyes provide a precise quantitation of the amount of DNA present in the isolate down to the picogram level (109, 110). However, DNA purity still must be measured using the above A₂₆₀/A₂₈₀ ratio prior to proceeding with downstream NGS procedures.

Optimization of the quantity of DNA template added to a cycle sequencing reaction is done according to the PCR product size. Sequencing PCR products between ∼100 to 200 bp requires 1 to 3 ng of template, but larger PCR products between ∼1,000 and 2,000 bp will require 10 to 40 ng (92, 98). A Beer-Lambert Law calculator can be used to determine the DNA concentration in the PCR product by multiplying the UV absorbance of the isolate at 260 nm by either 33 or 50 μg/ml for ssDNA versus dsDNA templates, respectively (111). Too much DNA template added to a sequencing reaction rapidly depletes the reagents and the dye label in the reaction mixture, whereas too little DNA results in a poor electropherogram trace (i.e., reduced peak height and strength), which makes data analysis difficult or uninterpretable.

Fast PCR/Cycle Sequencing Processes

Clinical laboratories routinely perform in-house PCR/cycle sequencing protocols for several applications. Although standard PCR can be used to perform these procedures, use of a fast protocol allows shortened PCR cycling times, which reduces the time to reporting and increases overall testing throughput. Potential contamination can be minimized by using DNA-free reagents and enzymes from commercial products, and purified primers avoid misannealing problems (60, 112). Fast PCR decreases the overall procedural cycle time by using “fast” primers (i.e., higher thermodynamic melting temperature [T_m] from 64°C to 77°C), DNA polymerases that are thermodynamically stable at higher temperature with a higher extension rate, typically 2 to 4 kb/min. Fast-ramping thermal cyclers are used that perform the reaction at higher temperature and speed with greater thermal uniformity, because the temperature differential between PCR cycle steps is reduced (96). Some commercial thermal cyclers allow both fast-ramping and standard protocols (i.e., Veriti thermal cycler [Applied Biosystems] and C1000 Touch thermal cycler with dual 48/48 fast reaction module [Bio-Rad]). Fast PCR protocol changes also include combining the annealing and extension steps and eliminating the final extension step for short (<250-bp) amplicons (96, 97). A fast PCR/cycle sequencing protocol decrease the PCR time to only 30 to 40 min. The total procedure takes less than half (∼4 to 5 h) the time required to complete a conventional PCR/cycle sequencing run, so it can be completed within a day provided data interpretation is completed using a commercial system that interprets the electropherogram data as it is generated (see “Sequence Data Analysis and Interpretation,” below).

Enzymatic purification of PCR amplicons prior to cycle sequencing using exonuclease 1/shrimp alkaline phosphatase (Exo1/SAP-IT) treatment is preferred, as it is a simple, reliable method that effectively cleans up a large number of isolates with minimal manual manipulation (113). ExoSAP-IT (P/N 78200; USR Corporation, Thermo Fisher Scientific) cleanup reagent is active in commonly used buffers, so it may be added directly to the PCR product: Exo1 degrades single-stranded DNA, such as unused primers, and recombinant SAP dephosphorylates unused primers and dNTPs. Enzymatic cleanup includes an initial treatment (15 min at 37°C) followed by heat incubation (80°C for 15 min) that allows enzymatic deactivation. Other more laborious methods that may be used for this step include serial dilution, ethanol precipitation, column ultrafiltration, and gel purification (95, 114).

Clinical laboratories should perform cycle sequencing using commercial protocols and reagents, so that reliable results are obtained with minimal assay optimization (i.e., commercial primers, DNA enzymes, and fast PCR/cycle sequencing mixtures are already optimized to work together for a range of DNA templates). For example, the BigDye direct sequencing kit (Applied Biosystems, ThermoFisher) includes a set of universal primers (M13 forward and reverse) for 16S sequencing and eliminates the need to perform another post-PCR purification step prior to cycle sequencing. These primers can also be used for the subsequent cycle sequencing step. By performing PCR to cycle sequencing steps in a single tube, the BigDye kit not only decreases manual manipulation but also makes obtaining sequence data much faster (91). However, difficult DNA template sequences (i.e., homopolymer G-C-rich regions) may require the use of specific commercial master mixes and reagents that have been optimized to work with a DNA polymerase with high processivity and fidelity to ensure sequencing efficiency and accuracy (115).

Troubleshooting Sequencing Problems

DNA extraction and amplification controls are necessary to ensure that subsequent sequencing reactions are performed according to regulatory and accreditation requirements (95). Positive and negative extraction, amplification, and sequencing controls are important for monitoring assay integrity as well as troubleshooting. The same solution used as a starting matrix for isolates in a sequencing run (e.g., distilled water free of either reagents or template DNA) can be used as the DNA extraction negative control, which is run through the entire procedure to detect possible environmental or reagent contamination. This negative control should produce no more than baseline traces in the electropherogram without sequence data. If amplicon is produced, sources of contamination are found most often in the DNA extraction reagents and/or the isolate handling protocol. The DNA extraction positive control should be a unique organism that is not a human pathogen but is representative of the expected bacteria isolate(s) that produces a unique, easily distinguished sequence pattern. To further reduce the risk of amplicon contamination, the positive-control strains should be rotated and not consist of pathogens expected to be contained in the samples being tested. Uracil DNA N-glycosylase enzyme and deoxynucleotide triphosphate mixes also should be supplemented with 2′-deoxyuridine 5′-triphosphate (dUTP) in the PCR mix to prevent carryover contamination of dUTP-containing amplicons (95, 98). Long periods of cold storage of purified PCR amplicons should also be avoided prior to template cycle sequencing as another measure to prevent degradation and contamination.

Amplification reactions should also include a negative and positive reaction tube. The negative control monitors the integrity of the amplification reagents, and the positive control should be like the one being used to control the extraction. Most commercial PCR kits already contain a positive internal control (i.e., genomic DNA extracted from a microorganism whose sequence is known), and this template can also serve as a control for the cycle sequencing reactions. HPLC-grade water should be used as the negative control for the sequencing procedure, and it should not produce sequencing data. Alternatively, a known sterile isolate matrix may be used in broad-range PCR/cycle sequencing assays to mimic the specific clinical isolate material being analyzed (98). Possible contamination is indicated by obtaining sequencing data from the negative control, and the microorganism’s identity may indicate the source of contamination (i.e., introduced during the procedure or within reagents).

Troubleshooting of poor-quality sequencing data involves investigating various causes, as outlined in previously published guidelines (92, 95). The automated sequencer trace most often shows no recognizable signal or signal loss after the start of base calling, unexpected gaps or termination, mixed signal with multiple overlapping peaks, or misshaped peaks or background noise resulting in missed or incorrect base calls (92, 93). Common reasons for low-quality data include (i) poor-quality DNA template (i.e., inefficient DNA isolation or low concentration), (ii) inadequate cleanup of the template, (iii) poor primer design (i.e., disparate T_m of F/R primers or primer T_m too low for fast PCR) or impurity (e.g., primer fragments) resulting in poor annealing, (iv) multiple annealing sites or failed PCR amplification, (v) inadequate cleanup of the PCR products, (vi) PCR and/or cycle sequencing reactions not optimized for DNA template/primers, (vii) wrong software mobility file used to interpret the dyes used, and (viii) overall low signal strength so that the software cannot interpret the raw data (92, 94, 95). Gaps or sequence termination may also occur when a low-fidelity DNA polymerase is used that cannot read through difficult template regions (i.e., homopolymer that is highly GC-rich) (116). Use of an automated sequencing quality control analysis program can assist troubleshooting DNA sequencing problems that limit sequence read length (e.g., QualTrace II [Nucleics]). Consultation with the manufacturer of the reagent kits and automated genetic analyzer often assist with identification of the problem(s) if a solution is not immediately evident.

Sequence Data Analysis and Interpretation

Quality checks of sequence data are important, because base-calling algorithms that determine the nucleic acid from the signal peak may produce wrong or incomplete base calls, particularly if the amount of input DNA is low and/or the template harbors insertions of deletions or conformational complexity (115, 117). Clinical laboratories that use an external sequencing service (i.e., university core facility) must ensure that the referral laboratory meets the appropriate regulatory and accreditation requirements for diagnostic testing, which includes employment of highly trained, knowledgeable personnel capable of communicating about encountered technical and organizational issues (94). The referral laboratory should routinely provide the individual electropherogram results as a DNA sequence chromatogram file (e.g., *.scf or *.ab1) for each sequenced isolate, and these files should include associated quality score metrics, such as phred scores. This is essential for ensuring the quality and accuracy of the clinical isolate’s identification and antibiotic susceptibility profile.

Sequencing should be initiated far enough upstream to ensure that the region of interest lies within the clear range to be analyzed. This range should cover as many variable regions as possible to optimize species differentiation (while being aware that the positions of variable regions differ between different bacterial families). Most capillary sequencing instruments produce good-quality sequences of ∼600 to 800 bp in length, but some well-tuned instruments may even exceed ∼1,000 bp in length for cultured isolates (92, 93). A sequence commonly has a few bases of poor quality at the beginning and end of the trace with a high-quality region in the middle of variable length; a consistent procedure should be used to review and trim poor-quality sequence data from the 3′ and 5′ ends before proceeding with further analyses (95). Manual sequence editing of base calls for nucleotides other than the one initially reported by the base caller must be documented and should be recorded.

Quality checks of sequence data should start with aligning and assembling all sequence fragments from an isolate to generate a contig and a consensus sequence; this can be achieved either by using a generic assembler software or by using application-specific software, which will perform the alignment against an automatically or manually selected reference sequence; reference-driven alignments are generally easier to interpret and may be more precise (Fig. 4) (5). Initial review of raw sequence data should primarily focus on alignment accuracy; all contig fragments should align within the boundaries of the expected target gene sequence. After trimming the 5′ and 3′ ends of the fragments to contain only valid calls, the resulting consensus sequence should span the expected read length. Bidirectional sequencing and alignment with a known reference sequence is essential for interpreting mixed or unclear base calls, and adequate coverage by other sequencing fragments can be very helpful in that regard (5). Once trimming has been performed, one should systematically verify the contig for read accuracy; it is recommended that users align the chromatogram fragments against a reference sequence that is close to the species expected in the isolate to obtain meaningful events to check while saving time (5).

An external file that holds a picture, illustration, etc.
Object name is CMR.00053-19-f0004.jpg

Open in a separate window

FIG 4

Multisequence alignment compared to a reference sequence ensures proper data interpretation.

The following events should be verified and edited where necessary. (i) Mismatches with the reference sequence may be misread bases (to be edited) or real mismatches, which may reflect a different species or intraspecies diversity. The reverse-complementary strand may help to differentiate between artifacts and real mismatches. (ii) Insertions and deletions can occur as artifacts when base-calling software shifts a chromatogram peak or double reads multiple chromatogram peaks for the same nucleotide (e.g., homopolymer stretches that cause polymerase stutter). Given the highly conserved nature of the 16S rRNA gene, insertions and deletions are less frequently encountered in the context of intraspecies or intraoperon diversity. (iii) An ambiguous base call is assigned by the instrument software whenever it cannot accurately determine a base at a particular position; either the International Union of Biochemistry (IUB) code for a mixture or an “N” (nucleotide) is inserted (118). Base pair differences will be detected by the sequencer’s base caller software if it has been configured to detect these ambiguities as IUB codes (recommended setting). Ambiguous base calls occur for many reasons, including (i) high noise levels (usually most prominent at the ends of a chromatogram or in the case of technical problems), (ii) the presence of multiple isolates in the sample, (iii) shifts downstream of insertions/deletions, or (iv) 16S intraoperon diversity in a species (119). To distinguish technical from biological ambiguities, one should verify if chromatogram peaks for the nucleotide concerned are all overlaid with the same noise, which indicates problems with signal detection in the sequencing reaction (Note that chromatogram peak height generally does not reflect quantitative relations of nucleotides well [and, thus, of subpopulations] due to signal normalization by the base-calling software.) In cases where a chromatogram peak cannot be assigned to a single nucleotide, one should apply the appropriate IUB code instead of assigning a less meaningful N. (Note that some base-calling software systems propose parametrizations that allow automated assignment of IUB codes.)

Given the potential impact of sequence edits on the resulting species identification, it is important to make these edits traceable, ideally automatically via the editing software used. Edited consensus 16S sequences are usually searched against one or several reference databases, using rapid search algorithms such as the basic local alignment search tool, or BLAST (https://blast.ncbi.nlm.nih.gov/), which screens large data sets to obtain a ranked list of the closest matching sequences as well as pairwise alignments of the isolate sequence with selected reference sequences (120, 121). One needs to be aware that matches are ordered by match score in BLAST; the best-matching sequence for a species identification does not always show on top. Please be aware that optimal similarity in BLAST searches against a reference database relies on keeping the number of ambiguous bases, and especially of undetermined positions (N), to a minimum. Microorganisms may also be misidentified using the BLAST algorithm for sequence interpretation, so the match list should be reviewed carefully for the important parameters outlined in Table 1 (120, 121); a matching reference sequence should be retained as possible identification, provided that the following criteria are fulfilled.

TABLE 1

Important parameters reviewed after a BLAST search for 16S sequences^a

Parameter	Definition
Match accuracy	Best matching reference sequences should show the lowest no. of mismatches (not necessarily reflected by default sorting by BLAST score)
Match length	Matching reference sequences should cover an isolate sequence entirely; shorter matching sequences by BLAST should be verified by alignment with the isolate sequence for missing mismatches on the edges
Match consistency	An isolate sequence that matches a no. of reference sequences within the same species and genus annotation increases the confidence for such species and genus identification
Match differentiation	To be able to estimate the degree of an isolate’s differentiation to the next closest species, sequences derived from closely related but different species should appear on the list of matching reference sequences, to be evaluated in this context

Open in a separate window

^aSee reference 5.

Match accuracy.

Match accuracy is the degree of similarity between the isolate sequence and the matching reference sequence. The higher the similarity, the better the match. However, one needs to be aware of the following issues, as outlined in CLSI MM18-A2 (5): ambiguity codes (IUB, IUPAC) are interpreted by the BLAST search algorithm as full, instead of partial, mismatches, so a sequence containing a number of ambiguous bases will rank lower on the list despite the presence of partially matching bases. In addition, BLAST scores rank matching reference sequences according to the sequence length above the mismatch number. Thus, a retrieved reference sequence may appear in a higher-ranked position according to the BLAST score based on its overall match length, even though it has more mismatches than another lower-ranked sequence based on its shorter match length. In the past, a similarity threshold of >98.5% has been recommended as a rule of thumb for assigning a 16S sample sequence to a certain species (16, 122). While this simple cutoff makes interpretation easier in diagnostic laboratories, it can easily be misleading for several reasons that have been previously reported (16, 122).

Match and database coverage.

The amount of 16S sequence coverage, as defined by the adequate representation of genus-relevant variable and conserved regions, affects the similarity of a sample to a reference sequence. If conserved stretches are predominant in the sample sequence, the match similarity to the best-matching sequence will exceed the cutoff without yielding an unambiguous identification. Therefore, a similarity score would require a minimum coverage of variable regions for the genus or genera involved. Genus diversity also plays a role, because some genera are highly diverse whereas others are not (e.g., Mycobacterium). Slow-growing mycobacteria, such as M. genavense, would not separate from other atypical mycobacteria using 16S sequencing with a cutoff of 98.5% (123). In the case of a standard sequencing of the first 500 bp, M. genavense and M. triplex are genetically different by only 4 mismatches or <1% of the V1-V3 sequence, but they can be clearly differentiated on this basis as outlined below. Species diversity also plays a role because highly diverse species, such as Fusobacterium nucleatum, which includes a number of at least 5 subspecies (https://lpsn.dsmz.de/), exhibit an intraspecies diversity of 10 to 12 mismatches (>2%) within the first 500 bp between variants. If the reference database used does not cover explicitly the relevant subspecies and variants as outlined below, the sample sequence will not be matched with a high enough score for species identification. Achievable match similarity also depends on the adequate coverage of species, subspecies, and variants in the reference database used. If the respective variant is not present in the database, a sample sequence may match only references below the cutoff, yielding an inconclusive result. Missing reference sequences can also lead to an unambiguous match above the cutoff with a reference sequence present, whereas the correct result should have been ambiguous and not definitive. (Note that BLAST match lists should always be reviewed for other possible matches, and in a multialignment with the sample sequence, the relevance of mismatches will become transparent with regard to the variable regions relevant to the genus involved [see “Match differentiation,” below]).

Match length.

Matching reference sequences should span the longest alignment possible or ideally align with the full isolate sequence. At an equal number of mismatches, longer matches should be given preference for identification, thereby ensuring better reliability. (Note that BLAST tends to truncate sequence matches when alignment become uncertain at the 5′ and 3′ end due to mismatches or insertions/deletions.) This can lead to matching references ranking high on the list despite the presence of mismatches on the edges, which then are not accounted for or displayed in the pairwise alignments. Therefore, one should always verify the match length of an isolate sequence with a reference before calling an identification. In cases of the doubt of mismatches being present at the edges of references, one should perform a pairwise or multiple alignment that includes the isolate sequence.

Match consistency.

The list of matching reference sequences (e.g., species and genus names) should be reviewed for naming consistency within the species and within the genus. (Note that ongoing taxonomic name changes have created confusion for clinical laboratorians and clinicians for clinically relevant microorganisms.) Thus, the best-matching reference sequences on the ranking list ideally should be consistently annotated with the same species name (provided that the reference database contains multiple entries of this species) or with the same genus (provided that the reference database contains only one or a few entries per species); mismatches can reflect the natural intraspecies variability. If the best-matching references contain other species at an equal number of mismatches, a species call for identification may not be possible on this basis; in such cases, a genus identification call could be made. If the best-matching reference sequences come from different genera at equal mismatches and scores, only an identification on the family level should be envisioned (e.g., see E. coli and Shigella spp. to be interpreted as “Enterobacterales”). Match consistency can be indicative of problems with inconsistent coverage of a species or genus in a database, or of a poor choice of the sequenced region, to be variable enough for differentiation between certain species and genera.

Match differentiation.

Match differentiation is the ability to call a species identification while making sure that no other species would match as well. The number of matches and mismatches between the isolate sequence and the best-matching references of the closest species are considered. The match list should show representative sequences of more than one species to enable differentiation from the next closest one. Match differentiation refers to intraspecies variability (thus, tolerable mismatches) and interspecies mismatches (thereby allowing species differentiation); therefore, it is important that the match list also includes the next closest species. A multiple alignment of the isolate sequence with the best-matching reference sequences of the closest species (two or three) is often helpful in assessing the differentiation concerning the position of mismatches; mismatches of an isolate sequence occurring in variable regions, where species of this genus usually differ, are indicative of a nonmatch to a species and should be documented. In these cases, one may report “close to” the closest matching species.

Reference databases.

Clinical laboratories should use a reference database for microbial identification that contains representative sequences of good quality for all species and genera that a user expects to detect. Thus, an isolate sequence is likely to match a relevant reference sequence, either of the species searched or of another closely related species. To achieve best possible match accuracy (Table 1), a reference database should include naturally occurring variants of all species and subspecies so that more than one good-quality reference sequence is available for each species (124). Sequences submitted for species that are rare or not previously described can be useful if such a case is detected in the laboratory; such a match may give hints to observations made by other investigators. However, such sequences may also confound the match results with regard to established species; thus, one should be able to blind them out. In any case, sequences from the public domain should be represented with key characteristics (i.e., the original annotation, referring to author, submission, source, etc., and original repository where the sequence comes from). After having performed a BLAST search, a multialignment (e.g., by CLUSTAL or by an equivalent method) of the best-matching species and their variants can help to accurately detect and assess the following problems: (i) mismatches on the edges of a sequence, which were not considered by the BLAST algorithm due to alignment break-off; (ii) match and mismatch consistency between the isolate and the best-matching sequences but still diverse reference sequences (note that the alignment in these cases can show if mismatches are located in hypervariable regions for this genus [indicating a different species] or in regions where mismatches are balanced and, thus, not significant [indicative of a species variant]); and (iii) match of reference sequences is not complete enough to see if essential information is missing for species-level identification (multialignments are invaluable in showing where there are hidden mismatches at the edges of an alignment and in defining areas of insertion or deletion that may affect alignment accuracy, and they may also indicate that the mismatches occur within the regions where interspecies variability is observed [5]); (iv) reduced variability between closely related species (i.e., near-complete sequence similarity across the entire 16S rRNA target gene) (5).

A multialignment is also necessary to subsequently construct a dendrogram, which is a useful graphical display for understanding the phylogenetic relationships between the query and reference sequences (125). Methods commonly used for generating dendrograms are the neighbor-joining (NJ) method, the unweighted pair group method with arithmetic averages (UPGMA), and the weighted pair group method with arithmetic averages (126). If bacterial isolates are closely related, these phylogenetic methods have equivalent performance, but if isolate sequences are not closely related, then the choice of phylogenetic method may affect dendrogram relationships, as previously illustrated (16, 126). To build significant dendrograms, one should use sequences of maximum length and maximum overlap; in the case of similar sequences within genera, mismatches within the 16S gene sequence within the first ∼500 bp or the last ∼1,000 bp, depending on the length of sequence analyzed and the alignment tool, can also affect the comparison of sequences (i.e., percentage dissimilarity) and, thus, the dendrogram (16, 127). In addition, naturally occurring insertions or deletions are likely not reflected by dendrogram matrices, which only account for positions covered by all sequences. Taxonomists must consider these potential pitfalls in their analyses and assignment of exact relationships between the higher bacterial taxa (128). Generation of a dendrogram may better show relatedness between isolates than either percent dissimilarity or concise sequence alignment comparison (6). Although strains may seem similar to each other based on their percent dissimilarity (i.e., ≤1%), based on the positions of the mismatches within 16S, a dendrogram may show this not to be the case (127). Rooting a phylogenetic tree using a somewhat distantly related sequence of a different genus can help to build more stable clusters of very similar sequences; bootstrapping will indicate the robustness of a branch but is generally low for highly similar sequences from genes, such as the 16S (129). Dendrogram analyses may be helpful when analyzing an unknown sequence, particularly an isolate’s relationship to other closely and distantly related major genera. A phylogenetic analysis of the unknown isolate can indicate where a species groups, even when there is not a closely related sequence to compare within available databases (16, 126, 130).

The final sequence is analyzed by comparing it to similar sequences available in a public and/or commercial database. Accurate identification of clinical isolates using 16S is highly dependent upon access to accurate databases that contain a sufficient number of high-quality sequences for a particular genus/species that have the correct taxonomic nomenclature assigned (16). DNA sequence databases commonly used for diagnostic bacterial identification are outlined in Table 2. Reference databases are powerful tools for sequence analysis, but their strengths and limitations should be specifically outlined by the clinical laboratory in their standard operating procedure. Some resource databases are freely available on the Internet, but many are unverified and depend on ongoing funding from public or private sources to maintain their content. Clinical laboratories must ensure that the database(s) used is clinically relevant and meets the diagnostic rigor required for diagnostic coverage, quality, and maintenance, as previously outlined above and in CLSI MM-18 (5). The most current database version should be used, and the derived sequence interpretation also should be cross-checked using one or more of these sources. Due to the rapid changes occurring in taxonomy and nomenclature of many clinically relevant bacterial pathogens (131), clinical laboratories should only use databases that are kept current by regular updates.

TABLE 2

Overview of some publicly and commercially available reference databases for bacterial identification using 16S rRNA gene sequence interpretation^a

Database	DNA target(s)	No. of sequences	Curation	Alignment of clustered sequences	Link	Comment
NCBI nt (Genbank NCBI)	All	≈21,000,000	Limited	No	https://blast.ncbi.nlm.nih.gov/Blast.cgi (select appropriate dataset in the menu in order to restrict and accelerate the search); for downloads, ftp://ftp.ncbi.nlm.nih.gov/blast/db/	Hosts all published sequences; excellent coverage; frequent updates; many redundant entries; frequent erroneous entries; use for unusual or new species
Greengenes (consortium comprised of Second Genome Inc., University of Colorado, and University of Queensland)	16S rRNA genes	≈1,200,000	Yes; manual sequences >12,000 bp; taxonomy curation	Yes: sequence clusters at various similarity percentages	Searches, http://greengenes.lbl.gov/Download/Tutorial/Tutorial_19Dec05.pdf; downloads, https://greengenes.secondgenome.com/	Includes several tools from chromatogram analysis to alignments; latest version from 2013; unclear updates, some taxonomy information may be outdated
RDP (Michigan State University)	16S rRNA	≈3,200,000	Yes; manual sequences >12,000 bp; taxonomy curation	Yes; Aligned	Searches, https://rdp.cme.msu.edu/seqmatch/seqmatch_intro.jsp and https://rdp.cme.msu.edu/index.jsp	Manually and not regularly updated; last update was May 2015; various tools available to analyze user data further
SILVA (Max Plank Institute for Marine Microbiology)		≈5,000,000, includes small ribosomal subunit for eukaryotes	Yes; manual sequence quality; taxonomy curation	Yes; multiple cluster sets available	Search, http://www.arb-silva.de/aligner/; downloads, http://www.arb-silva.de/download/arb-files/	Continually updated; tools available to analyze user data; genes other than 16S rRNA
Molzym SepsiTest		≈7,043	Manual	No		CE-IVD database works with kit but also with sequences generated otherwise
SmartGene IDNS Bacteria Module 3.9.x	16S rRNA and rpoB	≈800,000 16S, 358,000 centroid annotated	Yes; quality filters for sequence quality, centroid annotation for annotation qualification	Centroid annotation for most representative sequence per species		CE-IVD; proprietary centroid annotation; quality filtered, continually updated; tools available to analyze user data; genes other than 16S rRNA
MicroSEQ 3.1	16S rRNA	2,300	Sequences of collection and type strains	No		Compatible with the MicroSEQ sequencing kit of ThermoFisher; mainly for 500-bp sequencing

Open in a separate window

^aEntries with gray shading represent publicly available databases, and those without gray shading represent commercially available databases. See reference 5.

Manually copying and pasting an isolate’s sequence into a website’s search window to perform an interpretation search may also produce errors. Users should verify that the entire sequence being interrogated is accurately copied from the 5′ to the 3′ end (or that the software takes care of resolving this), and that no older sequence is accidentally pasted from the computer’s cache. If BLAST is being used as the search algorithm (120), the users should record and understand its settings, or a standardized parametrization is used, which has proven its adequacy for targets such as 16S. Analysis software that contains preparameterized BLAST search tools, easy-to-use, multiple alignment tools, and other functionalities, along with valid reference sequences, can avoid these pitfalls and streamline interpretation. All isolate sequence results should record the interpretation database(s) used along with its version to troubleshoot isolate result traceability.

Sequence databases also vary widely in terms of the target gene data available. One can distinguish curated and noncurated databases, and within the curated ones, those where manual curation is performed and where the curation is achieved via algorithm-based methods. All these databases have their advantages and disadvantages, but for diagnostic purposes, they should use the most current nomenclature and taxonomic organization and contain only curated sequences that are quality assured for accuracy, completeness, and annotations. Most of the bacterial sequence data deposited in public databases, such as GenBank (the world’s largest noncurated repository; https://www.ncbi.nlm.nih.gov/genbank), correspond to the 5′ region of 16S, but linked gene name/sequence can be uploaded, so the database is largely unverified. GenBank also contains both pathogenic and nonpathogenic human, animal, and environmental data that can generate some unusual matches against an isolate’s 16S sequence in BLAST (124). Furthermore, the presence in GenBank of many redundant entries (i.e., identical sequences with the same species annotation) can even mask relevant matches to sequences of other species. The same criteria outlined above should be applied when using a commercial database in the clinical laboratory for isolate sequence interpretation. Applied Biosystems (Life Technologies, Thermo Fisher Scientific) has software for bacterial 16S rRNA gene and fungal D1/D2 regions of the 26S rRNA gene (MicroSEQ ID Analysis); this software package and its manually curated reference database, however, is accessible only to users of the respective kits and often favors partial sequencing. Molzym (Bremen, Germany) also provides its own manually curated reference databases in the context of sepsis diagnostics. SmartGene IDNS is a commercially available software package that supports automated or semiautomated sequence analysis from raw data to the report for all current sequencing platforms; it also comes with its own reference database (21, 130, 132,–135). The SmartGene IDNS (Zug, Switzerland) provides comprehensive curated databases for bacterial and fungal sequences using automated algorithm-based methods, which houses nonredundant and representative sequences for each species, full-length sequences, or sequences from collection strains, etc., within different containers.

Overall laboratory resource availability, technologist expertise, and the required operational efficiency should be considered when selecting a reference sequencing interpretation database(s). Databases developed for 16S sequence analyses of specific genera or groups of microorganisms should also be used where verification studies show improved quality and accuracy of results. Turenne et al. compared the identification of 79 mycobacterial type strain sequences by analyses using either GenBank, the Ribosomal Database Project (RDP-II), or the 16S database of RIDOM (136). The RIDOM database contained an identical matching sequence for each submitted type strain, but only about a quarter of them could be accurately identified using BLAST either on GenBank or RDP-II (the open-access 16S RIDOM database has since been closed). Sequence-based identification within Nocardia spp. may also be problematic due to this genus’s high degree of intra- and interspecies genomic variability within this genus (137). Helal et al. compared clustering and classification algorithms within GenBank to identify 364 known and yet-to-be-identified Nocardia 16S sequences (138). These investigators found that the identification of centroids of 16S rRNA gene sequence clusters using novel distance matrix clustering enabled the identification of the most representative sequences for individual Nocardia species and allowed the quantitation of inter- and intraspecies variability.

GenBank/NCBI makes available a type strain match filter via its type material annotation (139). Using this resource, one can select matches from type strains, of which there are currently only ∼20,300. However, one should be aware of missing species and variants and linked 16S sequences that are sometimes partial or even fragmented, leading to coverage issues with BLAST (see “Match and database coverage,” above).

IDENTIFICATION OF CLINICALLY RELEVANT BACTERIAL PATHOGENS USING 16S rRNA GENE SEQUENCING

This section provides the readers with a comprehensive assessment of the use of this method for identification of bacteria within various taxonomic groups that cause human disease according to how things currently stand. Because NGS studies are rapidly changing our understanding of the classification and taxonomy of important groups of pathogens, it is important to consult online databases to verify that one is accessing the most up-to-date information for specific microorganisms and groups. Some recommended sites include the International Journal of Systematic and Evolutionary Microbiology (IJSEM), The Taxonomy Database of the National Center of Biotechnology Information (NCBI) (https://www.ncbi.nlm.nih.gov/taxonomy), the List of Prokaryotic Names with Standing in Nomenclature (LSPN) (https://lpsn.dsmz.de/), and Deutsche Sammlung von Mikroorganismen und Zellkulturen (https://www.dsmz.de).

Overview of Pathogen Identification

Clinical microbiology laboratories must be able to rapidly and accurately identify a diverse range of bacterial isolates in order to diagnose the etiology of infection and provide guidance about appropriate antibiotic treatment. Partial or complete sequencing of 16S has proven to be an invaluable tool for providing a reliable identification of infections caused by unusual or rarely encountered bacteria, particularly in the pre-MALDI-TOF MS era (16). A genus- and species-level identification of bacterial isolates was obtained by 16S rRNA gene sequencing in >90% and 65% to 83%, respectively, depending on the group of bacteria and the criteria used for species definition in cases where conventional phenotypic methods had failed (11, 16). With the emergence of 16S rRNA gene sequencing as an identification tool in the last 20 years, the usefulness of commercial databases has also undergone limited clinical evaluation. The MicroSeq 500 16S rDNA-based identification system can reliably identify >80% of clinically relevant bacterial isolates with atypical phenotypic profiles and 89.2% of unusual aerobic Gram-negative bacilli (8, 10, 135, 140, 141). It has also proven useful for the identification of some slow-growing bacteria, such as Mycobacterium species, notwithstanding the limitations of using this target (i.e., 16S sequences cannot differentiate species within the M. tuberculosis complex, the M. avium intracellulare complex, or the M. chelonae/M. abscessus complex) (136, 142, 143). Simmons et al. compared the identification by conventional methods of a diverse group of bacterial clinical isolates with gene sequences interrogated by the SmartGene and MicroSeq databases (135). Of 300 isolates, SmartGene identified 295 (98%) to the genus level and 262 (87%) to the species level, with 5 (2%) being inconclusive. MicroSeq identified 271 (90%) to the genus level and 223 (74%) to the species level, with 29 (10%) being inconclusive. SmartGene and MicroSeq agreed on the genus for 233 (78%) isolates and the species for 212 (71%) isolates. Conventional methods identified 291 (97%) isolates to the genus level and 208 (69%) to the species level, with 9 (3%) being inconclusive. SmartGene, MicroSeq, and conventional identifications agreed for 193 (64%) of the results.

Utilization of 16S PCR/sequencing to identify clinically relevant bacteria that previously would have been mis- or unidentified from clinical specimens has also provided insight into the epidemiological and pathogenic potential of rare or unusual bacteria in human infections. Woo and colleagues summarized the novel bacterial species discovered from human specimens in just 7 years, from 2001 to 2007 (11); a total of 215 novel species, 29 belonging to novel genera, were reported. In addition, 100 (15 novel genera) novel species were found in 4 or more patients, and the largest numbers were of the genera Mycobacterium and Nocardia. Then and now, the oral cavity/dental-related specimens and the gastrointestinal tract were the most important reservoirs for discovery of novel species (11). This agrees with the huge diversity of microbiota identified at these important body sites by the human microbiome project (144, 145). Since their discovery, Streptococcus sinensis, Laribacter hongkonensis, Clostridium hathewayi, and Borrelia spielmanii have been more fully characterized, including their epidemiology and routes of transmission (11). Prospective local experience with 16S sequencing can also help define regional epidemiology of novel opportunistic pathogens. Performance of 16S sequencing on a large number of clinically relevant pathogens over the past decade in our laboratory revealed the epidemiology of invasive infections, such as bacteremia, due to several unusual bacteria, including Eggerthella lenta (146) and Peptoniphilus (147) and Actinomyces (21) species.

Current Limitations of the 16S rRNA Gene Target for Pathogen Identification

Our group recently collaborated on updating the Clinical and Laboratory Standards Institute (CLSI MM-18-A2) document entitled Interpretive Criteria for Identification of Bacteria and Fungi by DNA Target Sequencing; Approved Guideline (5). This important clinical laboratory guideline provides interpretive criteria for identification of a wide range of clinically relevant bacteria and fungi to the genus and species levels using partial or complete 16S sequencing. To revise this document, we performed comprehensive multialignments to analyze relevant 16S sequences for most clinically relevant pathogens and closely related environmental species. Although more 16S sequence data are available for human pathogens within public/private databases than other gene targets, one must recognize that few to no sequences (i.e., defined as ≤5 individual 16S sequences/species currently deposited in GenBank [NCBI]) have been published for a wide variety of the pathogenic organism/microorganism groups outlined here. The statistics about genus homology, shown in Tables 3 to to12,12, were generated by using the best representative sequences for each species of optimal length (where available) from GenBank/NCBI (where available), grouping them by genus and then aligning them to cover at least ∼50 to 1,200 bp of the 16S gene using MAFFT V7 (148), excluding sequences of species not covering these positions. Each alignment was analyzed by column/position: a column where all the species-sequences have the same nucleotide was counted as an identical position, and a column where at least one species-sequence had a gap or a different nucleotide was counted as a divergent position. The counting started at the first common position of all sequences and stopped at the last common position to avoid recording diversity where sequences were shorter. The percentages give an idea about the homology of a genus; in general, genera with few species tend to display a higher degree of homology.

TABLE 3

Summary of ability of 16S rRNA gene target to identify clinically relevant Staphylococcaceae, Micrococcaceae, and Dermacoccaceae

Genus	No. of sequences in the genus MSA^a	Total no. of tested positions	No. of identical positions	No. of divergent positions	% Identity
Staphylococcus	46	1,387	1,237	150	89.19
Micrococcus	9	1,386	1,314	72	94.81
Citricoccus	3	1,457	1,435	22	98.49
Kytococcus	2	1,445	1,419	26	98.20
Dermacoccus	4	1,471	1,438	33	97.76
Kocuria	20	1,429	1,212	217	84.81
Rothia	8	1,394	1,273	121	91.32
Luteipulveratus	0	0
Auritidibacter	0	0

Open in a separate window