Removing noise from pyrosequenced amplicons

doi:10.1186/1471-2105-12-38

. 2011 Jan 28:12:38.

doi: 10.1186/1471-2105-12-38.

Removing noise from pyrosequenced amplicons

Christopher Quince¹, Anders Lanzen, Russell J Davenport, Peter J Turnbaugh

Affiliations

PMID: 21276213
PMCID: PMC3045300
DOI: 10.1186/1471-2105-12-38

Removing noise from pyrosequenced amplicons

Christopher Quince et al. BMC Bioinformatics. 2011.

. 2011 Jan 28:12:38.

doi: 10.1186/1471-2105-12-38.

Authors

Christopher Quince¹, Anders Lanzen, Russell J Davenport, Peter J Turnbaugh

Affiliation

¹ Department of Civil Engineering, University of Glasgow, Rankine Building, Oakfield Avenue, Glasgow G12 8LT, UK. christopher.quince@glasgow.ac.uk

PMID: 21276213
PMCID: PMC3045300
DOI: 10.1186/1471-2105-12-38

Abstract

Background: In many environmental genomics applications a homologous region of DNA from a diverse sample is first amplified by PCR and then sequenced. The next generation sequencing technology, 454 pyrosequencing, has allowed much larger read numbers from PCR amplicons than ever before. This has revolutionised the study of microbial diversity as it is now possible to sequence a substantial fraction of the 16S rRNA genes in a community. However, there is a growing realisation that because of the large read numbers and the lack of consensus sequences it is vital to distinguish noise from true sequence diversity in this data. Otherwise this leads to inflated estimates of the number of types or operational taxonomic units (OTUs) present. Three sources of error are important: sequencing error, PCR single base substitutions and PCR chimeras. We present AmpliconNoise, a development of the PyroNoise algorithm that is capable of separately removing 454 sequencing errors and PCR single base errors. We also introduce a novel chimera removal program, Perseus, that exploits the sequence abundances associated with pyrosequencing data. We use data sets where samples of known diversity have been amplified and sequenced to quantify the effect of each of the sources of error on OTU inflation and to validate these algorithms.

Results: AmpliconNoise outperforms alternative algorithms substantially reducing per base error rates for both the GS FLX and latest Titanium protocol. All three sources of error lead to inflation of diversity estimates. In particular, chimera formation has a hitherto unrealised importance which varies according to amplification protocol. We show that AmpliconNoise allows accurate estimates of OTU number. Just as importantly AmpliconNoise generates the right OTUs even at low sequence differences. We demonstrate that Perseus has very high sensitivity, able to find 99% of chimeras, which is critical when these are present at high frequencies.

Conclusions: AmpliconNoise followed by Perseus is a very effective pipeline for the removal of noise. In addition the principles behind the algorithms, the inference of true sequences using Expectation-Maximization (EM), and the treatment of chimera detection as a classification or 'supervised learning' problem, will be equally applicable to new sequencing technologies as they appear.

PubMed Disclaimer

Figures

**Figure 1**
**Flowgram signal intensity distributions**. Probability distributions of observed signal intensities at different homopolymer lengths for the 'Even' V2 Mock Communities. The homopolymer length is shown above the mode of the distribution.

**Figure 2**
**OTU numbers in the V5 'Artificial Community' as a function of percent sequence difference - logarithmic**. Numbers of OTUs formed at cut-offs of increasing percent sequence difference after complete linkage clustering of the 'Artificial Community' V5 data set (Table 1). Distances were calculated following pair-wise alignment with the Needleman-Wunsch algorithm. Results are shown following filtering (red line), pyrosequencing noise removal by the first PyroNoise stage of AmpliconNoise (green line), further removal of PCR point mutations by the second SeqNoise stage (blue line) and following removal of chimeric sequences (magenta line). For comparison the number of OTUs obtained by clustering the known reference sequences are shown in black. The y-axis is logarithmically scaled.

**Figure 3**
**OTU numbers in the V5 'Artificial Community' as a function of percent sequence difference - linear**. Numbers of OTUs formed at cut-offs of increasing percent sequence difference after complete linkage clustering of the 'Artificial Community' V5 data set (Table 1). Distances were calculated following pair-wise alignment with the Needleman-Wunsch algorithm, results are shown for the filtered sequences after pyrosequencing and PCR noise removal by AmpliconNoise (magenta line), for single-linkage preclustering at 1% (purple) and SLP at 2% (cyan), for the DeNoiser algorithm (orange), and for the original one-stage PyroNoise algorithm (dark green line). In all cases chimeric sequences were removed. For comparison the number of OTUs obtained by clustering the known reference sequences are shown in black. The y-axis is scaled linearly.

**Figure 4**
**OTU construction accuracy for the V5 'Artificial Community' as a function of percent sequence difference for the different noise removal algorithms**. Results are given for the improved two stage 'AmpliconNoise' (A), the original 'PyroNoise' (B), single-linkage preclustering at 2% (C), and the DeNoiser algorithm (D). Reads classified as chimeric by comparison with the references were removed. The solid black portion gives the number of OTUs comprised of reference sequences and denoised pyrosequenced reads. These are good OTUs. The grey area OTUs formed only from reference sequences. These correspond to true OTUs that are missed. The diagonal shaded area those OTUs containing only pyrosequenced reads and hence are noise.

**Figure 5**
**OTU construction accuracy for the Titanium data set as a function of percent sequence difference for the different noise removal algorithms**. Results are given for AmpliconNoise with *σ_s*= 0.01 (A) and *σ_s*= 0.04 (B), single-linkage preclustering at 2% (C), and the DeNoiser algorithm (D). Reads classified as chimeric by comparison with the references were removed. The solid black portion gives the number of OTUs comprised of reference sequences and denoised pyrosequenced reads. These are good OTUs. The grey area OTUs formed only from reference sequences. These correspond to true OTUs that are missed. The diagonal shaded area those OTUs containing only pyrosequenced reads and hence are noise.

**Figure 6**
**Training logistic regression on denoised V5 'Divergent' data**. Good sequences are shown as black dots, chimeras red and reference sequences magenta. We used the denoised V5 'Divergent' data set, classified either good or chimeric by comparison with the references, and the reference sequences, all good, to train a one dimensional logistic regression on the 'chimera index' I using the R software package [30]. An intercept, α = - 183.25, and coefficient, β = 10.56, were obtained despite the fact that the algorithm did not converge (see text), and the corresponding P50 classification value, 17.35, is shown (blue line).

**Figure 7**
**Validation of logistic regression on denoised V5 'Artificial Community' data**. Applying the classification rule (blue line) from Figure 5 to the 'Artificial Community' denoised data sets correctly predicts all but two chimeras that fall below the P50 line. Good sequences are shown as blackdots and chimeras red.

**Figure 8**
**Training logistic regression on denoised V2 'Even' data**. Good sequences are shown as black dots, chimeras red and reference sequences magenta. We used the three denoised V2 'Even' data sets, classified either good or chimeric by comparison with the references, and the reference sequences, all good, to train a one dimensional logistic regression on the 'chimera index' I using the R software package. An intercept, α = - 2.83542, and coefficient β = 0.55889 were obtained (highly significantly different from zero). The corresponding P25, P50 and P75 decision lines are shown (blue lines). The fit reduced the null deviance from 1371.81 on 5416 degrees of freedom to a residual deviance of 416.36 on 5415 degrees of freedom (AIC 420.36).

See this image and copyright information in PMC

Cited by

Conversion of boreal forests to agricultural systems: soil microbial responses along a land-conversion chronosequence.
Benalcazar P, Seuradge B, Diochon AC, Kolka RK, Phillips LA. Benalcazar P, et al. Environ Microbiome. 2024 May 11;19(1):32. doi: 10.1186/s40793-024-00576-3. Environ Microbiome. 2024. PMID: 38734653 Free PMC article.
Regulation and analysis of Simiao Yong'an Decoction fermentation by Bacillus subtilis on the diversity of intestinal microbiota in Sprague-Dawley rats.
Yang Z, Chen K, Liu Y, Wang X, Wang S, Hao B. Yang Z, et al. Vet World. 2024 Mar;17(3):712-719. doi: 10.14202/vetworld.2024.712-719. Epub 2024 Mar 25. Vet World. 2024. PMID: 38680148 Free PMC article.
Arbuscular Mycorrhizal Fungi and Rhizobium Improve Nutrient Uptake and Microbial Diversity Relative to Dryland Site-Specific Soil Conditions.
Calderon RB, Dangi SR. Calderon RB, et al. Microorganisms. 2024 Mar 27;12(4):667. doi: 10.3390/microorganisms12040667. Microorganisms. 2024. PMID: 38674611 Free PMC article.
Gut dysbiosis was inevitable, but tolerance was not: temporal responses of the murine microbiota that maintain its capacity for butyrate production correlate with sustained antinociception to chronic voluntary morphine.
Sall I, Foxall R, Felth L, Maret S, Rosa Z, Gaur A, Calawa J, Pavlik N, Whistler JL, Whistler CA. Sall I, et al. bioRxiv [Preprint]. 2024 Apr 17:2024.04.15.589671. doi: 10.1101/2024.04.15.589671. bioRxiv. 2024. PMID: 38659831 Free PMC article. Preprint.
Nitrogen-fixing bacterial communities differ between perennial agroecosystem crops.
Sorochkina K, Martens-Habbena W, Reardon CL, Inglett PW, Strauss SL. Sorochkina K, et al. FEMS Microbiol Ecol. 2024 May 14;100(6):fiae064. doi: 10.1093/femsec/fiae064. FEMS Microbiol Ecol. 2024. PMID: 38637314 Free PMC article.

See all "Cited by" articles

References

1. Margulies M, Egholm M, Altman W, Attiya S, Bader J, Bemben L, Berka J, Braverman M, Chen Y, Chen Z, Dewell S, Du L, Fierro J, Gomes X, Godwin B, He W, Helgesen S, Ho C, Irzyk G, Jando S, Alenquer M, Jarvie T, Jirage K, Kim J, Knight J, Lanza J, Leamon J, Lefkowitz S, Lei M, Li J, Lohman K, Lu H, Makhijani V, McDade K, McKenna M, Myers E, Nickerson E, Nobile J, Plant R, Puc B, Ronan M, Roth G, Sarkis G, Simons J, Simpson J, Srinivasan M, Tartaro K, Tomasz A, Vogt K, Volkmer G, Wang S, Wang Y, Weiner M, Yu P, Begley R, Rothberg J. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
1. Wang GP, Sherrill-Mix SA, Chang KM, Quince C, Bushman FD. Hepatitis C Virus Transmission Bottlenecks Analyzed by Deep Sequencing. J Virol. 2010;84(12):6218–6228. doi: 10.1128/JVI.02271-09. - DOI - PMC - PubMed
1. Huber JA, Mark Welch D, Morrison HG, Huse SM, Neal PR, Butterfield DA, Sogin ML. Microbial population structures in the deep marine biosphere. Science. 2007;318:97–100. doi: 10.1126/science.1146689. - DOI - PubMed
1. Huse SM, Huber JA, Morrison HG, Sogin ML, Mark Welch D. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007;8(7) doi: 10.1186/gb-2007-8-7-r143. - DOI - PMC - PubMed
1. Quince C, Lanzen A, Curtis TP, Davenport RJ, Hall N, Head IM, Read LF, Sloan WT. Accurate determination of microbial diversity from 454 pyrosequencing data. Nat Methods. 2009;6:639–641. doi: 10.1038/nmeth.1361. - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

P50 GM068763/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations
Molecular Biology Databases
- SILVA

[1] Margulies M, Egholm M, Altman W, Attiya S, Bader J, Bemben L, Berka J, Braverman M, Chen Y, Chen Z, Dewell S, Du L, Fierro J, Gomes X, Godwin B, He W, Helgesen S, Ho C, Irzyk G, Jando S, Alenquer M, Jarvie T, Jirage K, Kim J, Knight J, Lanza J, Leamon J, Lefkowitz S, Lei M, Li J, Lohman K, Lu H, Makhijani V, McDade K, McKenna M, Myers E, Nickerson E, Nobile J, Plant R, Puc B, Ronan M, Roth G, Sarkis G, Simons J, Simpson J, Srinivasan M, Tartaro K, Tomasz A, Vogt K, Volkmer G, Wang S, Wang Y, Weiner M, Yu P, Begley R, Rothberg J. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed

[2] Margulies M, Egholm M, Altman W, Attiya S, Bader J, Bemben L, Berka J, Braverman M, Chen Y, Chen Z, Dewell S, Du L, Fierro J, Gomes X, Godwin B, He W, Helgesen S, Ho C, Irzyk G, Jando S, Alenquer M, Jarvie T, Jirage K, Kim J, Knight J, Lanza J, Leamon J, Lefkowitz S, Lei M, Li J, Lohman K, Lu H, Makhijani V, McDade K, McKenna M, Myers E, Nickerson E, Nobile J, Plant R, Puc B, Ronan M, Roth G, Sarkis G, Simons J, Simpson J, Srinivasan M, Tartaro K, Tomasz A, Vogt K, Volkmer G, Wang S, Wang Y, Weiner M, Yu P, Begley R, Rothberg J. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed

[3] Wang GP, Sherrill-Mix SA, Chang KM, Quince C, Bushman FD. Hepatitis C Virus Transmission Bottlenecks Analyzed by Deep Sequencing. J Virol. 2010;84(12):6218–6228. doi: 10.1128/JVI.02271-09. - DOI - PMC - PubMed

[4] Wang GP, Sherrill-Mix SA, Chang KM, Quince C, Bushman FD. Hepatitis C Virus Transmission Bottlenecks Analyzed by Deep Sequencing. J Virol. 2010;84(12):6218–6228. doi: 10.1128/JVI.02271-09. - DOI - PMC - PubMed

[5] Huber JA, Mark Welch D, Morrison HG, Huse SM, Neal PR, Butterfield DA, Sogin ML. Microbial population structures in the deep marine biosphere. Science. 2007;318:97–100. doi: 10.1126/science.1146689. - DOI - PubMed

[6] Huber JA, Mark Welch D, Morrison HG, Huse SM, Neal PR, Butterfield DA, Sogin ML. Microbial population structures in the deep marine biosphere. Science. 2007;318:97–100. doi: 10.1126/science.1146689. - DOI - PubMed

[7] Huse SM, Huber JA, Morrison HG, Sogin ML, Mark Welch D. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007;8(7) doi: 10.1186/gb-2007-8-7-r143. - DOI - PMC - PubMed

[8] Huse SM, Huber JA, Morrison HG, Sogin ML, Mark Welch D. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007;8(7) doi: 10.1186/gb-2007-8-7-r143. - DOI - PMC - PubMed

[9] Quince C, Lanzen A, Curtis TP, Davenport RJ, Hall N, Head IM, Read LF, Sloan WT. Accurate determination of microbial diversity from 454 pyrosequencing data. Nat Methods. 2009;6:639–641. doi: 10.1038/nmeth.1361. - DOI - PubMed

[10] Quince C, Lanzen A, Curtis TP, Davenport RJ, Hall N, Head IM, Read LF, Sloan WT. Accurate determination of microbial diversity from 454 pyrosequencing data. Nat Methods. 2009;6:639–641. doi: 10.1038/nmeth.1361. - DOI - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Removing noise from pyrosequenced amplicons

Affiliation

Removing noise from pyrosequenced amplicons

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases