Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jun 23:11:341.
doi: 10.1186/1471-2105-11-341.

TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets

Affiliations

TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets

Robert Schmieder et al. BMC Bioinformatics. .

Abstract

Background: Sequencing metagenomes that were pre-amplified with primer-based methods requires the removal of the additional tag sequences from the datasets. The sequenced reads can contain deletions or insertions due to sequencing limitations, and the primer sequence may contain ambiguous bases. Furthermore, the tag sequence may be unavailable or incorrectly reported. Because of the potential for downstream inaccuracies introduced by unwanted sequence contaminations, it is important to use reliable tools for pre-processing sequence data.

Results: TagCleaner is a web application developed to automatically identify and remove known or unknown tag sequences allowing insertions and deletions in the dataset. TagCleaner is designed to filter the trimmed reads for duplicates, short reads, and reads with high rates of ambiguous sequences. An additional screening for and splitting of fragment-to-fragment concatenations that gave rise to artificial concatenated sequences can increase the quality of the dataset. Users may modify the different filter parameters according to their own preferences.

Conclusions: TagCleaner is a publicly available web application that is able to automatically detect and efficiently remove tag sequences from metagenomic datasets. It is easily configurable and provides a user-friendly interface. The interactive web interface facilitates export functionality for subsequent data processing, and is available at http://edwards.sdsu.edu/tagcleaner.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Simplified model showing how fragment-to-fragment concatenations can be generated. DNA polymerase can create overhangs during PCR amplification. An overhang is a stretch of unpaired nucleotides in the end of a DNA molecule (e.g. a single adenosine as a 3'-overhang). The unpaired nucleotides are removed to generate blunt-ended DNA molecules with both strands terminating in a base pair. This step can produce fragment-to-fragment concatenations because blunt ends are compatible with each other. The 454 adaptors are added to the amplified fragments by blunt-end ligation before sequencing. The resulting sequence data can contain artificial concatenated sequences.
Figure 2
Figure 2
Example data for the first 50 positions of a metagenomic dataset containing tag sequences. Example data showing nucleotide frequencies (top), predicted tag sequence (middle), and frequency range and median (bottom) for the first 50 positions in a metagenomic dataset before tag trimming. We can see a clear separation between the non-random nucleotide positions of the tag (A), the quasi-random nucleotides of the tag (B) and the metagenomic sequence (C).
Figure 3
Figure 3
Simplified example showing the calculation of shift values for 5-mers at the 3'-end. The 5-mers at the 3'-end of all sequences are extracted (A) and sorted by decreasing frequency (B). The first 5-mer in the list (highest frequency) is then aligned to the second 5-mer (C) to calculate the minimum number of shift operations to align the two 5-mers without gaps (D). The shift direction is based on the 5-mer with the higher frequency. Shifts to the left have negative values assigned, whereas shifts to the right have positive values assigned. If the number of shift operations is less than or equal to a given threshold (default: 2), then the two 5-mers are joined into one k-mer. In the next step, the third 5-mer is aligned with either the first 5-mer or the joined k-mer and the same operations are performed. These steps are repeated for the remaining 5-mers. The values of shift operations for the 3'-end are then adjusted (E) by the negative of the maximum number of shift operations (- max{-1, 1}).
Figure 4
Figure 4
TagCleaner web interface. Screenshots of the TagCleaner web interface at different steps of the data processing. The user can either input a data ID to access already processed data (A) or input a new sequence file and the tag sequences, if available (B). If the tag sequence is not available, the tag sequence is estimated using a nucleotide frequency-based approach. The estimated tag sequence is shown below the nucleotide frequency plot and the frequency range and median plot (C). Based on the provided frequency information, the user can change the estimated tag sequence using the functionality of the graphical interface (D). After detecting the tag sequence in the dataset, the results are shown including the input information (E), graphical representation of the number of mismatches (F), filter parameters (G), download options (H) and options to manage filter parameters (I).
Figure 5
Figure 5
Results for exact and approximate tag sequence matching for the datasets from Nakamura et al. [18]. TagCleaner detected the same tag sequences (5'-TGT GTT GGG TGT GTT TGG NNN NNN NNN N and NNN NNN NNN NCC AAA CAC ACC CAA CAC A-3') in the sequences from nasal (F1 - F3) and fecal samples (N1 - N5). The fraction of sequences that contained tag sequences with no mismatches and 1-3 mismatches is shown for the 5'-end, 3'-end and the concatenated tag sequences.

Similar articles

Cited by

References

    1. Djikeng A, Kuzmickas R, Anderson NG, Spiro DJ. Metagenomic analysis of RNA viruses in a fresh water lake. PLos One. 2009;4(9) doi: 10.1371/journal.pone.0007264. [PMID: 19787045 PMCID: 2746286] - DOI - PMC - PubMed
    1. Schloss P, Handelsman J. Metagenomics for studying unculturable microorganisms: cutting the Gordian knot. Genome Biology. 2005;6(8):229. doi: 10.1186/gb-2005-6-8-229. - DOI - PMC - PubMed
    1. Tringe SG, Rubin EM. Metagenomics: DNA sequencing of environmental samples. Nature Reviews Genetics. 2005;6(11):805–814. doi: 10.1038/nrg1709. - DOI - PubMed
    1. Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, Furlan M, Desnues C, Haynes M, Li L, McDaniel L, Moran MA, Nelson KE, Nilsson C, Olson R, Paul J, Brito BR, Ruan Y, Swan BK, Stevens R, Valentine DL, Thurber RV, Wegley L, White BA, Rohwer F. Functional metagenomic profiling of nine biomes. Nature. 2008;452(7187):629–632. doi: 10.1038/nature06810. - DOI - PubMed
    1. Thurber RV, Haynes M, Breitbart M, Wegley L, Rohwer F. Laboratory procedures to generate viral metagenomes. Nature Protocols. 2009;4(4):470–483. doi: 10.1038/nprot.2009.10. - DOI - PubMed

Publication types

LinkOut - more resources

-