Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 8;25(1):455.
doi: 10.1186/s12864-024-10344-9.

Disregarding multimappers leads to biases in the functional assessment of NGS data

Affiliations

Disregarding multimappers leads to biases in the functional assessment of NGS data

Michelle Almeida da Paz et al. BMC Genomics. .

Abstract

Background: Standard ChIP-seq and RNA-seq processing pipelines typically disregard sequencing reads whose origin is ambiguous ("multimappers"). This usual practice has potentially important consequences for the functional interpretation of the data: genomic elements belonging to clusters composed of highly similar members are left unexplored.

Results: In particular, disregarding multimappers leads to the underrepresentation in epigenetic studies of recently active transposable elements, such as AluYa5, L1HS and SVAs. Furthermore, this common strategy also has implications for transcriptomic analysis: members of repetitive gene families, such the ones including major histocompatibility complex (MHC) class I and II genes, are under-quantified.

Conclusion: Revealing inherent biases that permeate routine tasks such as functional enrichment analysis, our results underscore the urgency of broadly adopting multimapper-aware bioinformatic pipelines -currently restricted to specific contexts or communities- to ensure the reliability of genomic and transcriptomic studies.

Keywords: ChIP-seq; Functional analysis; Multimappers; Next-generation sequencing (NGS); RNA-seq.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Discarding multimappers leads to epigenetic mischaracterization of young TEs. A Percentage of uni- and multimapper reads mapping to portions of the human genome annotated as TEs and not annotated as TEs (non-TE) for dataset 1 containing two ChIP-seq libraries generated by the ENCODE consortium using single-end 50 bp (“SE50”) and 100 bp (“SE100”) reads. Mapping was performed with two different mapping tools: with BBMap and Bwa mem. TEs are the major source of ChIP-seq multimappers in the human genome. B Length distribution of TE individual copies. Only TEs shorter than 500 bp are shown. Note that ~ 74% (856 out of 1,160) of the TEs in the human genome are longer than 500 bp, spanning up to 153,104 bp. The bin width is 10 bp. TEs were classified as DNA, LTR (long terminal repeat), SINE (short interspersed nuclear element), LINE (long interspersed nuclear element) and Others (e.g., rolling-circle (RC), unknown classification). Standard NGS reads are too short to fully cover most TE copies, explaining why TEs often give rise to multimappers. C Read coverage (bar plot on the left y-axis; see Methods) per clade for the SE100 ChIP-seq library (dataset 1) for uni- and multimappers. Reads were mapped using Bwa mem. TEs found in multiple clades (e.g., L1HS and L1P1) were assigned to the “younger” clade (Homo and Hominoidea, respectively). Only 30 out of 1,160 different TEs in the human genome (~ 3%) have not been annotated to any clade and were not represented. The number of TE copies for each clade (line plot on the right y-axis) shows that the majority of TEs are Primates and Eutherian-specific. Evolutionary young TEs are prone to be underrepresented when excluding multimappers from ChIP-seq analysis
Fig. 2
Fig. 2
Discarding multimappers leads to functional mischaracterization of repetitive gene families. A Percentage of uni- and multimapper fragments mapping to human genome for a RNA-seq library of dataset 3 generated by the ENCODE consortium using pair-end 100 bp (“PE100”) and thereof simulated libraries with read pairs of length 25, 50 and 75 bp (“PE25”, “PE50” and “PE75”, respectively). Within the read lengths assessed, the difference in the proportion of multimappers was modest (10–21%). B Scatter plot showing gene expression values computed with HTSeq-count using default parameters (“–nonunique none”; x-axis) and by our “multimapper-aware” strategy (y-axis) for PE100. Each dot represents a protein-coding gene and is coloured differently depending on whether it is considered (approximately) equally-quantified or under-quantified by HTSeq-count (see Methods). The dashed line indicates identical gene expression values. About 6% (777 out of 13,437) expressed genes are under-quantified when discarding multimappers. C Gene ontology (GO) enrichment analysis of the 50, 100, and 200 protein-coding genes with the highest expression values in PE100 as computed by HTSeq-count (“H50”, “H100”, and “H200”, respectively) or our “multimapper-aware” strategy (“C50”, “C100”, and “C200”, respectively). GO enrichment analysis was performed for the “molecular function” category and using the “org.Hs.eg.db” annotation for the human genome. The q-value threshold was set to 0.01. Dot size represents the ratio between the number of genes in the given GO term (y-axis) and the number of genes annotated in each category (shown in brackets, below the label of each gene set on the x-axis). Dot colour indicates the P-value adjusted by Benjamini-Hochberg (BH, “p.adj.”). Neglecting multimappers leads to the underrepresentation of genes associated with specific GO terms (indicated in bold and with grey shading)

Similar articles

References

    1. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007;4:651–657. doi: 10.1038/nmeth1068. - DOI - PubMed
    1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed
    1. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320:1344–1349. doi: 10.1126/science.1158441. - DOI - PMC - PubMed
    1. Transcription Factor ChIP-seq Data Standards and Processing Pipeline. https://www.encodeproject.org/chip-seq/transcription_factor/. Accessed 1 Feb 2024.
    1. Sequencing Read Length. https://www.illumina.com/science/technology/next-generation-sequencing/p.... Accessed 1 Feb 2024.

MeSH terms

Substances

-