Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 27;25(3):bbae221.
doi: 10.1093/bib/bbae221.

Guidelines for reproducible analysis of adaptive immune receptor repertoire sequencing data

Affiliations

Guidelines for reproducible analysis of adaptive immune receptor repertoire sequencing data

Ayelet Peres et al. Brief Bioinform. .

Abstract

Enhancing the reproducibility and comprehension of adaptive immune receptor repertoire sequencing (AIRR-seq) data analysis is critical for scientific progress. This study presents guidelines for reproducible AIRR-seq data analysis, and a collection of ready-to-use pipelines with comprehensive documentation. To this end, ten common pipelines were implemented using ViaFoundry, a user-friendly interface for pipeline management and automation. This is accompanied by versioned containers, documentation and archiving capabilities. The automation of pre-processing analysis steps and the ability to modify pipeline parameters according to specific research needs are emphasized. AIRR-seq data analysis is highly sensitive to varying parameters and setups; using the guidelines presented here, the ability to reproduce previously published results is demonstrated. This work promotes transparency, reproducibility, and collaboration in AIRR-seq data analysis, serving as a model for handling and documenting bioinformatics pipelines in other research domains.

Keywords: AIRR-seq; FAIR; annotation; pipelines; preprocessing; reproducibility.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Steps for reproducible AIRR-seq analysis pipelines. (A) Tweak and run existing pipelines. In step one, an existing pipeline is selected using its Digital Object Identifier (DOI). In step two, the pipeline’s specification and run environment files are downloaded. In step three, the run parameters (e.g., process parameters, primer files, etc.) are adjusted. In step four, AIRR-seq data is obtained from public databases (e.g., ENA, NCBI) or from local storage. In step five, the execution framework is selected, which can be cloud-based (e.g., AWS, Azure, Google) or using ViaFoundry execution framework server or locally run in an automation server platform management (e.g., Jenkins). In step six, the analysis is run in the selected framework. Lastly, the updated pipeline files are downloaded in step seven and documented and archived for future use in steps eight and nine. (B) Create and archive pipelines. In steps one to six, the ViaFoundry framework is used to create the analysis pipeline and set the parameters and run environment. In step seven, the pipeline specification and run environment are obtained. Lastly, the files are documented in a Git repository and archived in Zenodo in steps eight and nine. (C) Create a pipeline with ViaFoundry. The first step is creating processes using the dedicated GUI. The second step is combining different processes into a module. The third step is assembling the full pipeline for analyzing AIRR sequences from a set of modules. This figure was created with BioRender.com
Figure 2
Figure 2
A case study of reproducing AIRR-seq analysis results. (A) The influence of a single pipeline parameter on the number of passed reads. Each facet is an independent repertoire, the x-axis corresponds to different error rate thresholds used in the MaskPrimers process, and the y-axis is the number of reads that passed the process given the threshold. Yellow bars correspond to the original threshold used to analyze the repertoires, and blue bars correspond to the alternative thresholds (B) The influence of initial IGHV germline reference set on mutation load. The x-axis corresponds to the different IGHV germline reference set. The yaxis corresponds to the calculated mutation load. (C) IGHV gene mean usage. The x-axis corresponds to the different IGHV genes, and the y-axis corresponds to the mean usage frequency across all control individuals. Green boxes represent the original publication results, and red boxes represent the results obtained by pipeline PP1 listed in Table 1.

Similar articles

References

    1. Wilkinson MD, Dumontier M, Aalbersberg IJJ, et al. The fair guiding principles for scientific data management and stewardship. Scientific data 2016;3(1):1–9. - PMC - PubMed
    1. Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten simple rules for reproducible computational research. PLoS Comput Biol 2013;9(10):e1003285. - PMC - PubMed
    1. Peng RD. Reproducible research in computational science. Science 2011;334(6060):1226–7. - PMC - PubMed
    1. Wratten L, Wilm A, Göke J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 2021;18(10):1161–8. - PubMed
    1. Köster J, Rahmann S. Snakemake - a scalable bioinformatics workflow engine. Bioinformatics 2012;28(19):2520–2. - PubMed
-