DnaSP v5: a software for comprehensive analysis of DNA polymorphism data

Librado, P.; Rozas, J.

doi:10.1093/bioinformatics/btp187

Abstract

Motivation: DnaSP is a software package for a comprehensive analysis of DNA polymorphism data. Version 5 implements a number of new features and analytical methods allowing extensive DNA polymorphism analyses on large datasets. Among other features, the newly implemented methods allow for: (i) analyses on multiple data files; (ii) haplotype phasing; (iii) analyses on insertion/deletion polymorphism data; (iv) visualizing sliding window results integrated with available genome annotations in the UCSC browser.

Availability: Freely available to academic users from: http://www.ub.edu/dnasp

Contact: jrozas@ub.edu

1 INTRODUCTION

The analysis of DNA polymorphisms is a powerful approach to understand the evolutionary process and to establish the functional significance of particular genomic regions (Begun et al., 2007; Nielsen, 2005; Rosenberg and Nordborg, 2002). In this context, estimating the impact of natural selection (both positive and negative) is of major interest. Furthermore, DNA polymorphisms are relevant as a tool for a broad range of life science disciplines. Consequently, many high-throughput sequencing, genotyping and polymorphism detection systems have been developed and are currently publicly available (Shendure and Ji, 2008). These new technologies are generating massive amounts of data that need to be processed, analyzed and transformed effectively into knowledge.

These technological advances have largely stimulated the development of both analytical methods and computer applications. Population genetic methods, and particularly those based on coalescent theory (Hudson, 1990; Wakeley, 2009), are used at an increasing rate, but need to be adapted to the particularities of the data (massive amounts of data, missing data, genotypes, insertion/deletion (indels) polymorphisms, etc.). Furthermore, new computer applications and algorithms need to be developed for processing massive datasets (Excoffier and Heckel, 2006), and more specifically computer visualization tools for the representation of DNA variation patterns. DnaSP (DNA Sequence Polymorphism) is a software package that allows for extensive DNA polymorphism analyses using a friendly graphical user interface (GUI) (Rozas et al., 2003). Version 5 extends the capabilities of the software, allowing comprehensive DNA polymorphism analyses on multiple data files and on large datasets. Altogether, the present version of DnaSP has the appropriate features for exhaustive exploratory analyses using high-throughput DNA polymorphism data.

2 FEATURES

DnaSP v5 incorporates major improvements. The new version currently allows for the handling and analysis of multiple data files in batch, and implements new algorithms and methods; among other things (see below) includes a new module to identify conserved DNA regions, this feature might be useful for phylogenetic footprinting-based analyses (Vingron et al., 2009). DnaSP provides a convenient GUI facilitating all data management and analytical tasks; the results can be visualized graphically as well as in a text report. DnaSP accepts multiple DNA sequence alignment file formats (Rozas et al., 2003), including NEXUS (Maddison et al., 1997), and HapMap3 files with phased haplotypes (The International HapMap Consortium, 2003). The software allows exhaustive DNA polymorphism analyses, including those based on coalescent theory (Rozas et al., 2003; Wakeley, 2009).

2.1 Haplotype reconstruction

Haplotype reconstruction aims at resolving haplotype phase given genotypic information. DnaSP implements statistical methods to infer haplotype phase, and prepares adequately the phased data for subsequent analyses. The input data (unphased genotype data) are required in FASTA format using IUPAC nucleotide ambiguity codes to represent heterozygous sites. DnaSP reconstructs the phase by applying various algorithms (PHASE v2.1, fastPHASE v1.1 and HAPAR) differing in the underlying population genetic assumptions. PHASE (Stephens and Donnelly, 2003; Stephens et al., 2001) assumes Hardy–Weinberg equilibrium and uses a coalescent-based Bayesian method to infer haplotypes. fastPHASE (Scheet and Stephens, 2006) implements a modification of the PHASE algorithm taking into account the patterns of linkage disequilibrium and its gradual decline with physical distance. This algorithm is faster and allows for the handling of larger datasets than PHASE, while being slightly less accurate. HAPAR (Wang and Xu, 2003) infers haplotype phase by maximum parsimony, i.e. attempts to find the minimum number of haplotypes explaining the genotype sample.

2.2 Deletion/insertion polymorphisms

Deletion/insertion polymorphisms (DIPs) analysis can provide insights into the evolutionary forces acting on DNA. This information, however, has been rarely used. One obstacle has been the difficulty of defining clearly homologous states (Young and Healy, 2003). DnaSP incorporates an algorithm for treating indels related to the ‘simple indel coding’ method of Simmons and Ochoterena (2000). Specifically, only indels with the same 5′ and 3′ termini are considered homologous (resulted from a single event), and indels of different lengths (even in the same position of the alignment) are treated as different events. DnaSP, nevertheless, uses a slightly different method for coding completely overlapping gaps, and allows the user to choose the level of overlap to be coded. Subsequently, DnaSP estimates a number of DIP summary statistics, such as the average indel length, indel diversity, as well as Tajima's D (Tajima, 1989) based on indel information. Additionally, it exports the recoded data in the NEXUS format file.

2.3 Analysis of multiple data files

DnaSP can automatically read and analyze multiple data files sequentially (in batch mode). These data files may contain a varying number of sequences (from within one species, or from one species as well as one outgroup), or represent diverse genomic regions. The program estimates the most common DNA polymorphism and divergence summary statistics (such as the nucleotide and haplotype diversity, the population mutation parameter, the number of nucleotide substitutions per site, etc.), and neutrality tests (such as Tajima's, Fu and Li's and Fu's tests).

2.4 Sliding window results visualization

The sliding window technique is a useful tool for exploratory DNA polymorphism data analysis (Hutter et al., 2006; Rozas et al., 2003; Vilella et al., 2005). The current version of DnaSP permits visualizing results of the sliding window (for example, nucleotide diversity or Tajima's D values along the DNA sequence) integrating available genome annotations in the UCSC browser (Kent et al., 2002). This feature can greatly facilitate the interpretation of the results; for instance, it is possible to identify the relevant genome annotations (genes, intergenic regions, conserved regions, etc.), which are adjacent to regions with atypical patterns of nucleotide variation.

3 IMPLEMENTATION

DnaSP version 5 has been developed in Microsoft Visual Basic v6.0, C and C++, and it runs under Microsoft Windows operating systems (2000/XP/Vista). With the use of Windows emulators, DnaSP can also run on Apple Macintosh platforms, Linux and Unix-based operating systems. The software has been tested in all three platforms.

ACKNOWLEDGEMENTS

We acknowledge Sergios-Orestis Kolokotronis for helpful comments on the manuscript. Special thanks to the numerous users who tested the software with their data, and particularly to all members of the Molecular Evolutionary Genetics group at the Departament de Genètica, Universitat de Barcelona.

Funding: Spanish Dirección General de Investigación Científica y Técnica (grants BFU2004-02253 and BFU2007-62927); the Catalonian Comissió Interdepartamental de Recerca i Innovació Tecnològica (grant 2005SGR00166).

Conflict of Interest: none declared.

REFERENCES

Begun

DJ

, et al.

Population genomics: whole-genome analysis of polymorphism and divergence in Drosophila simulans

,

PLoS Biol.

,

2007

, vol.

6

pg.

e310

Google Scholar

OpenURL Placeholder Text

WorldCat

Excoffier

L

,

Heckel

G

.

Computer programs for population genetics data analysis: a survival guide

,

Nat. Rev. Genet.

,

2006

, vol.

7

(pg.

745

-

758

)

Hudson

RR

.

Gene genealogies and the coalescent process

,

Oxf. Surv. Evol. Biol.

,

1990

, vol.

7

(pg.

1

-

44

)

Google Scholar

OpenURL Placeholder Text

WorldCat

Hutter

S

, et al.

Genome-wide DNA polymorphism analyses using VariScan

,

BMC Bioinformatics

,

2006

, vol.

7

pg.

409

Kent

WJ

, et al.

The Human Genome Browser at UCSC

,

Genome Res.

,

2002

, vol.

12

(pg.

996

-

1006

)

Maddison

WP

, et al.

NEXUS: an extendible file format for systematic information

,

Syst. Biol.

,

1997

, vol.

46

(pg.

590

-

621

)

Nielsen

R

.

Molecular signatures of natural selection

,

Annu. Rev. Genet.

,

2005

, vol.

39

(pg.

197

-

218

)

Rosenberg

NA

,

Nordborg

M

.

Genealogical trees, coalescent theory, and the analysis of genetic polymorphisms

,

Nat. Rev. Genet.

,

2002

, vol.

3

(pg.

380

-

390

)

Rozas

J

, et al.

DnaSP, DNA polymorphism analyses by the coalescent and other methods

,

Bioinformatics

,

2003

, vol.

19

(pg.

2496

-

2497

)

Scheet

P

,

Stephens

M

.

A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase

,

Am. J. Hum. Genet.

,

2006

, vol.

78

(pg.

629

-

644

)

Shendure

J

,

Ji

H

.

Next-generation DNA sequencing

,

Nat. Biotechnol.

,

2008

, vol.

26

(pg.

1135

-

1145

)

Simmons

MP

,

Ochoterena

H

.

Gaps as characters in sequence-based phylogenetic analyses

,

Syst. Biol.

,

2000

, vol.

49

(pg.

369

-

381

)

Stephens

M

,

Donnelly

P

.

A comparison of Bayesian methods for haplotype reconstruction from population genotype data

,

Am. J. Hum. Genet.

,

2003

, vol.

73

(pg.

1162

-

1169

)

Stephens

M

, et al.

A new statistical method for haplotype reconstruction from population data

,

Am. J. Hum. Genet.

,

2001

, vol.

68

(pg.

978

-

989

)

Tajima

F

.

Statistical method for testing the neutral mutation hypothesis by DNA polymorphism

,

Genetics

,

1989

, vol.

123

(pg.

585

-

595

)

The International HapMap Consortium (2003) The International HapMap Project Nature426789–796

Vilella

AJ

, et al.

VariScan: analysis of evolutionary patterns from large-scale DNA sequence polymorphism data

,

Bioinformatics

,

2005

, vol.

21

(pg.

2791

-

2793

)

Vingron

M

, et al.

Integrating sequence,evolution and functional genomics in regulatory genomics

,

Genome Biol.

,

2009

, vol.

10

pg.

202

Wang

L

,

Xu

Y

.

Haplotype inference by maximum parsimony

,

Bioinformatics

,

2003

, vol.

19

(pg.

1773

-

1780

)

Wakeley

J

. ,

Coalescent Theory. An Introduction.

,

2009

Greenwood Village

Roberts and Company Publishers

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Young

ND

,

Healy

J

.

GapCoder automates the use of indel characters in phylogenetic analysis

,

BMC Bioinformatics

,

2003

, vol.

4

pg.

6

Author notes

Associate Editor: Martin Bishop

Download all slides

Month:	Total Views:
November 2016	61
December 2016	50
January 2017	210
February 2017	305
March 2017	360
April 2017	271
May 2017	294
June 2017	296
July 2017	291
August 2017	228
September 2017	275
October 2017	291
November 2017	263
December 2017	754
January 2018	814
February 2018	776
March 2018	1,511
April 2018	1,577
May 2018	1,384
June 2018	1,357
July 2018	1,336
August 2018	986
September 2018	895
October 2018	977
November 2018	805
December 2018	577
January 2019	557
February 2019	510
March 2019	716
April 2019	738
May 2019	744
June 2019	548
July 2019	595
August 2019	555
September 2019	482
October 2019	521
November 2019	551
December 2019	503
January 2020	496
February 2020	541
March 2020	540
April 2020	611
May 2020	260
June 2020	495
July 2020	410
August 2020	409
September 2020	420
October 2020	457
November 2020	481
December 2020	473
January 2021	466
February 2021	512
March 2021	674
April 2021	488
May 2021	506
June 2021	442
July 2021	447
August 2021	440
September 2021	557
October 2021	605
November 2021	668
December 2021	585
January 2022	512
February 2022	496
March 2022	643
April 2022	583
May 2022	499
June 2022	453
July 2022	477
August 2022	499
September 2022	556
October 2022	602
November 2022	735
December 2022	623
January 2023	552
February 2023	574
March 2023	734
April 2023	579
May 2023	569
June 2023	380
July 2023	429
August 2023	451
September 2023	423
October 2023	468
November 2023	476
December 2023	385
January 2024	579
February 2024	493
March 2024	498
April 2024	416
May 2024	449
June 2024	408
July 2024	55

Article Contents

DnaSP v5: a software for comprehensive analysis of DNA polymorphism data

Abstract

1 INTRODUCTION

2 FEATURES

2.1 Haplotype reconstruction

2.2 Deletion/insertion polymorphisms

2.3 Analysis of multiple data files

2.4 Sliding window results visualization

3 IMPLEMENTATION

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

DnaSP v5: a software for comprehensive analysis of DNA polymorphism data

Abstract

1 INTRODUCTION

2 FEATURES

2.1 Haplotype reconstruction

2.2 Deletion/insertion polymorphisms

2.3 Analysis of multiple data files

2.4 Sliding window results visualization

3 IMPLEMENTATION

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only