Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 1999 Sep;9(9):868-77.
doi: 10.1101/gr.9.9.868.

CAP3: A DNA sequence assembly program

Affiliations

CAP3: A DNA sequence assembly program

X Huang et al. Genome Res. 1999 Sep.

Abstract

We describe the third generation of the CAP sequence assembly program. The CAP3 program includes a number of improvements and new features. The program has a capability to clip 5' and 3' low-quality regions of reads. It uses base quality values in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. The program also uses forward-reverse constraints to correct assembly errors and link contigs. Results of CAP3 on four BAC data sets are presented. The performance of CAP3 was compared with that of PHRAP on a number of BAC data sets. PHRAP often produces longer contigs than CAP3 whereas CAP3 often produces fewer errors in consensus sequences than PHRAP. It is easier to construct scaffolds with CAP3 than with PHRAP on low-pass data with forward-reverse constraints.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Major steps of the assembly algorithm.
Figure 2
Figure 2
Computation of the 5′ and 3′ clipping positions of read f. Read f has high local similarities to reads g and h. A pair of broken lines shows the start and end positions of a similarity. A thick line indicates the high-quality region of a read.
Figure 3
Figure 3
Computation of a global alignment of clean reads f and g with the maximum score over a band. The rectangle represents the dynamic programming matrix, with the rows corresponding to the bases of read f and the columns to the bases of read g. The band is indicated by a shaded area and the start position of an optimal local alignment between raw reads f and g is indicated by a dot.
Figure 4
Figure 4
An unsatisfied constraint involving reads h and r with a distance range of x to y bp. The orientations of reads h and r are indicated by arrows. (A) The constraint is satisfiable by an unused overlap from reads f to g, with x ≤ d + e ≤ y. (B) The constraint serves as a link between two contigs, with d + e ≤ y.
Figure 5
Figure 5
An example for calculation of scores of a match, a mismatch, a deletion, and an insertion. The quality values of bases are shown next to the bases. In each case, the average quality value of the column and the score are presented, where positive integer m is the match score factor, negative integer n is the mismatch score factor, and positive integer g is the gap extension penalty factor.

Similar articles

Cited by

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
    1. Bonfield JK, Smith K, Staden R. A new DNA sequence assembly program. Nucleic Acids Res. 1995;24:4992–4999. - PMC - PubMed
    1. Chao K-M, Pearson WR, Miller W. Aligning two sequences within a specified diagonal band. Comput Appl Biosci. 1992;8:481–487. - PubMed
    1. Engle ML, Burks C. Artificially generated data sets for testing DNA fragment assembly algorithms. Genomics. 1993;16:286–288. - PubMed
    1. Ewing B, Green P. Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res. 1998;8:186–194. - PubMed

Publication types

LinkOut - more resources

-