Skip to content

Latest commit

 

History

History
790 lines (679 loc) · 63.1 KB

formats.md

File metadata and controls

790 lines (679 loc) · 63.1 KB

VADR output file formats


Format of generic VADR output files created by all VADR scripts

All VADR scripts (e.g. v-build.pl and v-annotate.pl) create a common set of three output files. These files are named <outdir>.vadr.<suffix> where <suffix> is either log, cmd or filelist and <outdir> is the command line argument that specifies the name of the output directory to create. These files are the three types of generic files that are supported by the Sequip ofile Perl module.

suffix description
.log log of steps taken by the VADR script (this is identical to the standard output of the program)
.cmd list of the commands run using Perl's system command internally by the VADR script
.filelist list of output files created by the VADR script

Each format is explained in more detail below.


Explanation of .log-suffixed output files

The .log files include the same text that is printed to standard output. The documentation on v-annotate.pl and v-build.pl usage go over this output in more detail.


Explanation of .cmd-suffixed output files

The .cmd files simply list all the commands run by the Perl system function internally by the VADR script, each separated by a newline. The final three lines are special. The third-to-last line lists the date and time just before the script completed execution. The second-to-last line lists system info (output by the uname -a unix command). The final line is either [ok] or [fail], depending on if the script ended successfully (zero return status) or did not (non-zero return status), respectively. If this line is [fail] it will be followed by an error message. An example .cmd output file for the example command v-build.pl -f --group Norovirus --subgroup GI NC_039897 NC_039897 is:

rm -rf NC_039897
mkdir NC_039897
/home/nawrocki/vadr-install-dir/infernal/binaries/esl-reformat --informat afa  stockholm NC_039897/NC_039897.vadr.fa > NC_039897/NC_039897.vadr.stk
/home/nawrocki/vadr-install-dir/infernal/binaries/esl-translate  -M -l 3 --watson NC_039897/NC_039897.vadr.cds.fa > NC_039897/NC_039897.vadr.cds.esl-translate.1.fa
rm NC_039897/NC_039897.vadr.cds.esl-translate.1.fa
rm NC_039897/NC_039897.vadr.cds.esl-translate.2.fa
rm NC_039897/NC_039897.vadr.cds.esl-translate.2.fa.ssi
/home/nawrocki/vadr-install-dir/ncbi-blast/bin/makeblastdb -in NC_039897/NC_039897.vadr.protein.fa -dbtype prot > /dev/null
/home/nawrocki/vadr-install-dir/infernal/binaries/esl-sfetch NC_039897/NC_039897.vadr.protein.fa NC_039897.1/5..5404:+ | /home/nawrocki/vadr-install-dir/hmmer/binaries/hmmbuild -n NC_039897/5..5404:+ --informat afa NC_039897/NC_039897.vadr.1.hmm - > NC_039897/NC_039897.vadr.1.hmmbuild
/home/nawrocki/vadr-install-dir/infernal/binaries/esl-sfetch NC_039897/NC_039897.vadr.protein.fa NC_039897.1/5388..7025:+ | /home/nawrocki/vadr-install-dir/hmmer/binaries/hmmbuild -n NC_039897/5388..7025:+ --informat afa NC_039897/NC_039897.vadr.2.hmm - > NC_039897/NC_039897.vadr.2.hmmbuild
/home/nawrocki/vadr-install-dir/infernal/binaries/esl-sfetch NC_039897/NC_039897.vadr.protein.fa NC_039897.1/7025..7672:+ | /home/nawrocki/vadr-install-dir/hmmer/binaries/hmmbuild -n NC_039897/7025..7672:+ --informat afa NC_039897/NC_039897.vadr.3.hmm - > NC_039897/NC_039897.vadr.3.hmmbuild
cat NC_039897/NC_039897.vadr.1.hmm NC_039897/NC_039897.vadr.2.hmm NC_039897/NC_039897.vadr.3.hmm > NC_039897/NC_039897.vadr.protein.hmm
rm NC_039897/NC_039897.vadr.1.hmm
rm NC_039897/NC_039897.vadr.2.hmm
rm NC_039897/NC_039897.vadr.3.hmm
cat NC_039897/NC_039897.vadr.1.hmmbuild NC_039897/NC_039897.vadr.2.hmmbuild NC_039897/NC_039897.vadr.3.hmmbuild > NC_039897/NC_039897.vadr.protein.hmmbuild
rm NC_039897/NC_039897.vadr.1.hmmbuild
rm NC_039897/NC_039897.vadr.2.hmmbuild
rm NC_039897/NC_039897.vadr.3.hmmbuild
/home/nawrocki/vadr-install-dir/hmmer/binaries/hmmpress NC_039897/NC_039897.vadr.protein.hmm > NC_039897/NC_039897.vadr.hmmpress
/home/nawrocki/vadr-install-dir/infernal/binaries/cmbuild -n NC_039897 --verbose --occfile NC_039897/NC_039897.vadr.cmbuild.occ --cp9occfile NC_039897/NC_039897.vadr.cmbuild.cp9occ --fp7occfile NC_039897/NC_039897.vadr.cmbuild.fp7occ  --noss NC_039897/NC_039897.vadr.cm NC_039897/NC_039897.vadr.stk > NC_039897/NC_039897.vadr.cmbuild
/home/nawrocki/vadr-install-dir/infernal/binaries/cmpress NC_039897/NC_039897.vadr.cm > NC_039897/NC_039897.vadr.cmpress
/home/nawrocki/vadr-install-dir/infernal/binaries/cmfetch NC_039897/NC_039897.vadr.cm NC_039897 | /home/nawrocki/vadr-install-dir/infernal/binaries/cmemit -c - > NC_039897/NC_039897.vadr.nt-cseq.fa
/home/nawrocki/vadr-install-dir/ncbi-blast/bin/makeblastdb -in NC_039897/NC_039897.vadr.nt.fa -dbtype nucl > /dev/null
rm NC_039897/NC_039897.vadr.nt-cseq.fa
# Wed May  6 18:24:56 EDT 2020
# Darwin Erics-MBP.lan 19.0.0 Darwin Kernel Version 19.0.0: Wed Oct 23 18:29:05 PDT 2019; root:xnu-6153.41.3~44/RELEASE_X86_64 x86_64
[ok]

Explanation of .filelist-suffixed output files

The .filelist files list the output files created by the VADR script. This list will typically include at least those files printed in the file output file list section[log-outputfilelist] of the .log file, and sometimes more. Each line includes a brief description of each file. An example .filelist output file for the example command v-build.pl -f --group Norovirus --subgroup GI NC_039897 NC_039897 is:

# fasta file for NC_039897 saved in:                                               NC_039897.vadr.fa
# feature table format file for NC_039897 saved in:                                NC_039897.vadr.tbl
# feature table format file for YP_009538340.1 saved in:                           NC_039897.vadr.YP_009538340.1.tbl
# feature table format file for YP_009538341.1 saved in:                           NC_039897.vadr.YP_009538341.1.tbl
# feature table format file for YP_009538342.1 saved in:                           NC_039897.vadr.YP_009538342.1.tbl
# Stockholm alignment file for NC_039897 saved in:                                 NC_039897.vadr.stk
# fasta sequence file for CDS from NC_039897 saved in:                             NC_039897.vadr.cds.fa
# fasta sequence file for translated CDS from NC_039897 saved in:                  NC_039897.vadr.protein.fa
# BLAST db .phr file for NC_039897 saved in:                                       NC_039897.vadr.protein.fa.phr
# BLAST db .pin file for NC_039897 saved in:                                       NC_039897.vadr.protein.fa.pin
# BLAST db .psq file for NC_039897 saved in:                                       NC_039897.vadr.protein.fa.psq
# BLAST db .pdb file for NC_039897 saved in:                                       NC_039897.vadr.protein.fa.pdb
# BLAST db .pot file for NC_039897 saved in:                                       NC_039897.vadr.protein.fa.pot
# BLAST db .ptf file for NC_039897 saved in:                                       NC_039897.vadr.protein.fa.ptf
# BLAST db .pto file for NC_039897 saved in:                                       NC_039897.vadr.protein.fa.pto
# esl-sfetch index file for Bio::Easel::SqFile=HASH(0x7fe64f03cca0) saved in:      NC_039897.vadr.protein.fa.ssi
# HMMER model db file for NC_039897 saved in:                                      NC_039897.vadr.protein.hmm
# hmmbuild build output (concatenated) saved in:                                   NC_039897.vadr.protein.hmmbuild
# binary HMM and p7 HMM filter file saved in:                                      NC_039897.vadr.protein.hmm.h3m
# SSI index for binary HMM file saved in:                                          NC_039897.vadr.protein.hmm.h3i
# optimized p7 HMM filters (MSV part) saved in:                                    NC_039897.vadr.protein.hmm.h3f
# optimized p7 HMM filters (remainder) saved in:                                   NC_039897.vadr.protein.hmm.h3p
# hmmpress output file saved in:                                                   NC_039897.vadr.hmmpress
# CM file saved in:                                                                NC_039897.vadr.cm
# cmbuild output file saved in:                                                    NC_039897.vadr.cmbuild
# binary CM and p7 HMM filter file saved in:                                       NC_039897.vadr.cm.i1m
# SSI index for binary CM file saved in:                                           NC_039897.vadr.cm.i1i
# optimized p7 HMM filters (MSV part) saved in:                                    NC_039897.vadr.cm.i1f
# optimized p7 HMM filters (remainder) saved in:                                   NC_039897.vadr.cm.i1p
# cmpress output file saved in:                                                    NC_039897.vadr.cmpress
# fasta sequence file with cmemit consensus sequence for NC_039897 saved in:       NC_039897.vadr.nt.fa
# BLAST db .nhr file for NC_039897 saved in:                                       NC_039897.vadr.nt.fa.nhr
# BLAST db .nin file for NC_039897 saved in:                                       NC_039897.vadr.nt.fa.nin
# BLAST db .nsq file for NC_039897 saved in:                                       NC_039897.vadr.nt.fa.nsq
# BLAST db .ndb file for NC_039897 saved in:                                       NC_039897.vadr.nt.fa.ndb
# BLAST db .not file for NC_039897 saved in:                                       NC_039897.vadr.nt.fa.not
# BLAST db .ntf file for NC_039897 saved in:                                       NC_039897.vadr.nt.fa.ntf
# BLAST db .nto file for NC_039897 saved in:                                       NC_039897.vadr.nt.fa.nto
# VADR 'model info' format file for NC_039897 saved in:                            NC_039897.vadr.minfo

Format of v-build.pl output files

v-build.pl creates many output files. These files are named <outdir>.vadr.<suffix> where <outdir> is the second command line argument given to v-build.pl. The following table lists many of the output files with a brief description and in some cases further references on the file type/format. The .minfo file format is documented further below. Example files were all created with the v-build.pl -f --group Norovirus --subgroup GI NC_039897 NC_039897 command.

file suffix description ....example_file.... reference
.minfo VADR model info file NC_039897.vadr.minfo description of format in this document
.tbl 5 column tab-delimited feature table NC_039897.vadr.tbl https://www.ncbi.nlm.nih.gov/genbank/feature_table/
.stk Stockholm alignment format NC_039897.vadr.stk https://en.wikipedia.org/wiki/Stockholm_format, http://eddylab.org/infernal/Userguide.pdf (section 9: "File and output formats")
.vadr.fa FASTA format sequence file for single sequence model was built from NC_039897.vadr.fa https://en.wikipedia.org/wiki/FASTA_format
.cds.fa FASTA format sequence file for CDS features extracted from .vadr.fa file, translated to get .protein.fa files NC_039897.vadr.cds.fa https://en.wikipedia.org/wiki/FASTA_format
.protein.fa FASTA format sequence file for protein translations of .cds.fa file NC_039897.vadr.protein.fa https://en.wikipedia.org/wiki/FASTA_format
.protein.fa.phr, .protein.fa.pin, .protein.fa.psq, .protein.fa.pdb, .protein.fa.pot, .protein.fa.ptf, .protein.fa.pto BLAST database index files, created by makeblastdb - binary files, not meant to be human-readable
.nt.fa FASTA format sequence file of the consensus sequence output from the CM with cmemit NC_039897.vadr.nt.fa https://en.wikipedia.org/wiki/FASTA_format
.nt.fa.nhr, .nt.fa.nin, .nt.fa.nsq, .nt.fa.ndb, .nt.fa.not, .nt.fa.ntf, .nt.fa.nto BLAST database index files, created by makeblastdb - binary files, not meant to be human-readable
.cm Infernal 1.1x covariance model file - http://eddylab.org/infernal/Userguide.pdf (section 9: "File and output formats")
.cm.i1m, .cm.i1i, .cm.i1f, .cm.i1p Infernal 1.1x covariance model index files, created by cmpress - binary files, not meant to be human-readable
.cmbuild Infernal cmbuild output file - no further documentation
.cmpress Infernal cmpress output file - no further documentation
.hmm HMMER 3.x HMM file - http://eddylab.org/software/hmmer/Userguide.pdf ("HMMER profile HMM files" section)
.hmm.h3m, .hmm.h3i, .hmm.h3f, .hmm.h3p HMMER 3.x HMM index files, created by hmmpress - binary files, not meant to be human-readable
.hmmbuild HMMER hmmbuild output file - no further documentation
.hmmpress HMMER hmmpress output file - no further documentation

For examples of file types not included above, see files in the vadr/testfiles/models directory.


Explanation of VADR model info .minfo-suffixed output files

VADR .minfo model info files are created by v-build.pl and read by v-annotate.pl. They can also be created manually. An example model info file created by the command: v-build.pl -f --group Norovirus --subgroup GI NC_039897 NC_039897 with VADR 1.0 is:

MODEL NC_039897 blastdb:"NC_039897.vadr.protein.fa" cmfile:"NC_039897.vadr.cm" group:"Norovirus" length:"7745" subgroup:"GI"
FEATURE NC_039897 type:"gene" coords:"5..5404:+" parent_idx_str:"GBNULL" gene:"ORF1"
FEATURE NC_039897 type:"CDS" coords:"5..5404:+" parent_idx_str:"GBNULL" gene:"ORF1" product:"nonstructural polyprotein"
FEATURE NC_039897 type:"gene" coords:"5388..7025:+" parent_idx_str:"GBNULL" gene:"ORF2"
FEATURE NC_039897 type:"CDS" coords:"5388..7025:+" parent_idx_str:"GBNULL" gene:"ORF2" product:"VP1"
FEATURE NC_039897 type:"gene" coords:"7025..7672:+" parent_idx_str:"GBNULL" gene:"ORF3"
FEATURE NC_039897 type:"CDS" coords:"7025..7672:+" parent_idx_str:"GBNULL" gene:"ORF3" product:"VP2"
FEATURE NC_039897 type:"mat_peptide" coords:"5..1219:+" parent_idx_str:"1" product:"p48"
FEATURE NC_039897 type:"mat_peptide" coords:"1220..2308:+" parent_idx_str:"1" product:"NTPase"
FEATURE NC_039897 type:"mat_peptide" coords:"2309..2908:+" parent_idx_str:"1" product:"p22"
FEATURE NC_039897 type:"mat_peptide" coords:"2909..3328:+" parent_idx_str:"1" product:"VPg"
FEATURE NC_039897 type:"mat_peptide" coords:"3329..3871:+" parent_idx_str:"1" product:"Pro"
FEATURE NC_039897 type:"mat_peptide" coords:"3872..5401:+" parent_idx_str:"1" product:"RdRp"

Model info files have two types of lines:

  1. Model lines begin with MODEL.
  2. Feature lines begin with FEATURE.

(A third type of line is allowed: comment lines prefixed with # are allowed, and ignored.)

MODEL or FEATURE is always followed by one or more whitespace characters and then the model name <modelname> which cannot include whitespace. FEATURE lines for model <modelname> must occur after the MODEL line for <modelname>

After <modelname>, both model and feature lines contain 0 or more <key>:<value> pairs meeting the following criteria:

  • <key> must not include any whitespace or : characters
  • <value> must start and end with " but include no other " characters,
  • <value> may include whitespace characters
  • <key>:<value> pairs must be separated by one or more whitespace characters.
  • <modelname> and the first <key>:<value> pair must be separated by one or more whitespace characters.

To create multiple qualifier values for the same qualifier (e.g. multiple 'note' qualifier values), separate each qualifier value by the string :GBSEP: in the <value> field. For example:

FEATURE NC_039897 type:"mat_peptide" coords:"3872..5401:+" parent_idx_str:"1" product:"RdRp" note:"this is note 1:GBSEP:this is note 2"

Common MODEL line <key>:<value> pairs:

<key> <value> required? relevance
length reference/consensus length of the covariance model (CM) for this model (CLEN lines in CM file) yes required internally
blastdb file name root of the BLAST DB (not including the directory path) only if model has >=1 CDS feature important for protein-validation stage of v-annotate.pl
group group for this model (e.g. Norovirus) only if subgroup <key> is also present for v-annotate.pl, useful for enforcing expected group and also included in output
subgroup subgroup for this model (e.g. GI) no for v-annotate.pl, useful for enforcing expected subgroup and also included in output
exceptions (e.g. dupregin_exc) varies no defines alert exception for a given model reference position range, see more info here

Common FEATURE line <key>:<value> pairs:

<key> <value> required? relevance
type feature type, e.g. CDS yes some alerts are type-specific and some types are handled differently than others; e.g. coding potential of CDS and mat_peptide features is verified
coords coordinate string that defines model positions and strand for this feature in this format yes used to map/annotate features on sequences via alignment to model
parent_idx_str comma-delimited string that lists parent feature indices (in range [0..<nftr-1>]) for this feature, nftr is the total number of features for this model no some alerts are propagated from parent features to children
product product name for this feature no used as name of feature in .tbl output files, if present
gene gene name for this feature no used as name of feature in .tbl output files, if present and product not present
misc_not_failure usually 1 no if the corresponding feature has specific types of fatal alerts, still allow sequence to pass, just make feature a misc_feature in output .tbl file, see here for details
is_deletable usually 1 no if the corresponding feature is completely deleted, non-fatal deletina alert is reported instead of fatal deletins
canon_splice_sites usually 1 no if 1 v-annotate.pl will verify GT/AG splice sites, only relevant for CDS features
alternative_ftr_set name of feature set no v-annotate.pl will choose 1 feature from each feature set to annotate, see example in RSV model here
alternative_ftr_set_subn name of feature set followed by period and integer <d> no v-annotate.pl will only annotate this feature if it chooses the corresponding feature number <d> in the stated feature set, see example in RSV model here
exceptions (e.g. fst_exc) varies no defines alert exception for a given model reference position range, see more info here

VADR model library .minfo files are just individual model .minfo files concatenated together

v-annotate.pl will use as many models as exist in the input .minfo file and input .cm files. The default VADR v1.0 set of models is 197 Caliciviridae and Flaviviridae viral genome RefSeq models. This .minfo and .cm files for this library we created by concatenating the individual .minfo and .cm files output from the corresponding 197 v-build.pl commands for each RefSeq. Additionally, all BLAST database files must be in the same directory in order to use a VADR library. Use the v-annotate.pl -m, -i and -b options to specify paths to alternative .minfo files .cm files and BLAST database directories.


Format of v-annotate.pl output files

v-annotate.pl creates many output files. These files are named <outdir>.vadr.<suffix> where <outdir> is the second command line argument given to v-annotate.pl. The following two tables list many of the output files with a brief description and in some cases further references on the file type/format.

..........suffix.......... ...............description............... ...................example_file................... reference
.pass.list list of sequences that pass, one line per sequence va-noro.9.vadr.pass.list no further documentation
.pass.tbl 5 column tab-delimited feature table of sequences that pass va-noro.9.vadr.pass.tbl https://www.ncbi.nlm.nih.gov/genbank/feature_table/
.fail.list list of sequences that fail, one line per sequence va-noro.9.vadr.fail.list no further documentation
.fail.tbl 5 column tab-delimited feature table of sequences that fail, with information on fatal alerts va-noro.9.vadr.fail.tbl https://www.ncbi.nlm.nih.gov/genbank/feature_table/
.alt.list tab-delimited file of all fatal alerts listed in .fail.tbl va-noro.9.vadr.alt.list description of format in this document
.<m>.<f>.<i>.fa FASTA format sequence file with predicted sequences for feature type <f> number <i> annotated using model <m> from the .minfo file va-noro.9.vadr.NC_039477.CDS.2.fa https://en.wikipedia.org/wiki/FASTA_format, sequence naming conventions described here
.seqstat output of esl-seqstat -a run on input sequence file, with lengths of all sequences va-noro.9.vadr.seqstat no further documentation

There are also ten types of v-annotate.pl tabular output files with fields separated by one or more spaces, that are designed to be easily parseable with simple unix tools or scripts. These files are listed in the table below

suffix description ..........example_file.......... reference
.alc per-alert code information (counts) va-noro.9.vadr.alc description of format in this document
.alt per-alert instance information va-noro.9.vadr.alt description of format in this document
.ftr per-feature information va-noro.9.vadr.ftr description of format in this document
.mdl per-model information va-noro.9.vadr.mdl description of format in this document
.sgm per-segment information va-noro.9.vadr.sgm description of format in this document
.sqa per-sequence annotation information va-noro.9.vadr.sqa description of format in this document
.sqc per-sequence classification information va-noro.9.vadr.sqc description of format in this document
.sda per-sequence seed alignment information (only created if -s used) va-noro-s.9.vadr.sda description of format in this document
.rpn per-sequence N replacement information (only created if -r used) va-noro-r.9.vadr.rpn description of format in this document

All nine types of tabular output files share the following characteristics:

  1. fields are separated by whitespace (with the possible exception of the final field)
  2. comment lines begin with #
  3. data lines begin with a non-whitespace character other than #
  4. all lines are either comment lines or data lines

Each of these nine tabular formats are explained in more detail below. All example files linked to below, except where otherwise stated, were created by the v-annotate.pl example command v-annotate.pl $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-noro.9.


Explanation of .alc-suffixed output files

.alc data lines have 8 or more fields, the names of which appear in the first two comment lines in each file. There is one data line for each alert code that occurs at least once in the input sequence file that v-annotate.pl processed. Example file.

idx field description
1 idx index of alert code
2 alert code 8 character VADR alert code
3 causes failure yes if this code is fatal and causes the associated input sequence to FAIL, no if this code is non-fatal
4 short description short description of the alert that often maps to error message from NCBI's submission system, multiple alert codes can have the same short description
5 per type feature if this alert pertains to a specific feature in a sequence, sequence if it does not
6 num cases number of instances of this alert in the output (number of rows for this alert in .alt file), can be more than 1 per sequence
7 num seqs number of input sequences with at least one instance of this alert
8 to end long description longer description of the alert, specific to each alert type; this field, unlike all others, contains whitespace

Explanation of .alt-suffixed output files

.alt data lines have 14 or more fields, the names of which appear in the first two comment lines in each file. There is one data line for each alert instance that occurs for each input sequence file that v-annotate.pl processed. Example file.

For more information on the seq coords and mdl coords fields, which have different meanings for different alerts, see here.

For examples using a toy model of different types of alerts, see here.

idx field description
1 idx index of alert instance in format <d1>.<d2>.<d3>, where <d1> is the index of the sequence this alert instance pertains to in the input sequence file, <d2> is the index of the feature this alert instance pertains to (range 1..<n>, where <n> is the number of features in this sequence with at least 1 alert instance) and <d3> is the index of the alert instance for this sequence/feature pair
2 seq name sequence name
2 model name of the best-matching reference model used to annotate this sequence, coordinates in mdl coords pertain to this model
4 ftr type type of the feature this alert instance pertains to (e.g. CDS)
5 ftr name name of the feature this alert instance pertains to
6 ftr idx index (in input model info file) this alert instance pertains to
7 alert code 8 character VADR alert code
8 fail yes if this alert code is fatal (automatically causes the sequence to fail), no if not
9 alert description short description of the alert code that often maps to error message from NCBI's submission system, multiple alert codes can have the same short description
10 seq coords coordinates in the input sequence relevant to the alert, precise meaning differs per alert, more details are here
11 seq len total length of all positions described by coordinates in seq coords
12 mdl coords coordinates in the reference model relevant to the alert, precise meaning differs per alert, more details are here
13 mdl len total length of all positions described by coordinates in mdl coords
14 to end alert detail detailed description of the alert instance, possibly with sequence position information; this field, unlike all others, contains whitespace

Explanation of .ftr-suffixed output files

.ftr data lines have 26 fields, the names of which appear in the first two comment lines in each file. There is one data line for each feature that is annotated for each input sequence file that v-annotate.pl processed. The set of possible features for each input sequence depend on its best-matching model, and can be found in the model info file. Example file.

idx field description
1 idx index of feature in format <d1>.<d2>, where <d1> is the index of the sequence in which this feature is annotated in the input sequence file, <d2> is the index of the feature (range 1..<n>, where <n> is the number of features annotated for this sequence)
2 seq name sequence name in which this feature is annotated
3 seq len length of the sequence with name seq name
4 p/f PASS if this sequence passes, FAIL if it fails (has >= 1 fatal alerts)
5 model name of the best-matching model for this sequence
6 ftr type type of the feature (e.g. CDS)
7 ftr name name of the feature
8 ftr len length of the annotated feature in nucleotides in input sequence
9 ftr idx index (in input model info file) of this feature
10 par idx index (in input model info file) of parent of this feature, -1 if none
11 str strand on which the feature is annotated: + for positive/forward/Watson strand, - for negative/reverse/Crick strand
12 n_from nucleotide start position for this feature in input sequence
13 n_to nucleotide end position for this feature in input sequence, for CDS features this is typically the final position of a stop codon if CDS is not 3' truncated
14 n_instp nucleotide position of stop codon not at n_to, or - if none, will be 5' of n_to if early stop (cdsstopn alert), or 3' of n_to if first stop is 3' of n_to (mutendex or ambgnt3c alert), or ? if no in-frame stop exists 3' of n_from; will always be - if trunc is not no;
15 trc indicates whether the feature is truncated or not, where one or both ends of the feature are missing due to a premature end to the sequence; possible values are no for not truncated; 5' for truncated on the 5' end; 3' for truncated on the 3' end; and 5'&3' for truncated on both the 5' and 3' ends;
16 5'N number of consecutive N ambiguous nucleotide characters at 5' end, starting at n_from, 0 for none
17 3'N number of consecutive N ambiguous nucleotide characters at 3' end, ending at n_to, 0 for none
18 p_from if a CDS feature, the nucleotide start position for this feature based on the blastx protein-validation step, this will always be the first position of a codon in the blastx-predicted translated region
19 p_to if a CDS feature, nucleotide stop position for this feature based on the blastx protein-validation step, this will always be the final position of a codon in the blastx-predicted translated region, typically the final position of the codon immediately upstream (prior) of the stop codon if CDS is not 3' truncated
20 p_instp nucleotide position of stop codon 5' of p_to if an in-frame stop exists before p_to
21 p_sc raw score of best blastx alignment
22 nsa number of segments annotated for this feature
23 nsn number of segments not annotated for this feature
24 seq coords sequence coordinates of feature, see format of coordinate strings
25 mdl coords model coordinates of feature, see format of coordinate strings
26 ftr alerts alerts that pertain to this feature, listed in format SHORT_DESCRIPTION(alertcode), separated by commas if more than one, - if none

Explanation of .mdl-suffixed output files

.mdl data lines have 7 fields, the names of which appear in the first two comment lines in each file. There is one data line for each model that is the best-matching model for at least one sequence in the input file, plus 2 additional lines, a line with *all* in the model field reports the summed counts over all models, and a line with *none* in the model field reports the summed counts for all sequences that did not match any models. This information is also included in the .log output file. Example file.

idx field description
1 idx index of model
2 model name of model
3 group group of model, defined in model info file, or - if none
4 subgroup subgroup of model, defined in model info file, or -' if none
5 num seqs number of sequences for which this model was the best-matching model
6 num pass number of sequences from num seqs that passed with 0 fatal alerts
7 num fail number of sequences from num seqs that failed with >= 1 fatal alerts

Explanation of .sgm-suffixed output files

.sgm data lines have 21 fields, the names of which appear in the first two comment lines in each file. There is one data line for each segment of a feature that is annotated for each input sequence file that v-annotate.pl processed. Each feature is composed of 1 or more segments, as defined by the coords field in the model info file. Example file.

idx field description
1 idx index of segment in format <d1>.<d2>.<d3> where <d1> is the index of the sequence in which this segment is annotated in the input sequence file, <d2> is the index of the feature (range 1..<n1>, where <n1> is the number of features annotated for this sequence) and <d3> is the index of the segment annotated within that feature (range 1..<n2> where <n2> is the number of segments annotated for this feature)
2 seq name sequence name in which this feature is annotated
3 seq len length of the sequence with name seq name
4 p/f PASS if this sequence passes, FAIL if it fails (has >= 1 fatal alerts)
5 model name of the best-matching model for this sequence
6 ftr type type of the feature (e.g. CDS)
7 ftr name name of the feature
8 ftr idx index (in input model info file) of this feature
9 num sgm number of segments annotated for this sequence/feature pair
10 sgm idx index (in feature) of this segment
11 seq from nucleotide start position for this segment in input sequence, will be <= seq to if strand (str) -
12 seq to nucleotide end position for this segment in input sequence, will be >= seq from if strand (str) -
13 mdl from model start position for this segment, will be <= mdl to if strand (str) -
14 mdl to model end position for this segment, will be >= mdl from if strand (str) -
15 sgm len length, in nucleotides, for this annotated segment in the input sequence
16 str strand (+ or -) for this segment in the input sequence
17 trc indicates whether the segment is truncated or not, where one or both ends of the segment are missing due to a premature end to the sequence; possible values are no for not truncated; 5' for truncated on the 5' end; 3' for truncated on the 3' end; and 5'&3' for truncated on both the 5' and 3' ends;
18 5' pp posterior probability of the aligned nucleotide at the 5' boundary of the segment, or - if 5' boundary aligns to a gap (possibly due to a 5' truncation) }
19 3' pp posterior probability of the aligned nucleotide at the 3' boundary of the segment, or - if 3' boundary aligns to a gap (possibly due to a 3' truncation) }
20 5' gap yes if the 5' boundary of the segment is a gap (possibly due to a 5' truncation), else no }
21 3' gap yes if the 3' boundary of the segment is a gap (possibly due to a 3' truncation), else no }

Explanation of .sqa-suffixed output files

.sqa data lines have 14 fields, the names of which appear in the first two comment lines in each file. There is one data line for each sequence in the input sequence file file that v-annotate.pl processed. .sqa files include annotation information for each sequence. .sqc files include classification information for each sequence. Example file.

idx field description
1 seq idx index of sequence in the input file
2 seq name sequence name
3 seq len length of the sequence with name seq name
4 p/f PASS if this sequence passes, FAIL if it fails (has >= 1 fatal alerts)
5 ant yes if this sequence was annotated, no if not, due to a per-sequence alert that prevents annotation
6 best model name of the best-matching model for this sequence
7 grp group of model best model, defined in model info file, or - if none
8 subgp subgroup of model best model, defined in model info file, or -' if none
9 nfa number of features annotated for this sequence
10 nfn number of features in model best model that are not annotated for this sequence
11 nf5 number of annotated features that are 5' truncated
12 nf3 number of annotated features that are 3' truncated
13 nfalt number of per-feature alerts reported for this sequence (does not count per-sequence alerts)
14 seq alerts per-sequence alerts that pertain to this sequence, listed in format SHORT_DESCRIPTION(alertcode), separated by commas if more than one distinct alert, and only listed once per alert type (even if multiple instances of same alert type), - if none

Explanation of .sqc-suffixed output files

.sqc data lines have 21 fields, the names of which appear in the first two comment lines in each file. There is one data line for each sequence in the input sequence file file that v-annotate.pl processed. .sqc files include classification information for each sequence. .sqa files include annotation information for each sequence. For more information on bit scores and bias see the Infernal User's Guide (http://eddylab.org/infernal/Userguide.pdf) Example file.

idx field description
1 seq idx index of sequence in the input file
2 seq name sequence name
3 seq len length of the sequence with name seq name
4 p/f PASS if this sequence passes, FAIL if it fails (has >= 1 fatal alerts)
5 ant yes if this sequence was annotated, no if not, due to a per-sequence alert that prevents annotation
6 model1 name of the best-matching model for this sequence, this is the model with the top-scoring hit for this sequence in the classification stage
7 grp1 group of model model1, defined in model info file, or - if none
8 subgrp1 subgroup of model model1, defined in model info file, or - if none
9 score summed bit score for all hits on strand str to model model1 for this sequence in the classification stage
10 sc/nt bit score per nucleotide; score divided by total length (in sequence positions) of all hits to model model1 on strand str in the classification stage
11 seq cov fraction of sequence positions (seq len) covered by any hit to model model1 on strand str in the coverage determination stage
12 mdl cov fraction of model positions (model length - the number of reference positions in model1) covered by any hit to model model1 on strand str in the coverage determination stage
13 bias summed bit score due to biased composition (deviation from expected nucleotide frequencies) of all hits on strand str to model model1 for this sequence in the coverage determination stage
14 num hits number of hits on strand str to model model1 for this sequence in the coverage determination stage
15 str strand with the top-scoring hit to model1 for this sequence in the classification stage
16 model2 name of the second best-matching model for this sequence, this is the model with the top-scoring hit for this sequence across all hits that are not to model1 nor to any model in the same subgroup as model1 in the classification stage (if model1 does not have a subgroup, all other models are considered not in its subgroup)
17 grp2 group of model model2, defined in model info file, or - if none
18 subgrp2 subgroup of model model2, defined in model info file, or -' if none
19 score diff bit score difference between summed bit score for all hits to model1 on strand str and summed bit score for all hits to model2 on strand with top-scoring hit to model2 in the classification stage
20 diff/nt bit score difference per nucleotide; sc/nt minus sc2/nt where sc2/nt is summed bit score for all hits to model2 on strand with top-scoring hit to model2 in the classification stage
21 seq alerts per-sequence alerts that pertain to this sequence, listed in format SHORT_DESCRIPTION(alertcode), separated by commas if more than one, - if none

Explanation of .sda-suffixed output files

.sda files are only output if the v-annotate.pl -s option is used. .sda data lines have 16 fields, the names of which appear in the first two comment lines in each file. There is one data line for each sequence in the input sequence file file that v-annotate.pl processed. With -s, the alignment from the best-scoring blastn HSP hit is fixed (with some caveats to avoid large gaps and gaps that include start and stop codons) and used as a seed, and only the 5' and 3' regions before and after the seed region are aligned with cmalign or glsearch as described more here. .sda files include information about the seed, 5' and 3' regions. Note that the sequence length fractions in seed fraction, 5'unaln fraction, and 3'unaln fraction will not add up to 1.0 due to overlap between these regions, which is typically 100nt, but can be adjusted with the --s_overhang option to v-annotate.pl. Example file created with the command v-annotate.pl -s $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-noro-s.9.

idx field description
1 seq idx index of sequence in the input file
2 seq name sequence name
3 seq len length of the sequence with name seq name
4 model name of the best-matching model for this sequence, this is the model with the top-scoring hit for this sequence in the classification stage
5 p/f PASS if this sequence passes, FAIL if it fails (has >= 1 fatal alerts)
6 seed seq sequence coordinates of seed region from blastn, in vadr coords format
7 seed mdl model coordinates of seed region from blastn, in vadr coords format
8 seed fraction fraction of seq len in seed region in seed seq
9 5'unaln seq sequence coordinates of 5' region not covered by seed seq plus some overlap (typically 100nt) subsequently aligned with cmalign or glsearch, in vadr coords format
10 5'unaln mdl model start/stop coordinates for cmalign alignment of 5' region 5'unaln seq, in vadr coords format
11 5'unaln fraction fraction of seq len in 5' region in 5'unaln seq
12 3'unaln seq sequence coordinates of 3' region not covered by seed seq plus some overlap (typically 100nt) subsequently aligned with cmalign or glsearch, in vadr coords format
13 3'unaln mdl model start/stop coordinates for cmalign or glsearch alignment of 3' region 3'unaln seq, in vadr coords format
14 3'unaln fraction fraction of seq len in 3' region in 3'unaln seq
15 program program seed was derived from, either blastn or minimap2, only minimap2 if --minimap2 option used and minimap2 seed length greater than or equal to blastn seed
16 alt-seed fraction fraction of seq len in seed region of alternative seed not used (so blastn derived seed if program is minimap2), else - if program is blastn

Explanation of .rpn-suffixed output files

.rpn files are only output if the v-annotate.pl -r option is used. .rpn data lines have 16 fields, the names of which appear in the first two comment lines in each file. There is one data line for each sequence in the input sequence file file that v-annotate.pl processed. With -r, sequences are preprocessed with blastn and missing regions between blastn hits are identified and examined for Ns. The Ns in some of these regions are replaced with the expected nucleotides from the model as explained more here. .rpn files include information about these missing regions, referred to as gaps in the .rpn column headers and below. Example file created with the command v-annotate.pl -r $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.r.fa va-noro-r.9.

idx field description
1 seq idx index of sequence in the input file
2 seq name sequence name
3 seq len length of the sequence with name seq name
4 model name of the best-matching model for this sequence, this is the model with the top-scoring hit for this sequence in the classification stage
5 p/f PASS if this sequence passes, FAIL if it fails (has >= 1 fatal alerts)
6 num_Ns tot total number of Ns in the sequence
7 num_Ns rp number of Ns in the sequence replaced with expected nucleotides from the model consensus
8 fract_Ns rp fraction of Ns replaced: num_Ns rp/num_Ns tot
9 nregs tot number of regions between preprocessing stage blastn hits, including region at 5' and 3' ends
10 nregs int number of internal region between preprocessing stage blastn hits, nreg tot minus number of regions at 5' and/or 3' end
11 nregs rp number of regions in which one or more Ns were replaced
12 nregs rp-full number of regions in which entire region was Ns and all Ns were replaced
13 nregs rp-part number of regions in which entire region was not Ns, but all Ns were replaced
14 nnt rp-full number of Ns replaced in the nreg rp-full reg
15 nnt rp-part number of Ns replaced in the nreg rp-part reg
16 detail_on_regions [S:seq,M:mdl,D:lendiff,N:#Ns, E:#non_N_match_expected, F:flush_direction,R:region_replaced?]; string with details on each region; S: sequence positions of region; M: model positions of region; D:sequence length - model length; N: number of Ns in region; E: number of non-Ns that match expected / total non-Ns or ?/? if replacement not attempted or D is 0 and region is entirely Ns; F: if D is 0 or E is ?/? then -, else 5' if shifting sequence region left gave higher E or 3' if shifting right gave higher E; R: Y if region was replaced, N if not;

Explanation of .dcr-suffixed output files

.dcr data lines have 17 fields, the names of which appear in the first two comment lines in each file. There is one data line for each alignment doctoring that was performed. An alignment doctoring occurs only in rare cases. There are two types of alignment doctorings: insert type and delete types.

An insert type alignment doctoring occurs when the following criteria are met:

  1. the initial alignment returned by cmalign or glsearch includes a single nucleotide insertion after the first position of a start codon or a before the final position of a stop codon

  2. at least one adjacent nucleotide in the input sequence exists 5' of insert (start codon case) or 3' of insert (stop codon case)

  3. the adjacent nucleotide is aligned to a reference position (is not an insertion)

  4. making the adjacent nucleotide an insertion instead of the inserted position in the start/stop codon with the will result in a valid start or stop codon aligned to the reference start or stop codon

A delete type alignment doctoring occurs when the following criteria are met:

  1. the initial alignment returned by cmalign or glsearch includes a gap at the first position of a start codon of a CDS in the reference model, or at the final position of a stop codon of a CDS

  2. at least one adjacent nucleotide in the input sequence exists 5' of gap (start codon case) or 3' of gap (stop codon case)

  3. the adjacent nucleotide is aligned to a reference position (is not an insertion)

  4. swapping the gapped position in the start/stop codon with the adjacent nucleotide will result in a valid start or stop codon aligned to the reference start or stop codon

For every situation where criteria 1 to 3 above are met for either insert or delete types, a line of information will be output to the .dcr file. If criteria 4 is also met, then field 17 will be yes, otherwise it will be no.

For any doctoring that causes an existing valid start or stop codon in a nearby CDS to become invalid (even more rare), a second doctoring takes place to undo the first, and an additional line for this undoctoring will appear in the .dcr file.

The relevant code is in the parse_stk_and_add_alignment_alerts() and doctoring_check_new_codon_validity() subroutines in v-annotate.pl. The latter subroutine includes simple examples in the comments of its header section.

Example file

idx field description
1 idx index of doctoring instance in format <d1>.<d2>, where <d1> is the index of the sequence this doctoring instance pertains to in the input sequence file, <d2> is the index of the doctoring instance for this sequence
2 seq name sequence name
3 mdl name name of the best-matching model for this sequence, this is the model with the top-scoring hit for this sequence in the classification stage, and is the model used to align the sequence
4 ftr type type of the feature (will always be CDS)
5 ftr name name of the CDS feature
6 ftr idx index (in input model info file) this doctoring instance pertains to
7 dcr type 'delete' or 'insert' indicating type of doctoring
8 model pos reference model position, either first position of start codon, or final position of stop codon
9 indel apos alignment position of the insertion (insert type) or deletion (delete type)
10 orig seq-uapos unaligned sequence position of the first start or final stop position before swap
11 new seq-uapos unaligned sequence position of the first start or final stop position after swap (if performed)
12 codon type start if start codon, stop if stop codon
13 codon coords unaligned sequence coordinates of start or stop codon after potential swap, in vadr coords format
14 orig codon start or stop codon before potential doctoring (swap)
15 new codon start or stop codon after potential doctoring (swap)
16 dcr iter doctoring iteration, 1 if first time the gap and nucleotide may be swapped, 2 if second (swapping back because first swap invalidated previously valid start/stop codon), cannot exceed 2
17 did swap yes if doctoring (swap) took place because it created a valid start or stop codon, no if doctoring (swap) did not occur because it would not have created a valid start or stop codon

Explanation of .alt.list-suffixed output files

.alt.list files begin with a comment line that names the fields, followed by 0 or more lines with 8 tab-delimited fields. Example file.

For more information on the seq coords and mdl coords fields, which have different meanings for different alerts, see here.

For examples using a toy model of different types of alerts, see here.

idx field description
1 sequence name of sequence this alert pertains to
2 model name of the best-matching reference model used to annotate this sequence, coordinates in mdl coords pertain to this model
3 feature-type type of feature the alert/error pertains to, or - if this alert is a per-sequence alert and not a per-feature alert
4 feature-name name of the feature this alert/error pertains to, of *sequence* if this alert is a per-sequence alert and not a per-feature alert
5 error short description of the alert/error
6 seq coords coordinates in the input sequence relevant to the alert, precise meaning differs per alert, more details are here
7 mdl coords coordinates in the reference model relevant to the alert, precise meaning differs per alert, more details are here
8 error-description longer description of the alert/error, specific to each alert/error type; this field, unlike all others, contains whitespace

Additional files created by v-annotate.pl when the --keep option is used

When run with the --keep option, v-annotate.pl will create additional files, some of these may change based on command-line options, in particular -s and -r: There are additional options that begin with --out_ which specify that a subset of these files be output. For example the --out_stk option specifies that stockholm alignment files be output.

suffix description reference
.cm.namelist file with list of names of all models in model library no further documentation
.in.fa copy of input fasta file (if --origfa is used, this will not exist) https://en.wikipedia.org/wiki/FASTA_format
.fa.ssi Easel sequence index files binary file, not meant to be human-readable
.cls.*.tblout tabular output from cmscan (or blastn converted to cmscan format) from classification stage http://eddylab.org/infernal/Userguide.pdf (section 9: "File and output formats")
.cls.*.stdout standard output (usually from cmscan) in classification stage no further documentation
.cdt.<model_name>.tblout tabular output from coverage determination stage for model <model_name> http://eddylab.org/infernal/Userguide.pdf (section 9: "File and output formats")
.cdt.<model_name>.stdout standard output (usually from cmsearch) from coverage determination stage for model <model_name> http://eddylab.org/infernal/Userguide.pdf (section 9: "File and output formats")
.<model_name>.fa fasta file of sequences classified to <model_name>, used as input to cmsearch in coverage determination stage https://en.wikipedia.org/wiki/FASTA_format
.<model_name>.a.fa fasta file of sequences classified to <model_name>, used as input to cmalign in alignment stage https://en.wikipedia.org/wiki/FASTA_format
.<model_name>.<ftr_type>.<ftr_idx>.fa fasta file of predicted feature subsequences for feature type <ftr_type> number <ftr_idx> for sequences classified to <model_name>, for CDS, used as input to blastx in protein validation stage https://en.wikipedia.org/wiki/FASTA_format
.<model_name>.align.*.stk Stockholm alignment file output from cmalign with 1 or more sequences classified to <model_name> https://en.wikipedia.org/wiki/Stockholm_format, http://eddylab.org/infernal/Userguide.pdf (section 9: "File and output formats")
.<model_name>.align.*.ifile cmalign insert output file, created with --ifile option for 1 or more sequences classified to <model_name> description of fields at top of file, no further documentation
.<model_name>.align.*.stdout cmalign standard output for 1 or more sequences classified to <model_name> no further documentation
.<model_name>.pv.blastx.fa query fasta file used for blastx for sequences classified to <model_name>, with full input sequences and predicted CDS subsequences https://en.wikipedia.org/wiki/FASTA_format, sequence naming conventions described here
.<model_name>.blastx.out blastx output for for sequences classified to <model_name> https://www.ncbi.nlm.nih.gov/books/NBK279684/
.<model_name>.blastx.summary.txt summary of blastx output used internally by v-annotate.pl no further documentation

Explanation of VADR coords coordinate strings

VADR using its own format for specifying coordinates for features and for naming subsequences in some output fasta files.

VADR coordinate strings are made up of one or more tokens with format <d1>..<d2>:<s>, where <d1> is the start position, <d2> is the end position, and <s> is the strand, either + or -, or rarely ? if unknown/uncertain. Tokens are separated by a ,. Each token defines what is referred to as a single segment in VADR code and output.

Here are some examples:

VADR coords string #segments meaning corresponding GenBank format location string
1..200:+ 1 positions 1 to 200 on positive strand 1..200
200..1:- 1 positions 200 to 1 on negative strand complement(1..200)
1..200:+,300..400:+ 2 positions 1 to 200 on positive strand (segment #1) followed by positions 300 to 400 on positive strand (segment #2) join(1..200,300..400)
400..300:-,200..1:- 2 positions 400 to 300 on negative strand (segment #1) followed by positions 200 to 1 on negative strand (segment #2) complement(join(1..200,300..400))
1..200:+,400..300:- 2 positions 1 to 200 on positive strand (segment #1) followed by positions 400 to 300 on negative strand (segment #2) join(1..200,complement(300..400))

These coords strings appear in .ftr output files and as the <value> in <key>:<value> pairs in v-build.pl output model info (.minfo) files for FEATURE lines.

Explanation of sequence naming in output VADR fasta files

FASTA format sequence files output by VADR use a specific naming convention for naming sequences.

Specifically, when naming a new subsequence, VADR scripts will append a / character followed by a VADR coordinates string to sequence names to indicate the positions (and strand) of the original sequence the new subsequence derives from. v-annotate.pl will create names in a similar manner, but sometimes will add an additional string that defines the feature being annotated. Here are some examples:

Original sequence name subsequence name subseq start subseq end subseq strand notes
NC_039897.1 NC_039897.1/7025..7672:+ 7025 7672 + Typical of v-build.pl .cds.fa output files
JN975492.1 JN975492.1/mat_peptide.2/1001-2092:+ 1001 2092 + Typical of v-annotate.pl .<model_name>.mat_peptide.<d>.fa output files, this is the predicted sequence of the second mature peptide from model <model-name> in JN975492.1

Questions, comments or feature requests? Send a mail to eric.nawrocki@nih.gov.

-