AlignSets
usage: AlignSets.py [-h] [--version] ...
Multiple aligns input sequences by group
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
subcommands:
Alignment method
muscle Align sequence sets using MUSCLE
offset Align sequence sets using predefined 5' offset
table Create a 5' offset table by primer multiple alignment
output files:
align-pass
multiple aligned reads.
align-fail
raw reads failing multiple alignment.
offsets-forward
5' offset table for input into offset subcommand.
offsets-reverse
3' offset table for input into offset subcommand.
output annotation fields:
None
muscle
usage: AlignSets.py muscle [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed] [--log LOG_FILE]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [--outdir OUT_DIR]
[--outname OUT_NAME] [--bf BARCODE_FIELD] [--div]
[--exec MUSCLE_EXEC]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
--bf BARCODE_FIELD The annotation field containing barcode labels for
sequence grouping (default: BARCODE)
--div Specify to calculate nucleotide diversity of each set
(average pairwise error rate) (default: False)
--exec MUSCLE_EXEC The location of the MUSCLE executable (default:
/usr/local/bin/muscle)
offset
usage: AlignSets.py offset [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed] [--log LOG_FILE]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [--outdir OUT_DIR]
[--outname OUT_NAME] [--bf BARCODE_FIELD] [--div]
[-d OFFSET_TABLE] [--pf PRIMER_FIELD]
[--mode {pad,cut}]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
--bf BARCODE_FIELD The annotation field containing barcode labels for
sequence grouping (default: BARCODE)
--div Specify to calculate nucleotide diversity of each set
(average pairwise error rate) (default: False)
-d OFFSET_TABLE The tab delimited file of offset tags and values
(default: None)
--pf PRIMER_FIELD The primer field to use for offset assignment
(default: PRIMER)
--mode {pad,cut} Specifies whether or align sequence by padding with
gaps or by cutting the 5' sequence to a common start
position (default: pad)
table
usage: AlignSets.py table [-h] [--failed]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME] -p
PRIMER_FILE [--reverse] [--exec MUSCLE_EXEC]
optional arguments:
-h, --help show this help message and exit
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-p PRIMER_FILE A FASTA or REGEX file containing primer sequences
(default: None)
--reverse If specified create a 3' offset table instead
(default: False)
--exec MUSCLE_EXEC The location of the MUSCLE executable (default:
/usr/local/bin/muscle)
AssemblePairs
usage: AssemblePairs.py [-h] [--version] ...
Assembles paired-end reads into a single sequence
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
subcommands:
Assembly method
align Assembled pairs by aligning ends
join Assembled pairs by concatenating ends
reference
Assembled pairs by aligning reads against a reference database
output files:
assemble-pass
successfully assembled reads.
assemble-fail
raw reads failing paired-end assembly.
output annotation fields:
annotation fields specified by the --1f or --2f arguments.
align
usage: AssemblePairs.py align [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...] -2
SEQ_FILES_2 [SEQ_FILES_2 ...] [--fasta]
[--failed] [--log LOG_FILE]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [--outdir OUT_DIR]
[--outname OUT_NAME]
[--coord {illumina,solexa,sra,454,presto}]
[--rc {head,tail,both}]
[--1f HEAD_FIELDS [HEAD_FIELDS ...]]
[--2f TAIL_FIELDS [TAIL_FIELDS ...]]
[--alpha ALPHA] [--maxerror MAX_ERROR]
[--minlen MIN_LEN] [--maxlen MAX_LEN]
[--scanrev]
optional arguments:
-h, --help show this help message and exit
-1 SEQ_FILES_1 [SEQ_FILES_1 ...]
An ordered list of FASTA/FASTQ files containing
head/primary sequences. (default: None)
-2 SEQ_FILES_2 [SEQ_FILES_2 ...]
An ordered list of FASTA/FASTQ files containing
tail/secondary sequences. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
--coord {illumina,solexa,sra,454,presto}
The format of the sequence identifier which defines
shared coordinate information across paired ends
(default: presto)
--rc {head,tail,both}
Specify to reverse complement sequences before
stitching (default: None)
--1f HEAD_FIELDS [HEAD_FIELDS ...]
Specify annotation fields to copy from head records
into assembled record (default: None)
--2f TAIL_FIELDS [TAIL_FIELDS ...]
Specify annotation fields to copy from tail records
into assembled record (default: None)
--alpha ALPHA Significance threshold for sequence assemble (default:
1e-05)
--maxerror MAX_ERROR Maximum allowable error rate (default: 0.3)
--minlen MIN_LEN Minimum sequence length to scan for overlap (default:
8)
--maxlen MAX_LEN Maximum sequence length to scan for overlap (default:
1000)
--scanrev If specified, scan past the end of the tail sequence
to allow the head sequence to overhang the end of the
tail sequence. (default: False)
reference
usage: AssemblePairs.py reference [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...] -2
SEQ_FILES_2 [SEQ_FILES_2 ...] [--fasta]
[--failed] [--log LOG_FILE]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [--outdir OUT_DIR]
[--outname OUT_NAME]
[--coord {illumina,solexa,sra,454,presto}]
[--rc {head,tail,both}]
[--1f HEAD_FIELDS [HEAD_FIELDS ...]]
[--2f TAIL_FIELDS [TAIL_FIELDS ...]] -r
REF_FILE [--minident MIN_IDENT]
[--evalue EVALUE] [--maxhits MAX_HITS]
[--fill] [--exec USEARCH_EXEC]
optional arguments:
-h, --help show this help message and exit
-1 SEQ_FILES_1 [SEQ_FILES_1 ...]
An ordered list of FASTA/FASTQ files containing
head/primary sequences. (default: None)
-2 SEQ_FILES_2 [SEQ_FILES_2 ...]
An ordered list of FASTA/FASTQ files containing
tail/secondary sequences. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
--coord {illumina,solexa,sra,454,presto}
The format of the sequence identifier which defines
shared coordinate information across paired ends
(default: presto)
--rc {head,tail,both}
Specify to reverse complement sequences before
stitching (default: None)
--1f HEAD_FIELDS [HEAD_FIELDS ...]
Specify annotation fields to copy from head records
into assembled record (default: None)
--2f TAIL_FIELDS [TAIL_FIELDS ...]
Specify annotation fields to copy from tail records
into assembled record (default: None)
-r REF_FILE A FASTA file containing the reference sequence
database. (default: None)
--minident MIN_IDENT Minimum identity of the assembled sequence required to
call a valid assembly (between 0 and 1). (default:
0.5)
--evalue EVALUE Minimum E-value for the ublast reference alignment for
both the head and tail sequence. (default: 1e-05)
--maxhits MAX_HITS Maximum number of hits from ublast to check for
matching head and tail sequence reference alignments.
(default: 100)
--fill Specify to insert change the behavior of inserted
characters when the head and tail sequences do not
overlap. If specified this will result in inserted of
the V region reference sequence instead of a sequence
of Ns in the non-overlapping region. Warning, you
could end up making chimeric sequences by using this
option. (default: False)
--exec USEARCH_EXEC The path to the usearch executable file. (default:
/usr/local/bin/usearch)
join
usage: AssemblePairs.py join [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...] -2
SEQ_FILES_2 [SEQ_FILES_2 ...] [--fasta]
[--failed] [--log LOG_FILE]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [--outdir OUT_DIR]
[--outname OUT_NAME]
[--coord {illumina,solexa,sra,454,presto}]
[--rc {head,tail,both}]
[--1f HEAD_FIELDS [HEAD_FIELDS ...]]
[--2f TAIL_FIELDS [TAIL_FIELDS ...]] [--gap GAP]
optional arguments:
-h, --help show this help message and exit
-1 SEQ_FILES_1 [SEQ_FILES_1 ...]
An ordered list of FASTA/FASTQ files containing
head/primary sequences. (default: None)
-2 SEQ_FILES_2 [SEQ_FILES_2 ...]
An ordered list of FASTA/FASTQ files containing
tail/secondary sequences. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
--coord {illumina,solexa,sra,454,presto}
The format of the sequence identifier which defines
shared coordinate information across paired ends
(default: presto)
--rc {head,tail,both}
Specify to reverse complement sequences before
stitching (default: None)
--1f HEAD_FIELDS [HEAD_FIELDS ...]
Specify annotation fields to copy from head records
into assembled record (default: None)
--2f TAIL_FIELDS [TAIL_FIELDS ...]
Specify annotation fields to copy from tail records
into assembled record (default: None)
--gap GAP Number of N characters to place between ends (default:
0)
BuildConsensus
usage: BuildConsensus.py [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed] [--log LOG_FILE]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [--outdir OUT_DIR]
[--outname OUT_NAME] [--version] [-n MIN_COUNT]
[--bf BARCODE_FIELD] [-q MIN_QUAL] [--freq MIN_FREQ]
[--maxgap MAX_GAP] [--pf PRIMER_FIELD]
[--prcons PRIMER_FREQ]
[--cf COPY_FIELDS [COPY_FIELDS ...]]
[--act {min,max,sum,set,majority} [{min,max,sum,set,majority} ...]]
[--dep]
[--maxdiv MAX_DIVERSITY | --maxerror MAX_ERROR]
Builds a consensus sequence for each set of input sequences
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
--version show program's version number and exit
-n MIN_COUNT The minimum number of sequences needed to define a
valid consensus (default: 1)
--bf BARCODE_FIELD Position of description barcode field to group
sequences by (default: BARCODE)
-q MIN_QUAL Consensus quality score cut-off under which an
ambiguous character is assigned; does not apply when
quality scores are unavailable (default: 0)
--freq MIN_FREQ Fraction of character occurrences under which an
ambiguous character is assigned. (default: 0.6)
--maxgap MAX_GAP If specified, this defines a cut-off for the frequency
of allowed gap values for each position. Positions
exceeding the threshold are deleted from the
consensus. If not defined, positions are always
retained. (default: None)
--pf PRIMER_FIELD Specifies the field name of the primer annotations
(default: None)
--prcons PRIMER_FREQ Specify to define a minimum primer frequency required
to assign a consensus primer, and filter out sequences
with minority primers from the consensus building step
(default: None)
--cf COPY_FIELDS [COPY_FIELDS ...]
Specifies a set of additional annotation fields to
copy into the consensus sequence annotations.
(default: None)
--act {min,max,sum,set,majority} [{min,max,sum,set,majority} ...]
List of actions to take for each copy field which
defines how each annotation will be combined into a
single value. The actions "min", "max", "sum" perform
the corresponding mathematical operation on numeric
annotations. The action "set" combines annotations
into a comma delimited list of unique values and adds
an annotation named _COUNT specifying the count
of each item in the set. The action "majority" assigns
the most frequent annotation to the consensus
annotation and adds an annotation named _FREQ
specifying the frequency of the majority value.
(default: None)
--dep Specify to calculate consensus quality with a non-
independence assumption (default: False)
--maxdiv MAX_DIVERSITY
Specify to calculate the nucleotide diversity of each
read group (average pairwise error rate) and remove
groups exceeding the given diversity threshold.
Diversity is calculate for all positions within the
read group, ignoring any character filtering imposed
by the -q, --freq and --maxgap arguments. Mutually
exclusive with --maxerror. (default: None)
--maxerror MAX_ERROR Specify to calculate the error rate of each read group
(rate of mismatches from consensus) and remove groups
exceeding the given error threshold. The error rate is
calculated against the final consensus sequence, which
may include masked positions due to the -q and --freq
arguments and may have deleted positions due to the
--maxgap argument. Mutually exclusive with --maxdiv.
(default: None)
output files:
consensus-pass
consensus reads.
consensus-fail
raw reads failing consensus filtering criteria.
output annotation fields:
PRIMER
a comma delimited list of unique primer annotations found within the
barcode read group.
PRCOUNT
a comma delimited list of the corresponding counts of unique primer
annotations.
PRCONS
the majority primer within the barcode read group.
PRFREQ
the frequency of the majority primer.
CONSCOUNT
the count of reads within the barcode read group which contributed to
the consensus sequence. This is the total size of the read group,
minus sequence excluded due to user defined filtering criteria.
ClusterSets
usage: ClusterSets.py [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta] [--failed]
[--log LOG_FILE] [--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [--outdir OUT_DIR] [--outname OUT_NAME]
[--version] [-f BARCODE_FIELD] [-k CLUSTER_FIELD]
[--id IDENT] [--start SEQ_START] [--end SEQ_END]
[--exec USEARCH_EXEC]
Cluster sequences by group
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
--version show program's version number and exit
-f BARCODE_FIELD The annotation field containing annotations, such as
UID barcode, for sequence grouping. (default: BARCODE)
-k CLUSTER_FIELD The name of the output annotation field to add with
the cluster information for each sequence. (default:
CLUSTER)
--id IDENT The sequence identity threshold for the usearch
algorithm. (default: 0.9)
--start SEQ_START The start of the region to be used for clustering.
Together with --end, this parameter can be used to
specify a subsequence of each read to use in the
clustering algorithm. (default: None)
--end SEQ_END The end of the region to be used for clustering.
(default: None)
--exec USEARCH_EXEC The location of the USEARCH executable. (default:
/usr/local/bin/usearch)
output files:
cluster-pass
clustered reads.
cluster-fail
raw reads failing clustering.
output annotation fields:
CLUSTER
a numeric cluster identifier defining the within-group cluster.
CollapseSeq
usage: CollapseSeq.py [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta] [--failed]
[--log LOG_FILE] [--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME] [--version]
[-n MAX_MISSING] [--uf UNIQ_FIELDS [UNIQ_FIELDS ...]]
[--cf COPY_FIELDS [COPY_FIELDS ...]]
[--act {min,max,sum,set} [{min,max,sum,set} ...]]
[--inner] [--keepmiss]
[--maxf MAX_FIELD | --minf MIN_FIELD]
Removes duplicate sequences from FASTA/FASTQ files
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
--version show program's version number and exit
-n MAX_MISSING Maximum number of missing nucleotides to consider for
collapsing sequences. A sequence will be considered
undetermined if it contains too many missing
nucleotides. (default: 0)
--uf UNIQ_FIELDS [UNIQ_FIELDS ...]
Specifies a set of annotation fields that must match
for sequences to be considered duplicates (default:
None)
--cf COPY_FIELDS [COPY_FIELDS ...]
Specifies a set of annotation fields to copy into the
unique sequence output. (default: None)
--act {min,max,sum,set} [{min,max,sum,set} ...]
List of actions to take for each copy field which
defines how each annotation will be combined into a
single value. The actions "min", "max", "sum" perform
the corresponding mathematical operation on numeric
annotations. The action "set" collapses annotations
into a comma delimited list of unique values.
(default: None)
--inner If specified, exclude consecutive missing characters
at either end of the sequence. (default: False)
--keepmiss If specified, sequences with more missing characters
than the threshold set by the -n parameter will be
written to the unique sequence output file with a
DUPCOUNT=1 annotation. If not specified, such
sequences will be written to a separate file.
(default: False)
--maxf MAX_FIELD Specify the field whose maximum value determines the
retained sequence; mutually exclusive with --minf.
(default: None)
--minf MIN_FIELD Specify the field whose minimum value determines the
retained sequence; mutually exclusive with --minf.
(default: None)
output files:
collapse-unique
unique sequences. Contains one representative from each set of
duplicate sequences. The retained representative is determined by
user defined criteria.
collapse-duplicate
raw reads which are duplicates of the sequences retained in the
collapse-unique file.
collapse-undetermined
raw reads which were excluded from consideration due to having too
many N characters in the sequence.
output annotation fields:
DUPCOUNT
total number of sequences within the set of duplicates for each
retained unique sequence. Meaning, the copy number of each unique
sequence within the data file.
annotation fields specified by the --cf parameter.
ConvertHeaders
usage: ConvertHeaders.py [-h] [--version] ...
Converts sequence headers to the pRESTO format
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
subcommands:
Conversion method
generic Converts sequence headers without a known annotation system.
454 Converts Roche 454 sequence headers.
genbank Converts NCBI GenBank and RefSeq sequence headers.
illumina Converts Illumina sequence headers.
imgt Converts sequence headers output by IMGT/GENE-DB.
sra Converts NCBI SRA sequence headers.
output files:
convert-pass
reads passing header conversion.
convert-fail
raw reads failing header conversion.
output annotation fields:
the annotation fields added are specific to the header format of the
input file.
generic
usage: ConvertHeaders.py generic [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
454
usage: ConvertHeaders.py 454 [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
genbank
usage: ConvertHeaders.py genbank [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
illumina
usage: ConvertHeaders.py illumina [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
imgt
usage: ConvertHeaders.py imgt [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--simple]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
--simple If specified, only the allele name, and no other
annotations, will appear in the converted sequence
header. (default: False)
sra
usage: ConvertHeaders.py sra [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
EstimateError
usage: EstimateError.py [-h] -s SEQ_FILES [SEQ_FILES ...] [--log LOG_FILE]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [--outdir OUT_DIR]
[--outname OUT_NAME] [--version] [-f SET_FIELD]
[-n MIN_COUNT] [--mode {freq,qual}] [-q MIN_QUAL]
[--freq MIN_FREQ] [--maxdiv MAX_DIVERSITY]
Calculates annotation set error rates
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
--version show program's version number and exit
-f SET_FIELD The name of the annotation field to group sequences by
(default: BARCODE)
-n MIN_COUNT The minimum number of sequences needed to consider a
set (default: 10)
--mode {freq,qual} Specifies which method to use to determine the
consensus sequence. The "freq" method will determine
the consensus by nucleotide frequency at each position
and assign the most common value. The "qual" method
will weight values by their quality scores to
determine the consensus nucleotide at each position.
(default: freq)
-q MIN_QUAL Consensus quality score cut-off under which an
ambiguous ' character is assigned. (default: 20)
--freq MIN_FREQ Fraction of character occurrences under which an
ambiguous character is assigned. (default: 0.6)
--maxdiv MAX_DIVERSITY
Specify to calculate the nucleotide diversity of each
read group (average pairwise error rate) and exclude
groups which exceed the given diversity threshold.
(default: None)
output files:
error-position
estimated error by read position.
error-quality
estimated error by the quality score assigned within the input file.
error-nucleotide
estimated error by nucleotide.
error-set
estimated error by barcode read group size.
output fields:
POSITION
read position with base zero indexing.
Q
Phred quality score.
OBSERVED
observed nucleotide value.
REFERENCE
consensus nucleotide for the barcode read group.
SET_COUNT
barcode read group size.
REPORTED_Q
mean Phred quality score reported within the input file for the given
position, quality score, nucleotide or read group.
MISMATCHES
count of observed mismatches from consensus for the given position,
quality score, nucleotide or read group.
OBSERVATIONS
total count of observed values for each position, quality score,
nucleotide or read group size.
ERROR
estimated error rate.
EMPIRICAL_Q
estimated error rate converted to a Phred quality score.
FilterSeq
usage: FilterSeq.py [-h] [--version] ...
Filters sequences in FASTA/FASTQ files
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
subcommands:
Filtering operation
length Sequence length filtering mode
missing Missing nucleotide filtering mode
repeats Consecutive nucleotide repeating filtering mode
quality Quality filtering mode
maskqual Character masking mode
trimqual Sequence trimming mode
output files:
-pass
reads passing filtering operation and modified accordingly, where
is the name of the filtering operation that was run.
-fail
raw reads failing filtering criteria, where is the name of
the filtering operation.
output annotation fields:
None
length
usage: FilterSeq.py length [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed] [--log LOG_FILE] [--nproc NPROC]
[--outdir OUT_DIR] [--outname OUT_NAME]
[-n MIN_LENGTH] [--inner]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-n MIN_LENGTH Minimum sequence length to retain. (default: 250)
--inner If specified exclude consecutive missing characters at
either end of the sequence. (default: False)
missing
usage: FilterSeq.py missing [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed] [--log LOG_FILE] [--nproc NPROC]
[--outdir OUT_DIR] [--outname OUT_NAME]
[-n MAX_MISSING] [--inner]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-n MAX_MISSING Threshold for fraction of gap or N nucleotides.
(default: 10)
--inner If specified exclude consecutive missing characters at
either end of the sequence. (default: False)
repeats
usage: FilterSeq.py repeats [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed] [--log LOG_FILE] [--nproc NPROC]
[--outdir OUT_DIR] [--outname OUT_NAME]
[-n MAX_REPEAT] [--missing] [--inner]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-n MAX_REPEAT Threshold for fraction of repeating nucleotides.
(default: 15)
--missing If specified count consecutive gap and N characters '
in addition to {A,C,G,T}. (default: False)
--inner If specified exclude consecutive missing characters at
either end of the sequence. (default: False)
quality
usage: FilterSeq.py quality [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed] [--log LOG_FILE] [--nproc NPROC]
[--outdir OUT_DIR] [--outname OUT_NAME]
[-q MIN_QUAL] [--inner]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-q MIN_QUAL Quality score threshold. (default: 20)
--inner If specified exclude consecutive missing characters at
either end of the sequence. (default: False)
maskqual
usage: FilterSeq.py maskqual [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed] [--log LOG_FILE] [--nproc NPROC]
[--outdir OUT_DIR] [--outname OUT_NAME]
[-q MIN_QUAL]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-q MIN_QUAL Quality score threshold. (default: 20)
trimqual
usage: FilterSeq.py trimqual [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed] [--log LOG_FILE] [--nproc NPROC]
[--outdir OUT_DIR] [--outname OUT_NAME]
[-q MIN_QUAL] [--win WINDOW] [--reverse]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-q MIN_QUAL Quality score threshold. (default: 20)
--win WINDOW Nucleotide window size for moving average calculation.
(default: 10)
--reverse Specify to trim the head of the sequence rather than
the tail. (default: False)
MaskPrimers
usage: MaskPrimers.py [-h] [--version] ...
Removes primers and annotates sequences with primer and barcode identifiers
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
subcommands:
Alignment method
align Find primer matches using pairwise local alignment
score Find primer matches by scoring primers at a fixed position
output files:
mask-pass
processed reads with successful primer matches.
mask-fail
raw reads failing primer identification.
output annotation fields:
SEQORIENT
the orientation of the output sequence. Either F (input) or RC
(reverse complement of input).
PRIMER
name of the best primer match.
BARCODE
the sequence preceding the primer match. Only output when the
--barcode flag is specified.
align
usage: MaskPrimers.py align [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed] [--log LOG_FILE]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [--outdir OUT_DIR]
[--outname OUT_NAME] -p PRIMER_FILE
[--mode {cut,mask,trim,tag}]
[--maxerror MAX_ERROR] [--revpr] [--barcode]
[--maxlen MAX_LEN] [--skiprc]
[--gap GAP_PENALTY GAP_PENALTY]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-p PRIMER_FILE A FASTA or REGEX file containing primer sequences.
(default: None)
--mode {cut,mask,trim,tag}
Specifies the action to take with the primer sequence.
The "cut" mode will remove both the primer region and
the preceding sequence. The "mask" mode will replace
the primer region with Ns and remove the preceding
sequence. The "trim" mode will remove the region
preceding the primer, but leave the primer region
intact. The "tag" mode will leave the input sequence
unmodified. (default: mask)
--maxerror MAX_ERROR Maximum allowable error rate. (default: 0.2)
--revpr Specify to match the tail-end of the sequence against
the reverse complement of the primers. (default:
False)
--barcode Specify to encode sequences with barcode sequences
(unique molecular identifiers) found preceding the
primer region. (default: False)
--maxlen MAX_LEN Maximum sequence length to scan for primers. (default:
50)
--skiprc Specify to prevent checking of sample reverse
complement sequences. (default: False)
--gap GAP_PENALTY GAP_PENALTY
A list of two positive values defining the gap open
and gap extension penalties for aligning the primers.
Note: the error rate is calculated as the percentage
of mismatches from the primer sequence with gap
penalties reducing the match count accordingly; this
may lead to error rates that differ from strict
mismatch percentage when gaps are present in the
alignment. (default: (1, 1))
score
usage: MaskPrimers.py score [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed] [--log LOG_FILE]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [--outdir OUT_DIR]
[--outname OUT_NAME] -p PRIMER_FILE
[--mode {cut,mask,trim,tag}]
[--maxerror MAX_ERROR] [--revpr] [--barcode]
[--start START]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--log LOG_FILE Specify to write verbose logging to a file. May not be
specified with multiple input files. (default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--nproc NPROC The number of simultaneous computational processes to
execute (CPU cores to utilized). (default: 4)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-p PRIMER_FILE A FASTA or REGEX file containing primer sequences.
(default: None)
--mode {cut,mask,trim,tag}
Specifies the action to take with the primer sequence.
The "cut" mode will remove both the primer region and
the preceding sequence. The "mask" mode will replace
the primer region with Ns and remove the preceding
sequence. The "trim" mode will remove the region
preceding the primer, but leave the primer region
intact. The "tag" mode will leave the input sequence
unmodified. (default: mask)
--maxerror MAX_ERROR Maximum allowable error rate. (default: 0.2)
--revpr Specify to match the tail-end of the sequence against
the reverse complement of the primers. (default:
False)
--barcode Specify to encode sequences with barcode sequences
(unique molecular identifiers) found preceding the
primer region. (default: False)
--start START The starting position of the primer (default: 0)
PairSeq
usage: PairSeq.py [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...] -2 SEQ_FILES_2
[SEQ_FILES_2 ...] [--fasta] [--failed]
[--delim DELIMITER DELIMITER DELIMITER] [--outdir OUT_DIR]
[--outname OUT_NAME] [--version]
[--1f FIELDS_1 [FIELDS_1 ...]]
[--2f FIELDS_2 [FIELDS_2 ...]]
[--coord {illumina,solexa,sra,454,presto}]
Sorts and matches sequence records with matching coordinates across files
optional arguments:
-h, --help show this help message and exit
-1 SEQ_FILES_1 [SEQ_FILES_1 ...]
An ordered list of FASTA/FASTQ files containing
head/primary sequences. (default: None)
-2 SEQ_FILES_2 [SEQ_FILES_2 ...]
An ordered list of FASTA/FASTQ files containing
tail/secondary sequences. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
--version show program's version number and exit
--1f FIELDS_1 [FIELDS_1 ...]
The annotation fields to copy from file 1 records into
file 2 records. If a copied annotation already exists
in a file 2 record, then the annotations copied from
file 1 will be added to the front of the existing
annotation. (default: None)
--2f FIELDS_2 [FIELDS_2 ...]
The annotation fields to copy from file 2 records into
file 1 records. If a copied annotation already exists
in a file 1 record, then the annotations copied from
file 2 will be added to the end of the existing
annotation. (default: None)
--coord {illumina,solexa,sra,454,presto}
The format of the sequence identifier which defines
shared coordinate information across mate pairs.
(default: presto)
output files:
pair-pass
successfully paired reads with modified annotations.
pair-fail
raw reads that could not be assigned to a mate-pair.
output annotation fields:
annotation fields specified by the --1f or --2f arguments.
ParseHeaders
usage: ParseHeaders.py [-h] [--version] ...
Parses pRESTO annotations in FASTA/FASTQ sequence headers
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
subcommands:
Annotation operation
add Adds field/value pairs to header annotations
collapse Collapses header annotations with multiple entries
copy Copies header annotation fields
delete Deletes fields from header annotations
expand Expands annotation fields with multiple values
rename Renames header annotation fields
table Writes sequence headers to a table
output files:
reheader-pass
reads passing annotation operation and modified accordingly.
reheader-fail
raw reads failing annotation operation.
headers
tab delimited table of the selected annotations.
output annotation fields:
annotation fields specified by the -f argument.
add
usage: ParseHeaders.py add [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed] [--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME] -f FIELDS
[FIELDS ...] -u VALUES [VALUES ...]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-f FIELDS [FIELDS ...]
List of fields to add. (default: None)
-u VALUES [VALUES ...]
List of values to add for each field. (default: None)
collapse
usage: ParseHeaders.py collapse [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME] -f
FIELDS [FIELDS ...] --act
{min,max,sum,first,last,set,cat}
[{min,max,sum,first,last,set,cat} ...]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-f FIELDS [FIELDS ...]
List of fields to collapse. (default: None)
--act {min,max,sum,first,last,set,cat} [{min,max,sum,first,last,set,cat} ...]
List of actions to take for each field defining how
each annotation will be combined into a single value.
The actions "min", "max", "sum" perform the
corresponding mathematical operation on numeric
annotations. The actions "first" and "last" choose the
value from the corresponding position in the
annotation. The action "set" collapses annotations
into a comma delimited list of unique values. The
action "cat" concatenates the values together into a
single string. (default: None)
copy
usage: ParseHeaders.py copy [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed] [--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME] -f FIELDS
[FIELDS ...] -k NAMES [NAMES ...]
[--act {min,max,sum,first,last,set,cat} [{min,max,sum,first,last,set,cat} ...]]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-f FIELDS [FIELDS ...]
List of fields to copy. (default: None)
-k NAMES [NAMES ...] List of names for each copied field. If the new field
is already present, the copied field will be merged
into the existing field. (default: None)
--act {min,max,sum,first,last,set,cat} [{min,max,sum,first,last,set,cat} ...]
List of collapse actions to take on each new field
following the copy operation defining how each
annotation will be combined into a single value. The
actions "min", "max", "sum" perform the corresponding
mathematical operation on numeric annotations. The
actions "first" and "last" choose the value from the
corresponding position in the annotation. The action
"set" collapses annotations into a comma delimited
list of unique values. The action "cat" concatenates
the values together into a single string. (default:
None)
delete
usage: ParseHeaders.py delete [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME] -f
FIELDS [FIELDS ...]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-f FIELDS [FIELDS ...]
List of fields to delete. (default: None)
expand
usage: ParseHeaders.py expand [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME] -f
FIELDS [FIELDS ...] [--sep SEPARATOR]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-f FIELDS [FIELDS ...]
List of fields to expand. (default: None)
--sep SEPARATOR The character separating each value in the fields.
(default: ,)
rename
usage: ParseHeaders.py rename [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--failed]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME] -f
FIELDS [FIELDS ...] -k NAMES [NAMES ...]
[--act {min,max,sum,first,last,set,cat} [{min,max,sum,first,last,set,cat} ...]]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-f FIELDS [FIELDS ...]
List of fields to rename. (default: None)
-k NAMES [NAMES ...] List of new names for each field. If the new field is
already present, the renamed field will be merged into
the existing field and the old field will be deleted.
(default: None)
--act {min,max,sum,first,last,set,cat} [{min,max,sum,first,last,set,cat} ...]
List of collapse actions to take on each new field
following the rename operation defining how each
annotation will be combined into a single value. The
actions "min", "max", "sum" perform the corresponding
mathematical operation on numeric annotations. The
actions "first" and "last" choose the value from the
corresponding position in the annotation. The action
"set" collapses annotations into a comma delimited
list of unique values. The action "cat" concatenates
the values together into a single string. (default:
None)
table
usage: ParseHeaders.py table [-h] -s SEQ_FILES [SEQ_FILES ...] [--failed]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME] -f FIELDS
[FIELDS ...]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--failed If specified create files containing records that fail
processing. (default: False)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-f FIELDS [FIELDS ...]
List of fields to collect. The sequence identifier may
be specified using the hidden field name "ID".
(default: None)
ParseLog
usage: ParseLog.py [-h] [--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME] [--version] -l
RECORD_FILES [RECORD_FILES ...] -f FIELDS [FIELDS ...]
Parses records in the console log of pRESTO modules
optional arguments:
-h, --help show this help message and exit
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
--version show program's version number and exit
-l RECORD_FILES [RECORD_FILES ...]
List of log files to parse. (default: None)
-f FIELDS [FIELDS ...]
List of fields to collect. The sequence identifier may
be specified using the hidden field name "ID".
(default: None)
output files:
table
tab delimited table of the selected annotations.
output annotation fields:
annotation fields specified by the -f argument.
SplitSeq
usage: SplitSeq.py [-h] [--version] ...
Sorts, samples and splits FASTA/FASTQ sequence files
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
subcommands:
Sequence file operation
count Splits sequences files by number of records
group Splits sequences files by annotation
sample Randomly samples from unpaired sequences files
samplepair
Randomly samples from paired-end sequences files
sort Sorts sequences files by annotation
output files:
part
reads partitioned by count, where is the partition number.
-
reads partitioned by annotation and .
under-
reads partitioned by numeric threshold where the annotation value is
strictly less than the threshold .
atleast-
reads partitioned by numeric threshold where the annotation value is
greater than or equal to the threshold .
sorted
reads sorted by annotation value.
sorted-part
reads sorted by annotation value and partitioned by count, where
is the partition number.
sample-n
randomly sampled reads where is a number specifying the sampling
instance and is the number of sampled reads.
output annotation fields:
None
count
usage: SplitSeq.py count [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--outdir OUT_DIR] [--outname OUT_NAME] -n MAX_COUNT
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-n MAX_COUNT Maximum number of sequences in each new file (default:
None)
group
usage: SplitSeq.py group [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME] -f FIELD
[--num THRESHOLD]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-f FIELD Annotation field to split sequence files by (default:
None)
--num THRESHOLD Specify to define the split field as numeric and group
sequences by value (default: None)
sample
usage: SplitSeq.py sample [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME] -n MAX_COUNT
[MAX_COUNT ...] [-f FIELD] [-u VALUES [VALUES ...]]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-n MAX_COUNT [MAX_COUNT ...]
Maximum number of sequences to sample from each file
(default: None)
-f FIELD The annotation field for sampling criteria (default:
None)
-u VALUES [VALUES ...]
A list of annotation values that sequences must
contain one of; requires the -f argument (default:
None)
samplepair
usage: SplitSeq.py samplepair [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...] -2
SEQ_FILES_2 [SEQ_FILES_2 ...] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME] -n
MAX_COUNT [MAX_COUNT ...] [-f FIELD]
[-u VALUES [VALUES ...]]
[--coord {illumina,solexa,sra,454,presto}]
optional arguments:
-h, --help show this help message and exit
-1 SEQ_FILES_1 [SEQ_FILES_1 ...]
An ordered list of FASTA/FASTQ files containing
head/primary sequences. (default: None)
-2 SEQ_FILES_2 [SEQ_FILES_2 ...]
An ordered list of FASTA/FASTQ files containing
tail/secondary sequences. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-n MAX_COUNT [MAX_COUNT ...]
A list of the number of sequences to sample from each
file (default: None)
-f FIELD The annotation field for sampling criteria (default:
None)
-u VALUES [VALUES ...]
A list of annotation values that both paired sequences
must contain one of; requires the -f argument
(default: None)
--coord {illumina,solexa,sra,454,presto}
The format of the sequence identifier which defines
shared coordinate information across paired ends
(default: presto)
sort
usage: SplitSeq.py sort [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
[--outdir OUT_DIR] [--outname OUT_NAME] -f FIELD
[-n MAX_COUNT] [--num]
optional arguments:
-h, --help show this help message and exit
-s SEQ_FILES [SEQ_FILES ...]
A list of FASTA/FASTQ files containing sequences to
process. (default: None)
--fasta Specify to force output as FASTA rather than FASTQ.
(default: None)
--delim DELIMITER DELIMITER DELIMITER
A list of the three delimiters that separate
annotation blocks, field names and values, and values
within a field, respectively. (default: ('|', '=',
','))
--outdir OUT_DIR Specify to changes the output directory to the
location specified. The input file directory is used
if this is not specified. (default: None)
--outname OUT_NAME Changes the prefix of the successfully processed
output file to the string specified. May not be
specified with multiple input files. (default: None)
-f FIELD The annotation field to sort sequences by (default:
None)
-n MAX_COUNT Maximum number of sequences in each new file (default:
None)
--num Specify to define the sort field as numeric rather
than textual (default: False)