-
-
Notifications
You must be signed in to change notification settings - Fork 69
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
57 changed files
with
3,251 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
2.2.1 | ||
2.3.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
--- | ||
title: AnnotateBamWithUmis | ||
--- | ||
|
||
# AnnotateBamWithUmis | ||
|
||
## Overview | ||
**Group:** SAM/BAM | ||
|
||
Annotates existing BAM files with UMIs (Unique Molecular Indices, aka Molecular IDs, | ||
Molecular barcodes) from separate FASTQ files. Takes an existing BAM file and either | ||
one FASTQ file with UMI reads or multiple FASTQs if there are multiple UMIs per template, | ||
matches the reads between the files based on read names, and produces an output BAM file | ||
where each record is annotated with an optional tag (specified by `attribute`) that | ||
contains the read sequence of the UMI. Trailing read numbers (`/1` or `/2`) are | ||
removed from FASTQ read names, as is any text after whitespace, before matching. | ||
If multiple UMI segments are specified (see `--read-structure`) across one or more FASTQs, | ||
they are delimited in the same order as FASTQs are specified on the command line. | ||
The delimiter is controlled by the `--delimiter` option. | ||
|
||
The `--read-structure` option may be used to specify which bases in the FASTQ contain UMI | ||
bases. Otherwise it is assumed the FASTQ contains only UMI bases. | ||
|
||
The `--sorted` option may be used to indicate that the FASTQ has the same reads and is | ||
sorted in the same order as the BAM file. | ||
|
||
At the end of execution, reports how many records were processed and how many were | ||
missing UMIs. If any read from the BAM file did not have a matching UMI read in the | ||
FASTQ file, the program will exit with a non-zero exit status. The `--fail-fast` option | ||
may be specified to cause the program to terminate the first time it finds a records | ||
without a matching UMI. | ||
|
||
In order to avoid sorting the input files, the entire UMI fastq file(s) is read into | ||
memory. As a result the program needs to be run with memory proportional the size of | ||
the (uncompressed) fastq(s). Use the `--sorted` option to traverse the UMI fastq and BAM | ||
files assuming they are in the same order. More precisely, the UMI fastq file will be | ||
traversed first, reading in the next set of BAM reads with same read name as the | ||
UMI's read name. Those BAM reads will be annotated. If no BAM reads exist for the UMI, | ||
no logging or error will be reported. | ||
|
||
## Arguments | ||
|
||
|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| | ||
|----|----|----|-----------|---------|---------------|----------------| | ||
|input|i|PathToBam|The input SAM or BAM file.|Required|1|| | ||
|fastq|f|PathToFastq|Input FASTQ(s) with UMI reads.|Required|Unlimited|| | ||
|output|o|PathToBam|Output BAM file to write.|Required|1|| | ||
|attribute|t|String|The BAM attribute to store UMI bases in.|Optional|1|RX| | ||
|qual-attribute|q|String|The BAM attribute to store UMI qualities in.|Optional|1|| | ||
|read-structure|r|ReadStructure|The read structure for the FASTQ, otherwise all bases will be used.|Required|Unlimited|+M| | ||
|sorted|s|Boolean|Whether the FASTQ file is sorted in the same order as the BAM.|Optional|1|false| | ||
|fail-fast||Boolean|If set, fail on the first missing UMI.|Optional|1|false| | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
--- | ||
title: AssessPhasing | ||
--- | ||
|
||
# AssessPhasing | ||
|
||
## Overview | ||
**Group:** VCF/BCF | ||
|
||
Assess the accuracy of phasing for a set of variants. | ||
|
||
All phased genotypes should be annotated with the `PS` (phase set) `FORMAT` tag, which by convention is the | ||
position of the first variant in the phase set (see the VCF specification). Furthermore, the alleles of a phased | ||
genotype should use the `|` separator instead of the `/` separator, where the latter indicates the genotype is | ||
unphased. | ||
|
||
The input VCFs are assumed to be single sample: the genotype from the first sample is used. | ||
|
||
Only bi-allelic heterozygous SNPs are considered. | ||
|
||
The input known phased variants can be subsetted using the known interval list, for example to keep only variants | ||
from high-confidence regions. | ||
|
||
If the intervals argument is supplied, only the set of chromosomes specified will be analyzed. Note that the full | ||
chromosome will be analyzed and start/stop positions will be ignored. | ||
|
||
## Arguments | ||
|
||
|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| | ||
|----|----|----|-----------|---------|---------------|----------------| | ||
|called-vcf|c|PathToVcf|The VCF with called phased variants.|Required|1|| | ||
|truth-vcf|t|PathToVcf|The VCF with known phased variants.|Required|1|| | ||
|output|o|PathPrefix|The output prefix for all output files.|Required|1|| | ||
|known-intervals|k|PathToIntervals|The interval list over which known phased variants should be kept.|Optional|1|| | ||
|allow-missing-fields-in-vcf-header|m|Boolean|Allow missing fields in the VCF header.|Optional|1|true| | ||
|skip-mismatching-alleles|s|Boolean|Skip sites where the truth and call are both called but do not share the same alleles.|Optional|1|true| | ||
|intervals|l|PathToIntervals|Analyze only the given chromosomes in the interval list. The entire chromosome will be analyzed (start and end ignored).|Optional|1|| | ||
|modify-blocks|b|Boolean|Remove enclosed phased blocks and truncate overlapping blocks.|Optional|1|true| | ||
|debug-vcf|d|Boolean|Output a VCF with the called variants annotated by if their phase matches the truth|Optional|1|false| | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
--- | ||
title: AssignPrimers | ||
--- | ||
|
||
# AssignPrimers | ||
|
||
## Overview | ||
**Group:** SAM/BAM | ||
|
||
Assigns reads to primers post-alignment. Takes in a BAM file of aligned reads and a tab-delimited file with five columns | ||
(`chrom`, `left_start`, `left_end`, `right_start`, and `right_end`) which provide the 1-based inclusive start and | ||
end positions of the primers for each amplicon. The primer file must include headers, e.g: | ||
|
||
``` | ||
chrom left_start left_end right_start right_end | ||
chr1 1010873 1010894 1011118 1011137 | ||
``` | ||
|
||
Optionally, a sixth column column `id` may be given with a unique name for the amplicon. If not given, the | ||
coordinates of the amplicon's primers will be used: | ||
`<chrom>:<left_start>-<left_end>,<chrom>:<right_start>:<right_end>` | ||
|
||
Each read is assigned independently of its mate (for paired end reads). The primer for a read is assumed to be | ||
located at the start of the read in 5' sequencing order. Therefore, a positive strand | ||
read will use its aligned start position to match against the amplicon's left-most coordinate, while a negative | ||
strand read will use its aligned end position to match against the amplicon's right-most coordinate. | ||
|
||
For paired end reads, the assignment for mate will also be stored in the current read, using the same procedure as | ||
above but using the mate's coordinates. This requires the input BAM have the mate-cigar ("MC") SAM tag. Read | ||
pairs must have both ends mapped in forward/reverse configuration to have an assignment. Furthermore, the amplicon | ||
assignment may be different for a read and its mate. This may occur, for example, if tiling nearby amplicons and | ||
a large deletion occurs over a given primer and therefore "skipping" an amplicon. This may also occur if there are | ||
translocations across amplicons. | ||
|
||
The output will have the following tags added: | ||
- ap: the assigned primer coordinates (ex. `chr1:1010873-1010894`) | ||
- am: the mate's assigned primer coordinates (ex. `chr1:1011118-1011137`) | ||
- ip: the assigned amplicon id | ||
- im: the mate's assigned amplicon id (or `=` if the same as the assigned amplicon) | ||
|
||
The read sequence of the primer is not checked against the expected reference sequence at the primer's genomic | ||
coordinates. | ||
|
||
In some cases, large deletions within one end of a read pair may cause a primary and supplementary alignments to be | ||
produced by the aligner, with the supplementary alignment containing the primer end of the read (5' sequencing order). | ||
In this case, the primer may not be assigned for this end of the read pair. Therefore, it is recommended to prefer | ||
or choose the primary alignment that has the closest aligned read base to the 5' end of the read in sequencing order. | ||
For example, from `bwa` version `0.7.16` onwards, the `-5` option may be used. Consider also using the `-q` option | ||
for `bwa` `0.7.16` as well, which is standard in `0.7.17` onwards when the `-5` option is used. | ||
|
||
The `--annotate-all` option may be used to annotate all alignments for a given read end (eg. R1) with | ||
the same assignment. If the assignment differs across alignments for the same read end, no assignment is given. | ||
Furthermore, if the input BAM is neither `queryname` sorted nor `query` grouped, it will be sorted into queryname | ||
order to assign all alignments cross a template simultaneously. The output is written in coordinate order. | ||
|
||
## Arguments | ||
|
||
|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| | ||
|----|----|----|-----------|---------|---------------|----------------| | ||
|input|i|PathToBam|Input BAM file.|Required|1|| | ||
|output|o|PathToBam|Output BAM file.|Required|1|| | ||
|metrics|m|FilePath|Output metrics file.|Required|1|| | ||
|primers|p|FilePath|File with primer locations.|Required|1|| | ||
|slop|S|Int|Match to primer locations +/- this many bases.|Optional|1|5| | ||
|unclipped-coordinates|U|Boolean|True to based on the unclipped coordinates (adjust based on hard/soft clipping), otherwise the aligned bases|Optional|1|true| | ||
|primer-coordinates-tag||String|The SAM tag for the assigned primer coordinate.|Optional|1|rp| | ||
|mate-primer-coordinates-tag||String|The SAM tag for the mate's assigned primer coordinate.|Optional|1|mp| | ||
|amplicon-identifier-tag||String|The SAM tag for the assigned amplicon identifier.|Optional|1|ra| | ||
|mate-amplicon-identifier-tag||String|The SAM tag for the mate's assigned amplicon identifier.|Optional|1|ma| | ||
|annotate-all||Boolean|Annotate all R1 (or R2) with same value.|Optional|1|false| | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
--- | ||
title: AutoGenerateReadGroupsByName | ||
--- | ||
|
||
# AutoGenerateReadGroupsByName | ||
|
||
## Overview | ||
**Group:** SAM/BAM | ||
|
||
Adds read groups to a BAM file for a single sample by parsing the read names. | ||
|
||
Will add one or more read groups by parsing the read names. The read names should be of the form: | ||
|
||
``` | ||
<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<xpos>:<y-pos> | ||
``` | ||
|
||
Each unique combination of `<instrument>:<run number>:<flowcell ID>:<lane>` will be its own read group. The ID of the | ||
read group will be an integer and the platform unit will be `<flowcell-id>.<lane>`. | ||
|
||
The input is assumed to contain reads for one sample and library. Therefore, the sample and library must be given | ||
and will be applied to all read groups. Read groups will be replaced if present. | ||
|
||
Two passes will be performed on the input: first to gather all the read groups, and second to write the output BAM | ||
file. | ||
|
||
## Arguments | ||
|
||
|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| | ||
|----|----|----|-----------|---------|---------------|----------------| | ||
|input|i|PathToBam|Input SAM or BAM file|Required|1|| | ||
|output|o|PathToBam|Output SAM or BAM file|Required|1|| | ||
|sample|s|String|The sample to insert into the read groups|Required|1|| | ||
|library|l|String|The library to insert into the read groups|Required|1|| | ||
|sequencing-center||String|The sequencing center from which the data originated|Optional|1|| | ||
|predicted-insert-size||Integer|Predicted median insert size, to insert into the read groups|Optional|1|| | ||
|program-group||String|Program group to insert into the read groups|Optional|1|| | ||
|platform-model||String|Platform model to insert into the groups (free-form text providing further details of the platform/technology used)|Optional|1|| | ||
|description||String|Description inserted into the read groups|Optional|1|| | ||
|run-date||Iso8601Date|Date the run was produced (ISO 8601: `YYYY-MM-DD` ), to insert into the read groups|Optional|1|| | ||
|comments||String|Comment(s) to include in the merged output file's header.|Optional|Unlimited|| | ||
|sort-order||SamOrder|The sort order for the output sam/bam file.|Optional|1|| | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
--- | ||
title: CallDuplexConsensusReads | ||
--- | ||
|
||
# CallDuplexConsensusReads | ||
|
||
## Overview | ||
**Group:** Unique Molecular Identifiers (UMIs) | ||
|
||
Calls duplex consensus sequences from reads generated from the same _double-stranded_ source molecule. Prior | ||
to running this tool, read must have been grouped with `GroupReadsByUmi` using the `paired` strategy. Doing | ||
so will apply (by default) MI tags to all reads of the form `*/A` and `*/B` where the /A and /B suffixes | ||
with the same identifier denote reads that are derived from opposite strands of the same source duplex molecule. | ||
|
||
Reads from the same unique molecule are first partitioned by source strand and assembled into single | ||
strand consensus molecules as described by CallMolecularConsensusReads. Subsequently, for molecules that | ||
have at least one observation of each strand, duplex consensus reads are assembled by combining the evidence | ||
from the two single strand consensus reads. | ||
|
||
Because of the nature of duplex sequencing, this tool does not support fragment reads - if found in the | ||
input they are _ignored_. Similarly, read pairs for which consensus reads cannot be generated for one or | ||
other read (R1 or R2) are omitted from the output. | ||
|
||
The consensus reads produced are unaligned, due to the difficulty and error-prone nature of inferring the conesensus | ||
alignment. Consensus reads should therefore be aligned after, which should not be too expensive as likely there | ||
are far fewer consensus reads than input raw raws. Please see how best to use this tool within the best-practice | ||
pipeline: https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md | ||
|
||
Consensus reads have a number of additional optional tags set in the resulting BAM file. The tag names follow | ||
a pattern where the first letter (a, b or c) denotes that the tag applies to the first single strand consensus (a), | ||
second single-strand consensus (b) or the final duplex consensus (c). The second letter is intended to capture | ||
the meaning of the tag (e.g. d=depth, m=min depth, e=errors/error-rate) and is upper case for values that are | ||
one per read and lower case for values that are one per base. | ||
|
||
The tags break down into those that are single-valued per read: | ||
|
||
``` | ||
consensus depth [aD,bD,cD] (int) : the maximum depth of raw reads at any point in the consensus reads | ||
consensus min depth [aM,bM,cM] (int) : the minimum depth of raw reads at any point in the consensus reads | ||
consensus error rate [aE,bE,cE] (float): the fraction of bases in raw reads disagreeing with the final consensus calls | ||
``` | ||
|
||
And those that have a value per base (duplex values are not generated, but can be generated by summing): | ||
|
||
``` | ||
consensus depth [ad,bd] (short[]): the count of bases contributing to each single-strand consensus read at each position | ||
consensus errors [ae,be] (short[]): the count of bases from raw reads disagreeing with the final single-strand consensus base | ||
consensus errors [ac,bc] (string): the single-strand consensus bases | ||
consensus errors [aq,bq] (string): the single-strand consensus qualities | ||
``` | ||
|
||
The per base depths and errors are both capped at 32,767. In all cases no-calls (Ns) and bases below the | ||
min-input-base-quality are not counted in tag value calculations. | ||
|
||
The --min-reads option can take 1-3 values similar to `FilterConsensusReads`. For example: | ||
|
||
``` | ||
CallDuplexConsensusReads ... --min-reads 10 5 3 | ||
``` | ||
|
||
If fewer than three values are supplied, the last value is repeated (i.e. `5 4` -> `5 4 4` and `1` -> `1 1 1`. The | ||
first value applies to the final consensus read, the second value to one single-strand consensus, and the last | ||
value to the other single-strand consensus. It is required that if values two and three differ, | ||
the _more stringent value comes earlier_. | ||
|
||
## Arguments | ||
|
||
|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| | ||
|----|----|----|-----------|---------|---------------|----------------| | ||
|input|i|PathToBam|The input SAM or BAM file.|Required|1|| | ||
|output|o|PathToBam|Output SAM or BAM file to write consensus reads.|Required|1|| | ||
|read-name-prefix|p|String|The prefix all consensus read names|Optional|1|| | ||
|read-group-id|R|String|The new read group ID for all the consensus reads.|Optional|1|A| | ||
|error-rate-pre-umi|1|PhredScore|The Phred-scaled error rate for an error prior to the UMIs being integrated.|Optional|1|45| | ||
|error-rate-post-umi|2|PhredScore|The Phred-scaled error rate for an error post the UMIs have been integrated.|Optional|1|40| | ||
|min-input-base-quality|m|PhredScore|Ignore bases in raw reads that have Q below this value.|Optional|1|10| | ||
|trim|t|Boolean|If true, quality trim input reads in addition to masking low Q bases.|Optional|1|false| | ||
|sort-order|S|SamOrder|The sort order of the output, the same as the input if not given.|Optional|1|| | ||
|min-reads|M|Int|The minimum number of input reads to a consensus read.|Required|3|1| | ||
|max-reads-per-strand||Int|The maximum number of reads to use when building a single-strand consensus. If more than this many reads are present in a tag family, the family is randomly downsampled to exactly max-reads reads.|Optional|1|| | ||
|threads||Int|The number of threads to use while consensus calling.|Optional|1|1| | ||
|consensus-call-overlapping-bases||Boolean|Consensus call overlapping bases in mapped paired end reads|Optional|1|true| | ||
|
Oops, something went wrong.