Skip to content

Commit

Permalink
Adding docs for 2.3.0
Browse files Browse the repository at this point in the history
  • Loading branch information
nh13 committed Jul 31, 2024
1 parent 7190b7e commit 33305a3
Show file tree
Hide file tree
Showing 57 changed files with 3,251 additions and 4 deletions.
6 changes: 4 additions & 2 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@

fgbio is a command line toolkit for working with genomic and particularly next generation sequencing data.

See the [latest available tools here](tools/latest).

## Quick Installation

The [conda](https://conda.io/) package manager (configured with [bioconda channels](https://bioconda.github.io/)) can be used to quickly install fgbio:
Expand Down Expand Up @@ -39,8 +41,8 @@ If the reported version on the first line starts with `1.8` or higher, you are a

Once you have Java installed and a release downloaded you can run:

* Run `java -jar fgbio-2.2.1.jar` to get a list of available tools
* Run `java -jar fgbio-2.2.1.jar <Tool Name>` to see detailed usage instructions on any tool
* Run `java -jar fgbio-2.3.0.jar` to get a list of available tools
* Run `java -jar fgbio-2.3.0.jar <Tool Name>` to see detailed usage instructions on any tool

When running tools we recommend the following set of Java options as a starting point though individual tools may need more or less memory depending on the input data:

Expand Down
513 changes: 513 additions & 0 deletions metrics/2.3.0/index.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion metrics/latest
53 changes: 53 additions & 0 deletions tools/2.3.0/AnnotateBamWithUmis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: AnnotateBamWithUmis
---

# AnnotateBamWithUmis

## Overview
**Group:** SAM/BAM

Annotates existing BAM files with UMIs (Unique Molecular Indices, aka Molecular IDs,
Molecular barcodes) from separate FASTQ files. Takes an existing BAM file and either
one FASTQ file with UMI reads or multiple FASTQs if there are multiple UMIs per template,
matches the reads between the files based on read names, and produces an output BAM file
where each record is annotated with an optional tag (specified by `attribute`) that
contains the read sequence of the UMI. Trailing read numbers (`/1` or `/2`) are
removed from FASTQ read names, as is any text after whitespace, before matching.
If multiple UMI segments are specified (see `--read-structure`) across one or more FASTQs,
they are delimited in the same order as FASTQs are specified on the command line.
The delimiter is controlled by the `--delimiter` option.

The `--read-structure` option may be used to specify which bases in the FASTQ contain UMI
bases. Otherwise it is assumed the FASTQ contains only UMI bases.

The `--sorted` option may be used to indicate that the FASTQ has the same reads and is
sorted in the same order as the BAM file.

At the end of execution, reports how many records were processed and how many were
missing UMIs. If any read from the BAM file did not have a matching UMI read in the
FASTQ file, the program will exit with a non-zero exit status. The `--fail-fast` option
may be specified to cause the program to terminate the first time it finds a records
without a matching UMI.

In order to avoid sorting the input files, the entire UMI fastq file(s) is read into
memory. As a result the program needs to be run with memory proportional the size of
the (uncompressed) fastq(s). Use the `--sorted` option to traverse the UMI fastq and BAM
files assuming they are in the same order. More precisely, the UMI fastq file will be
traversed first, reading in the next set of BAM reads with same read name as the
UMI's read name. Those BAM reads will be annotated. If no BAM reads exist for the UMI,
no logging or error will be reported.

## Arguments

|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)|
|----|----|----|-----------|---------|---------------|----------------|
|input|i|PathToBam|The input SAM or BAM file.|Required|1||
|fastq|f|PathToFastq|Input FASTQ(s) with UMI reads.|Required|Unlimited||
|output|o|PathToBam|Output BAM file to write.|Required|1||
|attribute|t|String|The BAM attribute to store UMI bases in.|Optional|1|RX|
|qual-attribute|q|String|The BAM attribute to store UMI qualities in.|Optional|1||
|read-structure|r|ReadStructure|The read structure for the FASTQ, otherwise all bases will be used.|Required|Unlimited|+M|
|sorted|s|Boolean|Whether the FASTQ file is sorted in the same order as the BAM.|Optional|1|false|
|fail-fast||Boolean|If set, fail on the first missing UMI.|Optional|1|false|

40 changes: 40 additions & 0 deletions tools/2.3.0/AssessPhasing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
title: AssessPhasing
---

# AssessPhasing

## Overview
**Group:** VCF/BCF

Assess the accuracy of phasing for a set of variants.

All phased genotypes should be annotated with the `PS` (phase set) `FORMAT` tag, which by convention is the
position of the first variant in the phase set (see the VCF specification). Furthermore, the alleles of a phased
genotype should use the `|` separator instead of the `/` separator, where the latter indicates the genotype is
unphased.

The input VCFs are assumed to be single sample: the genotype from the first sample is used.

Only bi-allelic heterozygous SNPs are considered.

The input known phased variants can be subsetted using the known interval list, for example to keep only variants
from high-confidence regions.

If the intervals argument is supplied, only the set of chromosomes specified will be analyzed. Note that the full
chromosome will be analyzed and start/stop positions will be ignored.

## Arguments

|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)|
|----|----|----|-----------|---------|---------------|----------------|
|called-vcf|c|PathToVcf|The VCF with called phased variants.|Required|1||
|truth-vcf|t|PathToVcf|The VCF with known phased variants.|Required|1||
|output|o|PathPrefix|The output prefix for all output files.|Required|1||
|known-intervals|k|PathToIntervals|The interval list over which known phased variants should be kept.|Optional|1||
|allow-missing-fields-in-vcf-header|m|Boolean|Allow missing fields in the VCF header.|Optional|1|true|
|skip-mismatching-alleles|s|Boolean|Skip sites where the truth and call are both called but do not share the same alleles.|Optional|1|true|
|intervals|l|PathToIntervals|Analyze only the given chromosomes in the interval list. The entire chromosome will be analyzed (start and end ignored).|Optional|1||
|modify-blocks|b|Boolean|Remove enclosed phased blocks and truncate overlapping blocks.|Optional|1|true|
|debug-vcf|d|Boolean|Output a VCF with the called variants annotated by if their phase matches the truth|Optional|1|false|

71 changes: 71 additions & 0 deletions tools/2.3.0/AssignPrimers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
title: AssignPrimers
---

# AssignPrimers

## Overview
**Group:** SAM/BAM

Assigns reads to primers post-alignment. Takes in a BAM file of aligned reads and a tab-delimited file with five columns
(`chrom`, `left_start`, `left_end`, `right_start`, and `right_end`) which provide the 1-based inclusive start and
end positions of the primers for each amplicon. The primer file must include headers, e.g:

```
chrom left_start left_end right_start right_end
chr1 1010873 1010894 1011118 1011137
```

Optionally, a sixth column column `id` may be given with a unique name for the amplicon. If not given, the
coordinates of the amplicon's primers will be used:
`<chrom>:<left_start>-<left_end>,<chrom>:<right_start>:<right_end>`

Each read is assigned independently of its mate (for paired end reads). The primer for a read is assumed to be
located at the start of the read in 5' sequencing order. Therefore, a positive strand
read will use its aligned start position to match against the amplicon's left-most coordinate, while a negative
strand read will use its aligned end position to match against the amplicon's right-most coordinate.

For paired end reads, the assignment for mate will also be stored in the current read, using the same procedure as
above but using the mate's coordinates. This requires the input BAM have the mate-cigar ("MC") SAM tag. Read
pairs must have both ends mapped in forward/reverse configuration to have an assignment. Furthermore, the amplicon
assignment may be different for a read and its mate. This may occur, for example, if tiling nearby amplicons and
a large deletion occurs over a given primer and therefore "skipping" an amplicon. This may also occur if there are
translocations across amplicons.

The output will have the following tags added:
- ap: the assigned primer coordinates (ex. `chr1:1010873-1010894`)
- am: the mate's assigned primer coordinates (ex. `chr1:1011118-1011137`)
- ip: the assigned amplicon id
- im: the mate's assigned amplicon id (or `=` if the same as the assigned amplicon)

The read sequence of the primer is not checked against the expected reference sequence at the primer's genomic
coordinates.

In some cases, large deletions within one end of a read pair may cause a primary and supplementary alignments to be
produced by the aligner, with the supplementary alignment containing the primer end of the read (5' sequencing order).
In this case, the primer may not be assigned for this end of the read pair. Therefore, it is recommended to prefer
or choose the primary alignment that has the closest aligned read base to the 5' end of the read in sequencing order.
For example, from `bwa` version `0.7.16` onwards, the `-5` option may be used. Consider also using the `-q` option
for `bwa` `0.7.16` as well, which is standard in `0.7.17` onwards when the `-5` option is used.

The `--annotate-all` option may be used to annotate all alignments for a given read end (eg. R1) with
the same assignment. If the assignment differs across alignments for the same read end, no assignment is given.
Furthermore, if the input BAM is neither `queryname` sorted nor `query` grouped, it will be sorted into queryname
order to assign all alignments cross a template simultaneously. The output is written in coordinate order.

## Arguments

|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)|
|----|----|----|-----------|---------|---------------|----------------|
|input|i|PathToBam|Input BAM file.|Required|1||
|output|o|PathToBam|Output BAM file.|Required|1||
|metrics|m|FilePath|Output metrics file.|Required|1||
|primers|p|FilePath|File with primer locations.|Required|1||
|slop|S|Int|Match to primer locations +/- this many bases.|Optional|1|5|
|unclipped-coordinates|U|Boolean|True to based on the unclipped coordinates (adjust based on hard/soft clipping), otherwise the aligned bases|Optional|1|true|
|primer-coordinates-tag||String|The SAM tag for the assigned primer coordinate.|Optional|1|rp|
|mate-primer-coordinates-tag||String|The SAM tag for the mate's assigned primer coordinate.|Optional|1|mp|
|amplicon-identifier-tag||String|The SAM tag for the assigned amplicon identifier.|Optional|1|ra|
|mate-amplicon-identifier-tag||String|The SAM tag for the mate's assigned amplicon identifier.|Optional|1|ma|
|annotate-all||Boolean|Annotate all R1 (or R2) with same value.|Optional|1|false|

43 changes: 43 additions & 0 deletions tools/2.3.0/AutoGenerateReadGroupsByName.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: AutoGenerateReadGroupsByName
---

# AutoGenerateReadGroupsByName

## Overview
**Group:** SAM/BAM

Adds read groups to a BAM file for a single sample by parsing the read names.

Will add one or more read groups by parsing the read names. The read names should be of the form:

```
<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<xpos>:<y-pos>
```

Each unique combination of `<instrument>:<run number>:<flowcell ID>:<lane>` will be its own read group. The ID of the
read group will be an integer and the platform unit will be `<flowcell-id>.<lane>`.

The input is assumed to contain reads for one sample and library. Therefore, the sample and library must be given
and will be applied to all read groups. Read groups will be replaced if present.

Two passes will be performed on the input: first to gather all the read groups, and second to write the output BAM
file.

## Arguments

|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)|
|----|----|----|-----------|---------|---------------|----------------|
|input|i|PathToBam|Input SAM or BAM file|Required|1||
|output|o|PathToBam|Output SAM or BAM file|Required|1||
|sample|s|String|The sample to insert into the read groups|Required|1||
|library|l|String|The library to insert into the read groups|Required|1||
|sequencing-center||String|The sequencing center from which the data originated|Optional|1||
|predicted-insert-size||Integer|Predicted median insert size, to insert into the read groups|Optional|1||
|program-group||String|Program group to insert into the read groups|Optional|1||
|platform-model||String|Platform model to insert into the groups (free-form text providing further details of the platform/technology used)|Optional|1||
|description||String|Description inserted into the read groups|Optional|1||
|run-date||Iso8601Date|Date the run was produced (ISO 8601: `YYYY-MM-DD` ), to insert into the read groups|Optional|1||
|comments||String|Comment(s) to include in the merged output file's header.|Optional|Unlimited||
|sort-order||SamOrder|The sort order for the output sam/bam file.|Optional|1||

83 changes: 83 additions & 0 deletions tools/2.3.0/CallDuplexConsensusReads.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
---
title: CallDuplexConsensusReads
---

# CallDuplexConsensusReads

## Overview
**Group:** Unique Molecular Identifiers (UMIs)

Calls duplex consensus sequences from reads generated from the same _double-stranded_ source molecule. Prior
to running this tool, read must have been grouped with `GroupReadsByUmi` using the `paired` strategy. Doing
so will apply (by default) MI tags to all reads of the form `*/A` and `*/B` where the /A and /B suffixes
with the same identifier denote reads that are derived from opposite strands of the same source duplex molecule.

Reads from the same unique molecule are first partitioned by source strand and assembled into single
strand consensus molecules as described by CallMolecularConsensusReads. Subsequently, for molecules that
have at least one observation of each strand, duplex consensus reads are assembled by combining the evidence
from the two single strand consensus reads.

Because of the nature of duplex sequencing, this tool does not support fragment reads - if found in the
input they are _ignored_. Similarly, read pairs for which consensus reads cannot be generated for one or
other read (R1 or R2) are omitted from the output.

The consensus reads produced are unaligned, due to the difficulty and error-prone nature of inferring the conesensus
alignment. Consensus reads should therefore be aligned after, which should not be too expensive as likely there
are far fewer consensus reads than input raw raws. Please see how best to use this tool within the best-practice
pipeline: https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md

Consensus reads have a number of additional optional tags set in the resulting BAM file. The tag names follow
a pattern where the first letter (a, b or c) denotes that the tag applies to the first single strand consensus (a),
second single-strand consensus (b) or the final duplex consensus (c). The second letter is intended to capture
the meaning of the tag (e.g. d=depth, m=min depth, e=errors/error-rate) and is upper case for values that are
one per read and lower case for values that are one per base.

The tags break down into those that are single-valued per read:

```
consensus depth [aD,bD,cD] (int) : the maximum depth of raw reads at any point in the consensus reads
consensus min depth [aM,bM,cM] (int) : the minimum depth of raw reads at any point in the consensus reads
consensus error rate [aE,bE,cE] (float): the fraction of bases in raw reads disagreeing with the final consensus calls
```

And those that have a value per base (duplex values are not generated, but can be generated by summing):

```
consensus depth [ad,bd] (short[]): the count of bases contributing to each single-strand consensus read at each position
consensus errors [ae,be] (short[]): the count of bases from raw reads disagreeing with the final single-strand consensus base
consensus errors [ac,bc] (string): the single-strand consensus bases
consensus errors [aq,bq] (string): the single-strand consensus qualities
```

The per base depths and errors are both capped at 32,767. In all cases no-calls (Ns) and bases below the
min-input-base-quality are not counted in tag value calculations.

The --min-reads option can take 1-3 values similar to `FilterConsensusReads`. For example:

```
CallDuplexConsensusReads ... --min-reads 10 5 3
```

If fewer than three values are supplied, the last value is repeated (i.e. `5 4` -> `5 4 4` and `1` -> `1 1 1`. The
first value applies to the final consensus read, the second value to one single-strand consensus, and the last
value to the other single-strand consensus. It is required that if values two and three differ,
the _more stringent value comes earlier_.

## Arguments

|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)|
|----|----|----|-----------|---------|---------------|----------------|
|input|i|PathToBam|The input SAM or BAM file.|Required|1||
|output|o|PathToBam|Output SAM or BAM file to write consensus reads.|Required|1||
|read-name-prefix|p|String|The prefix all consensus read names|Optional|1||
|read-group-id|R|String|The new read group ID for all the consensus reads.|Optional|1|A|
|error-rate-pre-umi|1|PhredScore|The Phred-scaled error rate for an error prior to the UMIs being integrated.|Optional|1|45|
|error-rate-post-umi|2|PhredScore|The Phred-scaled error rate for an error post the UMIs have been integrated.|Optional|1|40|
|min-input-base-quality|m|PhredScore|Ignore bases in raw reads that have Q below this value.|Optional|1|10|
|trim|t|Boolean|If true, quality trim input reads in addition to masking low Q bases.|Optional|1|false|
|sort-order|S|SamOrder|The sort order of the output, the same as the input if not given.|Optional|1||
|min-reads|M|Int|The minimum number of input reads to a consensus read.|Required|3|1|
|max-reads-per-strand||Int|The maximum number of reads to use when building a single-strand consensus. If more than this many reads are present in a tag family, the family is randomly downsampled to exactly max-reads reads.|Optional|1||
|threads||Int|The number of threads to use while consensus calling.|Optional|1|1|
|consensus-call-overlapping-bases||Boolean|Consensus call overlapping bases in mapped paired end reads|Optional|1|true|

Loading

0 comments on commit 33305a3

Please sign in to comment.