diff --git a/index.md b/index.md index 8fc55c903..26243c494 100644 --- a/index.md +++ b/index.md @@ -11,6 +11,8 @@ fgbio is a command line toolkit for working with genomic and particularly next generation sequencing data. +See the [latest available tools here](tools/latest). + ## Quick Installation The [conda](https://conda.io/) package manager (configured with [bioconda channels](https://bioconda.github.io/)) can be used to quickly install fgbio: @@ -39,8 +41,8 @@ If the reported version on the first line starts with `1.8` or higher, you are a Once you have Java installed and a release downloaded you can run: -* Run `java -jar fgbio-2.2.1.jar` to get a list of available tools -* Run `java -jar fgbio-2.2.1.jar ` to see detailed usage instructions on any tool +* Run `java -jar fgbio-2.3.0.jar` to get a list of available tools +* Run `java -jar fgbio-2.3.0.jar ` to see detailed usage instructions on any tool When running tools we recommend the following set of Java options as a starting point though individual tools may need more or less memory depending on the input data: diff --git a/metrics/2.3.0/index.md b/metrics/2.3.0/index.md new file mode 100644 index 000000000..1a3369f6f --- /dev/null +++ b/metrics/2.3.0/index.md @@ -0,0 +1,513 @@ + +# fgbio Metrics Descriptions + +This page contains descriptions of all metrics produced by all fgbio tools. Within the descriptions +the type of each field/column is given, including two commonly used types: + +* `Count` is an integer representing the count of some item +* `Proportion` is a real number with a value between 0 and 1 representing a proportion or fraction + + +## Table of Contents + +|Metric Type|Description| +|-----------|-----------| +|[Amplicon](#amplicon)|A Locatable Amplicon class| +|[AssessPhasingMetric](#assessphasingmetric)|Metrics produced by `AssessPhasing` describing various statistics assessing the performance of phasing variants relative to a known set of phased variant calls| +|[AssignPrimersMetric](#assignprimersmetric)|Metrics produced by `AssignPrimers` that detail how many reads were assigned to a given primer and/or amplicon| +|[CallOverlappingConsensusBasesMetric](#calloverlappingconsensusbasesmetric)|Collects the the number of reads or bases that were examined, had overlap, and were corrected as part of the CallOverlappingConsensusBases tool| +|[ClippingMetrics](#clippingmetrics)|Metrics produced by ClipBam that detail how many reads and bases are clipped respectively| +|[ConsensusVariantReviewInfo](#consensusvariantreviewinfo)|Detailed information produced by `ReviewConsensusVariants` on variants called in consensus reads| +|[DuplexFamilySizeMetric](#duplexfamilysizemetric)|Metrics produced by `CollectDuplexSeqMetrics` to describe the distribution of double-stranded (duplex) tag families in terms of the number of reads observed on each strand| +|[DuplexUmiMetric](#duplexumimetric)|Metrics produced by `CollectDuplexSeqMetrics` describing the set of observed duplex UMI sequences and the frequency of their observations| +|[DuplexYieldMetric](#duplexyieldmetric)|Metrics produced by `CollectDuplexSeqMetrics` that are sampled at various levels of coverage, via random downsampling, during the construction of duplex metrics| +|[ErccDetailedMetric](#erccdetailedmetric)|Metrics produced by `CollectErccMetrics` describing various per-transcript metrics related to the spike-in of ERCC (External RNA Controls Consortium) into an RNA-Seq experiment| +|[ErccSummaryMetrics](#erccsummarymetrics)|Metrics produced by `CollectErccMetrics` describing various summary metrics related to the spike-in of ERCC (External RNA Controls Consortium) into an RNA-Seq experiment| +|[ErrorRateByReadPositionMetric](#errorratebyreadpositionmetric)|Metrics produced by `ErrorRateByReadPosition` describing the number of base observations and substitution errors at each position within each sequencing read| +|[FamilySizeMetric](#familysizemetric)|Metrics produced by `CollectDuplexSeqMetrics` to quantify the distribution of different kinds of read family sizes| +|[InsertSizeMetric](#insertsizemetric)|Metrics produced by `EstimateRnaSeqInsertSize` to describe the distribution of insert sizes within an RNA-seq experiment| +|[PhaseBlockLengthMetric](#phaseblocklengthmetric)|Metrics produced by `AssessPhasing` describing the number of phased blocks of a given length| +|[PoolingFractionMetric](#poolingfractionmetric)|Metrics produced by `EstimatePoolingFractions` to quantify the estimated proportion of a sample mixture that is attributable to a specific sample with a known set of genotypes| +|[RunInfo](#runinfo)|Stores the result of parsing the run info (RunInfo| +|[SampleBarcodeMetric](#samplebarcodemetric)|Metrics for matching templates to sample barcodes primarily used in com| +|[SwitchMetric](#switchmetric)|Summary metrics regarding switchback reads found| +|[TagFamilySizeMetric](#tagfamilysizemetric)|Metrics produced by `GroupReadsByUmi` to describe the distribution of tag family sizes observed during grouping| +|[UmiCorrectionMetrics](#umicorrectionmetrics)|Metrics produced by `CorrectUmis` regarding the correction of UMI sequences to a fixed set of known UMIs| +|[UmiMetric](#umimetric)|Metrics produced by `CollectDuplexSeqMetrics` describing the set of observed UMI sequences and the frequency of their observations| + +## Metric File Descriptions + + +### Amplicon + +A Locatable Amplicon class. + +|Column|Type|Description| +|------|----|-----------| +|chrom|String|The chromosome for the amplicon| +|left_start|Int|The 1-based start position of the left-most primer| +|left_end|Int|The 1-based end position inclusive of the left-most primer| +|right_start|Int|The 1-based start position of the right-most primer| +|right_end|Int|The 1-based end position inclusive of the right-most primer| +|id|Option[String]|| + + +### AssessPhasingMetric + +Metrics produced by `AssessPhasing` describing various statistics assessing the performance of phasing variants +relative to a known set of phased variant calls. Included are methods for assessing sensitivity and accuracy from +a number of previous papers (ex. http://dx.doi.org/10.1038%2Fng.3119).The N50, N90, and L50 statistics are defined as follows: +- The N50 is the longest block length such that the bases covered by all blocks this length and longer are at least +50% of the # of bases covered by all blocks. +- The N90 is the longest block length such that the bases covered by all blocks this length and longer are at least +90% of the # of bases covered by all blocks. +- The L50 is the smallest number of blocks such that the sum of the lengths of the blocks is `>=` 50% of the sum of +the lengths of all blocks. + + +|Column|Type|Description| +|------|----|-----------| +|num_called|Long|The number of variants called.| +|num_phased|Long|The number of variants called with phase.| +|num_truth|Long|The number of variants with known truth genotypes.| +|num_truth_phased|Long|The number of variants with known truth genotypes with phase.| +|num_called_with_truth_phased|Long|The number of variants called that had a known phased genotype.| +|num_phased_with_truth_phased|Long|The number of variants called with phase that had a known phased genotype.| +|num_truth_phased_in_called_block|Long|The number of known phased variants that were in a called phased block.| +|num_both_phased_in_called_block|Long|The number of called phase variants that had a known phased genotype in a called phased block.| +|num_short_switch_errors|Long|The number of short switch errors (isolated switch errors).| +|num_long_switch_errors|Long|The number of long switch errors (# of runs of consecutive switch errors).| +|num_switch_sites|Long|The number of sites that could be (short or long) switch errors (i.e. the # of sites with both known and called phased variants).| +|num_illumina_point_switch_errors|Long|The number of point switch errors (defined in http://dx.doi.org/10.1038%2Fng.3119).| +|num_illumina_long_switch_errors|Long|The number of long switch errors (defined in http://dx.doi.org/10.1038%2Fng.3119).| +|num_illumina_switch_sites|Long|The number of sites that could be (point or long) switch errors (defined in http://dx.doi.org/10.1038%2Fng.3119).| +|frac_phased|Double|The fraction of called variants with phase.| +|frac_phased_with_truth_phased|Double|The fraction of known phased variants called with phase.| +|frac_truth_phased_in_called_block|Double|The fraction of phased known genotypes in a called phased block.| +|frac_phased_with_truth_phased_in_called_block|Double|The fraction of called phased variants that had a known phased genotype in a called phased block.| +|short_accuracy|Double|The fraction of switch sites without short switch errors (`1 - (num_short_switch_errors / num_switch_sites)`).| +|long_accuracy|Double|The fraction of switch sites without long switch errors (`1 - (num_long_switch_errors / num_switch_sites)`).| +|illumina_point_accuracy|Double|The fraction of switch sites without point switch errors according to the Illumina method defining switch sites and errors (`1 - (num_illumina_point_switch_errors / num_illumina_switch_sites )`).| +|illumina_long_accuracy|Double|The fraction of switch sites wihtout long switch errors according to the Illumina method defining switch sites and errors (`1 - (num_illumina_long_switch_errors / num_illumina_switch_sites )`).| +|mean_called_block_length|Double|The mean phased block length in the callset.| +|median_called_block_length|Double|The median phased block length in the callset.| +|stddev_called_block_length|Double|The standard deviation of the phased block length in the callset.| +|n50_called_block_length|Double|The N50 of the phased block length in the callset.| +|n90_called_block_length|Double|The N90 of the phased block length in the callset.| +|l50_called|Double|The L50 of the phased block length in the callset.| +|mean_truth_block_length|Double|The mean phased block length in the truth.| +|median_truth_block_length|Double|The median phased block length in the truth.| +|stddev_truth_block_length|Double|The standard deviation of the phased block length in the truth.| +|n50_truth_block_length|Double|The N50 of the phased block length in the truth.| +|n90_truth_block_length|Double|The N90 of the phased block length in the callset.| +|l50_truth|Double|The L50 of the phased block length in the callset.| + + +### AssignPrimersMetric + +Metrics produced by `AssignPrimers` that detail how many reads were assigned to a given primer and/or amplicon. + + +|Column|Type|Description| +|------|----|-----------| +|identifier|String|The amplicon identifier this metric collects over| +|left|Long|The number of reads assigned to the left primer| +|right|Long|The number of reads assigned to the right primer| +|r1s|Long|The number of R1 reads assigned to this amplicon| +|r2s|Long|The number of R2 reads assigned to this amplicon| +|pairs|Long|The number of read pairs where R1 and R2 are both assigned to the this amplicon and are in FR orientation| +|frac_left|Double|The fraction of reads assigned to the left primer| +|frac_right|Double|The fraction of reads assigned to the right primer| +|frac_r1s|Double|The fraction of R1s reads assigned to this amplicon| +|frac_r2s|Double|The fraction of R2s reads assigned to this amplicon| +|frac_pairs|Double|The fraction of read pairs where R1 and R2 are both assigned to the this amplicon and are in FR orientation| + + +### CallOverlappingConsensusBasesMetric + +Collects the the number of reads or bases that were examined, had overlap, and were corrected as part of +the CallOverlappingConsensusBases tool. + + +|Column|Type|Description| +|------|----|-----------| +|kind|CountKind|Template if the counts are per template, bases if counts are in units of bases.| +|total|Long|The total number of templates (bases) examined| +|overlapping|Long|The total number of templates (bases) that were overlapping| +|corrected|Long|The total number of templates (bases) that were corrected.| + + +### ClippingMetrics + +Metrics produced by ClipBam that detail how many reads and bases are clipped respectively. + + +|Column|Type|Description| +|------|----|-----------| +|read_type|ReadType|The type of read (i.e. Fragment, ReadOne, ReadTwo).| +|reads|Long|The number of reads examined.| +|reads_unmapped|Long|The number of reads that became unmapped due to clipping.| +|reads_clipped_pre|Long|The number of reads with any type of clipping prior to clipping with ClipBam.| +|reads_clipped_post|Long|The number of reads with any type of clipping after clipping with ClipBam, including reads that became unmapped.| +|reads_clipped_five_prime|Long|The number of reads with the 5' end clipped.| +|reads_clipped_three_prime|Long|The number of reads with the 3' end clipped.| +|reads_clipped_overlapping|Long|The number of reads clipped due to overlapping reads.| +|reads_clipped_extending|Long|The number of reads clipped due to a read extending past its mate.| +|bases|Long|The number of aligned bases after clipping.| +|bases_clipped_pre|Long|The number of bases clipped prior to clipping with ClipBam.| +|bases_clipped_post|Long|The number of bases clipped after clipping with ClipBam, including bases from reads that became unmapped.| +|bases_clipped_five_prime|Long|The number of bases clipped on the 5' end of the read.| +|bases_clipped_three_prime|Long|The number of bases clipped on the 3 end of the read.| +|bases_clipped_overlapping|Long|The number of bases clipped due to overlapping reads.| +|bases_clipped_extending|Long|The number of bases clipped due to a read extending past its mate.| + + +### ConsensusVariantReviewInfo + +Detailed information produced by `ReviewConsensusVariants` on variants called in consensus reads. Each +row contains information about a consensus _read_ that carried a variant or non-reference allele at a +particular variant site.The first 10 columns (up to `N`) contain information about the variant site and are repeated for each +consensus read reported at that site. The remaining fields are specific to the consensus read. + + +|Column|Type|Description| +|------|----|-----------| +|chrom|String|The chromosome on which the variant exists.| +|pos|Int|The position of the variant.| +|ref|String|The reference allele at the position.| +|genotype|String|The genotype of the sample in question.| +|filters|String|The set of filters applied to the variant in the VCF.| +|A|Int|The count of A observations at the variant locus across all consensus reads.| +|C|Int|The count of C observations at the variant locus across all consensus reads.| +|G|Int|The count of G observations at the variant locus across all consensus reads.| +|T|Int|The count of T observations at the variant locus across all consensus reads.| +|N|Int|The count of N observations at the variant locus across all consensus reads.| +|consensus_read|String|The consensus read name for which the following fields contain values.| +|consensus_insert|String|A description of the insert that generated the consensus read.| +|consensus_call|Char|The base call from the consensus read.| +|consensus_qual|Int|The quality score from the consensus read.| +|a|Int|The number of As in raw-reads contributing to the consensus base call at the variant site.| +|c|Int|The number of Cs in raw-reads contributing to the consensus base call at the variant site.| +|g|Int|The number of Gs in raw-reads contributing to the consensus base call at the variant site.| +|t|Int|The number of Ts in raw-reads contributing to the consensus base call at the variant site.| +|n|Int|The number of Ns in raw-reads contributing to the consensus base call at the variant site.| + + +### DuplexFamilySizeMetric + +Metrics produced by `CollectDuplexSeqMetrics` to describe the distribution of double-stranded (duplex) +tag families in terms of the number of reads observed on each strand.We refer to the two strands as `ab` and `ba` because we identify the two strands by observing the same pair of +UMIs (A and B) in opposite order (A->B vs B->A). Which strand is `ab` and which is `ba` is largely arbitrary, so +to make interpretation of the metrics simpler we use a definition here that for a given tag family +`ab` is the sub-family with more reads and `ba` is the tag family with fewer reads. + + +|Column|Type|Description| +|------|----|-----------| +|ab_size|Int|The number of reads in the `ab` sub-family (the larger sub-family) for this double-strand tag family.| +|ba_size|Int|The number of reads in the `ba` sub-family (the smaller sub-family) for this double-strand tag family.| +|count|Count|The number of families with the `ab` and `ba` single-strand families of size `ab_size` and `ba_size`.| +|fraction|Proportion|The fraction of all double-stranded tag families that have `ab_size` and `ba_size`.| +|fraction_gt_or_eq_size|Proportion|The fraction of all double-stranded tag families that have `ab reads >= ab_size` and `ba reads >= ba_size`.| + + +### DuplexUmiMetric + +Metrics produced by `CollectDuplexSeqMetrics` describing the set of observed duplex UMI sequences and the +frequency of their observations. The UMI sequences reported may have been corrected using information +within a double-stranded tag family. For example if a tag family is comprised of three read pairs with +UMIs `ACGT-TGGT`, `ACGT-TGGT`, and `ACGT-TGGG` then a consensus UMI of `ACGT-TGGT` will be generated.UMI pairs are normalized within a tag family so that observations are always reported as if they came +from a read pair with read 1 on the positive strand (F1R2). Another way to view this is that for FR or RF +read pairs, the duplex UMI reported is the UMI from the positive strand read followed by the UMI from the +negative strand read. E.g. a read pair with UMI `AAAA-GGGG` and with R1 on the negative strand and R2 on +the positive strand, will be reported as `GGGG-AAAA`. + + +|Column|Type|Description| +|------|----|-----------| +|umi|String|The duplex UMI sequence, possibly-corrected.| +|raw_observations|Count|The number of read pairs in the input BAM that observe the duplex UMI (after correction).| +|raw_observations_with_errors|Count|The subset of raw observations that underwent any correction.| +|unique_observations|Count|The number of double-stranded tag families (i.e unique double-stranded molecules) that observed the duplex UMI.| +|fraction_raw_observations|Proportion|The fraction of all raw observations that the duplex UMI accounts for.| +|fraction_unique_observations|Proportion|The fraction of all unique observations that the duplex UMI accounts for.| +|fraction_unique_observations_expected|Proportion|The fraction of all unique observations that are expected to be attributed to the duplex UMI based on the `fraction_unique_observations` of the two individual UMIs.| + + +### DuplexYieldMetric + +Metrics produced by `CollectDuplexSeqMetrics` that are sampled at various levels of coverage, via random +downsampling, during the construction of duplex metrics. The downsampling is done in such a way that the +`fraction`s are approximate, and not exact, therefore the `fraction` field should only be interpreted as a guide +and the `read_pairs` field used to quantify how much data was used.See `FamilySizeMetric` for detailed definitions of `CS`, `SS` and `DS` as used below. + + +|Column|Type|Description| +|------|----|-----------| +|fraction|Proportion|The approximate fraction of the full dataset that was used to generate the remaining values.| +|read_pairs|Count|The number of read pairs upon which the remaining metrics are based.| +|cs_families|Count|The number of _CS_ (Coordinate & Strand) families present in the data.| +|ss_families|Count|The number of _SS_ (Single-Strand by UMI) families present in the data.| +|ds_families|Count|The number of _DS_ (Double-Strand by UMI) families present in the data.| +|ds_duplexes|Count|The number of _DS_ families that had the minimum number of observations on both strands to be called duplexes (default = 1 read on each strand).| +|ds_fraction_duplexes|Proportion|The fraction of _DS_ families that are duplexes (`ds_duplexes / ds_families`).| +|ds_fraction_duplexes_ideal|Proportion|The fraction of _DS_ families that should be duplexes under an idealized model where each strand, `A` and `B`, have equal probability of being sampled, given the observed distribution of _DS_ family sizes.| + + +### ErccDetailedMetric + +Metrics produced by `CollectErccMetrics` describing various per-transcript metrics related to the spike-in of ERCC +(External RNA Controls Consortium) into an RNA-Seq experiment. One metric per ERCC transcript will be present. + + +|Column|Type|Description| +|------|----|-----------| +|name|String|The name (or ID) of the ERCC transcript.| +|concentration|Double|The expected concentration as input to `CollectErccMetrics`.| +|count|Long|The observed count of the number of read pairs (or single end reads) .| +|normalized_count|Double|The observed count of the number of read pairs (or single end reads) normalized by the ERCC transcript length.| + + +### ErccSummaryMetrics + +Metrics produced by `CollectErccMetrics` describing various summary metrics related to the spike-in of ERCC +(External RNA Controls Consortium) into an RNA-Seq experiment.The correlation coefficients and linear regression are calculated based on the log2 observed read pair count normalized +by ERCC transcript length versus the log2 expected concentration. + + +|Column|Type|Description| +|------|----|-----------| +|total_reads|Long|The total number of reads considered.| +|ercc_reads|Long|The total number of reads mapping to an ERCC transcript.| +|fraction_ercc_reads|Double|The fraction of total reads that map to an ERCC transcript.| +|ercc_templates|Long|The total number of read pairs (or single end reads) mapping to an ERCC transcript.| +|total_transcripts|Int|The total number of ERCC transcripts with at least one read observed.| +|passing_filter_transcripts|Int|The total number of ERCC transcripts with at least the user-set minimum # of reads observed.| +|pearsons_correlation|Option[Double]|Pearson's correlation coefficient for correlation of concentration and normalized counts.| +|spearmans_correlation|Option[Double]|Spearman's correlation coefficient for correlation of concentration and normalized counts.| +|intercept|Option[Double]|The intercept of the linear regression.| +|slope|Option[Double]|The slope of the linear regression.| +|r_squared|Option[Double]|The r-squared of the linear regression.| + + +### ErrorRateByReadPositionMetric + +Metrics produced by `ErrorRateByReadPosition` describing the number of base observations and +substitution errors at each position within each sequencing read. Error rates are given for +the overall substitution error rate and also for each kind of substitution separately.If `collapsed` is `true`, then complementary substitutions are grouped together into the first 6 error rates. +e.g. `T>G` substitutions are reported as `A>C`. Otherwise, all 12 substitution rates are reported. + + +|Column|Type|Description| +|------|----|-----------| +|read_number|Int|The read number (0 for fragments, 1 for first of pair, 2 for second of pair).| +|position|Int|The position or cycle within the read (1-based).| +|bases_total|Count|The total number of bases observed at this position.| +|errors|Count|The total number of errors or non-reference basecalls observed at this position.| +|error_rate|Double|The overall error rate at position.| +|a_to_c_error_rate|Double|The rate of `A>C` (and `T>G` when collapsed) errors at the position.| +|a_to_g_error_rate|Double|The rate of `A>G` (and `T>C` when collapsed) errors at the position.| +|a_to_t_error_rate|Double|The rate of `A>T` (and `T>A` when collapsed) errors at the position.| +|c_to_a_error_rate|Double|The rate of `C>A` (and `G>T` when collapsed) errors at the position.| +|c_to_g_error_rate|Double|The rate of `C>G` (and `G>C` when collapsed) errors at the position.| +|c_to_t_error_rate|Double|The rate of `C>T` (and `G>A` when collapsed) errors at the position.| +|g_to_a_error_rate|Option[Double]|The rate of `G>A` errors at the position.| +|g_to_c_error_rate|Option[Double]|The rate of `G>C` errors at the position.| +|g_to_t_error_rate|Option[Double]|The rate of `G>T` errors at the position.| +|t_to_a_error_rate|Option[Double]|The rate of `T>A` errors at the position.| +|t_to_c_error_rate|Option[Double]|The rate of `T>C` errors at the position.| +|t_to_g_error_rate|Option[Double]|The rate of `T>T` errors at the position.| +|collapsed|Boolean|| + + +### FamilySizeMetric + +Metrics produced by `CollectDuplexSeqMetrics` to quantify the distribution of different kinds of read family +sizes. Three kinds of families are described:1. _CS_ or _Coordinate & Strand_: families of reads that are grouped together by their unclipped 5' + genomic positions and strands just as they are in traditional PCR duplicate marking +2. _SS_ or _Single Strand_: single-strand families that are each subsets of a CS family create by + also using the UMIs to partition the larger family, but not linking up families that are + created from opposing strands of the same source molecule. +3. _DS_ or _Double Strand_: families that are created by combining single-strand families that are from + opposite strands of the same source molecule. This does **not** imply that all DS families are composed + of reads from both strands; where only one strand of a source molecule is observed a DS family is + still counted. + + +|Column|Type|Description| +|------|----|-----------| +|family_size|Int|The family size, i.e. the number of read pairs grouped together into a family.| +|cs_count|Count|The count of families with `size == family_size` when grouping just by coordinates and strand information.| +|cs_fraction|Proportion|The fraction of all _CS_ families where `size == family_size`.| +|cs_fraction_gt_or_eq_size|Proportion|The fraction of all _CS_ families where `size >= family_size`.| +|ss_count|Count|The count of families with `size == family_size` when also grouping by UMI to create single-strand families.| +|ss_fraction|Proportion|The fraction of all _SS_ families where `size == family_size`.| +|ss_fraction_gt_or_eq_size|Proportion|The fraction of all _SS_ families where `size >= family_size`.| +|ds_count|Count|The count of families with `size == family_size`when also grouping by UMI and merging single-strand families from opposite strands of the same source molecule.| +|ds_fraction|Proportion|The fraction of all _DS_ families where `size == family_size`.| +|ds_fraction_gt_or_eq_size|Proportion|The fraction of all _DS_ families where `size >= family_size`.| + + +### InsertSizeMetric + +Metrics produced by `EstimateRnaSeqInsertSize` to describe the distribution of insert sizes within an +RNA-seq experiment. The insert sizes are computed in "transcript space", accounting for spliced +alignments, in order to get a true estimate of the size of the DNA fragment, not just it's span on +the genome. + + +|Column|Type|Description| +|------|----|-----------| +|pair_orientation|PairOrientation|The orientation of the reads within a read-pair relative to each other. Possible values are FR, RF and TANDEM.| +|read_pairs|Long|The number of read pairs observed with the `pair_orientation`.| +|mean|Double|The mean insert size of the read pairs.| +|standard_deviation|Double|The standard deviation of the insert size of the read pairs.| +|median|Double|The median insert size of the read pairs.| +|min|Long|The minimum observed insert size of the read pairs.| +|max|Long|The maximum observed insert size of the read pairs.| +|median_absolute_deviation|Double|The median absolution deviation of the read pairs.| + + +### PhaseBlockLengthMetric + +Metrics produced by `AssessPhasing` describing the number of phased blocks of a given length. The output will have +multiple rows, one for each observed phased block length. + + +|Column|Type|Description| +|------|----|-----------| +|dataset|String|The name of the dataset being assessed (i.e. "truth" or "called").| +|length|Long|The length of the phased block.| +|count|Long|The number of phased blocks of the given length.| + + +### PoolingFractionMetric + +Metrics produced by `EstimatePoolingFractions` to quantify the estimated proportion of a sample +mixture that is attributable to a specific sample with a known set of genotypes. + + +|Column|Type|Description| +|------|----|-----------| +|sample|String|The name of the sample within the pool being reported on.| +|variant_sites|Count|How many sites were examined at which the reported sample is known to be variant.| +|singletons|Count|How many of the variant sites were sites at which only this sample was variant.| +|estimated_fraction|Proportion|The estimated fraction of the pool that comes from this sample.| +|standard_error|Double|The standard error of the estimated fraction.| +|ci99_low|Proportion|The lower bound of the 99% confidence interval for the estimated fraction.| +|ci99_high|Proportion|The upper bound of the 99% confidence interval for the estimated fraction.| + + +### RunInfo + +Stores the result of parsing the run info (RunInfo.xml) file from an Illumina run folder. + + +|Column|Type|Description| +|------|----|-----------| +|run_barcode|String|The unique identifier for the sequencing run and flowcell, stored as "_".| +|flowcell_barcode|String|The flowcell barcode.| +|instrument_name|String|The instrument name.| +|run_date|Iso8601Date|The date of the sequencing run.| +|read_structure|ReadStructure|The description of the logical structure of cycles within the sequencing run. This will only contain template and sample barcode segments, as the RunInfo.xml does not contain information about other segments (i.e. molecular barcodes and skips).| +|num_lanes|Int|The number of lanes in the flowcell.| + + +### SampleBarcodeMetric + +Metrics for matching templates to sample barcodes primarily used in com.fulcrumgenomics.fastq.DemuxFastqs.The number of templates will match the number of reads for an Illumina single-end sequencing run, while the number +of templates will be half the number of reads for an Illumina paired-end sequencing run (i.e. R1 & R2 observe the +same template). + + +|Column|Type|Description| +|------|----|-----------| +|barcode_name|String|The name for the sample barcode, typically the sample name from the SampleSheet.| +|library_name|String|The name of the library, typically the library identifier from the SampleSheet.| +|barcode|String|The sample barcode bases. Dual index barcodes will have two sample barcode sequences delimited by a dash.| +|templates|Count|The total number of templates matching the given barcode.| +|pf_templates|Count|The total number of pass-filter templates matching the given barcode.| +|perfect_matches|Count|The number of templates that match perfectly the given barcode.| +|pf_perfect_matches|Count|The number of pass-filter templates that match perfectly the given barcode.| +|one_mismatch_matches|Count|The number of pass-filter templates that match the given barcode with exactly one mismatch.| +|pf_one_mismatch_matches|Count|The number of pass-filter templates that match the given barcode with exactly one mismatch.| +|q20_bases|Count|The number of bases in a template with a quality score 20 or above| +|q30_bases|Count|The number of bases in a template with a quality score 30 or above| +|total_number_of_bases|Count|The total number of bases in the templates combined| +|fraction_matches|Proportion|The fraction of all templates that match the given barcode.| +|ratio_this_barcode_to_best_barcode|Proportion|The rate of all templates matching this barcode to all template reads matching the most prevalent barcode. For the most prevalent barcode this will be 1, for all others it will be less than 1 (except for the possible exception of when there are more unmatched templates than for any other barcode, in which case the value may be arbitrarily large). One over the lowest number in this column gives you the fold-difference in representation between barcodes.| +|pf_fraction_matches|Proportion|The fraction of all pass-filter templates that match the given barcode.| +|pf_ratio_this_barcode_to_best_barcode|Proportion|The rate of all pass-filter templates matching this barcode to all templates matching the most prevalent barcode. For the most prevalent barcode this will be 1, for all others it will be less than 1 (except for the possible exception of when there are more unmatched templates than for any other barcode, in which case the value may be arbitrarily large). One over the lowest number in this column gives you the fold-difference in representation between barcodes.| +|pf_normalized_matches|Proportion|The "normalized" matches to each barcode. This is calculated as the number of pass-filter templates matching this barcode over the mean of all pass-filter templates matching any barcode (excluding unmatched). If all barcodes are represented equally this will be| +|frac_q20_bases|Proportion|The fraction of bases in a template with a quality score 20 or above| +|frac_q30_bases|Proportion|The fraction of bases in a template with a quality score 30 or above| + + +### SwitchMetric + +Summary metrics regarding switchback reads found. + + +|Column|Type|Description| +|------|----|-----------| +|sample|String|The name of the sample sequenced.| +|library|String|The name of the library sequenced.| +|templates|Count|The total number of templates (i.e. inserts, unique read names) seen in the input.| +|aligned_templates|Count|The number of templates that had at least one aligned read.| +|switchback_templates|Count|The number of templates identified as having a switchback event.| +|fraction_switchbacks|Proportion|The fraction of all templates that appear to have switchbacks in them.| +|read_based_switchbacks|Count|The count of switchback_templates that were identified by looking for soft-clipped reverse complementary sequence at the ends of reads.| +|mean_length|Double|The mean length of the reverse complementary sequence in the `read_based_switchbacks`.| +|mean_offset|Double|The mean offset of the `read_based_switchbacks`.| +|tandem_based_switchbacks|Count|The count of switchback_templates that were identified because they had paired reads mapped in FF or RR orientations within the designated gap size range.| +|mean_gap|Double|The mean gap size for `tandem_based_switchbacks`| + + +### TagFamilySizeMetric + +Metrics produced by `GroupReadsByUmi` to describe the distribution of tag family sizes +observed during grouping. + + +|Column|Type|Description| +|------|----|-----------| +|family_size|Int|The family size, or number of templates/read-pairs belonging to the family.| +|count|Count|The number of families (or source molecules) observed with `family_size` observations.| +|fraction|Proportion|The fraction of all families of all sizes that have this specific `family_size`.| +|fraction_gt_or_eq_family_size|Proportion|The fraction of all families that have `>= family_size`.| + + +### UmiCorrectionMetrics + +Metrics produced by `CorrectUmis` regarding the correction of UMI sequences to a fixed set of known UMIs. + + +|Column|Type|Description| +|------|----|-----------| +|umi|String|The corrected UMI sequence (or all `N`s for unmatched).| +|total_matches|Count|The number of UMI sequences that matched/were corrected to this UMI.| +|perfect_matches|Count|The number of UMI sequences that were perfect matches to this UMI.| +|one_mismatch_matches|Count|The number of UMI sequences that matched with a single mismatch.| +|two_mismatch_matches|Count|The number of UMI sequences that matched with two mismatches.| +|other_matches|Count|The number of UMI sequences that matched with three or more mismatches.| +|fraction_of_matches|Proportion|The fraction of all UMIs that matched or were corrected to this UMI.| +|representation|Double|The `total_matches` for this UMI divided by the _mean_ `total_matches` for all UMIs.| + + +### UmiMetric + +Metrics produced by `CollectDuplexSeqMetrics` describing the set of observed UMI sequences and the +frequency of their observations. The UMI sequences reported may have been corrected using information +within a double-stranded tag family. For example if a tag family is comprised of three read pairs with +UMIs `ACGT-TGGT`, `ACGT-TGGT`, and `ACGT-TGGG` then a consensus UMI of `ACGT-TGGT` will be generated, +and three raw observations counted for each of `ACGT` and `TGGT`, and no observations counted for `TGGG`. + + +|Column|Type|Description| +|------|----|-----------| +|umi|String|The UMI sequence, possibly-corrected.| +|raw_observations|Count|The number of read pairs in the input BAM that observe the UMI (after correction).| +|raw_observations_with_errors|Count|The subset of raw-observations that underwent any correction.| +|unique_observations|Count|The number of double-stranded tag families (i.e unique double-stranded molecules) that observed the UMI.| +|fraction_raw_observations|Proportion|The fraction of all raw observations that the UMI accounts for.| +|fraction_unique_observations|Proportion|The fraction of all unique observations that the UMI accounts for.| diff --git a/metrics/latest b/metrics/latest index fae692e41..cc6612c36 120000 --- a/metrics/latest +++ b/metrics/latest @@ -1 +1 @@ -2.2.1 \ No newline at end of file +2.3.0 \ No newline at end of file diff --git a/tools/2.3.0/AnnotateBamWithUmis.md b/tools/2.3.0/AnnotateBamWithUmis.md new file mode 100644 index 000000000..32c1cfdea --- /dev/null +++ b/tools/2.3.0/AnnotateBamWithUmis.md @@ -0,0 +1,53 @@ +--- +title: AnnotateBamWithUmis +--- + +# AnnotateBamWithUmis + +## Overview +**Group:** SAM/BAM + +Annotates existing BAM files with UMIs (Unique Molecular Indices, aka Molecular IDs, +Molecular barcodes) from separate FASTQ files. Takes an existing BAM file and either +one FASTQ file with UMI reads or multiple FASTQs if there are multiple UMIs per template, +matches the reads between the files based on read names, and produces an output BAM file +where each record is annotated with an optional tag (specified by `attribute`) that +contains the read sequence of the UMI. Trailing read numbers (`/1` or `/2`) are +removed from FASTQ read names, as is any text after whitespace, before matching. +If multiple UMI segments are specified (see `--read-structure`) across one or more FASTQs, +they are delimited in the same order as FASTQs are specified on the command line. +The delimiter is controlled by the `--delimiter` option. + +The `--read-structure` option may be used to specify which bases in the FASTQ contain UMI +bases. Otherwise it is assumed the FASTQ contains only UMI bases. + +The `--sorted` option may be used to indicate that the FASTQ has the same reads and is +sorted in the same order as the BAM file. + +At the end of execution, reports how many records were processed and how many were +missing UMIs. If any read from the BAM file did not have a matching UMI read in the +FASTQ file, the program will exit with a non-zero exit status. The `--fail-fast` option +may be specified to cause the program to terminate the first time it finds a records +without a matching UMI. + +In order to avoid sorting the input files, the entire UMI fastq file(s) is read into +memory. As a result the program needs to be run with memory proportional the size of +the (uncompressed) fastq(s). Use the `--sorted` option to traverse the UMI fastq and BAM +files assuming they are in the same order. More precisely, the UMI fastq file will be +traversed first, reading in the next set of BAM reads with same read name as the +UMI's read name. Those BAM reads will be annotated. If no BAM reads exist for the UMI, +no logging or error will be reported. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|The input SAM or BAM file.|Required|1|| +|fastq|f|PathToFastq|Input FASTQ(s) with UMI reads.|Required|Unlimited|| +|output|o|PathToBam|Output BAM file to write.|Required|1|| +|attribute|t|String|The BAM attribute to store UMI bases in.|Optional|1|RX| +|qual-attribute|q|String|The BAM attribute to store UMI qualities in.|Optional|1|| +|read-structure|r|ReadStructure|The read structure for the FASTQ, otherwise all bases will be used.|Required|Unlimited|+M| +|sorted|s|Boolean|Whether the FASTQ file is sorted in the same order as the BAM.|Optional|1|false| +|fail-fast||Boolean|If set, fail on the first missing UMI.|Optional|1|false| + diff --git a/tools/2.3.0/AssessPhasing.md b/tools/2.3.0/AssessPhasing.md new file mode 100644 index 000000000..9b2544916 --- /dev/null +++ b/tools/2.3.0/AssessPhasing.md @@ -0,0 +1,40 @@ +--- +title: AssessPhasing +--- + +# AssessPhasing + +## Overview +**Group:** VCF/BCF + +Assess the accuracy of phasing for a set of variants. + +All phased genotypes should be annotated with the `PS` (phase set) `FORMAT` tag, which by convention is the +position of the first variant in the phase set (see the VCF specification). Furthermore, the alleles of a phased +genotype should use the `|` separator instead of the `/` separator, where the latter indicates the genotype is +unphased. + +The input VCFs are assumed to be single sample: the genotype from the first sample is used. + +Only bi-allelic heterozygous SNPs are considered. + +The input known phased variants can be subsetted using the known interval list, for example to keep only variants +from high-confidence regions. + +If the intervals argument is supplied, only the set of chromosomes specified will be analyzed. Note that the full +chromosome will be analyzed and start/stop positions will be ignored. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|called-vcf|c|PathToVcf|The VCF with called phased variants.|Required|1|| +|truth-vcf|t|PathToVcf|The VCF with known phased variants.|Required|1|| +|output|o|PathPrefix|The output prefix for all output files.|Required|1|| +|known-intervals|k|PathToIntervals|The interval list over which known phased variants should be kept.|Optional|1|| +|allow-missing-fields-in-vcf-header|m|Boolean|Allow missing fields in the VCF header.|Optional|1|true| +|skip-mismatching-alleles|s|Boolean|Skip sites where the truth and call are both called but do not share the same alleles.|Optional|1|true| +|intervals|l|PathToIntervals|Analyze only the given chromosomes in the interval list. The entire chromosome will be analyzed (start and end ignored).|Optional|1|| +|modify-blocks|b|Boolean|Remove enclosed phased blocks and truncate overlapping blocks.|Optional|1|true| +|debug-vcf|d|Boolean|Output a VCF with the called variants annotated by if their phase matches the truth|Optional|1|false| + diff --git a/tools/2.3.0/AssignPrimers.md b/tools/2.3.0/AssignPrimers.md new file mode 100644 index 000000000..8186fc364 --- /dev/null +++ b/tools/2.3.0/AssignPrimers.md @@ -0,0 +1,71 @@ +--- +title: AssignPrimers +--- + +# AssignPrimers + +## Overview +**Group:** SAM/BAM + +Assigns reads to primers post-alignment. Takes in a BAM file of aligned reads and a tab-delimited file with five columns +(`chrom`, `left_start`, `left_end`, `right_start`, and `right_end`) which provide the 1-based inclusive start and +end positions of the primers for each amplicon. The primer file must include headers, e.g: + +``` +chrom left_start left_end right_start right_end +chr1 1010873 1010894 1011118 1011137 +``` + +Optionally, a sixth column column `id` may be given with a unique name for the amplicon. If not given, the +coordinates of the amplicon's primers will be used: + `:-,::` + +Each read is assigned independently of its mate (for paired end reads). The primer for a read is assumed to be +located at the start of the read in 5' sequencing order. Therefore, a positive strand +read will use its aligned start position to match against the amplicon's left-most coordinate, while a negative +strand read will use its aligned end position to match against the amplicon's right-most coordinate. + +For paired end reads, the assignment for mate will also be stored in the current read, using the same procedure as +above but using the mate's coordinates. This requires the input BAM have the mate-cigar ("MC") SAM tag. Read +pairs must have both ends mapped in forward/reverse configuration to have an assignment. Furthermore, the amplicon +assignment may be different for a read and its mate. This may occur, for example, if tiling nearby amplicons and +a large deletion occurs over a given primer and therefore "skipping" an amplicon. This may also occur if there are +translocations across amplicons. + +The output will have the following tags added: +- ap: the assigned primer coordinates (ex. `chr1:1010873-1010894`) +- am: the mate's assigned primer coordinates (ex. `chr1:1011118-1011137`) +- ip: the assigned amplicon id +- im: the mate's assigned amplicon id (or `=` if the same as the assigned amplicon) + +The read sequence of the primer is not checked against the expected reference sequence at the primer's genomic +coordinates. + +In some cases, large deletions within one end of a read pair may cause a primary and supplementary alignments to be +produced by the aligner, with the supplementary alignment containing the primer end of the read (5' sequencing order). +In this case, the primer may not be assigned for this end of the read pair. Therefore, it is recommended to prefer +or choose the primary alignment that has the closest aligned read base to the 5' end of the read in sequencing order. +For example, from `bwa` version `0.7.16` onwards, the `-5` option may be used. Consider also using the `-q` option +for `bwa` `0.7.16` as well, which is standard in `0.7.17` onwards when the `-5` option is used. + +The `--annotate-all` option may be used to annotate all alignments for a given read end (eg. R1) with +the same assignment. If the assignment differs across alignments for the same read end, no assignment is given. +Furthermore, if the input BAM is neither `queryname` sorted nor `query` grouped, it will be sorted into queryname +order to assign all alignments cross a template simultaneously. The output is written in coordinate order. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input BAM file.|Required|1|| +|output|o|PathToBam|Output BAM file.|Required|1|| +|metrics|m|FilePath|Output metrics file.|Required|1|| +|primers|p|FilePath|File with primer locations.|Required|1|| +|slop|S|Int|Match to primer locations +/- this many bases.|Optional|1|5| +|unclipped-coordinates|U|Boolean|True to based on the unclipped coordinates (adjust based on hard/soft clipping), otherwise the aligned bases|Optional|1|true| +|primer-coordinates-tag||String|The SAM tag for the assigned primer coordinate.|Optional|1|rp| +|mate-primer-coordinates-tag||String|The SAM tag for the mate's assigned primer coordinate.|Optional|1|mp| +|amplicon-identifier-tag||String|The SAM tag for the assigned amplicon identifier.|Optional|1|ra| +|mate-amplicon-identifier-tag||String|The SAM tag for the mate's assigned amplicon identifier.|Optional|1|ma| +|annotate-all||Boolean|Annotate all R1 (or R2) with same value.|Optional|1|false| + diff --git a/tools/2.3.0/AutoGenerateReadGroupsByName.md b/tools/2.3.0/AutoGenerateReadGroupsByName.md new file mode 100644 index 000000000..594427fb4 --- /dev/null +++ b/tools/2.3.0/AutoGenerateReadGroupsByName.md @@ -0,0 +1,43 @@ +--- +title: AutoGenerateReadGroupsByName +--- + +# AutoGenerateReadGroupsByName + +## Overview +**Group:** SAM/BAM + +Adds read groups to a BAM file for a single sample by parsing the read names. + +Will add one or more read groups by parsing the read names. The read names should be of the form: + +``` +:::::: +``` + +Each unique combination of `:::` will be its own read group. The ID of the +read group will be an integer and the platform unit will be `.`. + +The input is assumed to contain reads for one sample and library. Therefore, the sample and library must be given +and will be applied to all read groups. Read groups will be replaced if present. + +Two passes will be performed on the input: first to gather all the read groups, and second to write the output BAM +file. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input SAM or BAM file|Required|1|| +|output|o|PathToBam|Output SAM or BAM file|Required|1|| +|sample|s|String|The sample to insert into the read groups|Required|1|| +|library|l|String|The library to insert into the read groups|Required|1|| +|sequencing-center||String|The sequencing center from which the data originated|Optional|1|| +|predicted-insert-size||Integer|Predicted median insert size, to insert into the read groups|Optional|1|| +|program-group||String|Program group to insert into the read groups|Optional|1|| +|platform-model||String|Platform model to insert into the groups (free-form text providing further details of the platform/technology used)|Optional|1|| +|description||String|Description inserted into the read groups|Optional|1|| +|run-date||Iso8601Date|Date the run was produced (ISO 8601: `YYYY-MM-DD` ), to insert into the read groups|Optional|1|| +|comments||String|Comment(s) to include in the merged output file's header.|Optional|Unlimited|| +|sort-order||SamOrder|The sort order for the output sam/bam file.|Optional|1|| + diff --git a/tools/2.3.0/CallDuplexConsensusReads.md b/tools/2.3.0/CallDuplexConsensusReads.md new file mode 100644 index 000000000..ed67c0daa --- /dev/null +++ b/tools/2.3.0/CallDuplexConsensusReads.md @@ -0,0 +1,83 @@ +--- +title: CallDuplexConsensusReads +--- + +# CallDuplexConsensusReads + +## Overview +**Group:** Unique Molecular Identifiers (UMIs) + +Calls duplex consensus sequences from reads generated from the same _double-stranded_ source molecule. Prior +to running this tool, read must have been grouped with `GroupReadsByUmi` using the `paired` strategy. Doing +so will apply (by default) MI tags to all reads of the form `*/A` and `*/B` where the /A and /B suffixes +with the same identifier denote reads that are derived from opposite strands of the same source duplex molecule. + +Reads from the same unique molecule are first partitioned by source strand and assembled into single +strand consensus molecules as described by CallMolecularConsensusReads. Subsequently, for molecules that +have at least one observation of each strand, duplex consensus reads are assembled by combining the evidence +from the two single strand consensus reads. + +Because of the nature of duplex sequencing, this tool does not support fragment reads - if found in the +input they are _ignored_. Similarly, read pairs for which consensus reads cannot be generated for one or +other read (R1 or R2) are omitted from the output. + +The consensus reads produced are unaligned, due to the difficulty and error-prone nature of inferring the conesensus +alignment. Consensus reads should therefore be aligned after, which should not be too expensive as likely there +are far fewer consensus reads than input raw raws. Please see how best to use this tool within the best-practice +pipeline: https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md + +Consensus reads have a number of additional optional tags set in the resulting BAM file. The tag names follow +a pattern where the first letter (a, b or c) denotes that the tag applies to the first single strand consensus (a), +second single-strand consensus (b) or the final duplex consensus (c). The second letter is intended to capture +the meaning of the tag (e.g. d=depth, m=min depth, e=errors/error-rate) and is upper case for values that are +one per read and lower case for values that are one per base. + +The tags break down into those that are single-valued per read: + +``` +consensus depth [aD,bD,cD] (int) : the maximum depth of raw reads at any point in the consensus reads +consensus min depth [aM,bM,cM] (int) : the minimum depth of raw reads at any point in the consensus reads +consensus error rate [aE,bE,cE] (float): the fraction of bases in raw reads disagreeing with the final consensus calls +``` + +And those that have a value per base (duplex values are not generated, but can be generated by summing): + +``` +consensus depth [ad,bd] (short[]): the count of bases contributing to each single-strand consensus read at each position +consensus errors [ae,be] (short[]): the count of bases from raw reads disagreeing with the final single-strand consensus base +consensus errors [ac,bc] (string): the single-strand consensus bases +consensus errors [aq,bq] (string): the single-strand consensus qualities +``` + +The per base depths and errors are both capped at 32,767. In all cases no-calls (Ns) and bases below the +min-input-base-quality are not counted in tag value calculations. + +The --min-reads option can take 1-3 values similar to `FilterConsensusReads`. For example: + +``` +CallDuplexConsensusReads ... --min-reads 10 5 3 +``` + +If fewer than three values are supplied, the last value is repeated (i.e. `5 4` -> `5 4 4` and `1` -> `1 1 1`. The +first value applies to the final consensus read, the second value to one single-strand consensus, and the last +value to the other single-strand consensus. It is required that if values two and three differ, +the _more stringent value comes earlier_. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|The input SAM or BAM file.|Required|1|| +|output|o|PathToBam|Output SAM or BAM file to write consensus reads.|Required|1|| +|read-name-prefix|p|String|The prefix all consensus read names|Optional|1|| +|read-group-id|R|String|The new read group ID for all the consensus reads.|Optional|1|A| +|error-rate-pre-umi|1|PhredScore|The Phred-scaled error rate for an error prior to the UMIs being integrated.|Optional|1|45| +|error-rate-post-umi|2|PhredScore|The Phred-scaled error rate for an error post the UMIs have been integrated.|Optional|1|40| +|min-input-base-quality|m|PhredScore|Ignore bases in raw reads that have Q below this value.|Optional|1|10| +|trim|t|Boolean|If true, quality trim input reads in addition to masking low Q bases.|Optional|1|false| +|sort-order|S|SamOrder|The sort order of the output, the same as the input if not given.|Optional|1|| +|min-reads|M|Int|The minimum number of input reads to a consensus read.|Required|3|1| +|max-reads-per-strand||Int|The maximum number of reads to use when building a single-strand consensus. If more than this many reads are present in a tag family, the family is randomly downsampled to exactly max-reads reads.|Optional|1|| +|threads||Int|The number of threads to use while consensus calling.|Optional|1|1| +|consensus-call-overlapping-bases||Boolean|Consensus call overlapping bases in mapped paired end reads|Optional|1|true| + diff --git a/tools/2.3.0/CallMolecularConsensusReads.md b/tools/2.3.0/CallMolecularConsensusReads.md new file mode 100644 index 000000000..95b05224a --- /dev/null +++ b/tools/2.3.0/CallMolecularConsensusReads.md @@ -0,0 +1,94 @@ +--- +title: CallMolecularConsensusReads +--- + +# CallMolecularConsensusReads + +## Overview +**Group:** Unique Molecular Identifiers (UMIs) + +Calls consensus sequences from reads with the same unique molecular tag. + +Reads with the same unique molecular tag are examined base-by-base to assess the likelihood of each base in the +source molecule. The likelihood model is as follows: + +1. First, the base qualities are adjusted. The base qualities are assumed to represent the probability of a + sequencing error (i.e. the sequencer observed the wrong base present on the cluster/flowcell/well). The base + quality scores are converted to probabilities incorporating a probability representing the chance of an error + from the time the unique molecular tags were integrated to just prior to sequencing. The resulting probability + is the error rate of all processes from right after integrating the molecular tag through to the end of + sequencing. +2. Next, a consensus sequence is called for all reads with the same unique molecular tag base-by-base. For a + given base position in the reads, the likelihoods that an A, C, G, or T is the base for the underlying + source molecule respectively are computed by multiplying the likelihood of each read observing the base + position being considered. The probability of error (from 1.) is used when the observed base does not match + the hypothesized base for the underlying source molecule, while one minus that probability is used otherwise. + The computed likelihoods are normalized by dividing them by the sum of all four likelihoods to produce a + posterior probability, namely the probability that the source molecule was an A, C, G, or T from just after + integrating molecular tag through to sequencing, given the observations. The base with the maximum posterior + probability as the consensus call, and the posterior probability is used as its raw base quality. +3. Finally, the consensus raw base quality is modified by incorporating the probability of an error prior to + integrating the unique molecular tags. Therefore, the probability used for the final consensus base + quality is the posterior probability of the source molecule having the consensus base given the observed + reads with the same molecular tag, all the way from sample extraction and through sample and library + preparation, through preparing the library for sequencing (e.g. amplification, target selection), and finally, + through sequencing. + +This tool assumes that reads with the same tag are grouped together (consecutive in the file). Also, this tool +calls each end of a pair independently, and does not jointly call bases that overlap within a pair. Insertion or +deletion errors in the reads are not considered in the consensus model. + +The consensus reads produced are unaligned, due to the difficulty and error-prone nature of inferring the conesensus +alignment. Consensus reads should therefore be aligned after, which should not be too expensive as likely there +are far fewer consensus reads than input raw raws. Please see how best to use this tool within the best-practice +pipeline: https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md + +Particular attention should be paid to setting the `--min-reads` parameter as this can have a dramatic effect on +both results and runtime. For libraries with low duplication rates (e.g. 100-300X exomes libraries) in which it +is desirable to retain singleton reads while making consensus reads from sets of duplicates, `--min-reads=1` is +appropriate. For libraries with high duplication rates where it is desirable to only produce consensus reads +supported by 2+ reads to allow error correction, `--min-reads=2` or higher is appropriate. After generation, +consensus reads can be further filtered using the _FilterConsensusReads_ tool. As such it is always safe to run +with `--min-reads=1` and filter later, but filtering at this step can improve performance significantly. + +Consensus reads have a number of additional optional tags set in the resulting BAM file. The tags break down into +those that are single-valued per read: + +``` +consensus depth [cD] (int) : the maximum depth of raw reads at any point in the consensus read +consensus min depth [cM] (int) : the minimum depth of raw reads at any point in the consensus read +consensus error rate [cE] (float): the fraction of bases in raw reads disagreeing with the final consensus calls +``` + +And those that have a value per base: + +``` +consensus depth [cd] (short[]): the count of bases contributing to the consensus read at each position +consensus errors [ce] (short[]): the number of bases from raw reads disagreeing with the final consensus base +``` + +The per base depths and errors are both capped at 32,767. In all cases no-calls (`N`s) and bases below the +`--min-input-base-quality` are not counted in tag value calculations. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|The input SAM or BAM file.|Required|1|| +|output|o|PathToBam|Output SAM or BAM file to write consensus reads.|Required|1|| +|rejects|r|PathToBam|Optional output SAM or BAM file to write reads not used.|Optional|1|| +|tag|t|String|The SAM attribute with the unique molecule tag.|Optional|1|MI| +|read-name-prefix|p|String|The Prefix all consensus read names|Optional|1|| +|read-group-id|R|String|The new read group ID for all the consensus reads.|Optional|1|A| +|error-rate-pre-umi|1|PhredScore|The Phred-scaled error rate for an error prior to the UMIs being integrated.|Optional|1|45| +|error-rate-post-umi|2|PhredScore|The Phred-scaled error rate for an error post the UMIs have been integrated.|Optional|1|40| +|min-input-base-quality|m|PhredScore|Ignore bases in raw reads that have Q below this value.|Optional|1|10| +|min-consensus-base-quality|N|PhredScore|Deprecated: will be removed in future versions; use FilterConsensusReads to filter consensus bases on quality instead. Mask (make 'N') consensus bases with quality less than this threshold.|Optional|1|2| +|min-reads|M|Int|The minimum number of reads to produce a consensus base.|Required|1|| +|max-reads||Int|The maximum number of reads to use when building a consensus. If more than this many reads are present in a tag family, the family is randomly downsampled to exactly max-reads reads.|Optional|1|| +|output-per-base-tags|B|Boolean|If true produce tags on consensus reads that contain per-base information.|Optional|1|true| +|sort-order|S|SamOrder|The sort order of the output, the same as the input if not given.|Optional|1|| +|debug|D|Boolean|Turn on debug logging.|Optional|1|false| +|threads||Int|The number of threads to use while consensus calling.|Optional|1|1| +|consensus-call-overlapping-bases||Boolean|Consensus call overlapping bases in mapped paired end reads|Optional|1|true| + diff --git a/tools/2.3.0/CallOverlappingConsensusBases.md b/tools/2.3.0/CallOverlappingConsensusBases.md new file mode 100644 index 000000000..c127caca8 --- /dev/null +++ b/tools/2.3.0/CallOverlappingConsensusBases.md @@ -0,0 +1,64 @@ +--- +title: CallOverlappingConsensusBases +--- + +# CallOverlappingConsensusBases + +## Overview +**Group:** SAM/BAM + +Consensus calls overlapping bases in read pairs. + +## Inputs and Outputs + +In order to correctly correct reads by template, the input BAM must be either `queryname` sorted or `query` grouped. +The sort can be done in streaming fashion with: + +``` +samtools sort -n -u in.bam | fgbio CallOverlappingConsensusBases -i /dev/stdin ... +``` + +The output sort order may be specified with `--sort-order`. If not given, then the output will be in the same +order as input. + +The reference FASTA must be given so that any existing `NM`, `UQ` and `MD` tags can be repaired. + +## Correction + +Only mapped read pairs with overlapping bases will be eligible for correction. + +Each read base from the read and its mate that map to same position in the reference will be used to create +a consensus base as follows: + +1. If the base agree, then the chosen agreement strategy (`--agreement-strategy`) will be used. +2. If the base disagree, then the chosen disagreement strategy (`--disagreement-strategy`) will be used. + +The agreement strategies are as follows: + +* Consensus: Call the consensus base and return a new base quality that is the sum of the two base qualities. +* MaxQual: Call the consensus base and return a new base quality that is the maximum of the two base qualities. +* PassThrough: Leave the bases and base qualities unchanged. + +In the context of disagreement strategies, masking a base will make the base an "N" with base quality phred-value "2". +The disagreement strategies are as follows: + +* MaskBoth: Mask both bases. +* MaskLowerQual: Mask the base with the lowest base quality, with the other base unchanged. If the base qualities + are the same, mask both bases. +* Consensus: Consensus call the base. If the base qualities are the same, mask both bases. Otherwise, call the + base with the highest base quality and return a new base quality that is the difference between the + highest and lowest base quality. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input SAM or BAM file of aligned reads.|Required|1|| +|output|o|PathToBam|Output SAM or BAM file.|Required|1|| +|metrics|m|FilePath|Output metrics file.|Required|1|| +|ref|r|PathToFasta|Reference sequence fasta file.|Required|1|| +|threads||Int|The number of threads to use while consensus calling.|Optional|1|1| +|sort-order|S|SamOrder|The sort order of the output. If not given, output will be in the same order as input if the input.|Optional|1|| +|agreement-strategy||AgreementStrategy|The strategy to consensus call when both bases agree. See the usage for more details|Optional|1|Consensus| +|disagreement-strategy||DisagreementStrategy|The strategy to consensus call when both bases disagree. See the usage for more details|Optional|1|Consensus| + diff --git a/tools/2.3.0/ClipBam.md b/tools/2.3.0/ClipBam.md new file mode 100644 index 000000000..828e4ee0e --- /dev/null +++ b/tools/2.3.0/ClipBam.md @@ -0,0 +1,62 @@ +--- +title: ClipBam +--- + +# ClipBam + +## Overview +**Group:** SAM/BAM + +Clips reads from the same template. Ensures that at least N bases are clipped from any end of the read (i.e. +R1 5' end, R1 3' end, R2 5' end, and R2 3' end). Optionally clips reads from the same template to eliminate overlap +between the reads. This ensures that downstream processes, particularly variant calling, cannot double-count +evidence from the same template when both reads span a variant site in the same template. + +Clipping overlapping reads is only performed on `FR` read pairs, and is implemented by clipping approximately half +the overlapping bases from each read. By default hard clipping is performed; soft-clipping may be substituted +using the `--soft-clip` parameter. + +Secondary alignments and supplemental alignments are not clipped, but are passed through into the +output. + +In order to correctly clip reads by template and update mate information, the input BAM must be either +`queryname` sorted or `query` grouped. If your input BAM is not in an appropriate order the sort can be +done in streaming fashion with, for example: + +``` +samtools sort -n -u in.bam | fgbio ClipBam -i /dev/stdin ... +``` + +The output sort order may be specified with `--sort-order`. If not given, then the output will be in the same +order as input. + +Any existing `NM`, `UQ` and `MD` tags are repaired, and mate-pair information updated. + +Three clipping modes are supported: +1. `Soft` - soft-clip the bases and qualities. +2. `SoftWithMask` - soft-clip and mask the bases and qualities (make bases Ns and qualities the minimum). +3. `Hard` - hard-clip the bases and qualities. + +The `--upgrade-clipping` parameter will convert all existing clipping in the input to the given more stringent mode: +from `Soft` to either `SoftWithMask` or `Hard`, and `SoftWithMask` to `Hard`. In all other cases, clipping remains +the same prior to applying any other clipping criteria. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input SAM or BAM file of aligned reads in coordinate order.|Required|1|| +|output|o|PathToBam|Output SAM or BAM file.|Required|1|| +|metrics|m|FilePath|Optional output of clipping metrics.|Optional|1|| +|ref|r|PathToFasta|Reference sequence fasta file.|Required|1|| +|clipping-mode|c|ClippingMode|The type of clipping to perform.|Optional|1|Hard| +|auto-clip-attributes|a|Boolean|Automatically clip extended attributes that are the same length as bases.|Optional|1|false| +|upgrade-clipping|H|Boolean|Upgrade all existing clipping in the input to the given clipping mode prior to applying any other clipping criteria.|Optional|1|false| +|read-one-five-prime||Int|Require at least this number of bases to be clipped on the 5' end of R1|Optional|1|0| +|read-one-three-prime||Int|Require at least this number of bases to be clipped on the 3' end of R1|Optional|1|0| +|read-two-five-prime||Int|Require at least this number of bases to be clipped on the 5' end of R2|Optional|1|0| +|read-two-three-prime||Int|Require at least this number of bases to be clipped on the 3' end of R2|Optional|1|0| +|clip-overlapping-reads||Boolean|Clip overlapping reads.|Optional|1|false| +|clip-bases-past-mate||Boolean|Clip reads in FR pairs that sequence past the far end of their mate.|Optional|1|false| +|sort-order|S|SamOrder|The sort order of the output. If not given, output will be in the same order as input if the input.|Optional|1|| + diff --git a/tools/2.3.0/CollectAlternateContigNames.md b/tools/2.3.0/CollectAlternateContigNames.md new file mode 100644 index 000000000..5f1913eba --- /dev/null +++ b/tools/2.3.0/CollectAlternateContigNames.md @@ -0,0 +1,38 @@ +--- +title: CollectAlternateContigNames +--- + +# CollectAlternateContigNames + +## Overview +**Group:** FASTA + +Collates the alternate contig names from an NCBI assembly report. + +The input is to be the `*.assembly_report.txt` obtained from NCBI. + +The output will be a "sequence dictionary", which is a valid SAM file, containing the version header line and one +line per contig. The primary contig name (i.e. `@SQ.SN`) is specified with `--primary` option, while alternate +names (i.e. aliases) are specified with the `--alternates` option. + +The `Assigned-Molecule` column, if specified as an `--alternate`, will only be used for sequences with +`Sequence-Role` `assembled-molecule`. + +When updating an existing sequence dictionary with `--existing` the primary contig names must match. I.e. the +contig name from the assembly report column specified by `--primary` must match the contig name in the existing +sequence dictionary (`@SQ.SN`). All contigs in the existing sequence dictionary must be present in the assembly +report. Furthermore, contigs in the assembly report not found in the sequence dictionary will be ignored. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|FilePath|Input NCBI assembly report file.|Required|1|| +|output|o|PathToSequenceDictionary|Output sequence dictionary file.|Required|1|| +|primary|p|AssemblyReportColumn|The assembly report column for the primary contig name.|Optional|1|RefSeqAccession| +|alternates|a|AssemblyReportColumn|The assembly report column(s) for the alternate contig name(s)|Required|Unlimited|| +|sequence-roles|s|SequenceRole|Only output sequences with the given sequence roles. If none given, all sequences will be output.|Optional|Unlimited|| +|existing|d|PathToSequenceDictionary|Update an existing sequence dictionary file. The primary names must match.|Optional|1|| +|allow-mismatching-lengths|x|Boolean|Allow mismatching sequence lengths when using an existing sequence dictionary file.|Optional|1|false| +|skip-missing-alternates||Boolean|Skip contigs that have no alternates|Optional|1|true| + diff --git a/tools/2.3.0/CollectDuplexSeqMetrics.md b/tools/2.3.0/CollectDuplexSeqMetrics.md new file mode 100644 index 000000000..8b6f8d712 --- /dev/null +++ b/tools/2.3.0/CollectDuplexSeqMetrics.md @@ -0,0 +1,74 @@ +--- +title: CollectDuplexSeqMetrics +--- + +# CollectDuplexSeqMetrics + +## Overview +**Group:** Unique Molecular Identifiers (UMIs) + +Collects a suite of metrics to QC duplex sequencing data. + +## Inputs + +The input to this tool must be a BAM file that is either: + +1. The exact BAM output by the `GroupReadsByUmi` tool (in the sort-order it was produced in) +2. A BAM file that has MI tags present on all reads (usually set by `GroupReadsByUmi` and has + been sorted with `SortBam` into `TemplateCoordinate` order. + +Calculation of metrics may be restricted to a set of regions using the `--intervals` parameter. This +can significantly affect results as off-target reads in duplex sequencing experiments often have very +different properties than on-target reads due to the lack of enrichment. + +Several metrics are calculated related to the fraction of tag families that have duplex coverage. The +definition of "duplex" is controlled by the `--min-ab-reads` and `--min-ba-reads` parameters. The default +is to treat any tag family with at least one observation of each strand as a duplex, but this could be +made more stringent, e.g. by setting `--min-ab-reads=3 --min-ba-reads=3`. If different thresholds are +used then `--min-ab-reads` must be the higher value. + +## Outputs + +The following output files are produced: + +1. **.family_sizes.txt**: metrics on the frequency of different types of families of different sizes +2. **.duplex_family_sizes.txt**: metrics on the frequency of duplex tag families by the number of + observations from each strand +3. **.duplex_yield_metrics.txt**: summary QC metrics produced using 5%, 10%, 15%...100% of the data +4. **.umi_counts.txt**: metrics on the frequency of observations of UMIs within reads and tag families +5. **.duplex_qc.pdf**: a series of plots generated from the preceding metrics files for visualization +6. **.duplex_umi_counts.txt**: (optional) metrics on the frequency of observations of duplex UMIs within + reads and tag families. This file is only produced _if_ the `--duplex-umi-counts` option is used as it + requires significantly more memory to track all pairs of UMIs seen when a large number of UMI sequences are present. + +Within the metrics files the prefixes `CS`, `SS` and `DS` are used to mean: + +* **CS**: tag families where membership is defined solely on matching genome coordinates and strand +* **SS**: single-stranded tag families where membership is defined by genome coordinates, strand and UMI; + ie. 50/A and 50/B are considered different tag families. +* **DS**: double-stranded tag families where membership is collapsed across single-stranded tag families + from the same double-stranded source molecule; i.e. 50/A and 50/B become one family + +## Requirements + +For plots to be generated R must be installed and the ggplot2 package installed with suggested +dependencies. Successfully executing the following in R will ensure a working installation: + +```R +install.packages("ggplot2", repos="http://cran.us.r-project.org", dependencies=TRUE) +``` + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input BAM file generated by `GroupReadsByUmi`.|Required|1|| +|output|o|PathPrefix|Prefix of output files to write.|Required|1|| +|intervals|l|PathToIntervals|Optional set of intervals over which to restrict analysis.|Optional|1|| +|description|d|String|Description of data set used to label plots. Defaults to sample/library.|Optional|1|| +|duplex-umi-counts|u|Boolean|If true, produce the .duplex_umi_counts.txt file with counts of duplex UMI observations.|Optional|1|false| +|min-ab-reads|a|Int|Minimum AB reads to call a tag family a 'duplex'.|Optional|1|1| +|min-ba-reads|b|Int|Minimum BA reads to call a tag family a 'duplex'.|Optional|1|1| +|umi-tag|t|String|The tag containing the raw UMI.|Optional|1|RX| +|mi-tag|T|String|The output tag for UMI grouping.|Optional|1|MI| + diff --git a/tools/2.3.0/CollectErccMetrics.md b/tools/2.3.0/CollectErccMetrics.md new file mode 100644 index 000000000..20995a5d5 --- /dev/null +++ b/tools/2.3.0/CollectErccMetrics.md @@ -0,0 +1,51 @@ +--- +title: CollectErccMetrics +--- + +# CollectErccMetrics + +## Overview +**Group:** RNA-Seq + +Collects metrics for ERCC spike-ins for RNA-Seq experiments. + +Currently calculates per-transcript ERCC metrics and summarizes dose response, but does not calculate fold-change +response. + +The input BAM should contain reads mapped to a reference containing the ERCC transcripts. The reference may have +additional contigs, for example, when concatenating the sample's reference genome and the ERCC transcripts. The +BAM should have sequence lines in the header matching the ERCC ids (ex. ERCC-00130 or ERCC-00004). + +The standard ERCC transcripts, including their unique IDs and concentrations, are taken from +[here](https://assets.thermofisher.com/TFS-Assets/LSG/manuals/cms_095046.txt). The second column lists the ERCC +transcript identifier which should be present as a sequence line in the BAM header, while columns four and five +give the concentration of the transcript for mixtures #1 and #2. + +The choice of mixture to use can be specified with the `--mixture-name` option (either `Mix1` or `Mix2`), or with +a file containing a custom list of transcripts and concentrations using the `--custom-mixture` option as follows. +The custom mixture file should be tab-delimited file containing the following columns: + 1. ERCC ID - each ERCC ID should match a contig/reference-sequence name in the input SAM/BAM header. + 2. Concentration - the concentration (in `attomoles/ul`). +The custom mixture file should contain a header line with names for each column, though the actual values will be ignored. + +Three outputs will be produced: + 1. .ercc_summary_metrics.txt - summary statistics for total # of reads mapping to the ERCC transcripts and dose + response metrics. + 2. .ercc_detailed_metrics.txt - gives a per-ERCC-transcript expected concentration and observed fragment count. + 3. .ercc_metrics.pdf - plots the expected concentration versus the observed fragment count. + +Secondary andsupplementary reads will be ignored. A read pair mapping to an ERCC transcript is counted only if both +ends of the pair map to the same ERCC transcript. A minimum mapping quality can be required for reads to be +counted as mapped to an ERCC transcript. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input SAM or BAM file of aligned reads in coordinate order.|Required|1|| +|output|o|PathPrefix|Output prefix.|Required|1|| +|mixture-name|m|ErccMixture|The name of the standard ERCC mixture.|Optional|1|| +|custom-mixture||FilePath|Tab-delimited file containing ERCC IDs and expected concentrations.|Optional|1|| +|min-transcript-count|c|Int|Minimum # of counts required to include an ERCC transcript.|Optional|1|3| +|minimum-mapping-quality|M|Int|The minimum mapping quality|Optional|1|10| + diff --git a/tools/2.3.0/CopyUmiFromReadName.md b/tools/2.3.0/CopyUmiFromReadName.md new file mode 100644 index 000000000..52ba161f8 --- /dev/null +++ b/tools/2.3.0/CopyUmiFromReadName.md @@ -0,0 +1,35 @@ +--- +title: CopyUmiFromReadName +--- + +# CopyUmiFromReadName + +## Overview +**Group:** Unique Molecular Identifiers (UMIs) + +Copies the UMI at the end of the BAM's read name to the RX tag. + +The read name is split on `:` characters with the last field assumed to be the UMI sequence. The UMI +will be copied to the `RX` tag as per the SAM specification. If any read does not have a UMI composed of +valid bases (ACGTN), the program will report the error and fail. + +If a read name contains multiple UMIs they may be delimited (typically by a hyphen (`-`) or plus (`+`)). +The `--umi-delimiter` option specifies the delimiter on which to split. The resulting UMI in the `RX` tag +will always be hyphen delimited. + +Some tools (e.g. BCL Convert) may reverse-complement UMIs on R2 and add an 'r' prefix to indicate that the sequence +has been reverse-complemented. By default, the 'r' prefix is removed and the sequence is reverse-complemented +back to the forward orientation. The `--override-reverse-complement-umis` disables the latter behavior, such that +the 'r' prefix is removed but the UMI sequence is left as reverse-complemented. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|The input BAM file.|Required|1|| +|output|o|PathToBam|The output BAM file.|Required|1|| +|remove-umi||Boolean|Remove the UMI from the read name.|Optional|1|false| +|field-delimiter||Char|Delimiter between the read name and UMI.|Optional|1|:| +|umi-delimiter||Char|Delimiter between UMI sequences.|Optional|1|+| +|override-reverse-complement-umis||Boolean|Do not reverse-complement UMIs prefixed with 'r'.|Optional|1|false| + diff --git a/tools/2.3.0/CorrectUmis.md b/tools/2.3.0/CorrectUmis.md new file mode 100644 index 000000000..ded2d774d --- /dev/null +++ b/tools/2.3.0/CorrectUmis.md @@ -0,0 +1,64 @@ +--- +title: CorrectUmis +--- + +# CorrectUmis + +## Overview +**Group:** Unique Molecular Identifiers (UMIs) + +Corrects UMIs stored in BAM files when a set of fixed UMIs is in use. If the set of UMIs used in +an experiment is known and is a _subset_ of the possible randomers of the same length, it is possible +to error-correct UMIs prior to grouping reads by UMI. This tool takes an input BAM with UMIs in a +tag (`RX` by default) and set of known UMIs (either on the command line or in a file) and produces: + + 1. A new BAM with corrected UMIs in the same tag the UMIs were found in + 2. Optionally a set of metrics about the representation of each UMI in the set + 3. Optionally a second BAM file of reads whose UMIs could not be corrected within the specific parameters + +All of the fixed UMIs must be of he same length, and all UMIs in the BAM file must also have the same +length. Multiple UMIs that are concatenated with hyphens (e.g. `AACCAGT-AGGTAGA`) are split apart, +corrected individually and then re-assembled. A read is accepted only if all the UMIs can be corrected. + +Correction is controlled by two parameters that are applied per-UMI: + + 1. _--max-mismatches_ controls how many mismatches (no-calls are counted as mismatches) are tolerated + between a UMI as read and a fixed UMI. + 2. _--min-distance_ controls how many more mismatches the next best hit must have + +For example, with two fixed UMIs `AAAAA` and `CCCCC` and `--max-mismatches=3` and `--min-distance=2` the +following would happen: + + - AAAAA would match to AAAAA + - AAGTG would match to AAAAA with three mismatches because CCCCCC has six mismatches and 6 >= 3 + 2 + - AACCA would be rejected because it is 2 mismatches to AAAAA and 3 to CCCCCC and 3 <= 2 + 2 + +The set of fixed UMIs may be specified on the command line using `--umis umi1 umi2 ...` or via one or +more files of UMIs with a single sequence per line using `--umi-files umis.txt more_umis.txt`. If there +are multiple UMIs per template, leading to hyphenated UMI tags, the values for the fixed UMIs should +be single, non-hyphenated UMIs (e.g. if a record has `RX:Z:ACGT-GGCA`, you would use `--umis ACGT GGCA`). + +Records which have their UMIs corrected (i.e. the UMI is not identical to one of the expected UMIs but is +close enough to be corrected) will by default have their original UMI stored in the `OX` tag. This can be +disabled with the `--dont-store-original-umis` option. + +For a large number of input UMIs, the `--cache-size` option may used to speed up the tool. To disable +using a cache, set the value to `0`. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input SAM or BAM file.|Required|1|| +|output|o|PathToBam|Output SAM or BAM file.|Required|1|| +|rejects|r|PathToBam|Reject BAM file to save unassigned reads.|Optional|1|| +|metrics|M|FilePath|Metrics file to write.|Optional|1|| +|max-mismatches|m|Int|Maximum number of mismatches between a UMI and an expected UMI.|Required|1|| +|min-distance|d|Int|Minimum distance (in mismatches) to next best UMI.|Required|1|| +|umis|u|String|Expected UMI sequences.|Optional|Unlimited|| +|umi-files|U|FilePath|File of UMI sequences, one per line.|Optional|Unlimited|| +|umi-tag|t|String|Tag in which UMIs are stored.|Optional|1|RX| +|dont-store-original-umis|x|Boolean|Don't store original UMIs upon correction.|Optional|1|false| +|cache-size||Int|The number of uncorrected UMIs to cache; zero will disable the cache.|Optional|1|100000| +|min-corrected||Double|The minimum ratio of kept UMIs to accept. A ratio below this will cause a failure (but all files will still be written).|Optional|1|| + diff --git a/tools/2.3.0/DemuxFastqs.md b/tools/2.3.0/DemuxFastqs.md new file mode 100644 index 000000000..06f47d5e1 --- /dev/null +++ b/tools/2.3.0/DemuxFastqs.md @@ -0,0 +1,171 @@ +--- +title: DemuxFastqs +--- + +# DemuxFastqs + +## Overview +**Group:** FASTQ + +Performs sample demultiplexing on FASTQs. + +**Please see https://github.com/fulcrumgenomics/fqtk for a faster and supported replacement** + +The sample barcode for each sample in the sample sheet will be compared against the sample barcode bases extracted from +the FASTQs, to assign each read to a sample. Reads that do not match any sample within the given error tolerance +will be placed in the 'unmatched' file. + +The type of output is specified with the `--output-type` option, and can be BAM (`--output-type Bam`), +gzipped FASTQ (`--output-type Fastq`), or both (`--output-type BamAndFastq`). + +For BAM output, the output directory will contain one BAM file per sample in the sample sheet or metadata CSV file, +plus a BAM for reads that could not be assigned to a sample given the criteria. The output file names will be the +concatenation of sample id, sample name, and sample barcode bases (expected not observed), delimited by `-`. A +metrics file will also be output providing analogous information to the metric described +[SampleBarcodeMetric](http://fulcrumgenomics.github.io/fgbio/metrics/latest/#samplebarcodemetric). + +For gzipped FASTQ output, one or more gzipped FASTQs per sample in the sample sheet or metadata CSV file will be +written to the output directory. For paired end data, the output will have the suffix `_R1.fastq.gz` and +`_R2.fastq.gz` for read one and read two respectively. The sample barcode and molecular barcodes (concatenated) +will be appended to the read name and delimited by a colon. If the `--illumina-standards` option is given, then +the output read names and file names will follow the +[Illumina standards described here](https://help.basespace.illumina.com/articles/tutorials/upload-data-using-web-uploader/). + +The output base qualities will be standardized to Sanger/SAM format. + +FASTQs and associated read structures for each sub-read should be given: + +- a single fragment read should have one FASTQ and one read structure +- paired end reads should have two FASTQs and two read structures +- a dual-index sample with paired end reads should have four FASTQs and four read structures given: two for the + two index reads, and two for the template reads. + +If multiple FASTQs are present for each sub-read, then the FASTQs for each sub-read should be concatenated together +prior to running this tool (ex. `cat s_R1_L001.fq.gz s_R1_L002.fq.gz > s_R1.fq.gz`). + +[Read structures](https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures) are made up of `` +pairs much like the `CIGAR` string in BAM files. Four kinds of operators are recognized: + +1. `T` identifies a template read +2. `B` identifies a sample barcode read +3. `M` identifies a unique molecular index read +4. `S` identifies a set of bases that should be skipped or ignored + +The last `` pair may be specified using a `+` sign instead of number to denote "all remaining +bases". This is useful if, e.g., fastqs have been trimmed and contain reads of varying length. Both reads must +have template bases. Any molecular identifiers will be concatenated using +the `-` delimiter and placed in the given SAM record tag (`RX` by default). Similarly, the sample barcode bases +from the given read will be placed in the `BC` tag. + +Metadata about the samples should be given in either an Illumina Experiment Manager sample sheet or a metadata CSV +file. Formats are described in detail below. + +The read structures will be used to extract the observed sample barcode, template bases, and molecular identifiers +from each read. The observed sample barcode will be matched to the sample barcodes extracted from the bases in +the sample metadata and associated read structures. + +## Sample Sheet +The read group's sample id, sample name, and library id all correspond to the similarly named values in the +sample sheet. Library id will be the sample id if not found, and the platform unit will be the sample name +concatenated with the sample barcode bases delimited by a `.`. + +The sample section of the sample sheet should contain information related to each sample with the following columns: + + * Sample_ID: The sample identifier unique to the sample in the sample sheet. + * Sample_Name: The sample name. + * Library_ID: The library Identifier. The combination sample name and library identifier should be unique + across the samples in the sample sheet. + * Description: The description of the sample, which will be placed in the description field in the output BAM's + read group. This column may be omitted. + * Sample_Barcode: The sample barcode bases unique to each sample. The name of the column containing the sample barcode + can be changed using the `--column-for-sample-barcode` option. If the sample barcode is present + across multiple reads (ex. dual-index, or inline in both reads of a pair), then the expected + barcode bases from each read should be concatenated in the same order as the order of the reads' + FASTQs and read structures given to this tool. + +## Metadata CSV + +In lieu of a sample sheet, a simple CSV file may be provided with the necessary metadata. This file should +contain the same columns as described above for the sample sheet (`Sample_ID`, `Sample_Name`, `Library_ID`, and +`Description`). + +## Example Command Line + +As an example, if the sequencing run was 2x100bp (paired end) with two 8bp index reads both reading a sample +barcode, as well as an in-line 8bp sample barcode in read one, the command line would be + +``` +--inputs r1.fq i1.fq i2.fq r2.fq --read-structures 8B92T 8B 8B 100T \ + --metadata SampleSheet.csv --metrics metrics.txt --output output_folder +``` + +## Output Standards + +The following options affect the output format: + +1. If `--omit-fastq-read-numbers` is specified, then trailing /1 and /2 for R1 and R2 respectively, will not be +appended to e FASTQ read name. By default they will be appended. +2. If `--include-sample-barcodes-in-fastq` is specified, then sample barcode will replace the last field in the +first comment in the FASTQ header, e.g. replace 'NNNNNN' in the header `@Instrument:RunID:FlowCellID:Lane:Tile:X:Y 1:N:0:NNNNNN` +3. If `--illumina-file-names` is specified, the output files will be named according to the Illumina FASTQ file +naming conventions: + + a. The file extension will be `_R1_001.fastq.gz` for read one, and `_R2_001.fastq.gz` for read two (if paired end). + b. The per-sample output prefix will be `_S_L` (without angle brackets). + +Options (1) and (2) require the input FASTQ read names to contain the following elements: + +`@:::::: :::` + +[See the Illumina FASTQ conventions for more details.](https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/FASTQFiles_Intro_swBS.htm) + +The `--illumina-standards` option may not be specified with the three options above. Use this option if you +intend to upload to Illumina BaseSpace. This option implies: + +`--omit-fastq-read-numbers=true --include-sample-barcodes-in-fastq=false --illumina-file-names=true` + +[See the Illumina Basespace standards described here](https://help.basespace.illumina.com/articles/tutorials/upload-data-using-web-uploader/). + +To output with recent Illumina conventions (circa 2021) that match `bcl2fastq` and `BCLconvert`, use: + +`--omit-fastq-read-numbers=true --include-sample-barcodes-in-fastq=true --illumina-file-names=true` + +By default all input reads are output. If your input FASTQs contain reads that do not pass filter (as defined by the Y/N filter flag in the FASTQ comment) these can be filtered out during demultiplexing using the `--omit-failing-reads` option. + +To output only reads that are not control reads, as encoded in the `` field in the header comment, use the `--omit-control-reads` flag + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|inputs|i|PathToFastq|One or more input fastq files each corresponding to a sub-read (ex. index-read, read-one, read-two, fragment).|Required|Unlimited|| +|output|o|DirPath|The output directory in which to place sample BAMs.|Required|1|| +|metadata|x|FilePath|A file containing the metadata about the samples.|Required|1|| +|read-structures|r|ReadStructure|The read structure for each of the FASTQs.|Required|Unlimited|| +|metrics|m|FilePath|The file to which per-barcode metrics are written. If none given, a file named `demux_barcode_metrics.txt` will be written to the output directory.|Optional|1|| +|column-for-sample-barcode|c|String|The column name in the sample sheet or metadata CSV for the sample barcode.|Optional|1|Sample_Barcode| +|unmatched|u|String|Output BAM file name for the unmatched records.|Optional|1|unmatched.bam| +|quality-format|q|QualityEncoding|A value describing how the quality values are encoded in the FASTQ. Either Solexa for pre-pipeline 1.3 style scores (solexa scaling + 66), Illumina for pipeline 1.3 and above (phred scaling + 64) or Standard for phred scaled scores with a character shift of 33. If this value is not specified, the quality format will be detected automatically.|Optional|1|| +|threads|t|Int|The number of threads to use while de-multiplexing. The performance does not increase linearly with the # of threads and seems not to improve beyond 2-4 threads.|Optional|1|1| +|max-mismatches||Int|Maximum mismatches for a barcode to be considered a match.|Optional|1|1| +|min-mismatch-delta||Int|Minimum difference between number of mismatches in the best and second best barcodes for a barcode to be considered a match.|Optional|1|2| +|max-no-calls||Int|Maximum allowable number of no-calls in a barcode read before it is considered unmatchable.|Optional|1|2| +|sort-order||SortOrder|The sort order for the output sam/bam file (typically unsorted or queryname).|Optional|1|queryname| +|umi-tag||String|The SAM tag for any molecular barcode. If multiple molecular barcodes are specified, they will be concatenated and stored here.|Optional|1|RX| +|platform-unit||String|The platform unit (typically `-.`)|Optional|1|| +|sequencing-center||String|The sequencing center from which the data originated|Optional|1|| +|predicted-insert-size||Integer|Predicted median insert size, to insert into the read group header|Optional|1|| +|platform-model||String|Platform model to insert into the group header (ex. miseq, hiseq2500, hiseqX)|Optional|1|| +|platform||String|Platform to insert into the read group header of BAMs (e.g Illumina)|Optional|1|Illumina| +|comments||String|Comment(s) to include in the merged output file's header.|Optional|Unlimited|| +|run-date||Iso8601Date|Date the run was produced, to insert into the read group header|Optional|1|| +|output-type||OutputType|The type of outputs to produce.|Optional|1|| +|include-all-bases-in-fastqs||Boolean|Output all bases (i.e. all sample barcode, molecular barcode, skipped, and template bases) for every read with template bases (ex. read one and read two) as defined by the corresponding read structure(s).|Optional|1|false| +|illumina-standards||Boolean|Output FASTQs according to Illumina BaseSpace Sequence Hub naming standards. This is differfent than Illumina naming standards.|Optional|1|false| +|omit-fastq-read-numbers||Boolean|Do not include trailing /1 or /2 for R1 and R2 in the FASTQ read name.|Optional|1|false| +|include-sample-barcodes-in-fastq||Boolean|Insert the sample barcode into the FASTQ header.|Optional|1|false| +|illumina-file-names||Boolean|Name the output files according to the Illumina file name standards.|Optional|1|false| +|omit-failing-reads||Boolean|Keep only passing filter reads if true, otherwise keep all reads. Passing filter reads are determined from the comment in the FASTQ header.|Optional|1|false| +|omit-control-reads||Boolean|Do not keep reads identified as control if true, otherwise keep all reads. Control reads are determined from the comment in the FASTQ header.|Optional|1|false| +|mask-bases-below-quality||Int|Mask bases with a quality score below the specified threshold as Ns|Optional|1|0| + diff --git a/tools/2.3.0/DownsampleAndNormalizeBam.md b/tools/2.3.0/DownsampleAndNormalizeBam.md new file mode 100644 index 000000000..50b7f88eb --- /dev/null +++ b/tools/2.3.0/DownsampleAndNormalizeBam.md @@ -0,0 +1,37 @@ +--- +title: DownsampleAndNormalizeBam +--- + +# DownsampleAndNormalizeBam + +## Overview +**Group:** SAM/BAM + +Downsamples a BAM in a biased way to a uniform coverage across regions. + +Attempts to downsample a BAM such that every base in the genome (or in the target `regions` if provided) +is covered by at least `coverage` reads. When computing coverage: + - Reads marked as secondary, duplicate or unmapped are not used + - A base can receive coverage from only one read with the same queryname (i.e. mate overlaps are not counted) + - Coverage is counted if a read _spans_ a base, even if that base is deleted in the read + +Reads are first sorted into a random order (by hashing read names). Reads are then consumed one template +at a time, and if any read adds coverage to base that is under the target coverage, _all_ reads (including +secondary, unmapped, etc.) for that template are emitted into the output. + +Given the procedure used for downsampling, it is likely the output BAM will have coverage up to 2X the requested +coverage at regions in the input BAM that are i) well covered and ii) are close to regions that are poorly +covered. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input SAM or BAM file.|Required|1|| +|output|o|PathToBam|Output SAM or BAM file.|Required|1|| +|coverage|c|Int|Desired minimum coverage.|Required|1|| +|min-map-q|m|Int|Minimum mapping quality to count a read as covering.|Optional|1|0| +|seed|s|Int|Random seed to use when randomizing order of reads/templates.|Optional|1|42| +|regions|l|PathToIntervals|Optional set of regions for coverage targeting.|Optional|1|| +|max-in-memory|M|Int|Maximum records to be held in memory while sorting.|Optional|1|1000000| + diff --git a/tools/2.3.0/ErrorRateByReadPosition.md b/tools/2.3.0/ErrorRateByReadPosition.md new file mode 100644 index 000000000..297f2aede --- /dev/null +++ b/tools/2.3.0/ErrorRateByReadPosition.md @@ -0,0 +1,54 @@ +--- +title: ErrorRateByReadPosition +--- + +# ErrorRateByReadPosition + +## Overview +**Group:** SAM/BAM + +Calculates the error rate by read position on coordinate sorted mapped BAMs. The output file contains +a row per read (first of pair, second of pair and unpaired), per position in read, with the total number +of bases observed, the number of errors observed, the overall error rate, and the rate of each kind of +substitution error. + +Substitution types are collapsed based on the reference or expected base, with only six substitution +types being reported: `A>C`, `A>G`, `A>T`, `C>A`, `C>G` and `C>T`. For example, `T>G` is grouped in +with `A>C`. + +Analysis can be restricted to a set of intervals via the `--intervals` option. Genomic positions can be +excluded from analysis by supplying a set of variants (either known variants in the sample or a catalog +of known variants such as dbSNP). For data believed to have low error rates it is recommended to use +both the `--intervals` and `--variants` options to restrict analysis to only regions expected to be +homozygous reference in the data. + +The following are reads / bases are excluded from the analysis: + +- Unmapped reads +- Reads marked as failing vendor quality +- Reads marked as duplicates (unless `--include-duplicates` is specified) +- Secondary and supplemental records +- Soft-clipped bases in records +- Reads with MAPQ < `--min-mapping-quality` (default: 20) +- Bases with base quality < `--min-base-quality` (default: 0) +- Bases where either the read base or the reference base is non-ACGT + +An output text file is generated with the extension `.error_rate_by_read_position.txt` + +If R's `Rscript` utility is on the path and `ggplot2` is installed in the R distribution then a PDF +of error rate plots will also be generated with extension `.error_rate_by_read_position.pdf`. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input BAM file.|Required|1|| +|output|o|PathPrefix|Output metrics prefix. If not given, will use the input BAM basename.|Optional|1|| +|ref|r|PathToFasta|Reference sequence fasta file.|Required|1|| +|variants|v|PathToVcf|Optional file of variant sites to ignore.|Optional|1|| +|intervals|l|PathToIntervals|Optional list of intervals to restrict analysis to.|Optional|1|| +|include-duplicates|d|Boolean|Include duplicate reads, otherwise ignore.|Optional|1|false| +|min-mapping-quality|m|Int|The minimum mapping quality for a read to be included.|Optional|1|20| +|min-base-quality|q|Int|The minimum base quality for a base to be included.|Optional|1|0| +|collapse||Boolean|Collapse substitution types based on the reference or expected base, with only six substitution types being reported: `A>C`, `A>G`, `A>T`, `C>A`, `C>G` and `C>T`.For example, `T>G` is grouped in with `A>C`. Otherwise, all possible substitution types will be reported.|Optional|1|true| + diff --git a/tools/2.3.0/EstimatePoolingFractions.md b/tools/2.3.0/EstimatePoolingFractions.md new file mode 100644 index 000000000..ea9da2af0 --- /dev/null +++ b/tools/2.3.0/EstimatePoolingFractions.md @@ -0,0 +1,43 @@ +--- +title: EstimatePoolingFractions +--- + +# EstimatePoolingFractions + +## Overview +**Group:** SAM/BAM + +Examines sequence data generated from a pooled sample and estimates the fraction of sequence data +coming from each constituent sample. Uses a VCF of known genotypes for the samples within the +mixture along with a BAM of sequencing data derived from the pool. Performs a multiple regression +for the alternative allele fractions at each SNP locus, using as inputs the individual sample's genotypes. +Only SNPs that are bi-allelic within the pooled samples are used. + +Each sample's contribution of REF vs. ALT alleles at each site is derived in one of two ways: (1) if +the sample's genotype in the VCF has the `AF` attribute then the value from that field will be used, (2) if the +genotype has no `AF` attribute then the contribution is estimated based on the genotype (e.g. 0/0 will be 100% +ref, 0/1 will be 50% ref and 50% alt, etc.). + +Various filtering parameters can be used to control which loci are used: + +- _--intervals_ will restrict analysis to variants within the described intervals +- _--min-genotype-quality_ will filter out any site with any genotype with GQ < n +- _--min-mean-sample-coverage_ requires that the coverage of a site in the BAM be >= `min-mean-sample-coverage * n_samples` +- _--min-mapping-quality_ filters out reads in the BAM with MQ < n +- _--min-base-quality_ filters out bases in the BAM with Q < n + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|vcf|v|PathToVcf|VCF of individual sample genotypes.|Required|1|| +|bam|b|PathToBam|Path to BAM file of sequencing data.|Required|1|| +|output|o|FilePath|Output file to write with pooling fractions.|Required|1|| +|intervals|l|PathToIntervals|Zero or more set of regions to restrict analysis to.|Optional|Unlimited|| +|samples|s|String|Optional subset of samples from VCF to use.|Optional|Unlimited|| +|non-autosomes|n|String|Non-autosomal chromosomes to avoid.|Required|Unlimited|M, chrM, MT, X, chrX, Y, chrY| +|min-genotype-quality|g|Int|Minimum genotype quality. Use -1 to disable.|Optional|1|30| +|min-mean-sample-coverage|c|Int|Minimum (sequencing coverage @ SNP site / n_samples).|Optional|1|6| +|min-mapping-quality|m|Int|Minimum mapping quality.|Optional|1|20| +|min-base-quality|q|Int|Minimum base quality.|Optional|1|5| + diff --git a/tools/2.3.0/EstimateRnaSeqInsertSize.md b/tools/2.3.0/EstimateRnaSeqInsertSize.md new file mode 100644 index 000000000..7eed2a4cb --- /dev/null +++ b/tools/2.3.0/EstimateRnaSeqInsertSize.md @@ -0,0 +1,37 @@ +--- +title: EstimateRnaSeqInsertSize +--- + +# EstimateRnaSeqInsertSize + +## Overview +**Group:** RNA-Seq + +Computes the insert size for RNA-Seq experiments. + +Computes the insert size by counting the # of bases sequenced in transcript space. The insert size is defined +as the distance between the first bases sequenced of each pair respectively (5' sequencing ends). + +This tool skips reads that overlap multiple genes, reads that aren't fully enclosed in a gene, and reads where the +insert size would disagree across transcripts from the same gene. Also skips reads that are unpaired, failed QC, +secondary, supplementary, pairs without both ends mapped, duplicates, and pairs whose reads map to different +chromosomes. Finally, skips transcripts where too few mapped read bases overlap exonic sequence. + +This tool requires each mapped pair to have the mate cigar (`MC`) tag. Use `SetMateInformation` to add the mate cigar. + +The output metric file will have the extension `.rna_seq_insert_size.txt` and the output histogram file will have +the extension `.rna_seq_insert_size_histogram.txt`. The histogram file gives for each orientation (`FR`, `RF`, `tandem`), +the number of read pairs that had the given insert size. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input BAM file.|Required|1|| +|ref-flat|r|FilePath|Input gene annotations in [RefFlat](http://genome.ucsc.edu/goldenPath/gbdDescriptionsOld.html#RefFlat) form|Required|1|| +|prefix|p|PathPrefix|Output prefix file. The file will have the extension `.rna_seq_insert_size.txt` if not given|Optional|1|| +|include-duplicates|d|Boolean|Include duplicates|Optional|1|false| +|deviations|D|Double|Generate mean and standard deviation by filtering to `median + deviations*median_absolute_deviation`. This is done because insert size data typically includes enough anomalous values from chimeras and other artifacts to make the mean and sd grossly misleading regarding the real distribution. "|Optional|1|10.0| +|minimum-mapping-quality|q|Int|Ignore reads with mapping quality less than this value.|Optional|1|30| +|minimum-overlap|m|Double|The minimum fraction of read bases that must overlap exonic sequence in a transcript|Optional|1|0.95| + diff --git a/tools/2.3.0/ExtractBasecallingParamsForPicard.md b/tools/2.3.0/ExtractBasecallingParamsForPicard.md new file mode 100644 index 000000000..e7319dc2b --- /dev/null +++ b/tools/2.3.0/ExtractBasecallingParamsForPicard.md @@ -0,0 +1,32 @@ +--- +title: ExtractBasecallingParamsForPicard +--- + +# ExtractBasecallingParamsForPicard + +## Overview +**Group:** Basecalling + +Extracts sample and library information from an sample sheet for a given lane. + +The sample sheet should be an Illumina Experiment Manager sample sheet. The tool writes two files to the output +directory: a barcode parameter file and a library parameter file. + +The barcode parameter file is used by Picard's `ExtractIlluminaBarcodes` and `CollectIlluminaBasecallingMetrics` to +determine how to match sample barcodes to each read. The parameter file will be written to the output directory +with name `barcode_params..txt`. + +The library parameter file is used by Picard's `IlluminaBasecallsToSam` to demultiplex samples and name the output +BAM file path for each sample output BAM file. The parameter file will be written to the output directory with name +`library_params..txt`. The path to each sample's BAM file will be specified in the library parameter +file. Each BAM file will have path `/...bam`. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|FilePath|The input sample sheet.|Required|1|| +|output|o|DirPath|The output folder to where per-lane parameter files should be written.|Required|1|| +|bam-output|b|DirPath|Optional output folder to where per-lane BAM files should be written, otherwise the output directory will be used.|Optional|1|| +|lanes|l|Int|The lane(s) (1-based) for which to write per-lane parameter files.|Required|Unlimited|| + diff --git a/tools/2.3.0/ExtractIlluminaRunInfo.md b/tools/2.3.0/ExtractIlluminaRunInfo.md new file mode 100644 index 000000000..0aa09f79f --- /dev/null +++ b/tools/2.3.0/ExtractIlluminaRunInfo.md @@ -0,0 +1,28 @@ +--- +title: ExtractIlluminaRunInfo +--- + +# ExtractIlluminaRunInfo + +## Overview +**Group:** Basecalling + +Extracts information about an Illumina sequencing run from the RunInfo.xml. + +The output file will contain a header column and a single column containing the following rows: + +1. `run_barcode:` the unique identifier for the sequencing run and flowcell, stored as `_`. +2. `flowcell_barcode:` the flowcell barcode. +3. `instrument_name`: the instrument name. +4. `run_date`: the date of the sequencing run. +5. `read_structure`: the description of the logical structure of cycles within the sequencing run, including which cycles + correspond to sample barcodes, molecular barcodes, template bases, and bases that should be skipped. +6. `number_of_lanes`: the number of lanes in the flowcell. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|FilePath|The input RunInfo.xml typically found in the run folder.|Required|1|| +|output|o|FilePath|The output file.|Required|1|| + diff --git a/tools/2.3.0/ExtractUmisFromBam.md b/tools/2.3.0/ExtractUmisFromBam.md new file mode 100644 index 000000000..1932082f3 --- /dev/null +++ b/tools/2.3.0/ExtractUmisFromBam.md @@ -0,0 +1,60 @@ +--- +title: ExtractUmisFromBam +--- + +# ExtractUmisFromBam + +## Overview +**Group:** SAM/BAM + +Extracts unique molecular indexes from reads in a BAM file into tags. + +Currently only unmapped reads are supported. + +Only template bases will be retained as read bases (stored in the `SEQ` field) as specified by the read structure. + +A read structure should be provided for each read of a template. For example, paired end reads should have two +read structures specified. The tags to store the molecular indices will be associated with the molecular index +segment(s) in the read structure based on the order specified. If only one molecular index tag is given, then the +molecular indices will be concatenated and stored in that tag. Otherwise the number of molecular indices in the +read structure should match the number of tags given. In the resulting BAM file each end of a pair will contain +the same molecular index tags and values. Additionally, when multiple molecular indices are present the +`--single-tag` option may be used to write all indices, concatenated, to a single tag in addition to the tags +specified in `--molecular-index-tags`. + +Optionally, the read names can be annotated with the molecular indices directly. In this case, the read name +will be formatted `+` where `` is the concatenation of read one's molecular indices. +Similarly for ``. + +Mapping information will not be adjusted, as such, this tool should not be used on reads that have been mapped since +it will lead to an BAM with inconsistent records. + +The read structure describes the structure of a given read as one or more read segments. A read segment describes +a contiguous stretch of bases of the same type (ex. template bases) of some length and some offset from the start +of the read. Read structures are made up of `` pairs much like the CIGAR string in BAM files. +Four kinds of operators are recognized: + +1. `T` identifies a template read +2. `B` identifies a sample barcode read +3. `M` identifies a unique molecular index read +4. `S` identifies a set of bases that should be skipped or ignored + +The last `` pair may be specified using a '+' sign instead of number to denote "all remaining +bases". This is useful if, e.g., fastqs have been trimmed and contain reads of varying length. + +An example would be `10B3M7S100T` which describes 120 bases, with the first ten bases being a sample barcode, +bases 11-13 being a molecular index, bases 14-20 ignored, and bases 21-120 being template bases. See +[Read Structures](https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures) for more information. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input BAM file.|Required|1|| +|output|o|PathToBam|Output BAM file.|Required|1|| +|read-structure|r|ReadStructure|The read structure, one per read in a template.|Required|Unlimited|| +|molecular-index-tags|t|String|SAM tag(s) in which to store the molecular indices.|Optional|Unlimited|| +|single-tag|s|String|Single tag into which to concatenate all molecular indices.|Optional|1|| +|annotate-read-names|a|Boolean|Annotate the read names with the molecular indices. See usage for more details.|Optional|1|false| +|clipping-attribute|c|String|The SAM tag with the position in read to clip adapters (e.g. `XT` as produced by Picard's `MarkIlluminaAdapters`).|Optional|1|| + diff --git a/tools/2.3.0/FastqToBam.md b/tools/2.3.0/FastqToBam.md new file mode 100644 index 000000000..a343c96bf --- /dev/null +++ b/tools/2.3.0/FastqToBam.md @@ -0,0 +1,78 @@ +--- +title: FastqToBam +--- + +# FastqToBam + +## Overview +**Group:** FASTQ + +Generates an unmapped BAM (or SAM or CRAM) file from fastq files. Takes in one or more fastq files (optionally +gzipped), each representing a different sequencing read (e.g. R1, R2, I1 or I2) and can use a set of read +structures to allocate bases in those reads to template reads, sample indices, unique molecular indices, or to +designate bases to be skipped over. + +Read structures are made up of `` pairs much like the CIGAR string in BAM files. Four kinds of +operators are recognized: + +1. `T` identifies a template read +2. `B` identifies a sample barcode read +3. `M` identifies a unique molecular index read +4. `S` identifies a set of bases that should be skipped or ignored + +The last `` pair may be specified using a `+` sign instead of number to denote "all remaining +bases". This is useful if, e.g., fastqs have been trimmed and contain reads of varying length. For example +to convert a paired-end run with an index read and where the first 5 bases of R1 are a UMI and the second +five bases are monotemplate you might specify: + +``` +--input r1.fq r2.fq i1.fq --read-structures 5M5S+T +T +B +``` + +Alternative if you know your reads are of fixed length you could specify: + +``` +--input r1.fq r2.fq i1.fq --read-structures 5M5S65T 75T 8B +``` + +For more information on read structures see the +[Read Structure Wiki Page](https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures) + +UMIs may be extracted from the read sequences, the read names, or both. If `--extract-umis-from-read-names` is +specified, any UMIs present in the read names are extracted; read names are expected to be `:`-separated with +any UMIs present in the 8th field. If this option is specified, the `--umi-qual-tag` option may not be used as +qualities are not available for UMIs in the read name. If UMI segments are present in the read structures those +will also be extracted. If UMIs are present in both, the final UMIs are constructed by first taking the UMIs +from the read names, then adding a hyphen, then the UMIs extracted from the reads. + +The same number of input files and read structures must be provided, with one exception: if supplying exactly +1 or 2 fastq files, both of which are solely template reads, no read structures need be provided. + +The output file can be sorted by queryname using the `--sort-order` option; the default is to produce a BAM +with reads in the same order as they appear in the fastq file. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToFastq|Fastq files corresponding to each sequencing read (e.g. R1, I1, etc.).|Required|Unlimited|| +|output|o|PathToBam|The output SAM or BAM file to be written.|Required|1|| +|read-structures|r|ReadStructure|Read structures, one for each of the FASTQs.|Optional|Unlimited|| +|sort|s|Boolean|If true, queryname sort the BAM file, otherwise preserve input order.|Optional|1|false| +|umi-tag|u|String|Tag in which to store molecular barcodes/UMIs.|Optional|1|RX| +|umi-qual-tag|q|String|Tag in which to store molecular barcode/UMI qualities.|Optional|1|| +|store-sample-barcode-qualities|Q|Boolean|Store the sample barcode qualities in the QT Tag.|Optional|1|false| +|extract-umis-from-read-names|n|Boolean|Extract UMI(s) from read names and prepend to UMIs from reads.|Optional|1|false| +|read-group-id||String|Read group ID to use in the file header.|Optional|1|A| +|sample||String|The name of the sequenced sample.|Required|1|| +|library||String|The name/ID of the sequenced library.|Required|1|| +|barcode|b|String|Library or Sample barcode sequence.|Optional|1|| +|platform||String|Sequencing Platform.|Optional|1|illumina| +|platform-unit||String|Platform unit (e.g. '..')|Optional|1|| +|platform-model||String|Platform model to insert into the group header (ex. miseq, hiseq2500, hiseqX)|Optional|1|| +|sequencing-center||String|The sequencing center from which the data originated|Optional|1|| +|predicted-insert-size||Integer|Predicted median insert size, to insert into the read group header|Optional|1|| +|description||String|Description of the read group.|Optional|1|| +|comment||String|Comment(s) to include in the output file's header.|Optional|Unlimited|| +|run-date||Iso8601Date|Date the run was produced, to insert into the read group header|Optional|1|| + diff --git a/tools/2.3.0/FilterBam.md b/tools/2.3.0/FilterBam.md new file mode 100644 index 000000000..8b9a98808 --- /dev/null +++ b/tools/2.3.0/FilterBam.md @@ -0,0 +1,36 @@ +--- +title: FilterBam +--- + +# FilterBam + +## Overview +**Group:** SAM/BAM + +Filters reads out of a BAM file. Removes reads that may not be useful in downstream processing or +visualization. By default will remove unmapped reads, reads with MAPQ=0, reads +marked as secondary alignments, reads marked as duplicates, and if a set of Intervals are provided, +reads that do not overlap any of the intervals. + +If `--min-insert-size` or `--min-mapped-bases` is specified, unmapped reads will also be removed +even if `--remove-unmapped-reads` is false. + +NOTE: this will usually produce a BAM file in which some mate-pairs are orphaned (i.e. read 1 or +read 2 is included, but not both), but does not update any flag fields. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input BAM file.|Required|1|| +|output|o|PathToBam|Output BAM file.|Required|1|| +|intervals|l|PathToIntervals|Optionally remove reads not overlapping intervals.|Optional|1|| +|remove-duplicates|D|Boolean|If true remove all reads that are marked as duplicates.|Optional|1|true| +|remove-unmapped-reads|U|Boolean|Remove all unmapped reads.|Optional|1|true| +|min-map-q|M|Int|Remove all mapped reads with MAPQ lower than this number.|Optional|1|1| +|remove-single-end-mappings|P|Boolean|Removes non-PE reads and any read whose mate pair is unmapped.|Optional|1|false| +|remove-secondary-alignments|S|Boolean|Remove all reads marked as secondary alignments.|Optional|1|true| +|min-insert-size||Int|Remove all reads with insert size < this value.|Optional|1|| +|max-insert-size||Int|Remove all reads with insert size > this value.|Optional|1|| +|min-mapped-bases|m|Int|Remove reads with fewer than this many mapped bases.|Optional|1|| + diff --git a/tools/2.3.0/FilterConsensusReads.md b/tools/2.3.0/FilterConsensusReads.md new file mode 100644 index 000000000..320c1948b --- /dev/null +++ b/tools/2.3.0/FilterConsensusReads.md @@ -0,0 +1,82 @@ +--- +title: FilterConsensusReads +--- + +# FilterConsensusReads + +## Overview +**Group:** Unique Molecular Identifiers (UMIs) + +Filters consensus reads generated by _CallMolecularConsensusReads_ or _CallDuplexConsensusReads_. +Two kinds of filtering are performed: + + 1. Masking/filtering of individual bases in reads + 2. Filtering out of reads (i.e. not writing them to the output file) + +Base-level filtering/masking is only applied if per-base tags are present (see _CallDuplexConsensusReads_ and +_CallMolecularConsensusReads_ for descriptions of these tags). Read-level filtering is always applied. When +filtering reads, secondary alignments and supplementary records may be removed independently if they fail +one or more filters; if either R1 or R2 primary alignments fail a filter then all records for the template +will be filtered out. + +The filters applied are as follows: + + 1. Reads with fewer than min-reads contributing reads are filtered out + 2. Reads with an average consensus error rate higher than max-read-error-rate are filtered out + 3. Reads with mean base quality of the consensus read, prior to any masking, less than min-mean-base-quality + are filtered out (if specified) + 4. Bases with quality scores below min-base-quality are masked to Ns + 5. Bases with fewer than min-reads contributing raw reads are masked to Ns + 6. Bases with a consensus error rate (defined as the fraction of contributing reads that voted for a different + base than the consensus call) higher than max-base-error-rate are masked to Ns + 7. For duplex reads, if require-single-strand-agreement is provided, masks to Ns any bases where the base was + observed in both single-strand consensus reads and the two reads did not agree + 8. Reads with a proportion of Ns higher than max-no-call-fraction *after* per-base filtering are filtered out + +When filtering _single-umi consensus_ reads generated by _CallMolecularConsensusReads_ a single value each +should be supplied for `--min-reads`, `--max-read-error-rate`, and `--max-base-error-rate`. + +When filtering duplex consensus reads generated by _CallDuplexConsensusReads_ each of the three parameters +may independently take 1-3 values. For example: + +``` +FilterConsensusReads ... --min-reads 10 5 3 --max-base-error-rate 0.1 +``` + +In each case if fewer than three values are supplied, the last value is repeated (i.e. `80 40` -> `80 40 40` +and `0.1` -> `0.1 0.1 0.1`. The first value applies to the final consensus read, the second value to one +single-strand consensus, and the last value to the other single-strand consensus. It is required that if +values two and three differ, the _more stringent value comes earlier_. + +In order to correctly filter reads in or out by template, the input BAM must be either `queryname` sorted or +`query` grouped. If your BAM is not already in an appropriate order, this can be done in streaming fashion with: + +``` +samtools sort -n -u in.bam | fgbio FilterConsensusReads -i /dev/stdin ... +``` + +The output sort order may be specified with `--sort-order`. If not given, then the output will be in the same +order as input. + +The `--reverse-tags-per-base` option controls whether per-base tags should be reversed before being used on reads +marked as being mapped to the negative strand. This is necessary if the reads have been mapped and the +bases/quals reversed but the consensus tags have not. If true, the tags written to the output BAM will be +reversed where necessary in order to line up with the bases and quals. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|The input SAM or BAM file of consensus reads.|Required|1|| +|output|o|PathToBam|Output SAM or BAM file.|Required|1|| +|ref|r|PathToFasta|Reference fasta file.|Required|1|| +|reverse-per-base-tags|R|Boolean|Reverse [complement] per base tags on reverse strand reads.|Optional|1|false| +|min-reads|M|Int|The minimum number of reads supporting a consensus base/read.|Required|3|| +|max-read-error-rate|E|Double|The maximum raw-read error rate across the entire consensus read.|Required|3|0.025| +|max-base-error-rate|e|Double|The maximum error rate for a single consensus base.|Required|3|0.1| +|min-base-quality|N|PhredScore|Mask (make `N`) consensus bases with quality less than this threshold.|Required|1|| +|max-no-call-fraction|n|Double|Maximum fraction of no-calls in the read after filtering.|Optional|1|0.2| +|min-mean-base-quality|q|PhredScore|The minimum mean base quality across the consensus read.|Optional|1|| +|require-single-strand-agreement|s|Boolean|Mask (make `N`) consensus bases where the AB and BA consensus reads disagree (for duplex-sequencing only).|Optional|1|false| +|sort-order|S|SamOrder|The sort order of the output. If not given, output will be in the same order as input if the input is query name sorted or query grouped, otherwise queryname order.|Optional|1|| + diff --git a/tools/2.3.0/FilterSomaticVcf.md b/tools/2.3.0/FilterSomaticVcf.md new file mode 100644 index 000000000..2caac6b41 --- /dev/null +++ b/tools/2.3.0/FilterSomaticVcf.md @@ -0,0 +1,108 @@ +--- +title: FilterSomaticVcf +--- + +# FilterSomaticVcf + +## Overview +**Group:** VCF/BCF + +Applies one or more filters to a VCF of somatic variants. The VCF must contain genotype information for the +tumor sample. If the VCF also contains genotypes for one or more other samples, the `--sample` option must be +provided to specify the sample whose genotypes to examine and whose reads are present in the BAM file. + +Various options are available for filtering the reads coming from the BAM file, including +`--min-mapping-quality`, `--min-base-quality` and `--paired-reads-only`. The latter filters to only paired end +reads where both reads are mapped. Reads marked as duplicates, secondary alignments and supplemental alignments +are all filtered out. + +Each available filter may generate annotations in the `INFO` field of the output VCF and optionally, if a +threshold is specified, may apply one or more `FILTER`s to applicable variants. + +In previous versions of this tool, the only available filter was specific to A-base addition artifacts and was +referred to as the 'End Repair Artifact Filter.' This filter has been renamed to 'A-tailing Artifact Filter', but +its functionality is unchanged. The filter's associated command-line parameters, `INFO` field key, and `FILTER` +tag have also been renamed accordingly, as described below. + +## Available Filters + +### A-tailing Artifact Filter (previously 'End Repair Artifact Filter') + +The A-tailing artifact filter attempts to measure the probability that a single-nucleotide mismatch is the +product of errors in the template generated during the A-base addition steps that are common to many Illumina +library preparation protocols. The artifacts occur if/when a recessed 3' end is incorrectly filled in with one\ +or more adenines during A-base addition. Incorrect adenine incorporation presents specifically as errors to T at +the beginning of reads (and in very short templates, as matching errors to A at the ends of reads). + +The filter adds the `INFO` field `ATAP` (previously `ERAP`) to SNVs with an A or T alternate allele. This field +records the p-value representing the probability of the null hypothesis that the variant is a true mutation, so +lower p-values indicate that the variant is more likely an A-tailing artifact. If a threshold p-value is +specified, the `FILTER` tag `ATailingArtifact` (previously `EndRepairArtifact`) will be applied to variants with +p-values less than or equal to the threshold. + +Two options are available: + +* `--a-tailing-distance` (previously `--end-repair-distance`) allows control over how close to the ends of + reads/templates errors can be considered to be candidates for the A-tailing artifact. + Higher values decrease the power of the test, so this should be set as low as possible + given observed errors. +* `--a-tailing-p-value` (previously `--end-repair-p-value`) the p-value at or below which a filter should be + applied. If no value is supplied only the `INFO` annotation is produced and no `FILTER` + is applied. + +### End Repair Fill-in Artifact Filter + +The end repair fill-in artifact filter attempts to measure the probability that a single-nucleotide mismatch is +the product of an error in the template generated during the end repair fill-in step that is common to many +Illumina library preparation protocols, in which single-stranded 3' overhangs are filled in to create a blunt +end. These artifacts originate from single-stranded templates containing damaged bases, often as a consequence +of oxidative damage. These DNA lesions, for example 8-oxoguanine, undergo mismatched pairing, which after PCR +appear as mutations at the ends of reads. + +The filter adds the `INFO` field `ERFAP` to records SNVs. This field records the p-value representing the +probability of the null hypothesis (e.g. that the variant is a true mutation), so lower p-values indicate that +the variant is more likely an end repair fill-in artifact. If a threshold p-value is specified, then the `FILTER` +tag `EndRepairFillInArtifact` will be applied to variants with p-values less than or equal to the threshold. + +Two options are available: + +* `--end-repair-fill-in-distance` allows control over how close to the ends of reads/templates errors can be + considered to be candidates for the artifact. Higher values decrease the + power of the test, so this should be set as low as possible given observed + errors. +* `--end-repair-fill-in-p-value` the p-value below which a filter should be applied. If no value is supplied + only the annotation is produced and no filtering is performed. + +## Performance Expectations + +By default `--access-pattern` will be set to `RandomAccess` and the input BAM will be queried using index-based +random access. Random access is mandatory if the input VCF is not coordinate sorted. If random access is not +requested and the input VCF is not coordinate sorted, then an exception will be raised on the first +non-coordinate increasing VCF record found. The BAM must be coordinate sorted in all cases and additionally be +indexed if random access is requested. + +Often, a VCF file will contain a sparse set of records that are scattered across a given territory within a +genome (or the records will be sparsely scattered genome-wide). If the territory of the VCF records is markedly +smaller than the territory of all aligned SAM records in the BAM file, then random access may be the most +efficient BAM access pattern. However, there are cases where random access will be less efficient such as when +the VCF is coordinate sorted and the variant call records are very densely packed across a similar territory as +compared to all aligned SAM records. Such a case is common in deeply sequenced hybrid selection NGS experiments +and setting `--access-pattern` to `Streaming` will often be the most efficient BAM access pattern. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToVcf|Input VCF of somatic variant calls.|Required|1|| +|output|o|PathToVcf|Output VCF of filtered somatic variants.|Required|1|| +|bam|b|PathToBam|BAM file for the tumor sample.|Required|1|| +|sample|s|String|Sample name in VCF if `> 1` sample present.|Optional|1|| +|min-mapping-quality|m|Int|Minimum mapping quality for reads.|Optional|1|20| +|min-base-quality|q|Int|Minimum base quality.|Optional|1|20| +|paired-reads-only|p|Boolean|Use only paired reads mapped in pairs.|Optional|1|false| +|access-pattern|A|BamAccessPattern|The type of BAM access pattern to use.|Optional|1|RandomAccess| +|a-tailing-distance||Int|Distance from 5-prime end of read to implicate A-base addition artifacts. Set to :none: to deactivate the filter.|Optional|1|2| +|a-tailing-p-value||Double|Minimum acceptable p-value for the A-base addition artifact test.|Optional|1|| +|end-repair-fill-in-distance||Int|Distance from 5-prime end of read to implicate end repair fill-in artifacts. Set to :none: to deactivate the filter.|Optional|1|15| +|end-repair-fill-in-p-value||Double|Minimum acceptable p-value for the end repair fill-in artifact test.|Optional|1|| + diff --git a/tools/2.3.0/FindSwitchbackReads.md b/tools/2.3.0/FindSwitchbackReads.md new file mode 100644 index 000000000..e3ca27d02 --- /dev/null +++ b/tools/2.3.0/FindSwitchbackReads.md @@ -0,0 +1,89 @@ +--- +title: FindSwitchbackReads +--- + +# FindSwitchbackReads + +## Overview +**Group:** SAM/BAM + +Finds reads where a template switch occurred during library construction. Some library construction methods, +notably ultra-low-input shotgun methods, are prone to template switching events that create molecules +(templates, inserts) that instead of being a linear copy of a segment of genomic DNA, instead are chimeras formed +by starting on one strand of DNA and then, switching to the opposite strand. Frequently when the switch occurs +there may be a small offset between where the first strand was departed and the opposite strand joined. + +## Algorithm + +Templates that contain strand switch events (switch-backs) are found by this tool in two different ways: + +1. By looking at reads that contain soft-clipped bases at their 5' end that, when reverse complemented, matches the + genome proximal to the 5'-most mapped base of the read. We call these matches + "read based switchbacks". Finding read based switchbacks is based on several parameters: + + 1. `max-offset` controls how far away to search for the reverse-complemented sequence. The default value of + `35` allows matches to be found when the soft-clipped sequence matches the genome _starting_ at most 35bp + from the 5' mapped position of the read, and reading in the opposite direction. + 2. `min-length` controls the minimum number of soft-clipped bases that must exist to trigger the search. + Given that the search looks at `2 * max-offset` locations (default=70) it is important that `min-length` + is set such that `4^min-length >> 2 * `max-offset` in order to avoid false positives. + 3. `max-error-rate` allows for some mismatches to exist between the soft-clipped sequence and the genome when matching. + +2. By identifying templates with `FF` or `RR` (aka tandem) orientation where it is surmised that the template + switch occurred in the un-sequenced region of the template between R1 and R2. We call these `tandem based + switchbacks`. This is controlled by a single parameter, `max-gap`, which causes the tool to only identify a + tandem read pair as a switch-back _if_ the gap between the end of the first read and the start of the second + read is `+/- max-gap`. + +By default, when a switch-back template is identified, the primary reads are made unmapped (and the original +alignment stored in the OA tag) and all secondary and supplementary alignments are discarded. This can be +disabled with the `--dont-unmap` or `-d` option. + +All reads from a switch-back template are also tagged with an `sb` tag that describes the nature of the +switchback. If the template was identified base on soft-clipped sequence within a read the format is: + +```sb:Z:r,[read|mate],{offset},{length}``` + +If the template is identified due to it's tandem pair orientation then the format is: + +```sb:Z:t,{gap}``` + +## Inputs and Outputs + +The tool takes as input a SAM or BAM file, and by default consumes from `stdin`. The primary output is also a +SAM or BAM file, and defaults to compressed BAM on `stdout`. This allows the tool to be run immediately after +an aligner in a pipe, e.g. `bwa mem ref.fa r1.fq r2.fq | fgbio -Xmx8g --ref=ref.fa | ...`. + +If the input is neither `queryname` sorted nor `queryname` grouped (i.e. all reads with the same name grouped +together) it will be sorted into `queryname` order by the tool before processing. + +By default the output BAM is produced in the order the reads were processed (i.e. the input ordering _or_ +queryname sorted if sorting was required). This can be overridden with the `--sort-order` option. + +A number of text files are also produced if the `--metrics` option is specified. E.g. when specifying +`--metrics=s1.switchback` the following files are produced: + +1. `s1.switchback.summary.txt`: A table of summary metrics describing the number of reads, switchbacks, etc. +2. `s1.switchback.lengths.txt`: A table of the distribution of observed switchback lengths in read-based switchbacks. +3. `s1.switchback.offsets.txt`: A table of the distribution of observed offsets in read-based switchbacks. +4. `s1.switchback.gaps.txt`: A table of the distribution of gap lengths in tampl +5. `s1.switchback.plots.pdf`: A PDF containing plots of the distributions from 2-4. + +Note: because this tool accesses the reference genome in a random manner it pre-loads the entire reference fasta +into memory. As a result the tool is best run with `-Xmx8g` to give it sufficient memory. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input BAM file.|Optional|1|/dev/stdin| +|output|o|PathToBam|Output BAM file.|Optional|1|/dev/stdout| +|metrics|m|PathPrefix|Metrics output file.|Optional|1|| +|sort-order|s|SamOrder|Output sort order.|Optional|1|| +|ref|r|PathToFasta|Reference genome fasta file.|Required|1|| +|max-offset|O|Int|Maximum offset between end the two segments of the read on the reference. Set to 0 to disable read-based checks.|Optional|1|35| +|max-gap|g|Int|Maximum gap between R1 and R2 of tandem reads to call a template a switchback. Set to 0 to disable tandem-based checks.|Optional|1|500| +|min-length|l|Int|Minimum match length of the switched back segment.|Optional|1|6| +|max-error-rate|e|Double|Maximum mismatch error rate of switchback match to genome.|Optional|1|0.1| +|dont-unmap|d|Boolean|IF true, do NOT unmap reads from switchback templates.|Optional|1|false| + diff --git a/tools/2.3.0/FindTechnicalReads.md b/tools/2.3.0/FindTechnicalReads.md new file mode 100644 index 000000000..46f9a36cf --- /dev/null +++ b/tools/2.3.0/FindTechnicalReads.md @@ -0,0 +1,41 @@ +--- +title: FindTechnicalReads +--- + +# FindTechnicalReads + +## Overview +**Group:** SAM/BAM + +Find reads that are from technical or synthetic sequences in a BAM file. Takes in +a BAM file, extracts the read pairs and fragment reads that are unmapped, and tests +them to see if they are likely generated from a technical sequence (e.g. adapter +dimer). + +The identification of reads is done by testing the first N bases (controlled by the +match-length parameter) of each read against all sub-sequences of length N from the +technical sequences. Sub-sequences are generated from both the sequences and the +reverse complement of the sequences, ignoring any sub-sequences that include `N`s. + +By default the output BAM file will contain all reads that matched to a sub-sequence of the +technical sequences and, if the read is paired, the read's mate pair. An option is +available to apply a tag to matched reads (--tag/-t), and if specified each matching +read will be tagged with the 0-based index of the sequence to which it matched. In +combination with tagging it is possible to output all reads (-a/--all-reads) which will +re-create the input BAM with the addition of tags on matching reads. + +The default set of sequences include a range of different Illumina adapter sequences +with the sample index/barcode region masked to Ns. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input SAM or BAM file|Required|1|| +|output|o|PathToBam|Output SAM or BAM file|Required|1|| +|match-length|m|Int|The number of bases at the start of the read to match against.|Optional|1|15| +|max-errors|e|Int|The maximum number of errors in the matched region.|Optional|1|1| +|sequences|s|String|The set of technical sequences to look for.|Required|Unlimited|AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACGCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAG, CTGTCTCTTATACACATCTCTGAGCGGGCTGGCAAGGCAGACCGNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACNNNNNNNNTCGTCGGCAGCGTCAGATGTGTATAAGAGACAG, CTGTCTCTTATACACATCTCCGAGCCCACGAGACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACNNNNNNNNACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG| +|all-reads|a|Boolean|Output all reads.|Optional|1|false| +|tag|t|String|Tag to set to indicate a read is a technical sequence.|Optional|1|| + diff --git a/tools/2.3.0/FixVcfPhaseSet.md b/tools/2.3.0/FixVcfPhaseSet.md new file mode 100644 index 000000000..99be0137f --- /dev/null +++ b/tools/2.3.0/FixVcfPhaseSet.md @@ -0,0 +1,41 @@ +--- +title: FixVcfPhaseSet +--- + +# FixVcfPhaseSet + +## Overview +**Group:** VCF/BCF + +Adds/fixes the phase set (PS) genotype field. + +The VCF specification allows phased genotypes to be annotated with the `PS` (phase set) `FORMAT` field. The value +should be a non-negative integer, corresponding to the position of the first variant in the phase set. Some tools +will output a non-integer value, as well as describe this field as having non-Integer type in the VCF header. This +tool will update the phase set (`PS`) `FORMAT` field to be VCF spec-compliant. + +The handling of unphased genotypes with phase-sets is controlled by the `-x` option: +- If `-x` is used the genotype will be converted to a phased genotype (e.g. `0/1` => `0|1`) +- Otherwise the phase-set (`PS`) will be removed from the genotype + +The `--keep-original` option may be used to store the original `PS` value in a new `OPS` field. The type +described in the header will match the original. + +This tool cannot fix phased variants without a phase set, or phased variant sets who have different phase set +values. + +In some cases, VCFs (e.g. from GIAB/NIST or Platinum Genomes) have illegal header lines, for example, a `PEDIGREE` +header line without a `ID` key-value field. The `-z` option can be used to remove those lines. This option is +included in this tool for convenience as those example VCFs in some cases have these illegal header lines, and it +is convenient to fix the phase set in addition to removing those illegal header lines. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToVcf|Input VCF.|Required|1|| +|output|o|PathToVcf|Output VCF.|Required|1|| +|keep-original|k|Boolean|Store the original phase set in the `OPS` field.|Optional|1|false| +|phase-genotypes-with-phase-set|x|Boolean|Set unphased genotypes with a PS FORMAT value to be phased.|Optional|1|false| +|remove-no-id-header-lines|z|Boolean|Remove header lines that do not contain an ID key-value, which is required in VCF.|Optional|1|false| + diff --git a/tools/2.3.0/GroupReadsByUmi.md b/tools/2.3.0/GroupReadsByUmi.md new file mode 100644 index 000000000..ef41b1182 --- /dev/null +++ b/tools/2.3.0/GroupReadsByUmi.md @@ -0,0 +1,99 @@ +--- +title: GroupReadsByUmi +--- + +# GroupReadsByUmi + +## Overview +**Group:** Unique Molecular Identifiers (UMIs) + +Groups reads together that appear to have come from the same original molecule. Reads +are grouped by template, and then templates are sorted by the 5' mapping positions of +the reads from the template, used from earliest mapping position to latest. Reads that +have the same end positions are then sub-grouped by UMI sequence. + +Accepts reads in any order (including `unsorted`) and outputs reads sorted by: + + 1. The lower genome coordinate of the two outer ends of the templates + 2. The sequencing library + 3. The assigned UMI tag + 4. Read Name + +If the input is not template-coordinate sorted (i.e. `SO:unsorted GO:query SS:unsorted:template-coordinate`), then +this tool will re-sort the input. The output will be written in template-coordinate order. + +During grouping, reads and templates are filtered out as follows: + +1. Templates are filtered if all reads for the template are unmapped +2. Templates are filtered if any non-secondary, non-supplementary read has mapping quality < `min-map-q` +3. Templates are filtered if R1 and R2 are mapped to different chromosomes and `--allow-inter-contig` is false +4. Templates are filtered if any UMI sequence contains one or more `N` bases +5. Templates are filtered if `--min-umi-length` is specified and the UMI does not meet the length requirement +6. Reads are filtered out if flagged as secondary and `--include-secondary` is false +7. Reads are filtered out if flagged as supplementary and `--include-supplementary` is false + +Grouping of UMIs is performed by one of four strategies: + +1. **identity**: only reads with identical UMI sequences are grouped together. This strategy + may be useful for evaluating data, but should generally be avoided as it will + generate multiple UMI groups per original molecule in the presence of errors. +2. **edit**: reads are clustered into groups such that each read within a group has at least + one other read in the group with <= edits differences and there are inter-group + pairings with <= edits differences. Effective when there are small numbers of + reads per UMI, but breaks down at very high coverage of UMIs. +3. **adjacency**: a version of the directed adjacency method described in [umi_tools](http://dx.doi.org/10.1101/051755) + that allows for errors between UMIs but only when there is a count gradient. +4. **paired**: similar to adjacency but for methods that produce template such that a read with A-B is related + to but not identical to a read with B-A. Expects the UMI sequences to be stored in a single SAM + tag separated by a hyphen (e.g. `ACGT-CCGG`) and allows for one of the two UMIs to be absent + (e.g. `ACGT-` or `-ACGT`). The molecular IDs produced have more structure than for single + UMI strategies and are of the form `{base}/{A|B}`. E.g. two UMI pairs would be mapped as + follows AAAA-GGGG -> 1/A, GGGG-AAAA -> 1/B. + +Strategies `edit`, `adjacency`, and `paired` make use of the `--edits` parameter to control the matching of +non-identical UMIs. + +By default, all UMIs must be the same length. If `--min-umi-length=len` is specified then reads that have a UMI +shorter than `len` will be discarded, and when comparing UMIs of different lengths, the first len bases will be +compared, where `len` is the length of the shortest UMI. The UMI length is the number of [ACGT] bases in the UMI +(i.e. does not count dashes and other non-ACGT characters). This option is not implemented for reads with UMI pairs +(i.e. using the paired assigner). + +If the `--mark-duplicates` option is given, reads will also have their duplicate flag set in the BAM file. +Each tag-family is treated separately, and a single template within the tag family is chosen to be the "unique" +template and marked as non-duplicate, while all other templates in the tag family are then marked as duplicate. +One limitation of duplicate-marking mode, vs. e.g. Picard MarkDuplicates, is that read pairs with one unmapped read +are duplicate-marked independently from read pairs with both reads mapped. + +Several parameters have different defaults depending on whether duplicates are being marked or not (all are +directly settable on the command line): + + 1. `--min-map-q` defaults to 0 in duplicate marking mode and 1 otherwise + 2. `--include-secondary` defaults to true in duplicate marking mode and false otherwise + 3. `--include-supplementary` defaults to true in duplicate marking mode and false otherwise + +Multi-threaded operation is supported via the `--threads/-@` option. This only applies to the Adjacency and Paired +strategies. Additionally the only operation that is multi-threaded is the comparisons of UMIs at the same genomic +position. Running with e.g. `--threads 8` can provide a _substantial_ reduction in runtime when there are many +UMIs observed at the same genomic location, such as can occur in amplicon sequencing or ultra-deep coverage data. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|The input BAM file.|Optional|1|/dev/stdin| +|output|o|PathToBam|The output BAM file.|Optional|1|/dev/stdout| +|family-size-histogram|f|FilePath|Optional output of tag family size counts.|Optional|1|| +|raw-tag|t|String|The tag containing the raw UMI.|Optional|1|RX| +|assign-tag|T|String|The output tag for UMI grouping.|Optional|1|MI| +|mark-duplicates|d|Boolean|Turn on duplicate marking mode.|Optional|1|false| +|include-secondary|S|Boolean|Include secondary reads.|Optional|1|| +|include-supplementary|U|Boolean|Include supplementary reads.|Optional|1|| +|min-map-q|m|Int|Minimum mapping quality for mapped reads.|Optional|1|| +|include-non-pf-reads|n|Boolean|Include non-PF reads.|Optional|1|false| +|strategy|s|Strategy|The UMI assignment strategy.|Required|1|| +|edits|e|Int|The allowable number of edits between UMIs.|Optional|1|1| +|min-umi-length|l|Int|The minimum UMI length. If not specified then all UMIs must have the same length, otherwise discard reads with UMIs shorter than this length and allow for differing UMI lengths.|Optional|1|| +|allow-inter-contig|x|Boolean|DEPRECATED: this option will be removed in future versions and inter-contig reads will be automatically processed.|Optional|1|true| +|threads|@|Int|Number of threads to use when comparing UMIs. Only recommended for amplicon or similar data.|Optional|1|1| + diff --git a/tools/2.3.0/HapCutToVcf.md b/tools/2.3.0/HapCutToVcf.md new file mode 100644 index 000000000..3b0ea2c29 --- /dev/null +++ b/tools/2.3.0/HapCutToVcf.md @@ -0,0 +1,66 @@ +--- +title: HapCutToVcf +--- + +# HapCutToVcf + +## Overview +**Group:** VCF/BCF + +Converts the output of `HAPCUT` (`HapCut1`/`HapCut2`) to a VCF. + +The output of `HAPCUT` does not include all input variants, but simply those variants that are in phased blocks. +This tool takes the original VCF and the output of `HAPCUT`, and produces a VCF containing both the variants that +were phased and the variants that were not phased, such that all variants in the original file are in the output. + +The original VCF provided to `HAPCUT` is assumed to contain a single sample, as `HAPCUT` only supports a single +sample. + +By default, all phased genotypes are annotated with the `PS` (phase set) `FORMAT` tag, which by convention is the +position of the first variant in the phase set (see the VCF specification). Furthermore, this tool formats the +alleles of a phased genotype using the `|` separator instead of the `/` separator, where the latter indicates the +genotype is unphased. If the option to output phased variants in GATK's `ReadBackedPhasing` format is used, then +the first variant in a phase set will have `/` instead of `|` separating its alleles. Also, all phased variants +will have `PASS` set in its `FILTER` column, while unphased variants will have `NotPhased` set in their `FILTER` +column. Unlike GATK's `ReadBackedPhasing`, homozygous variants will always be unphased. + +More information about the purpose and operation of GATK's Read-backed phasing, including its output format, can +be found in the [GATK Forum](http://gatkforums.broadinstitute.org/gatk/discussion/45/purpose-and-operation-of-read-backed-phasing) + +Additional `FORMAT` fields for phased variants are provided corresponding to per-genotype information produced by +`HapCut1`: + + 1. The `RC` tag gives the counts of calls supporting `allele0` and `allele1` respectively. + 2. The `LC` tag gives the change in likelihood if this SNP is made homozygous or removed. + 3. The `MCL` tag gives the maximum change in likelihood if this SNP is made homozygous or removed. + 4. The `RMEC` tag gives the reduction in `MEC` score if we remove this variant altogether. + +Additional `FORMAT` fields for phased variants are provided corresponding to per-genotype information produced by +`HapCut2`: + + 1. The `PR` tag is `1` if `HapCut2` pruned this variant, `0` otherwise. + 2. The `SE` tag gives the confidence (`log10`) that there is not a switch error occurring immediately before + the SNV (closer to zero means that a switch error is more likely). + 3. The `NE` tag gives the confidence (`log10`) that the SNV is not a mismatch (single SNV) error (closer to + zero means a mismatch is more likely). + +HapCut2 should not be run with `--call-homozygous` as the genotypes may be different than the input and so is not +currently supported. + +For more information about `HapCut1`, see the source code or paper below. + source code: https://github.com/vibansal/hapcut + HapCut1 paper: An efficient and accurate algorithm for the haplotype assembly problem Bioinformatics. 2008 Aug + 15;24(16):i153-9. + +For more information about `HapCut2`, see the [source code](https://github.com/vibansal/HapCUT2) below. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|vcf|v|PathToVcf|The original VCF provided to `HapCut1`/`HapCut2`.|Required|1|| +|input|i|FilePath|The file produced by `HapCut1`/`HapCut2`.|Required|1|| +|output|o|PathToVcf|The output VCF with both phased and unphased variants.|Required|1|| +|gatk-phasing-format|r|Boolean|Output phased variants in GATK's `ReadBackedPhasing` format.|Optional|1|false| +|fix-ambiguous-reference-alleles||Boolean|Fix IUPAC codes in the original VCF to be VCF 4.3 spec-compliant (ex 'R' -> 'A'). Does not support BCF inputs.|Optional|1|false| + diff --git a/tools/2.3.0/HardMaskFasta.md b/tools/2.3.0/HardMaskFasta.md new file mode 100644 index 000000000..1e0770b3a --- /dev/null +++ b/tools/2.3.0/HardMaskFasta.md @@ -0,0 +1,21 @@ +--- +title: HardMaskFasta +--- + +# HardMaskFasta + +## Overview +**Group:** FASTA + +Converts soft-masked sequence to hard-masked in a FASTA file. All lower case bases are +converted to Ns, all other bases are left unchanged. Line lengths are also standardized +to allow easy indexing with `samtools faidx`" + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToFasta|Input FASTA file.|Required|1|| +|output|o|PathToFasta|Output FASTA file.|Required|1|| +|line-length|l|Int|Line length or sequence lines.|Optional|1|100| + diff --git a/tools/2.3.0/MakeMixtureVcf.md b/tools/2.3.0/MakeMixtureVcf.md new file mode 100644 index 000000000..ed0477da8 --- /dev/null +++ b/tools/2.3.0/MakeMixtureVcf.md @@ -0,0 +1,42 @@ +--- +title: MakeMixtureVcf +--- + +# MakeMixtureVcf + +## Overview +**Group:** VCF/BCF + +Creates a VCF with one sample whose genotypes are a mixture of other samples'. + +The input VCF must contain all samples to be mixed, and may optionally contain other samples. +Sample mixtures can be specified in one of several ways: + + 1. `--samples s1 s2 s3`: specifies that the samples with names `s1`, `s`2 and `s3` should be mixed equally. + 2. ` --samples s1@0.1 s2@0.1 s3@0.8`: specifies that the three samples should be mixed at `0.1`, `0.1`, and `0.8` + respectively + 3. `--samples s1@0.1 s2 s3`: specifies that `s1` should form `0.1` of the mixture and that the remaining (`0.9`) + should be equally split amongst `s2` and s`3` + 4. If no sample names are given, all samples in the input VCF will be used, mixing equally + +The input samples are assumed to be diploid with the allele fraction defined by the genotype (`0/0`, `0/1` or `1/1`). +The `--allele-fraction-field` option may be specified, in which case the allele fraction of the input genotypes +will be retrieved from the specified `FORMAT` field in the VCF genotypes. All genotypes except hom-ref and +no-call genotypes must have this field present if supplied. + +If the --`no-call-is-hom-ref` flag is `true` (the default) then no-call genotypes in the input VCF are interpreted +as hom-ref genotypes. If it is `false`, any location with a no-call genotype will be emitted as a no-call in +the mixture, and will be filtered. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToVcf|Input VCF containing genotypes for the samples to be mixed.|Required|1|| +|output|o|PathToVcf|Output VCF of mixture sample.|Required|1|| +|samples|s|String|Samples to mix. See general usage for format and examples.|Optional|Unlimited|| +|output-sample-name|S|String|Output sample name.|Optional|1|mixture| +|no-call-is-hom-ref|N|Boolean|Treat no-calls for samples as hom-ref genotypes.|Optional|1|true| +|allele-fraction-field|a|String|Format field containing allele fraction.|Optional|1|| +|precision|p|Int|Digits of precision in generated allele fractions.|Optional|1|5| + diff --git a/tools/2.3.0/MakeTwoSampleMixtureVcf.md b/tools/2.3.0/MakeTwoSampleMixtureVcf.md new file mode 100644 index 000000000..ae3a1bff5 --- /dev/null +++ b/tools/2.3.0/MakeTwoSampleMixtureVcf.md @@ -0,0 +1,45 @@ +--- +title: MakeTwoSampleMixtureVcf +--- + +# MakeTwoSampleMixtureVcf + +## Overview +**Group:** VCF/BCF + +Creates a simulated tumor or tumor/normal VCF by in-silico mixing genotypes from two samples. + +The tumor genotypes are created by mixing the incoming genotypes for the for the +'tumor' sample and the incoming genotypes for the 'normal' samples with the 'tumor' +alleles accounting for `tumorFraction` of the resulting mixture and the 'normal' alleles +accounting for `1 - tumorFraction` of the resulting mixture (see `--tumor-fraction`). E.g. if the 'tumor' genotype +is `A/C`, the 'normal' genotype is `C/C`, and 'tumorFraction' is set at 0.5, then the resulting +tumor genotype will be `A/C` with an allele fraction of `0.75`. The resulting allele fraction +is written to the `AF` info field in the VCF. + +In tumor-only mode (`--tumor-only`) only tumor genotypes are output. In tumor/normal mode genotypes for +the 'normal' samples are also emitted, and match the genotypes from the input sample. + +All loci (potentially restricted by intervals) that are variant in one or both samples are written +to the output VCF, though in several cases the variants will be filtered: + + * If either of the tumor or normal sample is no-called the resulting locus will have the + `unknown_gt` filter applied + * If the tumor and the normal have more than one alternative allele between them the + `multi_allelic` filter will be applied + * In tumor/normal mode (as opposed to tumor-only) loci that have an alt allele in the normal + sample will have the `alt_allele_in_normal` filter applied + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToVcf|Input VCF file.|Required|1|| +|output|o|PathToVcf|Output VCF file.|Required|1|| +|tumor|t|String|Name of the 'tumor' sample in the input VCF.|Required|1|| +|normal|n|String|Name of the 'normal' sample in the input VCF.|Required|1|| +|tumor-fraction|f|Double|What fraction of the mixture comes from the 'tumor' sample.|Optional|1|0.5| +|tumor-only|T|Boolean|Tumor only mode - only output tumor genotypes and don't filter sites.|Optional|1|false| +|no-call-is-hom-ref|N|Boolean|Treat no-calls for either sample as hom-ref genotypes.|Optional|1|true| +|intervals|l|PathToIntervals|Optional set of intervals to restrict to.|Optional|1|| + diff --git a/tools/2.3.0/PickIlluminaIndices.md b/tools/2.3.0/PickIlluminaIndices.md new file mode 100644 index 000000000..dec9f8f4a --- /dev/null +++ b/tools/2.3.0/PickIlluminaIndices.md @@ -0,0 +1,35 @@ +--- +title: PickIlluminaIndices +--- + +# PickIlluminaIndices + +## Overview +**Group:** Utilities + +Picks a set of molecular indices that should work well together. + +The indexes to evaluate are generated randomly, unless the `--candidates` option is given. The latter +specifies a path to a file with one index per line from which indices are picked. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|length|l|Int|The length of each barcode sequence.|Optional|1|8| +|indices|n|Int|The number of indices desired.|Required|1|| +|edit-distance|e|Int|The minimum edit distance between two indices in the set.|Optional|1|3| +|output|o|FilePath|File to write indices to.|Required|1|| +|allow-reverses||Boolean|Allow indices that are lexical reverses of one another|Optional|1|false| +|allow-reverse-complements||Boolean|Allow indices that are reverse complements of one another|Optional|1|false| +|allow-palindromes||Boolean|Allow indices that are palindromic (`bases == rev(bases)`).|Optional|1|false| +|max-homopolymer||Int|Reject indices with a homopolymer of greater than this length.|Optional|1|2| +|min-gc||Double|The minimum GC fraction for a barcode to be accepted.|Optional|1|0.0| +|max-gc||Double|The maximum GC fraction for a barcode to be accepted.|Optional|1|0.7| +|threads|t|Int|Number of threads to use.|Optional|1|4| +|vienna-rna-dir||DirPath|The installation directory for `ViennaRNA`.|Optional|1|| +|min-delta-g||Double|The lowest acceptable secondary structure `deltaG`.|Optional|1|-10.0| +|adapters||String|The indexed adapter sequence into which the indices will be integrated.|Required|Unlimited|AATGATACGGCGACCACCGAGATCTACACNNNNNNNNACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG| +|avoid-sequence||String|Sequences that should be avoided. Any kmer of `length` that appears in these sequences and their reverse complements will be thrown out.|Required|Unlimited|AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACGCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAG, CTGTCTCTTATACACATCTCTGAGCGGGCTGGCAAGGCAGACCGNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACNNNNNNNNTCGTCGGCAGCGTCAGATGTGTATAAGAGACAG, CTGTCTCTTATACACATCTCCGAGCCCACGAGACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACNNNNNNNNACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG| +|candidates||FilePath|The candidate indices from which to choose with one index per line (nothing else), otherwise generate all possible indices|Optional|1|| + diff --git a/tools/2.3.0/PickLongIndices.md b/tools/2.3.0/PickLongIndices.md new file mode 100644 index 000000000..f64e0ba70 --- /dev/null +++ b/tools/2.3.0/PickLongIndices.md @@ -0,0 +1,58 @@ +--- +title: PickLongIndices +--- + +# PickLongIndices + +## Overview +**Group:** Utilities + +Picks a set of molecular indices that have at least a given number of mismatches between +them. Whereas `PickIlluminaIndices` attempts to pick a near-optimal set of indices, +`PickLongIndices` implements a significantly more efficient method based on generation of +random indices that can generate a large set of satisfactory indices in a small amount of +time and memory even for index lengths `>> 10bp`. + +Many options exist for controlling aspects of the indices picked, including length, edit +distance (mismatches only), gc range, homopolymer content, secondary structure etc. + +Secondary structure is predicted using ViennaRNA's `RNAfold` in DNA mode. To enable structure +checking both the `--vienna-rna-dir` and `--adapters` must be specified. Adapters must be +strings of A, C, G, and T with a single block of Ns (e.g. `ACGTNNNN` or `ACNNNNGT`). At runtime +the Ns are replaced with indices, and `deltaG` of the index-containing sequence is calculated. + +The number of indices requested may not be possible to produce given other constraints. +When this is the case the tool will output as many indices as possible, though less than +the requested number. In such cases it may be useful to try different values for `--attempt`. +This parameter controls how many attempts are made to find the next valid index before +quitting and outputting the accumulated indices. Higher values will yield incrementally more +indices but require significantly longer runtimes. + +A file of existing indices may be provided. Existing indices must be of the same length as +requested indices and composed of A, C, G and Ts, but are subject to no other constraints. +Index picking will then built a set comprised of the existing indices, and new indices which +satisfy all constraints. Existing indices are included in the generated output file. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|length|l|Int|The length of each index sequence.|Optional|1|8| +|number-of-indices|n|Int|The number of indices desired.|Required|1|| +|edit-distance|e|Int|The minimum edit distance between two indices in the set.|Optional|1|3| +|output|o|FilePath|File to write indices to.|Required|1|| +|allow-reverses||Boolean|Allow indices that are lexical reverses of one another|Optional|1|false| +|allow-reverse-complements||Boolean|Allow indices that are reverse complements of one another|Optional|1|false| +|allow-palindromes||Boolean|Allow indices that are palindromic (`index == revcomp(index)`).|Optional|1|false| +|max-homopolymer||Int|Reject indices with a homopolymer of greater than this length.|Optional|1|2| +|min-gc|g|Double|The minimum GC fraction for an index to be accepted.|Optional|1|0.2| +|max-gc|G|Double|The maximum GC fraction for an index to be accepted.|Optional|1|0.8| +|existing||FilePath|File of existing index sequences to integrate, one per line.|Optional|1|| +|seed|s|Int|Random seed value.|Optional|1|1| +|attempts|a|Int|Attempts to pick the next index before quitting.|Optional|1|100000| +|vienna-rna-dir||DirPath|The installation directory for `ViennaRNA`.|Optional|1|| +|temperature|t|Double|The temperature at which to predict secondary structure.|Optional|1|25.0| +|min-delta-g||Double|The lowest acceptable secondary structure `deltaG`.|Optional|1|-10.0| +|adapters||String|Adapter sequence(s) into which indices will be inserted.|Optional|Unlimited|| +|avoid-sequence||String|Any index sequence that appears in an avoid sequence or its reverse complement will be discarded.|Required|Unlimited|AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACGCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAG, CTGTCTCTTATACACATCTCTGAGCGGGCTGGCAAGGCAGACCGNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACNNNNNNNNTCGTCGGCAGCGTCAGATGTGTATAAGAGACAG, CTGTCTCTTATACACATCTCCGAGCCCACGAGACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACNNNNNNNNACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG| + diff --git a/tools/2.3.0/RandomizeBam.md b/tools/2.3.0/RandomizeBam.md new file mode 100644 index 000000000..5a556bf75 --- /dev/null +++ b/tools/2.3.0/RandomizeBam.md @@ -0,0 +1,23 @@ +--- +title: RandomizeBam +--- + +# RandomizeBam + +## Overview +**Group:** SAM/BAM + +Randomizes the order of reads in a SAM or BAM file. Randomization is done by sorting +on a hash of the `queryname` (and bases and quals if not query-grouping). By default +reads with the same query name are grouped together in the output file; this can be +turned off by specifying --query-group=false. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|The input SAM or BAM file.|Optional|1|/dev/stdin| +|output|o|PathToBam|The output SAM or BAM file.|Optional|1|/dev/stdout| +|seed|s|Int|Random seed.|Optional|1|42| +|query-group|q|Boolean|Group together reads by queryname.|Optional|1|true| + diff --git a/tools/2.3.0/RemoveSamTags.md b/tools/2.3.0/RemoveSamTags.md new file mode 100644 index 000000000..a3b32cb6f --- /dev/null +++ b/tools/2.3.0/RemoveSamTags.md @@ -0,0 +1,19 @@ +--- +title: RemoveSamTags +--- + +# RemoveSamTags + +## Overview +**Group:** SAM/BAM + +Removes SAM tags from a SAM or BAM file. If no tags to remove are given, the original file is produced. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input SAM or BAM.|Required|1|| +|output|o|PathToBam|Output SAM or BAM.|Required|1|| +|tags-to-remove|t|String|The tags to remove.|Optional|Unlimited|| + diff --git a/tools/2.3.0/ReviewConsensusVariants.md b/tools/2.3.0/ReviewConsensusVariants.md new file mode 100644 index 000000000..776c56bf6 --- /dev/null +++ b/tools/2.3.0/ReviewConsensusVariants.md @@ -0,0 +1,45 @@ +--- +title: ReviewConsensusVariants +--- + +# ReviewConsensusVariants + +## Overview +**Group:** Unique Molecular Identifiers (UMIs) + +Extracts data to make reviewing of variant calls from consensus reads easier. Creates +a list of variant sites from the input VCF (SNPs only) or IntervalList then extracts all +the consensus reads that do not contain a reference allele at the the variant sites, and +all raw reads that contributed to those consensus reads. This will include consensus +reads that carry the alternate allele, a third allele, a no-call or a spanning +deletion at the variant site. + +Reads are correlated between consensus and grouped BAMs using a molecule ID stored +in an optional attribute, `MI` by default. In order to support paired molecule IDs +where two or more molecule IDs are related (e.g. see the Paired assignment strategy +in _GroupReadsByUmi_) the molecule ID is truncated at the last `/` if present +(e.g. `1/A => 1` and `2 => 2`). + +Both input BAMs must be coordinate sorted and indexed. + +A pair of output BAMs named `.consensus.bam` and `.grouped.bam` are created +with the relevant reads from each input BAM, and a review file `.txt` is +created. The review file contains details on each variant position along with detailed +information on each consensus read that supports the variant. If the sample-name argument +is supplied and the input is VCF, genotype information for that sample will be retrieved. +If the sample-name isn't supplied and the VCF contains only a single sample then those +genotypes will be used. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|FilePath|Input VCF or IntervalList of variant locations.|Required|1|| +|sample|s|String|Name of the sample being reviewed.|Optional|1|| +|consensus-bam|c|PathToBam|BAM file of consensus reads used to call variants.|Required|1|| +|grouped-bam|g|PathToBam|BAM file of grouped raw reads used to build consensuses.|Required|1|| +|ref|r|PathToFasta|Reference fasta file.|Required|1|| +|output|o|PathPrefix|Basename of output files to create.|Required|1|| +|ignore-ns-in-consensus-reads|N|Boolean|Ignore N bases in the consensus reads.|Optional|1|false| +|maf|m|Double|Only output detailed information for variants at maf and below.|Optional|1|0.05| + diff --git a/tools/2.3.0/SetMateInformation.md b/tools/2.3.0/SetMateInformation.md new file mode 100644 index 000000000..6eca8bd93 --- /dev/null +++ b/tools/2.3.0/SetMateInformation.md @@ -0,0 +1,28 @@ +--- +title: SetMateInformation +--- + +# SetMateInformation + +## Overview +**Group:** SAM/BAM + +Adds and/or fixes mate information on paired-end reads. Sets the MQ (mate mapping quality), +`MC` (mate cigar string), ensures all mate-related flag fields are set correctly, and that +the mate reference and mate start position are correct. + +Supplementary records are handled correctly (updated with their mate's non-supplemental +attributes). Secondary alignments are passed through but are not updated. + +The input file must be query-name sorted or query-name grouped (i.e. all records from the same +query sequence must be adjacent in the file, though the ordering between queries is unspecified). + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input SAM/BAM/CRAM file.|Optional|1|/dev/stdin| +|output|o|PathToBam|Output SAM/BAM/CRAM file.|Optional|1|/dev/stdout| +|ref|r|PathToFasta|Reference fasta, only needed if writing CRAM.|Optional|1|| +|allow-missing-mates|x|Boolean|If specified, do not fail when reads marked as paired are missing their mate pairs.|Optional|1|false| + diff --git a/tools/2.3.0/SortBam.md b/tools/2.3.0/SortBam.md new file mode 100644 index 000000000..2932f50c3 --- /dev/null +++ b/tools/2.3.0/SortBam.md @@ -0,0 +1,42 @@ +--- +title: SortBam +--- + +# SortBam + +## Overview +**Group:** SAM/BAM + +Sorts a SAM or BAM file. Several sort orders are available: + +1. **Coordinate**: sorts reads by their reference sequence and left-most aligned coordinate +2. **Queryname**: sort the reads by their query (i.e. read) name +3. **Random**: sorts the reads into a random order. The output is deterministic for any given input. +and several +4. **RandomQuery**: sorts the reads into a random order but keeps reads with the same + queryname together. The ordering is deterministic for any given input. +5. **TemplateCoordinate**: The sort order used by `GroupReadByUmi`. Sorts reads by + the earlier unclipped 5' coordinate of the read pair, the higher unclipped 5' coordinate of the + read pair, library, the molecular identifier (MI tag), read name, and if R1 has the lower + coordinates of the pair. + +Uses a temporary directory to buffer sets of sorted reads to disk. The number of reads kept in memory +affects memory use and can be changed with the `--max-records-in-ram` option. The temporary directory +to use can be set with the fgbio global option `--tmp-dir`. + +An example invocation might look like: + +```bash +java -Xmx4g -jar fgbio.jar --tmp-dir=/my/big/scratch/volume \ + SortBam --input=queryname.bam --sort-order=Coordinate --output coordinate.bam +``` + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input SAM or BAM.|Required|1|| +|output|o|PathToBam|Output SAM or BAM.|Required|1|| +|sort-order|s|SamOrder|Order into which to sort the records.|Optional|1|Coordinate| +|max-records-in-ram|m|Int|Max records in RAM.|Optional|1|1000000| + diff --git a/tools/2.3.0/SortFastq.md b/tools/2.3.0/SortFastq.md new file mode 100644 index 000000000..fe0c6500b --- /dev/null +++ b/tools/2.3.0/SortFastq.md @@ -0,0 +1,20 @@ +--- +title: SortFastq +--- + +# SortFastq + +## Overview +**Group:** FASTQ + +Sorts a FASTQ file. Sorts the records in a FASTQ file based on the lexicographic ordering +of their read names. Input and output files can be either uncompressed or gzip-compressed. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToFastq|Input fastq file.|Required|1|| +|output|o|PathToFastq|Output fastq file.|Required|1|| +|max-records-in-ram|m|Int|Maximum records to keep in RAM at one time.|Optional|1|500000| + diff --git a/tools/2.3.0/SortSequenceDictionary.md b/tools/2.3.0/SortSequenceDictionary.md new file mode 100644 index 000000000..15894f3b9 --- /dev/null +++ b/tools/2.3.0/SortSequenceDictionary.md @@ -0,0 +1,32 @@ +--- +title: SortSequenceDictionary +--- + +# SortSequenceDictionary + +## Overview +**Group:** FASTA + +Sorts a sequence dictionary file in the order of another sequence dictionary. + +The inputs are to two `*.dict` files. One to be sorted, and the other to provide the order for the sorting. + +If there is a contig in the input dictionary that is not in the sorting dictionary, that contig will be appended +to the end of the sequence dictionary in the same relative order to other appended contigs as in the input dictionary. +Missing contigs can be omitted by setting `--skip-missing-contigs` to true. + +If there is a contig in the sorting dictionary that is not in the input dictionary, that contig will be ignored. + +The output will be a sequence dictionary, containing the version header line and one +line per contig. The fields of the entries in this dictionary will be the same as in input, but in the order of +`--sort-dictionary`. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToSequenceDictionary|Input sequence dictionary file to be sorted.|Required|1|| +|sort-dictionary|d|PathToSequenceDictionary|Input sequence dictionary file containing contigs in the desired sort order.|Required|1|| +|output|o|PathToSequenceDictionary|Output sequence dictionary file.|Required|1|| +|skip-missing-contigs||Boolean|Skip input contigs that have no matching contig in the sort dictionary rather than appending to the end of the output dictionary.|Optional|1|false| + diff --git a/tools/2.3.0/SplitBam.md b/tools/2.3.0/SplitBam.md new file mode 100644 index 000000000..f3bb0d84e --- /dev/null +++ b/tools/2.3.0/SplitBam.md @@ -0,0 +1,30 @@ +--- +title: SplitBam +--- + +# SplitBam + +## Overview +**Group:** SAM/BAM + +Splits a BAM into multiple BAMs, one per-read group (or library). + +The resulting BAMs will be named `..bam`, or `..bam` +when splitting by the library. All reads without a read group, or without a library when splitting by library, +will be written to `.unknown.bam`. If no such reads exist, then no such file will exist. + +By default, async writing of BAM files is controlled by the `--async-io` common tool option to +increase performance. If the input BAM has significantly more read groups (or libraries) than your system has CPUs +it is recommended to disable this feature for this tool using `--no-async-writing`. Asynchronous reading is not +affected. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input SAM or BAM file.|Required|1|| +|output|o|PathPrefix|Output prefix for all SAM or BAM files (ex. output/sample-name).|Required|1|| +|split-by|s|SplitType|Split by library instead of read group|Optional|1|ReadGroup| +|unknown|u|String|The name to use for the unknown file|Optional|1|unknown| +|no-async-writing||Boolean|Do not write the records asynchronously. Use this to reduce memory usage when many read groups/libraries are present.|Optional|1|false| + diff --git a/tools/2.3.0/TrimFastq.md b/tools/2.3.0/TrimFastq.md new file mode 100644 index 000000000..a4f1d5c85 --- /dev/null +++ b/tools/2.3.0/TrimFastq.md @@ -0,0 +1,27 @@ +--- +title: TrimFastq +--- + +# TrimFastq + +## Overview +**Group:** FASTQ + +Trims reads in one or more line-matched fastq files to a specific read length. The +individual fastq files are expected to have the same set of reads, as would be the +case with an `r1.fastq` and `r2.fastq` file for the same sample. + +Optionally supports dropping of reads across all files when one or more reads +is already shorter than the desired trim length. + +Input and output fastq files may be gzipped. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToFastq|One or more input fastq files.|Required|Unlimited|| +|output|o|PathToFastq|A matching number of output fastq files.|Required|Unlimited|| +|length|l|Int|Length to trim reads to (either one per input fastq file, or one for all).|Required|Unlimited|| +|exclude|x|Boolean|Exclude reads below the trim length.|Optional|1|false| + diff --git a/tools/2.3.0/TrimPrimers.md b/tools/2.3.0/TrimPrimers.md new file mode 100644 index 000000000..883890c22 --- /dev/null +++ b/tools/2.3.0/TrimPrimers.md @@ -0,0 +1,57 @@ +--- +title: TrimPrimers +--- + +# TrimPrimers + +## Overview +**Group:** SAM/BAM + +Trims primers from reads post-alignment. Takes in a BAM file of aligned reads +and a tab-delimited file with five columns (`chrom`, `left_start`, `left_end`, +`right_start`, and `right_end`) which provide the 1-based inclusive start and end +positions of the primers for each amplicon. The primer file must include headers, e.g: + +``` +chrom left_start left_end right_start right_end +chr1 1010873 1010894 1011118 1011137 +``` + +Paired end reads that map to a given amplicon position are trimmed so that the +alignment no-longer includes the primer sequences. All other aligned reads have the +_maximum primer length trimmed_! + +Reads that are trimmed will have the `NM`, `UQ` and `MD` tags cleared as they are no longer +guaranteed to be accurate. If a reference is provided the reads will be re-sorted +by coordinate after trimming and the `NM`, `UQ` and `MD` tags recalculated. + +If the input BAM is not `queryname` sorted it will be sorted internally so that mate +information between paired-end reads can be corrected before writing the output file. + +The `--first-of-pair` option will cause only the first of pair (R1) reads to be trimmed +based solely on the primer location of R1. This is useful when there is a target +specific primer on the 5' end of R1 but no primer sequenced on R2 (eg. single gene-specific +primer target enrichment). In this case, the location of each target specific primer should +be specified in an amplicons left or right primer exclusively. The coordinates of the +non-specific-target primer should be `-1` for both start and end, e.g: + +``` +chrom left_start left_end right_start right_end +chr1 1010873 1010894 -1 -1 +chr2 -1 -1 1011118 1011137 +``` + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input BAM file.|Required|1|| +|output|o|PathToBam|Output BAM file.|Required|1|| +|primers|p|FilePath|File with primer locations.|Required|1|| +|hard-clip|H|Boolean|If true, hard clip reads, else soft clip.|Optional|1|false| +|slop|S|Int|Match to primer locations +/- this many bases.|Optional|1|5| +|sort-order|s|SamOrder|Sort order of output BAM file (defaults to input sort order).|Optional|1|| +|ref|r|PathToFasta|Optional reference fasta for recalculating NM, MD and UQ tags.|Optional|1|| +|auto-trim-attributes|a|Boolean|Automatically trim extended attributes that are the same length as bases.|Optional|1|false| +|first-of-pair||Boolean|Trim only first of pair reads (R1s), otherwise both ends of a pair|Optional|1|false| + diff --git a/tools/2.3.0/UpdateDelimitedFileContigNames.md b/tools/2.3.0/UpdateDelimitedFileContigNames.md new file mode 100644 index 000000000..c4801f3d1 --- /dev/null +++ b/tools/2.3.0/UpdateDelimitedFileContigNames.md @@ -0,0 +1,32 @@ +--- +title: UpdateDelimitedFileContigNames +--- + +# UpdateDelimitedFileContigNames + +## Overview +**Group:** Utilities + +Updates the contig names in columns of a delimited data file (e.g. CSV, TSV). + +The name of each sequence must match one of the names (including aliases) in the given sequence dictionary. The +new name will be the primary (non-alias) name in the sequence dictionary. Use `--skip-missing` to ignore lines +where a contig name could not be updated (i.e. missing from the sequence dictionary). + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|FilePath|Input delimited data file.|Required|1|| +|dict|d|PathToSequenceDictionary|The path to the sequence dictionary with contig aliases.|Required|1|| +|columns|c|Int|The column indices for the contig names (0-based).|Required|Unlimited|| +|delimiter|T|Char|The delimiter|Optional|1|\t| +|comment|H|String|Treat lines with this starting string as comments (always printed)|Optional|1|#| +|output|o|FilePath|Output delimited data file.|Required|1|| +|output-first-num-lines|n|Int|Output the first `N` lines as-is (always printed).|Optional|1|0| +|skip-missing||Boolean|Skip lines where a contig name could not be updated (i.e. missing from the sequence dictionary).|Optional|1|false| +|sort-order|s|SortOrder|Sort the output based on the following order.|Optional|1|Unsorted| +|contig||Int|The column index for the contig (0-based) for sorting. Use the first column if not given.|Optional|1|| +|position||Int|The column index for the genomic position (0-based) for sorting by coordinate.|Optional|1|| +|max-objects-in-ram||Int|The maximum number of objects to store in memory|Optional|1|1000000| + diff --git a/tools/2.3.0/UpdateFastaContigNames.md b/tools/2.3.0/UpdateFastaContigNames.md new file mode 100644 index 000000000..16339c7b0 --- /dev/null +++ b/tools/2.3.0/UpdateFastaContigNames.md @@ -0,0 +1,35 @@ +--- +title: UpdateFastaContigNames +--- + +# UpdateFastaContigNames + +## Overview +**Group:** FASTA + +Updates the sequence names in a FASTA. + +The name of each sequence must match one of the names (including aliases) in the given sequence dictionary. The +new name will be the primary (non-alias) name in the sequence dictionary. + +By default, the sort order of the contigs will be the same as the input FASTA. Use the `--sort-by-dict` option to +sort by the input sequence dictionary. Furthermore, the sequence dictionary may contain **more** contigs than the +input FASTA, and they wont be used. + +Use the `--skip-missing` option to skip contigs in the input FASTA that cannot be renamed (i.e. who are not present +in the input sequence dictionary); missing contigs will not be written to the output FASTA. Finally, use the +`--default-contigs` option to specify an additional FASTA which will be queried to locate contigs not present in +the input FASTA but present in the sequence dictionary. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToFasta|Input FASTA.|Required|1|| +|dict|d|PathToSequenceDictionary|The path to the sequence dictionary with contig aliases.|Required|1|| +|output|o|PathToFasta|Output FASTA.|Required|1|| +|line-length|l|Int|Line length or sequence lines.|Optional|1|100| +|skip-missing||Boolean|Skip missing source contigs (will not be outputted).|Optional|1|false| +|sort-by-dict||Boolean|Sort the contigs based on the input sequence dictionary.|Optional|1|false| +|default-contigs||PathToFasta|Add sequences from this FASTA when contigs in the sequence dictionary are missing from the input FASTA.|Optional|1|| + diff --git a/tools/2.3.0/UpdateGffContigNames.md b/tools/2.3.0/UpdateGffContigNames.md new file mode 100644 index 000000000..4a565411c --- /dev/null +++ b/tools/2.3.0/UpdateGffContigNames.md @@ -0,0 +1,25 @@ +--- +title: UpdateGffContigNames +--- + +# UpdateGffContigNames + +## Overview +**Group:** Utilities + +Updates the contig names in a GFF. + +The name of each sequence must match one of the names (including aliases) in the given sequence dictionary. The +new name will be the primary (non-alias) name in the sequence dictionary. + +Please note: the output GFF will be in the same order as the input GFF. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|FilePath|Input GFF.|Required|1|| +|dict|d|PathToSequenceDictionary|The path to the sequence dictionary with contig aliases.|Required|1|| +|output|o|FilePath|Output GFF.|Required|1|| +|skip-missing||Boolean|Skip missing contigs.|Optional|1|false| + diff --git a/tools/2.3.0/UpdateIntervalListContigNames.md b/tools/2.3.0/UpdateIntervalListContigNames.md new file mode 100644 index 000000000..e78dacb70 --- /dev/null +++ b/tools/2.3.0/UpdateIntervalListContigNames.md @@ -0,0 +1,23 @@ +--- +title: UpdateIntervalListContigNames +--- + +# UpdateIntervalListContigNames + +## Overview +**Group:** FASTA + +Updates the sequence names in an Interval List file. + +The name of each sequence must match one of the names (including aliases) in the given sequence dictionary. The +new name will be the primary (non-alias) name in the sequence dictionary. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToIntervals|Input interval list.|Required|1|| +|dict|d|PathToSequenceDictionary|The path to the sequence dictionary with contig aliases.|Required|1|| +|output|o|PathToIntervals|Output interval list.|Required|1|| +|skip-missing||Boolean|Skip missing source contigs.|Optional|1|false| + diff --git a/tools/2.3.0/UpdateReadGroups.md b/tools/2.3.0/UpdateReadGroups.md new file mode 100644 index 000000000..ef861e7d6 --- /dev/null +++ b/tools/2.3.0/UpdateReadGroups.md @@ -0,0 +1,35 @@ +--- +title: UpdateReadGroups +--- + +# UpdateReadGroups + +## Overview +**Group:** SAM/BAM + +Updates one or more read groups and their identifiers. + +This tool will replace each read group with a new read group, including a new read group identifier. If the read +group identifier is not to be changed, it is recommended to use `samtools reheader` or Picard's +`ReplaceSamHeader` instead as in this case only the header needs modification. If all read groups are to be +assigned to one read group, it is recommended to use Picard's `AddOrReplaceReadGroups`. Nonetheless, if the read +group identifier also needs to be changed, use this tool. + +Each read group in the input file will be mapped to one and only one new read group identifier, unless +`--ignore-missing-read-groups` is set. A SAM header file should be given with the new read groups and the ID field +foreach read group containing the new read group identifier. An additional attribute (`FR`) should be provided +that gives the original read group identifier (`ID`) to which this new read group corresponds. + +If `--keep-read-group-attributes` is true, then any read group attribute not replaced will be kept in the new read +group. Otherwise, only the attributes in the provided SAM header file will be used. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Input BAM file.|Required|1|| +|output|o|PathToBam|Output BAM file.|Required|1|| +|read-groups-file|r|PathToBam|A SAM header file with the replacement read groups (see detailed usage).|Required|1|| +|keep-read-group-attributes|k|Boolean|Keep all read group attributes that are not replaced.|Optional|1|false| +|ignore-missing-read-groups|g|Boolean|Keep all read groups not found in the replacement header, otherwise throw an error.|Optional|1|false| + diff --git a/tools/2.3.0/UpdateVcfContigNames.md b/tools/2.3.0/UpdateVcfContigNames.md new file mode 100644 index 000000000..22557c8fd --- /dev/null +++ b/tools/2.3.0/UpdateVcfContigNames.md @@ -0,0 +1,23 @@ +--- +title: UpdateVcfContigNames +--- + +# UpdateVcfContigNames + +## Overview +**Group:** VCF/BCF + +Updates then contig names in a VCF. + +The name of each sequence must match one of the names (including aliases) in the given sequence dictionary. The +new name will be the primary (non-alias) name in the sequence dictionary. + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToVcf|Input VCF.|Required|1|| +|dict|d|PathToSequenceDictionary|The path to the sequence dictionary with contig aliases.|Required|1|| +|output|o|PathToVcf|Output VCF.|Required|1|| +|skip-missing||Boolean|Skip missing contigs.|Optional|1|false| + diff --git a/tools/2.3.0/ZipperBams.md b/tools/2.3.0/ZipperBams.md new file mode 100644 index 000000000..dbbceb1b7 --- /dev/null +++ b/tools/2.3.0/ZipperBams.md @@ -0,0 +1,45 @@ +--- +title: ZipperBams +--- + +# ZipperBams + +## Overview +**Group:** SAM/BAM + +Zips together an unmapped and mapped BAM to transfer metadata into the output BAM. + +Both the unmapped and mapped BAMs _must_ be a) queryname sorted or grouped (i.e. all records with the same +name are grouped together in the file), and b) have the same ordering of querynames. If either of these are +violated the output is undefined! + +All tags present on the unmapped reads are transferred to the mapped reads. The options `--tags-to-reverse` +and `--tags-to-revcomp` will cause tags on the unmapped reads to be reversed or reverse complemented before +being copied to reads mapped to the negative strand. These options can take a mixture of two-letter tag names +and the names of tag sets, which will be expanded into sets of tag names. Currently the only named tag set +is "Consensus" which contains all the per-base consensus tags produced by fgbio consensus callers. + +By default the mapped BAM is read from standard input (stdin) and the output BAM is written to standard +output (stdout). This can be changed using the `--input/-i` and `--output/-o` options. + +By default the output BAM file is emitted in the same order as the input BAMs. This can be overridden +using the `--sort` option, though in practice it may be faster to do the following: + +``` +fgbio --compression 0 ZipperBams -i mapped.bam -u unmapped.bam -r ref.fa | samtools sort -@ $(nproc) +``` + +## Arguments + +|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)| +|----|----|----|-----------|---------|---------------|----------------| +|input|i|PathToBam|Mapped SAM or BAM.|Optional|1|/dev/stdin| +|unmapped|u|PathToBam|Unmapped SAM or BAM.|Required|1|| +|ref|r|PathToFasta|Path to the reference used in alignment. Must have accompanying .dict file.|Required|1|| +|output|o|PathToBam|Output SAM or BAM file.|Optional|1|/dev/stdout| +|tags-to-remove||String|Tags to remove from the mapped BAM records.|Optional|Unlimited|| +|tags-to-reverse||String|Set of optional tags to reverse on reads mapped to the negative strand.|Optional|Unlimited|| +|tags-to-revcomp||String|Set of optional tags to reverse complement on reads mapped to the negative strand.|Optional|Unlimited|| +|sort|s|SamOrder|Sort the output BAM into the given order.|Optional|1|| +|buffer|b|Int|Buffer this many read-pairs while reading the input BAMs.|Optional|1|5000| + diff --git a/tools/2.3.0/index.md b/tools/2.3.0/index.md new file mode 100644 index 000000000..3ff54f382 --- /dev/null +++ b/tools/2.3.0/index.md @@ -0,0 +1,116 @@ +--- +title: fgbio tools +--- + +# fgbio tools + +The following tools are available in fgbio version 2.3.0. +## Basecalling + +Tools for manipulating basecalling data. + +|Tool|Description| +|----|-----------| +|[ExtractBasecallingParamsForPicard](ExtractBasecallingParamsForPicard.md)|Extracts sample and library information from an sample sheet for a given lane| +|[ExtractIlluminaRunInfo](ExtractIlluminaRunInfo.md)|Extracts information about an Illumina sequencing run from the RunInfo| + +## FASTA + +Tools for manipulating FASTA files. + +|Tool|Description| +|----|-----------| +|[CollectAlternateContigNames](CollectAlternateContigNames.md)|Collates the alternate contig names from an NCBI assembly report| +|[HardMaskFasta](HardMaskFasta.md)|Converts soft-masked sequence to hard-masked in a FASTA file| +|[SortSequenceDictionary](SortSequenceDictionary.md)|Sorts a sequence dictionary file in the order of another sequence dictionary| +|[UpdateFastaContigNames](UpdateFastaContigNames.md)|Updates the sequence names in a FASTA| +|[UpdateIntervalListContigNames](UpdateIntervalListContigNames.md)|Updates the sequence names in an Interval List file| + +## FASTQ + +Tools for manipulating FASTQ files. + +|Tool|Description| +|----|-----------| +|[DemuxFastqs](DemuxFastqs.md)|Performs sample demultiplexing on FASTQs| +|[FastqToBam](FastqToBam.md)|Generates an unmapped BAM (or SAM or CRAM) file from fastq files| +|[SortFastq](SortFastq.md)|Sorts a FASTQ file| +|[TrimFastq](TrimFastq.md)|Trims reads in one or more line-matched fastq files to a specific read length| + +## RNA-Seq + +Tools for RNA-Seq data + +|Tool|Description| +|----|-----------| +|[CollectErccMetrics](CollectErccMetrics.md)|Collects metrics for ERCC spike-ins for RNA-Seq experiments| +|[EstimateRnaSeqInsertSize](EstimateRnaSeqInsertSize.md)|Computes the insert size for RNA-Seq experiments| + +## SAM/BAM + +Tools for manipulating SAM, BAM, or related data. + +|Tool|Description| +|----|-----------| +|[AnnotateBamWithUmis](AnnotateBamWithUmis.md)|Annotates existing BAM files with UMIs (Unique Molecular Indices, aka Molecular IDs, Molecular barcodes) from separate FASTQ files| +|[AssignPrimers](AssignPrimers.md)|Assigns reads to primers post-alignment| +|[AutoGenerateReadGroupsByName](AutoGenerateReadGroupsByName.md)|Adds read groups to a BAM file for a single sample by parsing the read names| +|[CallOverlappingConsensusBases](CallOverlappingConsensusBases.md)|Consensus calls overlapping bases in read pairs| +|[ClipBam](ClipBam.md)|Clips reads from the same template| +|[DownsampleAndNormalizeBam](DownsampleAndNormalizeBam.md)|Downsamples a BAM in a biased way to a uniform coverage across regions| +|[ErrorRateByReadPosition](ErrorRateByReadPosition.md)|Calculates the error rate by read position on coordinate sorted mapped BAMs| +|[EstimatePoolingFractions](EstimatePoolingFractions.md)|Examines sequence data generated from a pooled sample and estimates the fraction of sequence data coming from each constituent sample| +|[ExtractUmisFromBam](ExtractUmisFromBam.md)|Extracts unique molecular indexes from reads in a BAM file into tags| +|[FilterBam](FilterBam.md)|Filters reads out of a BAM file| +|[FindSwitchbackReads](FindSwitchbackReads.md)|Finds reads where a template switch occurred during library construction| +|[FindTechnicalReads](FindTechnicalReads.md)|Find reads that are from technical or synthetic sequences in a BAM file| +|[RandomizeBam](RandomizeBam.md)|Randomizes the order of reads in a SAM or BAM file| +|[RemoveSamTags](RemoveSamTags.md)|Removes SAM tags from a SAM or BAM file| +|[SetMateInformation](SetMateInformation.md)|Adds and/or fixes mate information on paired-end reads| +|[SortBam](SortBam.md)|Sorts a SAM or BAM file| +|[SplitBam](SplitBam.md)|Splits a BAM into multiple BAMs, one per-read group (or library)| +|[TrimPrimers](TrimPrimers.md)|Trims primers from reads post-alignment| +|[UpdateReadGroups](UpdateReadGroups.md)|Updates one or more read groups and their identifiers| +|[ZipperBams](ZipperBams.md)|Zips together an unmapped and mapped BAM to transfer metadata into the output BAM| + +## Unique Molecular Identifiers (UMIs) + +Tools for manipulating UMIs & reads tagged with UMIs + +|Tool|Description| +|----|-----------| +|[CallDuplexConsensusReads](CallDuplexConsensusReads.md)|Calls duplex consensus sequences from reads generated from the same double-stranded source molecule| +|[CallMolecularConsensusReads](CallMolecularConsensusReads.md)|Calls consensus sequences from reads with the same unique molecular tag| +|[CollectDuplexSeqMetrics](CollectDuplexSeqMetrics.md)|Collects a suite of metrics to QC duplex sequencing data| +|[CopyUmiFromReadName](CopyUmiFromReadName.md)|Copies the UMI at the end of the BAM's read name to the RX tag| +|[CorrectUmis](CorrectUmis.md)|Corrects UMIs stored in BAM files when a set of fixed UMIs is in use| +|[FilterConsensusReads](FilterConsensusReads.md)|Filters consensus reads generated by CallMolecularConsensusReads or CallDuplexConsensusReads| +|[GroupReadsByUmi](GroupReadsByUmi.md)|Groups reads together that appear to have come from the same original molecule| +|[ReviewConsensusVariants](ReviewConsensusVariants.md)|Extracts data to make reviewing of variant calls from consensus reads easier| + +## Utilities + +Various utility programs. + +|Tool|Description| +|----|-----------| +|[PickIlluminaIndices](PickIlluminaIndices.md)|Picks a set of molecular indices that should work well together| +|[PickLongIndices](PickLongIndices.md)|Picks a set of molecular indices that have at least a given number of mismatches between them| +|[UpdateDelimitedFileContigNames](UpdateDelimitedFileContigNames.md)|Updates the contig names in columns of a delimited data file (e| +|[UpdateGffContigNames](UpdateGffContigNames.md)|Updates the contig names in a GFF| + +## VCF/BCF + +Tools for manipulating VCF, BCF, or related data. + +|Tool|Description| +|----|-----------| +|[AssessPhasing](AssessPhasing.md)|Assess the accuracy of phasing for a set of variants| +|[FilterSomaticVcf](FilterSomaticVcf.md)|Applies one or more filters to a VCF of somatic variants| +|[FixVcfPhaseSet](FixVcfPhaseSet.md)|Adds/fixes the phase set (PS) genotype field| +|[HapCutToVcf](HapCutToVcf.md)|Converts the output of 'HAPCUT' ('HapCut1'/'HapCut2') to a VCF| +|[MakeMixtureVcf](MakeMixtureVcf.md)|Creates a VCF with one sample whose genotypes are a mixture of other samples'| +|[MakeTwoSampleMixtureVcf](MakeTwoSampleMixtureVcf.md)|Creates a simulated tumor or tumor/normal VCF by in-silico mixing genotypes from two samples| +|[UpdateVcfContigNames](UpdateVcfContigNames.md)|Updates then contig names in a VCF| + + diff --git a/tools/latest b/tools/latest index fae692e41..cc6612c36 120000 --- a/tools/latest +++ b/tools/latest @@ -1 +1 @@ -2.2.1 \ No newline at end of file +2.3.0 \ No newline at end of file