VCFX_probability_filter¶
Overview¶
VCFX_probability_filter
filters a VCF file based on specified genotype probability values. It allows you to keep only variants where samples meet certain probability thresholds, using various comparison operators.
Usage¶
VCFX_probability_filter --filter-probability "<CONDITION>" < input.vcf > filtered.vcf
Options¶
Option | Description |
---|---|
-f, --filter-probability <condition> |
Specify the probability filter condition (e.g., GP>0.9 ) |
-h, --help |
Display help message and exit (handled by vcfx::handle_common_flags ) |
-v , --version |
Show program version and exit (handled by vcfx::handle_common_flags ) |
Description¶
VCFX_probability_filter
analyzes the genotype probability fields in the FORMAT column of a VCF file and filters variants based on a user-defined condition. The tool:
- Reads the VCF file line by line from standard input
- Parses the specified condition (e.g.,
GP>0.9
) - For each variant, examines the probability values in the specified field
- Keeps variants where all samples meet the condition
- Outputs the filtered variants to standard output
The condition must be specified in the format FIELD OPERATOR VALUE
, where:
- FIELD
is a valid field in the FORMAT column (e.g., GP
for genotype probabilities)
- OPERATOR
is one of: >
, <
, >=
, <=
, ==
, !=
- VALUE
is a numeric threshold
Output Format¶
The output is a standard VCF file with the same format as the input, but containing only variants that meet the specified probability condition. All header lines are preserved.
Examples¶
Basic Filtering¶
Filter for variants where all samples have a GP (genotype probability) value greater than 0.9:
VCFX_probability_filter --filter-probability "GP>0.9" < input.vcf > filtered.vcf
Using Different Operators¶
Filter for variants where all samples have a GP value less than 0.1:
VCFX_probability_filter --filter-probability "GP<0.1" < input.vcf > filtered.vcf
Exact Match¶
Filter for variants where all samples have a GP value exactly equal to 0.9:
VCFX_probability_filter --filter-probability "GP==0.9" < input.vcf > filtered.vcf
Different Probability Fields¶
Filter using a different probability field (e.g., PP
for posterior probability):
VCFX_probability_filter --filter-probability "PP>=0.8" < input.vcf > filtered.vcf
In a Pipeline¶
Use as part of a processing pipeline:
cat input.vcf | VCFX_probability_filter --filter-probability "GP>0.95" | other_vcf_tool > output.vcf
Probability Value Handling¶
Field Parsing¶
The tool locates the specified probability field (e.g., GP
) in the FORMAT column of each variant and extracts the corresponding values for each sample.
Value Comparison¶
The extracted probability values are compared to the specified threshold using the given operator. A variant passes the filter only if all samples meet the condition.
Multiple Probability Values¶
If a field contains multiple values (e.g., GP
often contains three values for a biallelic variant), the filter is applied to the first value that can be successfully extracted and converted to a number.
Handling Special Cases¶
Missing Values¶
If a sample has a missing value (.
) for the specified probability field, the variant is filtered out.
Malformed Values¶
If a probability value cannot be converted to a number, the variant is filtered out with a warning message.
Missing Fields¶
If the specified field is not found in the FORMAT column, the tool reports an error and exits.
Performance Considerations¶
- The tool processes the VCF file line by line, requiring minimal memory.
- No sorting or indexing is performed, preserving the original order of variants.
- Time complexity is linear with respect to the size of the input file.
Limitations¶
- The tool requires that the specified field exists in the FORMAT column of the VCF file.
- It only supports standard comparison operators and cannot handle complex conditions or multiple conditions.
- The tool does not handle cases where different samples might need different thresholds.
- Missing or malformed values cause the variant to be filtered out, which might not be desired in all cases.