VCFX_impact_filter¶
Overview¶
VCFX_impact_filter filters VCF variants based on their predicted functional impact level (HIGH, MODERATE, LOW, or MODIFIER) found in the INFO field of VCF records.
Usage¶
# Using file input (recommended for large files - 10-20x faster)
VCFX_impact_filter --filter-impact <LEVEL> -I input.vcf > filtered.vcf
# Using stdin
VCFX_impact_filter --filter-impact <LEVEL> < input.vcf > filtered.vcf
Options¶
| Option | Description |
|---|---|
-i, --filter-impact <LEVEL> |
Required. Impact level threshold. Must be one of: HIGH, MODERATE, LOW, MODIFIER |
-I, --input FILE |
Input VCF file. Uses memory-mapped I/O for 10-20x faster processing |
-q, --quiet |
Suppress warning messages |
-h, --help |
Display help message and exit (handled by vcfx::handle_common_flags) |
-v, --version |
Show program version and exit (handled by vcfx::handle_common_flags) |
Description¶
VCFX_impact_filter analyzes variant annotations in a VCF file and filters them based on a specified impact level threshold. The tool:
- Reads a VCF file line-by-line
- For each variant, extracts the impact level from the INFO field (looks for
IMPACT=...) - Classifies the impact into one of four levels: HIGH, MODERATE, LOW, or MODIFIER
- Keeps only variants with impact level greater than or equal to the specified threshold
- Adds an
EXTRACTED_IMPACTfield to the INFO column of retained variants - Outputs the filtered VCF with the same format as the input
The impact level hierarchy used for filtering is: HIGH > MODERATE > LOW > MODIFIER > UNKNOWN
Output Format¶
The output is a valid VCF file containing only variants that meet or exceed the specified impact threshold. Each retained variant will have an additional INFO field:
EXTRACTED_IMPACT=<value>
Where <value> is the original impact value extracted from the INFO field.
Examples¶
Filter for HIGH impact variants only¶
./VCFX_impact_filter --filter-impact HIGH < input.vcf > high_impact_variants.vcf
Filter for MODERATE or higher impact variants¶
./VCFX_impact_filter --filter-impact MODERATE < input.vcf > functional_variants.vcf
Combining with other tools¶
# Filter by impact then convert to another format
./VCFX_impact_filter --filter-impact HIGH < input.vcf | \
./VCFX_format_converter --format=bed > high_impact.bed
Handling Special Cases¶
- Case insensitivity: Impact values are case insensitive (e.g., "high" and "HIGH" are treated the same)
- Extended impact values: Values like "HIGH_MISSENSE" are recognized by looking for the presence of standard impact keywords
- Missing IMPACT field: Variants without an IMPACT field in the INFO column are treated as "UNKNOWN" and filtered out by default
- Empty INFO field: Properly handled by adding the EXTRACTED_IMPACT field as the only INFO attribute
- Multiple impact annotations: If multiple IMPACT fields are present, only the first one is considered
- Invalid impact values: Any impact value not recognized as one of the four standard levels is classified as "UNKNOWN"
Performance¶
The tool is optimized for efficiency:
- Memory-mapped I/O: When using -I/--input, files are memory-mapped for 10-20x faster processing
- SIMD acceleration: Uses AVX2/SSE2/NEON instructions for fast newline scanning
- Zero-copy parsing: Uses string_view for minimal memory allocation
- 1MB output buffering: Reduces system call overhead
- Processes very large VCF files with linear time complexity
Limitations¶
- Only extracts and analyzes the first IMPACT field found in the INFO column
- Cannot differentiate between more detailed impact subclassifications (relies on basic HIGH/MODERATE/LOW/MODIFIER keywords)
- Assumes that functional impact annotations follow standard convention with one of the four recognized impact levels
- Does not account for the specific variant type (SNP, indel, etc.) when filtering
- No built-in options to combine impact filtering with other criteria (e.g., allele frequency)