Skip to content

VCFX_phred_filter

Overview

VCFX_phred_filter filters a VCF file based on the PHRED quality scores (QUAL column), removing variants that fall below a specified quality threshold to focus analysis on higher confidence variant calls.

Usage

VCFX_phred_filter [OPTIONS] [files...] > filtered.vcf
VCFX_phred_filter [OPTIONS] < input.vcf > filtered.vcf

Options

Option Description
-p, --phred-filter Set PHRED quality score threshold (default: 30.0)
-k, --keep-missing-qual Keep variants with missing quality values (represented as ".")
-h, --help Display help message and exit (handled by vcfx::handle_common_flags)
-v, --version Show program version and exit (handled by vcfx::handle_common_flags)

Description

VCFX_phred_filter examines the QUAL column of each variant record in a VCF file and filters out variants with quality scores below a specified threshold. The tool:

  1. Processes a VCF file line by line
  2. Extracts the PHRED quality score from the QUAL column (position 6) of each variant
  3. Compares the quality score against the specified threshold
  4. Retains variants where the quality meets or exceeds the threshold
  5. Passes through all header lines unchanged

This tool is particularly useful for: - Removing low-confidence variant calls for downstream analysis - Quality control of variant data - Reducing false positives in variant datasets - Applying consistent quality standards across multiple VCF files

Output Format

The output is a standard VCF file with the same format as the input, but containing only variants with PHRED quality scores at or above the specified threshold. All header lines are preserved.

Examples

Basic Usage with Default Threshold

# Filter variants using the default quality threshold (30) - from file (fastest)
VCFX_phred_filter input.vcf > filtered.vcf

# Filter variants from stdin
VCFX_phred_filter < input.vcf > filtered.vcf

Setting a Custom Threshold

# Keep only variants with quality scores of 20 or higher
VCFX_phred_filter -p 20 < input.vcf > q20_filtered.vcf

Keeping Variants with Missing Quality

# Keep variants with quality ≥ 30 or missing quality values
VCFX_phred_filter -p 30 -k < input.vcf > high_quality_with_missing.vcf

In a Pipeline

# Combine with other filters in a pipeline
cat input.vcf | \
VCFX_phred_filter -p 40 | \
grep "PASS" > high_quality_pass.vcf

Using Long Option Format

# Using long options for clarity
VCFX_phred_filter --phred-filter 25 --keep-missing-qual < input.vcf > filtered.vcf

Processing Multiple Files

# Process multiple VCF files (concatenated output)
VCFX_phred_filter -p 20 file1.vcf file2.vcf file3.vcf > combined_filtered.vcf

Quality Score Handling

The tool processes quality scores as follows:

  1. Standard numeric values: Directly compared to the threshold
  2. Missing quality values (represented as "."):
  3. By default, treated as 0.0 (filtered out)
  4. With the -k option, treated as extremely high (effectively infinity) to ensure they pass
  5. Invalid quality values: Logged with a warning and treated as 0.0 (filtered out)

Handling Special Cases

  • Missing quality values: Can be retained with the -k option
  • Invalid quality formats: Treated as 0.0 with a warning
  • Empty lines: Preserved with a single newline
  • Header lines: Preserved unchanged
  • Malformed VCF lines: Lines with fewer than 6 columns are skipped with a warning
  • Data before header: Skipped with a warning

Performance

The tool uses advanced optimization techniques for maximum throughput:

  1. Memory-mapped I/O: When file arguments are provided, uses mmap with MADV_SEQUENTIAL hints for optimal file reading
  2. Zero-copy QUAL extraction: Extracts QUAL field directly using pointer arithmetic without string allocation
  3. Output buffering: Uses 1MB buffer with periodic flushing for efficient output
  4. Fallback stdin processing: Uses buffered I/O for pipe input when no files specified
  5. Fast float parsing: Uses strtod for efficient quality value conversion

Benchmark Results

On a test file with 427K variants and 2,504 samples (1.4GB): - stdin processing: ~320 seconds - mmap file argument: ~12 seconds - Improvement: ~26x faster with file arguments

Limitations

  1. Focuses only on the QUAL column, not on per-sample or per-genotype quality metrics
  2. No support for filtering based on other quality-related fields (e.g., GQ in the genotype fields)
  3. Cannot apply different thresholds to different variant types
  4. No option to exclude variants based on upper quality bounds
  5. Doesn't provide statistics on filtered variants
  6. Cannot filter based on multiple quality metrics simultaneously
  7. Lacks region-specific filtering capabilities