VCFX_diff_tool¶

Overview¶

VCFX_diff_tool compares two VCF files and identifies variants that are unique to each file, providing a simple way to detect differences between variant sets.

Usage¶

VCFX_diff_tool --file1 <file1.vcf> --file2 <file2.vcf>

Options¶

Option	Description
`-a`, `--file1 <FILE>`	Required. Path to the first VCF file
`-b`, `--file2 <FILE>`	Required. Path to the second VCF file
`-h`, `--help`	Display help message and exit (handled by `vcfx::handle_common_flags`)
`-v`, `--version`	Show program version and exit (handled by `vcfx::handle_common_flags`)

Description¶

VCFX_diff_tool analyzes two VCF files, compares their variant content, and identifies differences between them. The tool:

Loads variants from both VCF files, ignoring header lines
Creates a normalized key for each variant based on chromosome, position, reference allele, and sorted alternate alleles
Identifies variants that are unique to each file by comparing these keys
Reports the differences in a readable format

This tool is particularly useful for: - Validating VCF file transformations - Checking tool outputs against expected results - Comparing variant calls between different callers or pipelines - Verifying that VCF manipulations haven't inadvertently altered variant content

Output Format¶

The output consists of two sections:

Variants unique to file1.vcf:
chrom:pos:ref:alt
chrom:pos:ref:alt
...

Variants unique to file2.vcf:
chrom:pos:ref:alt
chrom:pos:ref:alt
...

Where each variant is represented as a colon-separated string with chromosome, position, reference allele, and sorted alternate alleles.

Examples¶

Basic Usage¶

./VCFX_diff_tool --file1 original.vcf --file2 modified.vcf

Comparing Variant Caller Outputs¶

./VCFX_diff_tool --file1 caller1_output.vcf --file2 caller2_output.vcf > caller_differences.txt

Validate Processing Results¶

# Check that filtering didn't remove variants it shouldn't have
./VCFX_diff_tool --file1 expected_filtered.vcf --file2 actual_filtered.vcf

Handling Special Cases¶

Multi-allelic variants: Alternate alleles are sorted alphabetically to ensure consistent comparison even if the order differs between files (e.g., "A,G" and "G,A" are treated as identical)
Header differences: Header lines (starting with #) are ignored, so differences in metadata don't affect the comparison
Malformed VCF lines: Invalid lines are skipped with a warning
Empty files: Properly handled; will show all variants from the non-empty file as unique
Missing files: Reports an error if either file cannot be opened
Large files: Efficiently processes files with thousands of variants using hash-based comparison

Performance¶

The tool is optimized for efficiency: - Uses hash sets for O(1) lookups when comparing variants - Single-pass processing of each input file - Memory usage scales with the number of unique variants in both files - Can handle large VCF files with minimal overhead

Limitations¶

Compares only chromosome, position, reference, and alternate alleles; ignores other fields like quality, filter, and INFO
Cannot detect differences in sample genotypes
No support for partial matches or fuzzy comparisons (e.g., variants that differ only in quality)
Not designed to handle VCF files with extremely large numbers of variants (hundreds of millions)
Doesn't consider changes in INFO or FORMAT fields as differences
Cannot compare complex structural variants represented in different ways