VCFX_duplicate_remover¶

Overview¶

VCFX_duplicate_remover identifies and removes duplicate variant records from a VCF file based on their essential coordinates (chromosome, position, reference allele, and alternate allele), ensuring that each unique variant is represented only once in the output.

Usage¶

VCFX_duplicate_remover [OPTIONS] < input.vcf > deduplicated.vcf

Options¶

Option	Description
`-h`, `--help`	Display help message and exit (handled by `vcfx::handle_common_flags`)
`-v`, `--version`	Show program version and exit (handled by `vcfx::handle_common_flags`)

Description¶

VCFX_duplicate_remover processes a VCF file to detect and remove duplicate variant records. The tool:

Reads a VCF file from standard input line by line
Passes header lines (beginning with '#') through unchanged
For each data line, creates a normalized variant key consisting of:
Chromosome name (CHROM)
Position (POS)
Reference allele (REF)
Normalized alternate alleles (ALT)
Normalizes multi-allelic variants by sorting the comma-separated ALT alleles alphabetically
Tracks seen variants using a hash-based data structure
Outputs only the first occurrence of each unique variant, discarding subsequent duplicates
Writes the deduplicated VCF to standard output

This tool is particularly useful for: - Merging VCF files that may contain overlapping variants - Cleaning up datasets that have inadvertently duplicated records - Ensuring downstream analysis tools don't process the same variant multiple times

Output Format¶

The output is a valid VCF file with the same format as the input, but with duplicate variant records removed. The first occurrence of each unique variant is preserved, maintaining all original fields (ID, QUAL, FILTER, INFO, FORMAT, samples).

Examples¶

Basic Usage¶

# Remove duplicate variants from a VCF file
./VCFX_duplicate_remover < input.vcf > deduplicated.vcf

In a Pipeline¶

# Filter a VCF file and then remove duplicates
./VCFX_record_filter --quality ">20" < input.vcf | \
./VCFX_duplicate_remover > filtered_unique.vcf

Checking Results¶

# Count variants before and after deduplication
echo "Before: $(grep -v '^#' input.vcf | wc -l) variants"
echo "After: $(grep -v '^#' deduplicated.vcf | wc -l) variants"

Duplicate Detection¶

The tool determines uniqueness based on four key attributes:

Chromosome (CHROM): Exact string match of chromosome name
Position (POS): Numerical position on the chromosome
Reference Allele (REF): The reference sequence at that position
Alternate Alleles (ALT): The alternate alleles, normalized by sorting

For multi-allelic variants, the ALT field is normalized by: - Splitting the comma-separated list of alternate alleles - Sorting the alleles alphabetically - Re-joining them with commas

This normalization ensures that variants with the same alleles but in different order are correctly identified as duplicates. For example, "A,G,T" and "T,A,G" are treated as the same variant after normalization.

Handling Special Cases¶

The tool implements several strategies for handling edge cases:

Multi-allelic variants: Normalizes ALT fields by sorting to handle different orderings
Empty lines: Skipped without affecting output
Malformed lines: Lines that can't be parsed are skipped with a warning
Position parsing errors: If a POS field can't be parsed as an integer, it's set to 0
Header lines: All header lines are preserved in the output
Empty files: Properly handles empty input files, producing empty output
Files with only headers: Header lines are passed through correctly
Sample columns: Maintains all sample genotype data in the output

Performance¶

VCFX_duplicate_remover is designed for efficiency:

Single-pass processing with O(n) time complexity where n is the number of variants
Uses an optimized hash-based data structure for fast variant lookups
Minimal memory overhead, proportional to the number of unique variants
Handles large VCF files with millions of variants efficiently
String processing optimized for performance with minimal copying

Limitations¶

Retains the first occurrence of duplicate variants; quality or information in subsequent duplicates is discarded
No option to select which duplicate to keep (e.g., the one with highest quality)
No facility to annotate output variants with duplicate counts
Limited to exact matching; doesn't detect overlapping variants that might represent the same event
Doesn't consider INFO fields in uniqueness determination
Cannot handle duplicates based on sample-specific criteria
No option to only report duplicate variants without removing them