VCFX_haplotype_extractor¶

Overview¶

VCFX_haplotype_extractor reconstructs phased haplotype blocks from genotype data in a VCF file. It identifies stretches of phased variants on the same chromosome and outputs them as continuous haplotype blocks for each sample.

Usage¶

VCFX_haplotype_extractor [OPTIONS] < input.vcf > haplotypes.tsv

Options¶

Option	Description
`--block-size <SIZE>`	Maximum distance in base pairs between consecutive variants to be included in the same block (default: 100,000)
`--check-phase-consistency`	Enable checks for phase consistency between adjacent variants in a block
`-h`, `--help`	Display help message and exit (handled by `vcfx::handle_common_flags`)
`-v`, `--version`	Show program version and exit (handled by `vcfx::handle_common_flags`)

Description¶

VCFX_haplotype_extractor analyzes phased genotype data in a VCF file to reconstruct continuous haplotype blocks. The tool:

Reads a VCF file from standard input
Extracts phased genotype (GT) fields for each sample at each variant position
Groups consecutive phased variants into blocks based on:
Chromosome continuity (variants must be on the same chromosome)
Maximum distance threshold (default 100kb between adjacent variants)
Optional phase consistency checks across variants
Constructs haplotype strings representing the sequence of alleles on each chromosome
Outputs blocks of phased haplotypes in a tab-delimited format

This tool is valuable for: - Identifying regions of continuous phasing in VCF files - Preparing haplotype data for downstream analyses - Reconstructing parental chromosomes from phased variant data - Quality control of phasing algorithms

Output Format¶

The output is a tab-delimited text file with columns:

CHROM  START  END  SAMPLE_1_HAPLOTYPES  SAMPLE_2_HAPLOTYPES  ...

Where: - CHROM: Chromosome name - START: Start position of the haplotype block - END: End position of the haplotype block - SAMPLE_X_HAPLOTYPES: A pipe-delimited string representing the phased genotypes for that sample

Each sample's haplotype column contains a string of pipe-separated genotypes where each genotype is itself pipe-separated (e.g., "0|1|1|0|0|1"). This represents the sequence of alleles in the phased block.

Examples¶

Basic Usage¶

./VCFX_haplotype_extractor < phased.vcf > haplotype_blocks.tsv

Custom Block Size¶

# Use a smaller maximum distance (50kb) to generate more, smaller blocks
./VCFX_haplotype_extractor --block-size 50000 < phased.vcf > small_blocks.tsv

With Phase Consistency Checking¶

# Enable checks for phase consistency between variants
./VCFX_haplotype_extractor --check-phase-consistency < phased.vcf > consistent_blocks.tsv

Filtering for Large Blocks¶

# Extract only blocks spanning at least 10 variants
./VCFX_haplotype_extractor < phased.vcf | awk -F'|' '{if (NF >= 10) print}' > large_blocks.tsv

Phase Consistency¶

When the --check-phase-consistency option is enabled, the tool performs a basic check to detect potential phase inconsistencies:

For each new variant, the tool examines its phased alleles for each sample
It compares these with the last variant added to the current block
If it detects a phase "flip" (e.g., changing from "0|1" to "1|0"), it may start a new block
This helps identify regions where phasing may be inconsistent

This basic consistency checking is useful for identifying phase switches that might indicate errors in the phasing process or real recombination events.

Handling Special Cases¶

The tool implements several strategies for handling edge cases:

Unphased genotypes: Any variant with unphased genotypes ("/" delimiter) is skipped and will not be included in haplotype blocks
Missing genotypes: Variants with missing genotypes (".") are handled, but may affect block formation
Multiallelic sites: Properly processed with the actual allele codes in the haplotype strings
Chromosome changes: Automatically starts a new block when the chromosome changes
Large distances: Starts a new block when the distance between consecutive variants exceeds the threshold
Empty input: Produces no output blocks but exits cleanly
Malformed VCF: Attempts to skip malformed lines with warnings

Performance¶

VCFX_haplotype_extractor is designed for efficiency:

Single-pass processing with O(n) time complexity where n is the number of variants
Memory usage scales primarily with the number of samples and the size of the largest haplotype block
Streaming architecture allows processing large files without loading them entirely into memory
Block-based approach prevents excessive memory usage for very long chromosomes

Limitations¶

Requires phased genotypes - variants with unphased genotypes are skipped
Cannot join blocks across different chromosomes
Simple distance-based blocking may not align with biological recombination patterns
Basic phase consistency checking may not detect all inconsistencies
No ability to export or visualize the relationship between blocks
Does not account for potential errors in the original phasing
No special handling for reference gaps or known problematic regions