VCFX_haplotype_extractor¶
Overview¶
VCFX_haplotype_extractor reconstructs phased haplotype blocks from genotype data in a VCF file. It identifies stretches of phased variants on the same chromosome and outputs them as continuous haplotype blocks for each sample.
Usage¶
VCFX_haplotype_extractor [OPTIONS] < input.vcf > haplotypes.tsv
Options¶
Option | Description |
---|---|
--block-size <SIZE> |
Maximum distance in base pairs between consecutive variants to be included in the same block (default: 100,000) |
--check-phase-consistency |
Enable checks for phase consistency between adjacent variants in a block |
-h , --help |
Display help message and exit (handled by vcfx::handle_common_flags ) |
-v , --version |
Show program version and exit (handled by vcfx::handle_common_flags ) |
Description¶
VCFX_haplotype_extractor analyzes phased genotype data in a VCF file to reconstruct continuous haplotype blocks. The tool:
- Reads a VCF file from standard input
- Extracts phased genotype (GT) fields for each sample at each variant position
- Groups consecutive phased variants into blocks based on:
- Chromosome continuity (variants must be on the same chromosome)
- Maximum distance threshold (default 100kb between adjacent variants)
- Optional phase consistency checks across variants
- Constructs haplotype strings representing the sequence of alleles on each chromosome
- Outputs blocks of phased haplotypes in a tab-delimited format
This tool is valuable for: - Identifying regions of continuous phasing in VCF files - Preparing haplotype data for downstream analyses - Reconstructing parental chromosomes from phased variant data - Quality control of phasing algorithms
Output Format¶
The output is a tab-delimited text file with columns:
CHROM START END SAMPLE_1_HAPLOTYPES SAMPLE_2_HAPLOTYPES ...
Where: - CHROM: Chromosome name - START: Start position of the haplotype block - END: End position of the haplotype block - SAMPLE_X_HAPLOTYPES: A pipe-delimited string representing the phased genotypes for that sample
Each sample's haplotype column contains a string of pipe-separated genotypes where each genotype is itself pipe-separated (e.g., "0|1|1|0|0|1"). This represents the sequence of alleles in the phased block.
Examples¶
Basic Usage¶
./VCFX_haplotype_extractor < phased.vcf > haplotype_blocks.tsv
Custom Block Size¶
# Use a smaller maximum distance (50kb) to generate more, smaller blocks
./VCFX_haplotype_extractor --block-size 50000 < phased.vcf > small_blocks.tsv
With Phase Consistency Checking¶
# Enable checks for phase consistency between variants
./VCFX_haplotype_extractor --check-phase-consistency < phased.vcf > consistent_blocks.tsv
Filtering for Large Blocks¶
# Extract only blocks spanning at least 10 variants
./VCFX_haplotype_extractor < phased.vcf | awk -F'|' '{if (NF >= 10) print}' > large_blocks.tsv
Phase Consistency¶
When the --check-phase-consistency
option is enabled, the tool performs a basic check to detect potential phase inconsistencies:
- For each new variant, the tool examines its phased alleles for each sample
- It compares these with the last variant added to the current block
- If it detects a phase "flip" (e.g., changing from "0|1" to "1|0"), it may start a new block
- This helps identify regions where phasing may be inconsistent
This basic consistency checking is useful for identifying phase switches that might indicate errors in the phasing process or real recombination events.
Handling Special Cases¶
The tool implements several strategies for handling edge cases:
- Unphased genotypes: Any variant with unphased genotypes ("/" delimiter) is skipped and will not be included in haplotype blocks
- Missing genotypes: Variants with missing genotypes (".") are handled, but may affect block formation
- Multiallelic sites: Properly processed with the actual allele codes in the haplotype strings
- Chromosome changes: Automatically starts a new block when the chromosome changes
- Large distances: Starts a new block when the distance between consecutive variants exceeds the threshold
- Empty input: Produces no output blocks but exits cleanly
- Malformed VCF: Attempts to skip malformed lines with warnings
Performance¶
VCFX_haplotype_extractor is designed for efficiency:
- Single-pass processing with O(n) time complexity where n is the number of variants
- Memory usage scales primarily with the number of samples and the size of the largest haplotype block
- Streaming architecture allows processing large files without loading them entirely into memory
- Block-based approach prevents excessive memory usage for very long chromosomes
Limitations¶
- Requires phased genotypes - variants with unphased genotypes are skipped
- Cannot join blocks across different chromosomes
- Simple distance-based blocking may not align with biological recombination patterns
- Basic phase consistency checking may not detect all inconsistencies
- No ability to export or visualize the relationship between blocks
- Does not account for potential errors in the original phasing
- No special handling for reference gaps or known problematic regions