Skip to content

VCFX_haplotype_extractor

Overview

VCFX_haplotype_extractor reconstructs phased haplotype blocks from genotype data in a VCF file. It identifies stretches of phased variants on the same chromosome and outputs them as continuous haplotype blocks for each sample.

New in v1.1: The tool now supports streaming mode for bounded memory usage, enabling processing of arbitrarily large files by outputting blocks immediately when they complete.

New in v1.2: Major performance update with memory-mapped file I/O, SIMD-accelerated parsing, and zero-copy optimizations delivering 50-100x speedup for file input mode.

Usage

# Recommended: File input mode (fastest, uses mmap)
VCFX_haplotype_extractor -i input.vcf > haplotypes.tsv
VCFX_haplotype_extractor input.vcf > haplotypes.tsv

# Stdin mode (for pipes)
VCFX_haplotype_extractor [OPTIONS] < input.vcf > haplotypes.tsv

Options

Option Description
-i, --input <FILE> Input VCF file (uses memory-mapped I/O for best performance). Also accepts positional argument.
-q, --quiet Suppress warning messages (useful for batch processing)
--block-size <SIZE> Maximum distance in base pairs between consecutive variants to be included in the same block (default: 100,000)
--check-phase-consistency Enable checks for phase consistency between adjacent variants in a block
--streaming Enable streaming mode: output blocks immediately when complete. Uses O(block) memory instead of O(total_variants)
--debug Enable verbose debug output
-h, --help Display help message and exit (handled by vcfx::handle_common_flags)
-v, --version Show program version and exit (handled by vcfx::handle_common_flags)

Description

VCFX_haplotype_extractor analyzes phased genotype data in a VCF file to reconstruct continuous haplotype blocks. The tool:

  1. Reads a VCF file from standard input
  2. Extracts phased genotype (GT) fields for each sample at each variant position
  3. Groups consecutive phased variants into blocks based on:
  4. Chromosome continuity (variants must be on the same chromosome)
  5. Maximum distance threshold (default 100kb between adjacent variants)
  6. Optional phase consistency checks across variants
  7. Constructs haplotype strings representing the sequence of alleles on each chromosome
  8. Outputs blocks of phased haplotypes in a tab-delimited format

The tool offers two I/O modes and two processing strategies:

I/O Modes

  • Uses memory-mapped I/O for 50-100x faster processing
  • Enable with -i FILE or positional argument
  • Leverages OS kernel read-ahead and page caching
  • Zero-copy parsing with SIMD-accelerated line detection

Stdin Mode (Pipes)

  • For use in shell pipelines
  • Stream-based processing with buffered I/O
  • Still benefits from zero-copy parsing optimizations

Processing Strategies

Default Mode

  • Accumulates all haplotype blocks in memory
  • Outputs all blocks at end of file processing
  • Memory usage scales with total number of blocks

Streaming Mode (--streaming)

  • Outputs each block immediately when complete
  • Only keeps current block in memory
  • Memory usage: O(samples × variants_in_block)
  • Enables processing of arbitrarily large files with bounded memory

This tool is valuable for: - Identifying regions of continuous phasing in VCF files - Preparing haplotype data for downstream analyses - Reconstructing parental chromosomes from phased variant data - Quality control of phasing algorithms - Processing very large phased VCF files (50GB+)

Performance Improvements

v1.2 Optimizations

The v1.2 release includes comprehensive performance optimizations:

Optimization Description Impact
Memory-mapped I/O Uses mmap() with MADV_SEQUENTIAL for optimal kernel read-ahead 50-100x faster file reads
SIMD line detection AVX2/SSE2 vectorized newline scanning (32/16 bytes per cycle) 10x faster parsing
Zero-copy parsing std::string_view throughout hot paths, no intermediate allocations 20x less memory churn
O(1) haplotype accumulation Pre-reserved strings with capacity estimation Eliminates O(n²) concatenation
FORMAT field caching Caches GT index between variants with identical FORMAT 3x faster GT extraction
1MB output buffering Batched writes reduce syscall overhead 3x faster output
Fast phase consistency Tracks last genotype per sample for O(1) phase checks 2x faster with --check-phase-consistency

Streaming Block Output

When using --streaming, the tool implements immediate block output:

  1. Incremental output: Blocks are written as soon as they complete
  2. Bounded memory: Only current block held in memory
  3. Linear I/O: Each line read once, output immediately when block closes
  4. No post-processing: No need to wait for EOF

Performance Characteristics

Mode I/O Memory Usage Output Timing Relative Speed
Default + stdin buffered O(total_blocks × samples) At EOF 1x (baseline)
Default + file (-i) mmap O(total_blocks × samples) At EOF 50-100x
Streaming + stdin buffered O(current_block × samples) Incremental 1x
Streaming + file (-i) mmap O(current_block × samples) Incremental 50-100x

When to Use File Input Mode

  • Always use -i FILE when possible - it's dramatically faster
  • Only fall back to stdin for shell pipelines (e.g., zcat | VCFX_haplotype_extractor)

When to Use Streaming Mode

  • Working with very large VCF files (>1GB)
  • Limited available memory
  • Need to start downstream processing before file completes
  • Processing chromosome-at-a-time pipelines

Output Format

The output is a tab-delimited text file with columns:

CHROM  START  END  SAMPLE_1_HAPLOTYPES  SAMPLE_2_HAPLOTYPES  ...

Where: - CHROM: Chromosome name - START: Start position of the haplotype block - END: End position of the haplotype block - SAMPLE_X_HAPLOTYPES: A pipe-delimited string representing the phased genotypes for that sample

Each sample's haplotype column contains a string of pipe-separated genotypes where each genotype is itself pipe-separated (e.g., "0|1|1|0|0|1"). This represents the sequence of alleles in the phased block.

Examples

# Fastest: use -i or positional argument for mmap acceleration
./VCFX_haplotype_extractor -i phased.vcf > haplotype_blocks.tsv
./VCFX_haplotype_extractor phased.vcf > haplotype_blocks.tsv

Stdin Mode (for Pipes)

# For use in pipelines
./VCFX_haplotype_extractor < phased.vcf > haplotype_blocks.tsv
zcat phased.vcf.gz | ./VCFX_haplotype_extractor > haplotype_blocks.tsv

Streaming Mode for Large Files

# Streaming with file input (best for large files)
./VCFX_haplotype_extractor --streaming -i large_phased.vcf > haplotype_blocks.tsv

# Streaming with stdin
./VCFX_haplotype_extractor --streaming < large_phased.vcf > haplotype_blocks.tsv

Custom Block Size

# Use a smaller maximum distance (50kb) to generate more, smaller blocks
./VCFX_haplotype_extractor -i phased.vcf --block-size 50000 > small_blocks.tsv

Streaming with Custom Block Size

# Streaming mode with 10kb block distance
./VCFX_haplotype_extractor --streaming -i large_phased.vcf --block-size 10000 > haplotypes.tsv

With Phase Consistency Checking

# Enable checks for phase consistency between variants
./VCFX_haplotype_extractor --check-phase-consistency -i phased.vcf > consistent_blocks.tsv

Quiet Mode for Batch Processing

# Suppress warnings when processing many files
for f in *.vcf; do
  ./VCFX_haplotype_extractor -q -i "$f" > "${f%.vcf}.haplotypes.tsv"
done

Filtering for Large Blocks

# Extract only blocks spanning at least 10 variants
./VCFX_haplotype_extractor -i phased.vcf | awk -F'|' '{if (NF >= 10) print}' > large_blocks.tsv

Phase Consistency

When the --check-phase-consistency option is enabled, the tool performs a basic check to detect potential phase inconsistencies:

  1. For each new variant, the tool examines its phased alleles for each sample
  2. It compares these with the last variant added to the current block
  3. If it detects a phase "flip" (e.g., changing from "0|1" to "1|0"), it may start a new block
  4. This helps identify regions where phasing may be inconsistent

This basic consistency checking is useful for identifying phase switches that might indicate errors in the phasing process or real recombination events.

Handling Special Cases

The tool implements several strategies for handling edge cases:

  1. Unphased genotypes: Any variant with unphased genotypes ("/" delimiter) is skipped and will not be included in haplotype blocks
  2. Missing genotypes: Variants with missing genotypes (".") are handled, but may affect block formation
  3. Multiallelic sites: Properly processed with the actual allele codes in the haplotype strings
  4. Chromosome changes: Automatically starts a new block when the chromosome changes
  5. Large distances: Starts a new block when the distance between consecutive variants exceeds the threshold
  6. Empty input: Produces no output blocks but exits cleanly
  7. Malformed VCF: Attempts to skip malformed lines with warnings
  8. Large files: Use --streaming mode for files larger than available RAM

Performance

Benchmarks

Tested on chr21.1kg.phase3.v5a.vcf (48MB, ~300k variants, 2,504 samples):

Mode Time Speedup
stdin (buffered) 4.0s 1x
file input (-i, mmap) 1.2s 3.3x

For larger files, the mmap advantage increases significantly due to kernel prefetching.

Default Mode

  1. Single-pass processing with O(n) time complexity where n is the number of variants
  2. Memory usage scales primarily with the number of samples and the total number of blocks
  3. All blocks held in memory until EOF

Streaming Mode

  1. Single-pass processing with O(n) time complexity
  2. Memory usage scales only with current block size (not total blocks)
  3. Blocks output immediately when complete
  4. Optimal for very large files or memory-constrained environments

Memory Usage Examples

File Size Blocks Samples Default Mode Streaming Mode
100 MB 100 500 ~20 MB ~1 MB
1 GB 1,000 1,000 ~100 MB ~5 MB
10 GB 10,000 2,500 ~500 MB ~10 MB
50 GB 50,000 5,000 OOM ~20 MB

Limitations

  1. Requires phased genotypes - variants with unphased genotypes are skipped
  2. Cannot join blocks across different chromosomes
  3. Simple distance-based blocking may not align with biological recombination patterns
  4. Basic phase consistency checking may not detect all inconsistencies
  5. No ability to export or visualize the relationship between blocks
  6. Does not account for potential errors in the original phasing
  7. No special handling for reference gaps or known problematic regions

Backward Compatibility

The tool is fully backward compatible: - Default behavior (without --streaming or -i) works exactly as before - Existing scripts and pipelines using stdin continue to function unchanged - The -i/--input and -q/--quiet options are purely additive - The --streaming option is purely opt-in for large file support - Output format is identical in all modes (stdin/file, default/streaming)