Skip to content

VCFX_phase_checker

Overview

The VCFX_phase_checker tool filters a VCF file to retain only those variant lines where all sample genotypes are fully phased (using the pipe '|' phasing separator). This is particularly useful for downstream analyses that require complete phasing information.

Usage

VCFX_phase_checker [OPTIONS] < input.vcf > phased_output.vcf

Options

Option Description
-h, --help Display help message and exit (handled by vcfx::handle_common_flags)
-v, --version Show program version and exit (handled by vcfx::handle_common_flags)

Description

VCFX_phase_checker reads a VCF file from standard input and examines the GT (genotype) field for every sample in each variant line. It determines whether each genotype is fully phased using the following criteria:

  • A genotype is considered fully phased only if:
  • It uses the pipe character '|' as the separator between alleles (e.g., "0|1")
  • It contains no missing alleles (no ".")
  • It contains no unphased separators ("/")

The tool outputs only those variant lines where all sample genotypes meet these criteria. Lines that don't meet these criteria are skipped, and warnings are written to standard error.

Output

The output is a valid VCF file containing: - All header lines from the input file (unchanged) - Only the variant lines where all samples have fully phased genotypes - Warnings (to stderr) for each line that was skipped due to unphased genotypes

Examples

Basic Usage

./VCFX_phase_checker < input.vcf > phased_output.vcf

Capturing Warnings

./VCFX_phase_checker < input.vcf > phased_output.vcf 2> unphased_warnings.log

Counting Phased vs. Unphased Variants

# Count total variants
total=$(grep -v "^#" input.vcf | wc -l)
# Count phased variants
phased=$(./VCFX_phase_checker < input.vcf | grep -v "^#" | wc -l)
# Calculate percentage
echo "Phased variants: $phased / $total ($(echo "scale=2; 100*$phased/$total" | bc)%)"

Handling Special Cases

  • Haploid genotypes (e.g., "0" or "1"): These are not considered phased; the line will be skipped
  • Missing genotypes (e.g., "./." or ".|."): These are not considered phased; the line will be skipped
  • Missing GT field: Lines without a GT field in the FORMAT column are skipped with a warning
  • Multiallelic variants: These are treated the same as biallelic variants, as long as all alleles are phased
  • Non-VCF-compliant genotype notation: Any genotype that doesn't follow standard VCF format is not considered phased
  • Header lines: All header lines (starting with "#") are preserved in the output
  • Samples with different ploidy levels: Each sample is checked independently; if all are phased, the line is kept

Performance

The tool processes files line by line with minimal memory requirements, allowing it to handle very large VCF files efficiently.

Limitations

  • No option to make a best-effort phasing assumption
  • Cannot output partially phased lines or filter specific samples
  • Designed to be used as part of a pipeline, not as a standalone phasing tool