Skip to content

VCFX_multiallelic_splitter

Overview

VCFX_multiallelic_splitter takes a VCF file with multi-allelic variants (variants with multiple ALT alleles) and splits them into multiple bi-allelic variant lines, while properly handling genotypes and FORMAT/INFO fields with various number specifications.

Usage

VCFX_multiallelic_splitter [OPTIONS] < input.vcf > biallelic_output.vcf

Options

Option Description
--help, -h Display help message and exit (handled by vcfx::handle_common_flags)
-v, --version Show program version and exit (handled by vcfx::handle_common_flags)

Description

Multi-allelic variants in VCF files (where multiple alternate alleles are specified in a comma-separated list) can complicate analysis and be incompatible with tools that require bi-allelic variants. This tool converts multi-allelic variants into the equivalent set of bi-allelic variants.

Key features: - Maintains original VCF header information - Correctly processes INFO fields tagged with different Number attributes (A, R, G) - Properly adjusts genotypes and FORMAT fields for each resulting bi-allelic variant - Preserves phasing information in genotypes - Handles complex symbolic variants (e.g., <DEL>, <INS>) - Correctly manages missing or malformed fields

Output Format

The output is a standard VCF file containing: - All header lines from the input file (unchanged) - Bi-allelic variants only, with each multi-allelic variant split into multiple lines - Each split variant maintains the same CHROM, POS, ID, REF, QUAL, and FILTER values - INFO and FORMAT fields properly adjusted for each alternate allele

Examples

Basic Usage

./VCFX_multiallelic_splitter < multi_allelic.vcf > biallelic.vcf

Integration with Other Tools

# Split multi-allelic variants, then run analysis requiring bi-allelic variants
cat input.vcf | \
  ./VCFX_multiallelic_splitter | \
  ./vcf_analysis_tool > results.txt

Validation

# Validate that all variants in the output are indeed bi-allelic
./VCFX_multiallelic_splitter < input.vcf | \
  grep -v "^#" | awk -F'\t' '{print $5}' | grep -c ","
# Should output 0 if all variants are bi-allelic

Handling Special Cases

  • INFO fields:
  • Number=A fields (one value per alternate allele): Each split variant gets the corresponding value
  • Number=R fields (one value per allele including reference): Values are preserved properly
  • Number=G fields (one value per genotype): Recalculated for bi-allelic case
  • Number=1 or other fixed numbers: These values are copied unchanged

  • FORMAT fields:

  • AD (allelic depth): Properly subset for each resulting variant
  • PL (genotype likelihoods): Recalculated for each bi-allelic output
  • GT (genotype): Adjusted to reflect the new allele indices (0/2 may become 0/1 in split variant)

  • Genotype conversion:

  • For each variant, genotypes are only preserved if they involve the specific alt allele
  • Genotypes not involving the current alternate allele are set to missing (./.)
  • Phased genotypes maintain their phase information

  • Edge cases:

  • Missing data in FORMAT fields is properly handled
  • Symbolic alternate alleles are processed correctly
  • Star alleles (*) and non-ref symbolic alleles are supported

Performance

The tool processes VCF files line by line with minimal memory requirements, with performance primarily dependent on: - Number of samples in the VCF - Number of multi-allelic sites - Complexity of INFO and FORMAT fields

For very large VCF files with many samples, processing time scales linearly with file size.

Limitations

  • No command-line options to control the splitting behavior
  • Cannot selectively split only certain multi-allelic variants
  • May produce large output files when the input contains many multi-allelic variants with many samples
  • Cannot reconstruct the original multi-allelic variants from the split output in all cases