VCFX_multiallelic_splitter¶
Overview¶
VCFX_multiallelic_splitter takes a VCF file with multi-allelic variants (variants with multiple ALT alleles) and splits them into multiple bi-allelic variant lines, while properly handling genotypes and FORMAT/INFO fields with various number specifications.
Usage¶
VCFX_multiallelic_splitter [OPTIONS] < input.vcf > biallelic_output.vcf
Options¶
Option | Description |
---|---|
--help , -h |
Display help message and exit (handled by vcfx::handle_common_flags ) |
-v , --version |
Show program version and exit (handled by vcfx::handle_common_flags ) |
Description¶
Multi-allelic variants in VCF files (where multiple alternate alleles are specified in a comma-separated list) can complicate analysis and be incompatible with tools that require bi-allelic variants. This tool converts multi-allelic variants into the equivalent set of bi-allelic variants.
Key features:
- Maintains original VCF header information
- Correctly processes INFO fields tagged with different Number attributes (A, R, G)
- Properly adjusts genotypes and FORMAT fields for each resulting bi-allelic variant
- Preserves phasing information in genotypes
- Handles complex symbolic variants (e.g., <DEL>
, <INS>
)
- Correctly manages missing or malformed fields
Output Format¶
The output is a standard VCF file containing: - All header lines from the input file (unchanged) - Bi-allelic variants only, with each multi-allelic variant split into multiple lines - Each split variant maintains the same CHROM, POS, ID, REF, QUAL, and FILTER values - INFO and FORMAT fields properly adjusted for each alternate allele
Examples¶
Basic Usage¶
./VCFX_multiallelic_splitter < multi_allelic.vcf > biallelic.vcf
Integration with Other Tools¶
# Split multi-allelic variants, then run analysis requiring bi-allelic variants
cat input.vcf | \
./VCFX_multiallelic_splitter | \
./vcf_analysis_tool > results.txt
Validation¶
# Validate that all variants in the output are indeed bi-allelic
./VCFX_multiallelic_splitter < input.vcf | \
grep -v "^#" | awk -F'\t' '{print $5}' | grep -c ","
# Should output 0 if all variants are bi-allelic
Handling Special Cases¶
- INFO fields:
- Number=A fields (one value per alternate allele): Each split variant gets the corresponding value
- Number=R fields (one value per allele including reference): Values are preserved properly
- Number=G fields (one value per genotype): Recalculated for bi-allelic case
-
Number=1 or other fixed numbers: These values are copied unchanged
-
FORMAT fields:
- AD (allelic depth): Properly subset for each resulting variant
- PL (genotype likelihoods): Recalculated for each bi-allelic output
-
GT (genotype): Adjusted to reflect the new allele indices (0/2 may become 0/1 in split variant)
-
Genotype conversion:
- For each variant, genotypes are only preserved if they involve the specific alt allele
- Genotypes not involving the current alternate allele are set to missing (./.)
-
Phased genotypes maintain their phase information
-
Edge cases:
- Missing data in FORMAT fields is properly handled
- Symbolic alternate alleles are processed correctly
- Star alleles (*) and non-ref symbolic alleles are supported
Performance¶
The tool processes VCF files line by line with minimal memory requirements, with performance primarily dependent on: - Number of samples in the VCF - Number of multi-allelic sites - Complexity of INFO and FORMAT fields
For very large VCF files with many samples, processing time scales linearly with file size.
Limitations¶
- No command-line options to control the splitting behavior
- Cannot selectively split only certain multi-allelic variants
- May produce large output files when the input contains many multi-allelic variants with many samples
- Cannot reconstruct the original multi-allelic variants from the split output in all cases