VCFX_annotation_extractor¶

Overview¶

VCFX_annotation_extractor extracts annotation fields from a VCF file's INFO column and converts them into a tabular format. The tool is particularly useful for extracting specific annotations (such as functional impact, gene name, or any custom annotation) from VCF files into a more analysis-friendly TSV format.

Usage¶

VCFX_annotation_extractor --annotation-extract "FIELD1,FIELD2,..." < input.vcf > extracted.tsv

Options¶

Option	Description
`-a`, `--annotation-extract <FIELDS>`	Required. Comma-separated list of INFO field annotations to extract (e.g., "ANN,Gene,Impact")
`-h`, `--help`	Display help message and exit (handled by `vcfx::handle_common_flags`)
`-v`, `--version`	Show program version and exit (handled by `vcfx::handle_common_flags`)

Description¶

VCFX_annotation_extractor simplifies the extraction and analysis of variant annotations by:

Reading a VCF file from standard input
Parsing the INFO column to extract user-specified annotation fields
Handling multi-allelic variants by creating separate rows for each ALT allele
Aligning per-allele annotations (such as ANN) with the corresponding ALT allele
Producing a clean tab-delimited output with standardized columns

This tool is particularly useful for: - Converting complex VCF annotations into a format suitable for spreadsheet applications - Extracting specific annotation fields for focused analysis - Preparing variant annotation data for visualization or reporting - Working with multi-allelic variants where annotations correspond to specific alleles

Output Format¶

The output is a tab-separated (TSV) file with the following columns:

CHROM  POS  ID  REF  ALT  <ANNOTATION1>  <ANNOTATION2>  ...

Where: - The first five columns are standard VCF fields (chromosome, position, ID, reference allele, alternate allele) - Each subsequent column contains the value of a requested annotation field - Missing values are represented by "NA" - Multi-allelic variants are split into multiple rows, one for each ALT allele - Per-allele annotations (like ANN) are properly aligned with their corresponding ALT allele

Examples¶

Basic Usage - Extract Gene Annotations¶

./VCFX_annotation_extractor --annotation-extract "Gene" < input.vcf > genes.tsv

Extract Multiple Annotation Fields¶

./VCFX_annotation_extractor --annotation-extract "ANN,Gene,Impact,DP" < input.vcf > annotations.tsv

Process and Filter in a Pipeline¶

# Extract annotations from only PASS variants
grep -e "^#" -e "PASS" input.vcf | ./VCFX_annotation_extractor --annotation-extract "ANN,Gene,Impact" > pass_annotations.tsv

Analyze Impact Distribution¶

# Extract impact annotations and count occurrences
./VCFX_annotation_extractor --annotation-extract "Impact" < input.vcf | tail -n +2 | cut -f6 | sort | uniq -c

Multi-allelic Variant Handling¶

The tool handles multi-allelic variants specially:

Each ALT allele in a multi-allelic variant gets its own row in the output
For Number=A annotations (like ANN) that have multiple comma-separated values, each value is aligned with the corresponding ALT allele
For single-value annotations (like Gene, Impact), the same value is used for all ALT alleles of a variant
If there are more ALT alleles than annotation values, "NA" is used for the excess ALT alleles

Example¶

For a variant line with ALT=T,G,C and ANN=missense,stop_gained,intergenic: - Three rows will be generated in the output (one for each ALT) - The annotations will be properly aligned: T→missense, G→stop_gained, C→intergenic

Handling Special Cases¶

The tool implements several strategies for handling edge cases:

Missing annotations: If a requested annotation is not found, "NA" is output
Malformed VCF lines: Lines with fewer than 8 columns are skipped with a warning
Empty annotations: Empty annotation values are preserved and not replaced with "NA"
Multi-value annotations: Currently, only ANN field is treated as multi-value and split by commas
Header parsing: The tool checks for proper VCF headers before processing data
Empty input: The tool correctly handles empty input files, producing only the header line
Invalid characters: The tool preserves all characters in annotation values, including special characters

Performance¶

VCFX_annotation_extractor is designed for efficiency:

Single-pass processing reads the VCF file line-by-line without loading the entire file into memory
Efficient string parsing with optimized splitting functions
Uses hash maps for quick annotation lookups
Memory usage scales with the size of individual variant lines rather than the whole file
Output is streamed directly without intermediate storage

Limitations¶

Currently, only the ANN field is recognized as a per-allele (Number=A) field that needs to be split; other Number=A fields are not automatically detected
No VCF header parsing to automatically determine which fields are Number=A vs. Number=1
Cannot extract FORMAT fields or sample-specific information
The output does not include QUAL or FILTER columns from the input VCF
No wildcard or regex support for selecting annotation fields
Annotation fields with embedded tab or newline characters may cause issues in the output format
Limited error recovery for malformed INFO fields