VCFX_annotation_extractor¶
Overview¶
VCFX_annotation_extractor extracts annotation fields from a VCF file's INFO column and converts them into a tabular format. The tool is particularly useful for extracting specific annotations (such as functional impact, gene name, or any custom annotation) from VCF files into a more analysis-friendly TSV format.
Usage¶
VCFX_annotation_extractor --annotation-extract "FIELD1,FIELD2,..." < input.vcf > extracted.tsv
Options¶
Option | Description |
---|---|
-a , --annotation-extract <FIELDS> |
Required. Comma-separated list of INFO field annotations to extract (e.g., "ANN,Gene,Impact") |
-h , --help |
Display help message and exit (handled by vcfx::handle_common_flags ) |
-v , --version |
Show program version and exit (handled by vcfx::handle_common_flags ) |
Description¶
VCFX_annotation_extractor simplifies the extraction and analysis of variant annotations by:
- Reading a VCF file from standard input
- Parsing the INFO column to extract user-specified annotation fields
- Handling multi-allelic variants by creating separate rows for each ALT allele
- Aligning per-allele annotations (such as ANN) with the corresponding ALT allele
- Producing a clean tab-delimited output with standardized columns
This tool is particularly useful for: - Converting complex VCF annotations into a format suitable for spreadsheet applications - Extracting specific annotation fields for focused analysis - Preparing variant annotation data for visualization or reporting - Working with multi-allelic variants where annotations correspond to specific alleles
Output Format¶
The output is a tab-separated (TSV) file with the following columns:
CHROM POS ID REF ALT <ANNOTATION1> <ANNOTATION2> ...
Where: - The first five columns are standard VCF fields (chromosome, position, ID, reference allele, alternate allele) - Each subsequent column contains the value of a requested annotation field - Missing values are represented by "NA" - Multi-allelic variants are split into multiple rows, one for each ALT allele - Per-allele annotations (like ANN) are properly aligned with their corresponding ALT allele
Examples¶
Basic Usage - Extract Gene Annotations¶
./VCFX_annotation_extractor --annotation-extract "Gene" < input.vcf > genes.tsv
Extract Multiple Annotation Fields¶
./VCFX_annotation_extractor --annotation-extract "ANN,Gene,Impact,DP" < input.vcf > annotations.tsv
Process and Filter in a Pipeline¶
# Extract annotations from only PASS variants
grep -e "^#" -e "PASS" input.vcf | ./VCFX_annotation_extractor --annotation-extract "ANN,Gene,Impact" > pass_annotations.tsv
Analyze Impact Distribution¶
# Extract impact annotations and count occurrences
./VCFX_annotation_extractor --annotation-extract "Impact" < input.vcf | tail -n +2 | cut -f6 | sort | uniq -c
Multi-allelic Variant Handling¶
The tool handles multi-allelic variants specially:
- Each ALT allele in a multi-allelic variant gets its own row in the output
- For Number=A annotations (like ANN) that have multiple comma-separated values, each value is aligned with the corresponding ALT allele
- For single-value annotations (like Gene, Impact), the same value is used for all ALT alleles of a variant
- If there are more ALT alleles than annotation values, "NA" is used for the excess ALT alleles
Example¶
For a variant line with ALT=T,G,C
and ANN=missense,stop_gained,intergenic
:
- Three rows will be generated in the output (one for each ALT)
- The annotations will be properly aligned: T→missense, G→stop_gained, C→intergenic
Handling Special Cases¶
The tool implements several strategies for handling edge cases:
- Missing annotations: If a requested annotation is not found, "NA" is output
- Malformed VCF lines: Lines with fewer than 8 columns are skipped with a warning
- Empty annotations: Empty annotation values are preserved and not replaced with "NA"
- Multi-value annotations: Currently, only ANN field is treated as multi-value and split by commas
- Header parsing: The tool checks for proper VCF headers before processing data
- Empty input: The tool correctly handles empty input files, producing only the header line
- Invalid characters: The tool preserves all characters in annotation values, including special characters
Performance¶
VCFX_annotation_extractor is designed for efficiency:
- Single-pass processing reads the VCF file line-by-line without loading the entire file into memory
- Efficient string parsing with optimized splitting functions
- Uses hash maps for quick annotation lookups
- Memory usage scales with the size of individual variant lines rather than the whole file
- Output is streamed directly without intermediate storage
Limitations¶
- Currently, only the ANN field is recognized as a per-allele (Number=A) field that needs to be split; other Number=A fields are not automatically detected
- No VCF header parsing to automatically determine which fields are Number=A vs. Number=1
- Cannot extract FORMAT fields or sample-specific information
- The output does not include QUAL or FILTER columns from the input VCF
- No wildcard or regex support for selecting annotation fields
- Annotation fields with embedded tab or newline characters may cause issues in the output format
- Limited error recovery for malformed INFO fields