VCFX_info_parser¶

Overview¶

VCFX_info_parser extracts and formats specific INFO fields from a VCF file into a tabular format for easier analysis. The tool processes a VCF file line by line, parses the INFO column, and outputs only the requested fields in a clean TSV format.

Usage¶

VCFX_info_parser --info "FIELD1,FIELD2,..." < input.vcf > extracted_info.tsv

Options¶

Option	Description
`-i`, `--info <FIELDS>`	Required. Comma-separated list of INFO fields to extract (e.g., "DP,AF,SOMATIC")
`-h`, `--help`	Display help message and exit (handled by `vcfx::handle_common_flags`)
`-v`, `--version`	Show program version and exit (handled by `vcfx::handle_common_flags`)

Description¶

VCFX_info_parser simplifies the process of extracting specific information from VCF files by:

Reading VCF data from standard input
Parsing the INFO column to extract user-specified fields
Producing a clean, tabular output with standardized headers
Properly handling flags, missing values, and malformed entries

This tool is particularly useful for: - Extracting numeric values like depth (DP) or allele frequency (AF) for statistical analysis - Converting complex VCF INFO fields into a format suitable for spreadsheet applications - Creating simplified datasets focused on specific annotations - Preparing data for visualization or report generation

Output Format¶

The output is a tab-separated file with the following columns:

CHROM  POS  ID  REF  ALT  FIELD1  FIELD2  ...

Where: - The first five columns are standard VCF fields (chromosome, position, ID, reference allele, alternate allele) - Each subsequent column contains the value of a requested INFO field - Missing values are represented by a dot (.) - Flag fields (INFO fields with no value) are also represented by a dot (.)

Examples¶

Basic Usage - Extract Depth Information¶

./VCFX_info_parser --info "DP" < input.vcf > depth_data.tsv

Extract Multiple Fields¶

./VCFX_info_parser --info "DP,AF,MQ" < input.vcf > key_metrics.tsv

Working with Annotation Data¶

./VCFX_info_parser --info "Gene,IMPACT,Consequence" < annotated.vcf > gene_impacts.tsv

Pipeline Example¶

# Filter a VCF file and extract specific INFO fields
cat input.vcf | grep "PASS" | ./VCFX_info_parser --info "DP,AF,SOMATIC" > filtered_annotations.tsv

Handling Special Cases¶

The tool implements several strategies for handling edge cases:

Flag fields: INFO fields without values (flags like 'SOMATIC') are represented by a dot in the output
Missing fields: If a requested INFO field is not present in a specific variant, a dot is printed
Malformed lines: Lines that don't conform to VCF format are skipped with a warning message
Empty input: The tool correctly handles empty input files
Header lines: VCF header lines (starting with #) are skipped
Line endings: LF and CRLF line endings are supported
Partial final line: Files without a final newline character are processed correctly

Performance¶

VCFX_info_parser is designed for efficiency:

Single-pass processing with line-by-line reading, allowing for streaming of very large files
Minimal memory footprint regardless of input file size
Efficient string parsing with no complex regular expressions
Fast lookup of INFO fields using hash maps

Limitations¶

Cannot handle multi-allelic variants specially (each row is processed independently)
No built-in filtering capabilities (use in conjunction with other filtering tools)
Cannot split INFO fields with multiple values (e.g., CSQ fields from VEP)
Doesn't preserve VCF headers in the output
No option to include additional VCF columns (QUAL, FILTER) in the output
Cannot extract FORMAT fields or sample-specific information