VCFX_info_parser¶
Overview¶
VCFX_info_parser extracts and formats specific INFO fields from a VCF file into a tabular format for easier analysis. The tool processes a VCF file line by line, parses the INFO column, and outputs only the requested fields in a clean TSV format.
Usage¶
VCFX_info_parser --info "FIELD1,FIELD2,..." < input.vcf > extracted_info.tsv
Options¶
Option | Description |
---|---|
-i , --info <FIELDS> |
Required. Comma-separated list of INFO fields to extract (e.g., "DP,AF,SOMATIC") |
-h , --help |
Display help message and exit (handled by vcfx::handle_common_flags ) |
-v , --version |
Show program version and exit (handled by vcfx::handle_common_flags ) |
Description¶
VCFX_info_parser simplifies the process of extracting specific information from VCF files by:
- Reading VCF data from standard input
- Parsing the INFO column to extract user-specified fields
- Producing a clean, tabular output with standardized headers
- Properly handling flags, missing values, and malformed entries
This tool is particularly useful for: - Extracting numeric values like depth (DP) or allele frequency (AF) for statistical analysis - Converting complex VCF INFO fields into a format suitable for spreadsheet applications - Creating simplified datasets focused on specific annotations - Preparing data for visualization or report generation
Output Format¶
The output is a tab-separated file with the following columns:
CHROM POS ID REF ALT FIELD1 FIELD2 ...
Where: - The first five columns are standard VCF fields (chromosome, position, ID, reference allele, alternate allele) - Each subsequent column contains the value of a requested INFO field - Missing values are represented by a dot (.) - Flag fields (INFO fields with no value) are also represented by a dot (.)
Examples¶
Basic Usage - Extract Depth Information¶
./VCFX_info_parser --info "DP" < input.vcf > depth_data.tsv
Extract Multiple Fields¶
./VCFX_info_parser --info "DP,AF,MQ" < input.vcf > key_metrics.tsv
Working with Annotation Data¶
./VCFX_info_parser --info "Gene,IMPACT,Consequence" < annotated.vcf > gene_impacts.tsv
Pipeline Example¶
# Filter a VCF file and extract specific INFO fields
cat input.vcf | grep "PASS" | ./VCFX_info_parser --info "DP,AF,SOMATIC" > filtered_annotations.tsv
Handling Special Cases¶
The tool implements several strategies for handling edge cases:
- Flag fields: INFO fields without values (flags like 'SOMATIC') are represented by a dot in the output
- Missing fields: If a requested INFO field is not present in a specific variant, a dot is printed
- Malformed lines: Lines that don't conform to VCF format are skipped with a warning message
- Empty input: The tool correctly handles empty input files
- Header lines: VCF header lines (starting with #) are skipped
- Line endings: LF and CRLF line endings are supported
- Partial final line: Files without a final newline character are processed correctly
Performance¶
VCFX_info_parser is designed for efficiency:
- Single-pass processing with line-by-line reading, allowing for streaming of very large files
- Minimal memory footprint regardless of input file size
- Efficient string parsing with no complex regular expressions
- Fast lookup of INFO fields using hash maps
Limitations¶
- Cannot handle multi-allelic variants specially (each row is processed independently)
- No built-in filtering capabilities (use in conjunction with other filtering tools)
- Cannot split INFO fields with multiple values (e.g., CSQ fields from VEP)
- Doesn't preserve VCF headers in the output
- No option to include additional VCF columns (QUAL, FILTER) in the output
- Cannot extract FORMAT fields or sample-specific information