VCFX_field_extractor¶

Overview¶

VCFX_field_extractor is a tool designed to extract and format specific fields from VCF (Variant Call Format) files. It allows users to select and output particular fields from VCF records, including standard fields, INFO subfields, and sample-specific genotype fields in a tabular format.

Usage¶

VCFX_field_extractor --fields "FIELD1,FIELD2,..." [OPTIONS] < input.vcf > output.tsv

Options¶

Option	Description
`-f`, `--fields`	Required. Comma-separated list of fields to extract (no spaces between fields)
`-h`, `--help`	Display help message and exit (handled by `vcfx::handle_common_flags`)
`-v`, `--version`	Show program version and exit (handled by `vcfx::handle_common_flags`)

Description¶

VCFX_field_extractor processes a VCF file and extracts only the specified fields for each variant. The tool:

Reads a VCF file from standard input
Identifies and parses the VCF header
For each variant line, extracts the requested fields
Outputs the extracted fields in a tab-separated format

The tool can extract three types of fields: - Standard VCF fields: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO - INFO subfields: Any key that appears in the INFO column (e.g., DP, AF, TYPE) - Sample-specific fields: Fields from the genotype columns, specified as either: - SampleName:Subfield (e.g., SAMPLE1:GT for the genotype of SAMPLE1) - S<number>:Subfield (e.g., S1:DP for the depth of the first sample)

Field Types¶

Standard VCF Fields¶

These are the eight fixed columns in the VCF format: - CHROM: Chromosome - POS: Position - ID: Variant identifier - REF: Reference allele - ALT: Alternate allele(s) - QUAL: Quality score - FILTER: Filter status - INFO: Additional information

INFO Subfields¶

Any key found in the INFO column can be extracted directly by name. For example: - DP: Read depth - AF: Allele frequency - TYPE: Variant type

Sample Fields¶

Sample fields are specified using one of these formats: - SampleName:Subfield where SampleName is the exact sample name from the VCF header - S<number>:Subfield where <number> is the 1-based index of the sample column

Common sample subfields include: - GT: Genotype - DP: Read depth - GQ: Genotype quality - Other format fields defined in the VCF

Output Format¶

The tool produces a tab-separated values (TSV) file with: - A header row containing the requested field names - One row per variant, with each requested field value - Missing or invalid fields represented as .

Examples¶

Basic Standard Fields¶

Extract chromosome, position, ID, reference, and alternate alleles:

VCFX_field_extractor --fields "CHROM,POS,ID,REF,ALT" < input.vcf > basic_fields.tsv

INFO Fields¶

Extract depth and allele frequency:

VCFX_field_extractor --fields "CHROM,POS,DP,AF" < input.vcf > info_fields.tsv

Sample Genotype Fields¶

Extract genotypes for specific samples:

VCFX_field_extractor --fields "CHROM,POS,SAMPLE1:GT,SAMPLE2:GT" < input.vcf > genotypes.tsv

Sample Fields by Index¶

Extract genotypes using sample indices:

VCFX_field_extractor --fields "CHROM,POS,S1:GT,S2:GT" < input.vcf > genotypes_by_index.tsv

Mixed Field Types¶

Combine different field types:

VCFX_field_extractor --fields "CHROM,POS,DP,AF,SAMPLE1:GT,SAMPLE1:DP" < input.vcf > mixed_fields.tsv

Handling Special Cases¶

Missing Fields¶

If a requested field is not found in the VCF record, a . is output
This applies to missing INFO fields, invalid sample names, or non-existent format fields

Malformed Records¶

The tool attempts to handle malformed VCF lines gracefully
For lines with too few columns, missing fields are filled with .
Invalid data types in numeric fields are preserved as they appear in the input

Header-only Files¶

If a VCF file contains only headers and no variant records, the tool outputs just the header row

Performance Considerations¶

The tool processes VCF files line-by-line, with minimal memory overhead
Extraction scales linearly with input size and number of requested fields
For large VCF files, consider extracting only the necessary fields to improve performance

Limitations¶

Cannot filter records (only extracts fields from all records)
Cannot perform operations or calculations on the extracted fields
Does not support complex expressions or conditionals
Limited to tab-separated output format
Cannot output field descriptions or metadata from the VCF header
No direct support for multi-allelic splitting or normalization