VCFX_validator¶

Overview¶

VCFX_validator is a utility tool for checking the validity of VCF files according to the basic VCF format specifications. It performs various checks on the file structure, header format, and data lines to ensure the file is properly formatted and contains valid data.

Usage¶

VCFX_validator [OPTIONS] < input.vcf

Options¶

Option	Description
`-h`, `--help`	Display help message and exit (handled by `vcfx::handle_common_flags`)
`-v`, `--version`	Show program version and exit (handled by `vcfx::handle_common_flags`)
`-s`, `--strict`	Enable stricter validation checks
`-d`, `--report-dups`	Report duplicate records to stderr

Description¶

VCFX_validator processes a VCF file to verify its structural validity by:

Reading the VCF file from standard input (plain or gzip/BGZF compressed)
Checking that all meta-information lines (starting with '##') are properly formatted
Validating that the #CHROM header line is present and has at least 8 required columns
For each data line:
Ensuring it has at least 8 columns
Verifying that CHROM is not empty
Confirming POS is a positive integer
Checking that REF and ALT contain only valid bases
Validating that QUAL is either '.' or a non-negative float
Ensuring FILTER is not empty
Checking INFO and FORMAT fields against header definitions
Validating genotype syntax
Detecting duplicate records when --report-dups is used
Reporting errors for any validation failures
Returning exit code 0 if the file is valid, or 1 if it contains errors

This tool is useful for validating VCF files before processing them with other tools, ensuring they meet the basic requirements of the VCF format specification.

Validation Details¶

Meta-Information Lines¶

Must start with '##'
No specific content validation beyond the prefix

#CHROM Header Line¶

Must be present in the file
Must start with '#CHROM'
Must have at least 8 columns (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO)
Must appear before any data lines

Data Lines¶

Must have at least 8 columns
CHROM: Must not be empty
POS: Must be a positive integer
ID: Can be empty or '.' (not validated)
REF: Must contain only A,C,G,T,N
ALT: Must contain only A,C,G,T,N
QUAL: Must be '.' or a non-negative float
FILTER: Must not be empty
INFO: Keys must be defined in the header. Numeric counts are validated when numeric. Flags are allowed.

Strict Mode¶

When --strict is used, additional checks are applied: - The number of columns in every data line must exactly match the #CHROM header. - If FORMAT/sample columns are present, each sample field must contain the same number of sub-fields as specified in the FORMAT column. - Any warning that would normally be emitted is treated as an error and causes the validator to exit with a non-zero status.

Examples¶

Basic Validation¶

Check if a VCF file is valid:

VCFX_validator < input.vcf > validated.vcf

Using Strict Mode¶

Enable stricter validation with additional checks:

VCFX_validator --strict < input.vcf > validated.vcf

When the input is valid, the original VCF is written unchanged to standard output, allowing VCFX_validator to be used as a filter in processing pipelines. Informational messages such as VCF file is valid. are printed to standard error.

Redirecting Error Messages¶

Save validation errors to a file:

VCFX_validator < input.vcf 2> validation_errors.txt

Example Output¶

For Valid Files¶

VCF file is valid.

For Invalid Files¶

Error: line 15 has <8 columns.

Error: line 42 POS must be >0.

Error: no #CHROM line found in file.

Special Case Handling¶

Empty Lines¶

Empty lines are ignored during validation

Malformed Header Lines¶

Lines starting with '#' that are neither '##' meta-information lines nor the '#CHROM' header line are considered errors

Missing #CHROM Line¶

If no #CHROM line is found in the file, an error is reported
Data lines encountered before a #CHROM line will cause validation to fail

Whitespace¶

Leading and trailing whitespace is trimmed from field values before validation

Performance Considerations¶

The tool processes the VCF file line by line, with minimal memory requirements
Performance scales linearly with the size of the input file
No external dependencies or reference files are required

Limitations¶

Does not validate VCF version compatibility
No validation of the content of meta-information lines beyond the '##' prefix