VCFX_missing_data_handler¶
Overview¶
VCFX_missing_data_handler identifies and processes missing genotype data in VCF files. It can either flag missing genotypes (default behavior) or impute them with a specified default value, ensuring consistent data representation for downstream analysis.
Usage¶
VCFX_missing_data_handler [OPTIONS] [files...] > processed.vcf
Options¶
Option | Description |
---|---|
--fill-missing , -f |
Impute missing genotypes with a default value |
--default-genotype , -d |
Specify the default genotype for imputation (default: "./.") |
--help , -h |
Display help message and exit (handled by vcfx::handle_common_flags ) |
-v , --version |
Show program version and exit (handled by vcfx::handle_common_flags ) |
Description¶
VCFX_missing_data_handler processes VCF files to identify and handle missing genotype data. The tool:
- Reads one or more VCF files (or standard input if no files specified)
- Identifies missing genotypes in each variant (empty, ".", "./.", or ".|.")
- Either:
- Leaves missing genotypes unchanged (default behavior)
- Replaces missing genotypes with a user-specified value
- Outputs the processed VCF data to standard output
This tool is particularly useful for: - Preparing VCF files for tools that don't handle missing genotypes well - Standardizing the representation of missing data - Imputing missing genotypes with reference (0/0) or other default values - Processing multiple VCF files with consistent handling of missing data
Output Format¶
The output is a valid VCF file with the same format as the input, but with missing genotypes either left as-is or replaced with the specified default value. All header lines are preserved.
Examples¶
Basic Usage (Flag Only)¶
# Process a single file, leaving missing genotypes as-is
VCFX_missing_data_handler < input.vcf > flagged_output.vcf
Impute Missing Data with Default Value¶
# Replace missing genotypes with the default value (./.):
VCFX_missing_data_handler --fill-missing < input.vcf > imputed_output.vcf
Impute with Custom Genotype¶
# Replace missing genotypes with homozygous reference (0/0):
VCFX_missing_data_handler --fill-missing --default-genotype "0/0" < input.vcf > ref_imputed.vcf
Process Multiple Files¶
# Process multiple files at once:
VCFX_missing_data_handler --fill-missing file1.vcf file2.vcf > combined_output.vcf
In a Pipeline¶
# Filter a VCF file and then handle missing data:
grep -v "^#" input.vcf | grep "PASS" | \
VCFX_missing_data_handler --fill-missing --default-genotype "0/0" > filtered_imputed.vcf
Missing Genotype Detection¶
The tool identifies the following representations of missing data:
- Empty genotype field
- Single dot: "."
- Pair of dots with slash: "./."
- Pair of dots with pipe: ".|."
Handling Special Cases¶
- No GT field in FORMAT: If the FORMAT column does not include a GT field, the variant line is left unchanged
- Invalid variant lines: Lines with fewer than 9 columns are passed through unchanged
- Multiple input files: Processes each file in sequence, properly handling headers
- Sample columns structure: Carefully preserves the structure of sample columns, only modifying the GT field
- Empty lines: Preserved with a single newline
- Header lines: Passed through unchanged
- Data before header: Able to handle invalid VCF files where data appears before the header (with a warning)
Performance¶
The tool is designed for efficiency:
- Line-by-line processing allows handling of arbitrarily large files
- No need to load the entire file into memory
- Efficient string splitting and joining operations
- Handles multiple files in a single run
Limitations¶
- No option to specify which samples should have their missing data imputed
- Cannot handle phased vs. unphased genotype distinction in imputation
- No support for probabilistic imputation based on population frequencies
- No ability to flag sites with a high proportion of missing data
- Cannot process only specific regions of a VCF file
- Imputes with the same value regardless of context or neighboring genotypes
- No reporting of the number or percentage of imputed genotypes