VCFX_missing_detector¶
Overview¶
VCFX_missing_detector identifies and flags variants in a VCF file that contain missing genotype data in any sample. This tool helps researchers identify potentially problematic variants or samples with incomplete data that may require special handling in downstream analyses.
Usage¶
VCFX_missing_detector [OPTIONS] [input.vcf]
VCFX_missing_detector [OPTIONS] < input.vcf > flagged.vcf
Options¶
| Option | Description |
|---|---|
-i, --input FILE |
Input VCF file (uses memory-mapping for best performance) |
-t, --threads N |
Number of threads for parallel scanning (default: auto) |
-q, --quiet |
Suppress informational messages |
-h, --help |
Display help message and exit |
-v, --version |
Show program version and exit |
Description¶
VCFX_missing_detector analyzes a VCF file to identify variants with missing genotype data. The tool:
- Reads a VCF file using memory-mapped I/O for extreme performance
- Uses SIMD-accelerated scanning to detect '.' characters in sample columns
- Multi-threaded pre-scan determines if any missing genotypes exist
- For each variant with missing data, adds
MISSING_GENOTYPES=1to the INFO field - Uses zero-copy output for unchanged lines
This simple annotation allows researchers to easily: - Filter variants based on missing data presence using standard VCF tools - Identify data completeness issues that might affect analysis results - Implement different handling strategies for variants with missing data
Output Format¶
The output is a valid VCF file with the same format as the input, but with an additional INFO field annotation for variants containing missing genotypes:
MISSING_GENOTYPES=1
This annotation is appended to the existing INFO field, or replaces the . placeholder if the INFO field is empty.
Examples¶
File Input Mode (Recommended)¶
Use memory-mapped file I/O for best performance:
VCFX_missing_detector -i input.vcf > flagged.vcf
VCFX_missing_detector input.vcf > flagged.vcf
Basic Usage (Stdin)¶
VCFX_missing_detector < input.vcf > flagged.vcf
In a Pipeline with Filtering¶
# Flag variants with missing genotypes, then filter to keep only complete variants
VCFX_missing_detector -i input.vcf | grep -v "MISSING_GENOTYPES=1" > complete_variants.vcf
Counting Missing Variants¶
# Count variants with missing genotypes
VCFX_missing_detector -i input.vcf | grep "MISSING_GENOTYPES=1" | wc -l
Quiet Mode for Scripts¶
# Suppress progress messages
VCFX_missing_detector -q -i input.vcf > flagged.vcf
Missing Genotype Detection¶
The tool uses comprehensive logic to identify various forms of missing genotype data:
- Completely missing genotypes: Formats like
./.,.|., or just. - Partially missing diploid genotypes: When one allele is missing, like
./1or0/. - Multi-field format handling: Properly extracts just the GT portion when other fields (like DP, GQ) are present
- Format field awareness: Correctly identifies the GT position in the FORMAT string
The tool examines each sample column independently and flags a variant if any sample has missing data.
Handling Special Cases¶
The tool implements several strategies for handling edge cases:
- Missing FORMAT field: If GT is not included in the FORMAT column, the variant is passed through unchanged
- No sample columns: Variants with fewer than 9 columns are passed through unchanged
- Empty INFO field: If the original INFO is "." (missing), it's replaced with "MISSING_GENOTYPES=1"
- Non-empty INFO field: The missing flag is appended with a semicolon separator
- Empty lines: Preserved with a single newline
- Header lines: Passed through unchanged
- Non-diploid genotypes: The tool focuses on diploid genotypes with a single delimiter ('/' or '|')
Performance Characteristics¶
The tool uses several optimizations for extreme performance:
- Memory-mapped I/O: Uses
mmap()for file input to minimize syscall overhead - SIMD acceleration: Uses NEON/SSE2/AVX2 instructions for fast '.' character scanning
- Multi-threaded pre-scan: Parallel scanning to determine if file has any missing genotypes
- Zero-copy fast path: Files with no missing genotypes are copied verbatim without per-line parsing
- Batched zero-copy output: Groups consecutive unchanged lines into single
write()calls
Benchmark Results (4GB VCF, chr21, 427K variants, 2504 samples)¶
| Mode | Time | Notes |
|---|---|---|
| mmap (-i) | ~7s | Multi-threaded pre-scan + zero-copy output |
| stdin | ~5 min | Buffered I/O |
Speedup: ~42x with memory-mapped I/O
Limitations¶
- Primarily designed for diploid genotypes; may not correctly identify missing data in haploid or polyploid contexts
- Limited to checking the GT field; does not evaluate other potential indicators of missing data
- No built-in functionality to annotate the percentage or count of samples with missing data
- No option to customize the INFO field tag name from the default "MISSING_GENOTYPES"
- Cannot perform sample-specific missing data analysis
- No threshold options (e.g., flag only if more than X% of samples have missing data)
- Limited to binary detection (missing/not missing) without quantifying the degree of missingness