VCFX_nonref_filter¶
Overview¶
VCFX_nonref_filter removes variants from a VCF file where all samples are homozygous reference (0/0), retaining only variants where at least one sample has an alternate allele or a missing genotype.
Usage¶
VCFX_nonref_filter [OPTIONS] [input.vcf]
VCFX_nonref_filter [OPTIONS] < input.vcf > filtered.vcf
Options¶
| Option | Description |
|---|---|
-h, --help |
Display help message and exit |
-i FILE, --input FILE |
Input VCF file (uses fast memory-mapped I/O) |
Description¶
VCFX_nonref_filter examines each variant in a VCF file and filters out those where all samples are homozygous reference (0/0). The tool:
- Processes a VCF file line by line
- For each variant, examines the genotype (GT) field of all samples
- Determines if every sample is definitively homozygous reference
- Retains variants where at least one sample has a non-reference allele or missing data
- Outputs a filtered VCF with only the retained variants
- Passes through all header lines unchanged
This tool is particularly useful for: - Removing uninformative variants where no sample has an alternate allele - Reducing VCF file size by filtering out invariant sites - Focusing analysis on polymorphic sites - Preparing variant files for downstream analysis tools that expect polymorphic sites
Output Format¶
The output is a standard VCF file with the same format as the input, but containing only variants where at least one sample has a non-reference allele or a missing genotype. All header lines are preserved.
Examples¶
Fast File Input (Recommended)¶
# Use memory-mapped I/O for maximum performance (100-1000x faster)
VCFX_nonref_filter -i input.vcf > filtered.vcf
# Equivalent syntax (file as positional argument)
VCFX_nonref_filter input.vcf > filtered.vcf
Standard Input (Slower)¶
# Traditional stdin mode (slower, but works with pipes)
VCFX_nonref_filter < input.vcf > filtered.vcf
Counting Filtered Variants¶
# Count how many variants were removed/retained
input_count=$(grep -v "^#" input.vcf | wc -l)
output_count=$(grep -v "^#" filtered.vcf | wc -l)
removed_count=$((input_count - output_count))
echo "Removed $removed_count homozygous reference variants out of $input_count total variants"
In a Pipeline¶
# stdin mode is required when using pipes
zcat input.vcf.gz | VCFX_nonref_filter > filtered.vcf
Combining with Other Filters¶
# Create a pipeline of filters (uses stdin mode)
cat input.vcf | \
VCFX_nonref_filter | \
VCFX_phred_filter --phred-filter 30 > filtered.vcf
Homozygous Reference Detection¶
The tool uses comprehensive logic to identify homozygous reference genotypes:
- If a genotype is missing (e.g., "./.", ".|.", or "."), it's considered NOT homozygous reference, and the variant is retained
- For each specified allele in a genotype:
- The allele must be "0" for it to be considered reference
- Any non-"0" allele (including "1", "2", etc.) is considered alternate
For example: - "0/0" → Homozygous reference (filtered if all samples have this) - "0/1" → Heterozygous (variant retained) - "1/1" → Homozygous alternate (variant retained) - "./." → Missing genotype (variant retained) - "0/." → Partially missing (variant retained) - "0/0/0" → Polyploid homozygous reference (filtered if all samples have reference)
Handling Special Cases¶
- Missing genotypes: Variants with samples having missing genotypes ("./.") are retained
- Partial missing: Genotypes with some missing alleles (e.g., "0/.") are considered not definitively homozygous reference, so the variant is retained
- No GT field: If the GT field is not present in the FORMAT column, the variant is retained
- Empty lines: Skipped in output
- Header lines: Preserved unchanged
- Malformed VCF lines: Lines with fewer than 10 columns (required for at least one sample) are passed through unchanged
- Data before header: Warning issued and lines passed through unchanged
- Windows line endings: CRLF (\r\n) line endings are handled correctly
Performance¶
The tool offers two I/O modes with dramatically different performance characteristics:
File Input Mode (-i or positional argument) - Recommended¶
When a file path is provided, the tool uses memory-mapped I/O with several optimizations:
| Optimization | Description |
|---|---|
| Memory-mapped I/O | Direct file access via mmap, avoiding buffered I/O overhead |
| SIMD line scanning | AVX2/SSE2 vectorized newline detection (with memchr fallback) |
| Zero-copy parsing | Uses std::string_view to avoid string allocations |
| 1MB output buffering | Reduces system call overhead for output |
| Direct GT extraction | Extracts GT field without parsing entire sample columns |
| Early termination | Stops checking samples after first non-homref is found |
| FORMAT caching | Caches GT index when FORMAT string doesn't change |
Performance: Processes ~500MB/second on modern hardware.
Stdin Mode¶
Traditional line-by-line processing via std::getline(). Use this mode when:
- Input comes from a pipe (e.g., zcat file.vcf.gz | VCFX_nonref_filter)
- Processing compressed streams
- The file is small enough that performance doesn't matter
Note: Stdin mode is 50-100x slower than file input mode for large files.
Benchmarks¶
| File Size | File Mode | Stdin Mode | Speedup |
|---|---|---|---|
| 50MB | 0.1s | 5s | 50x |
| 500MB | 1.1s | 55s | 50x |
| 4GB | ~9s | timeout | >100x |
Limitations¶
- No option to filter based on a subset of samples
- Cannot retain specific homozygous reference variants based on other criteria
- No support for filtering by percentage of non-reference samples
- Missing data is always treated as "not definitely homozygous reference"
- No built-in option to keep variants where the reference allele might be incorrect
- Cannot incorporate quality values in the filtering decision
- No reporting on the number of variants removed or statistics about filtered variants
See Also¶
- VCFX_phred_filter - Filter by quality scores
- VCFX_record_filter - General record filtering
- VCFX_population_filter - Filter by population criteria