Skip to content

VCFX_nonref_filter

Overview

VCFX_nonref_filter removes variants from a VCF file where all samples are homozygous reference (0/0), retaining only variants where at least one sample has an alternate allele or a missing genotype.

Usage

VCFX_nonref_filter [OPTIONS] [input.vcf]
VCFX_nonref_filter [OPTIONS] < input.vcf > filtered.vcf

Options

Option Description
-h, --help Display help message and exit
-i FILE, --input FILE Input VCF file (uses fast memory-mapped I/O)

Description

VCFX_nonref_filter examines each variant in a VCF file and filters out those where all samples are homozygous reference (0/0). The tool:

  1. Processes a VCF file line by line
  2. For each variant, examines the genotype (GT) field of all samples
  3. Determines if every sample is definitively homozygous reference
  4. Retains variants where at least one sample has a non-reference allele or missing data
  5. Outputs a filtered VCF with only the retained variants
  6. Passes through all header lines unchanged

This tool is particularly useful for: - Removing uninformative variants where no sample has an alternate allele - Reducing VCF file size by filtering out invariant sites - Focusing analysis on polymorphic sites - Preparing variant files for downstream analysis tools that expect polymorphic sites

Output Format

The output is a standard VCF file with the same format as the input, but containing only variants where at least one sample has a non-reference allele or a missing genotype. All header lines are preserved.

Examples

# Use memory-mapped I/O for maximum performance (100-1000x faster)
VCFX_nonref_filter -i input.vcf > filtered.vcf

# Equivalent syntax (file as positional argument)
VCFX_nonref_filter input.vcf > filtered.vcf

Standard Input (Slower)

# Traditional stdin mode (slower, but works with pipes)
VCFX_nonref_filter < input.vcf > filtered.vcf

Counting Filtered Variants

# Count how many variants were removed/retained
input_count=$(grep -v "^#" input.vcf | wc -l)
output_count=$(grep -v "^#" filtered.vcf | wc -l)
removed_count=$((input_count - output_count))
echo "Removed $removed_count homozygous reference variants out of $input_count total variants"

In a Pipeline

# stdin mode is required when using pipes
zcat input.vcf.gz | VCFX_nonref_filter > filtered.vcf

Combining with Other Filters

# Create a pipeline of filters (uses stdin mode)
cat input.vcf | \
VCFX_nonref_filter | \
VCFX_phred_filter --phred-filter 30 > filtered.vcf

Homozygous Reference Detection

The tool uses comprehensive logic to identify homozygous reference genotypes:

  1. If a genotype is missing (e.g., "./.", ".|.", or "."), it's considered NOT homozygous reference, and the variant is retained
  2. For each specified allele in a genotype:
  3. The allele must be "0" for it to be considered reference
  4. Any non-"0" allele (including "1", "2", etc.) is considered alternate

For example: - "0/0" → Homozygous reference (filtered if all samples have this) - "0/1" → Heterozygous (variant retained) - "1/1" → Homozygous alternate (variant retained) - "./." → Missing genotype (variant retained) - "0/." → Partially missing (variant retained) - "0/0/0" → Polyploid homozygous reference (filtered if all samples have reference)

Handling Special Cases

  • Missing genotypes: Variants with samples having missing genotypes ("./.") are retained
  • Partial missing: Genotypes with some missing alleles (e.g., "0/.") are considered not definitively homozygous reference, so the variant is retained
  • No GT field: If the GT field is not present in the FORMAT column, the variant is retained
  • Empty lines: Skipped in output
  • Header lines: Preserved unchanged
  • Malformed VCF lines: Lines with fewer than 10 columns (required for at least one sample) are passed through unchanged
  • Data before header: Warning issued and lines passed through unchanged
  • Windows line endings: CRLF (\r\n) line endings are handled correctly

Performance

The tool offers two I/O modes with dramatically different performance characteristics:

When a file path is provided, the tool uses memory-mapped I/O with several optimizations:

Optimization Description
Memory-mapped I/O Direct file access via mmap, avoiding buffered I/O overhead
SIMD line scanning AVX2/SSE2 vectorized newline detection (with memchr fallback)
Zero-copy parsing Uses std::string_view to avoid string allocations
1MB output buffering Reduces system call overhead for output
Direct GT extraction Extracts GT field without parsing entire sample columns
Early termination Stops checking samples after first non-homref is found
FORMAT caching Caches GT index when FORMAT string doesn't change

Performance: Processes ~500MB/second on modern hardware.

Stdin Mode

Traditional line-by-line processing via std::getline(). Use this mode when: - Input comes from a pipe (e.g., zcat file.vcf.gz | VCFX_nonref_filter) - Processing compressed streams - The file is small enough that performance doesn't matter

Note: Stdin mode is 50-100x slower than file input mode for large files.

Benchmarks

File Size File Mode Stdin Mode Speedup
50MB 0.1s 5s 50x
500MB 1.1s 55s 50x
4GB ~9s timeout >100x

Limitations

  1. No option to filter based on a subset of samples
  2. Cannot retain specific homozygous reference variants based on other criteria
  3. No support for filtering by percentage of non-reference samples
  4. Missing data is always treated as "not definitely homozygous reference"
  5. No built-in option to keep variants where the reference allele might be incorrect
  6. Cannot incorporate quality values in the filtering decision
  7. No reporting on the number of variants removed or statistics about filtered variants

See Also