VCFX_nonref_filter¶
Overview¶
VCFX_nonref_filter removes variants from a VCF file where all samples are homozygous reference (0/0), retaining only variants where at least one sample has an alternate allele or a missing genotype.
Usage¶
VCFX_nonref_filter [OPTIONS] < input.vcf > filtered.vcf
Options¶
Option | Description |
---|---|
-h , --help |
Display help message and exit (handled by vcfx::handle_common_flags ) |
-v , --version |
Show program version and exit (handled by vcfx::handle_common_flags ) |
Description¶
VCFX_nonref_filter examines each variant in a VCF file and filters out those where all samples are homozygous reference (0/0). The tool:
- Processes a VCF file line by line
- For each variant, examines the genotype (GT) field of all samples
- Determines if every sample is definitively homozygous reference
- Retains variants where at least one sample has a non-reference allele or missing data
- Outputs a filtered VCF with only the retained variants
- Passes through all header lines unchanged
This tool is particularly useful for: - Removing uninformative variants where no sample has an alternate allele - Reducing VCF file size by filtering out invariant sites - Focusing analysis on polymorphic sites - Preparing variant files for downstream analysis tools that expect polymorphic sites
Output Format¶
The output is a standard VCF file with the same format as the input, but containing only variants where at least one sample has a non-reference allele or a missing genotype. All header lines are preserved.
Examples¶
Basic Usage¶
# Filter out variants where all samples are homozygous reference
VCFX_nonref_filter < input.vcf > filtered.vcf
Counting Filtered Variants¶
# Count how many variants were removed/retained
input_count=$(grep -v "^#" input.vcf | wc -l)
output_count=$(grep -v "^#" filtered.vcf | wc -l)
removed_count=$((input_count - output_count))
echo "Removed $removed_count homozygous reference variants out of $input_count total variants"
In a Pipeline¶
# Filter VCF file first by quality, then by non-reference status
grep -v "^#" input.vcf | grep "PASS" | \
VCFX_nonref_filter > high_quality_nonref.vcf
Combining with Other Filters¶
# Create a pipeline of filters
cat input.vcf | \
VCFX_nonref_filter | \
VCFX_phred_filter --phred-filter 30 > filtered.vcf
Homozygous Reference Detection¶
The tool uses comprehensive logic to identify homozygous reference genotypes:
- If a genotype is missing (e.g., "./.", ".|.", or "."), it's considered NOT homozygous reference, and the variant is retained
- For each specified allele in a genotype:
- The allele must be "0" for it to be considered reference
- Any non-"0" allele (including "1", "2", etc.) is considered alternate
For example: - "0/0" → Homozygous reference (filtered if all samples have this) - "0/1" → Heterozygous (variant retained) - "1/1" → Homozygous alternate (variant retained) - "./." → Missing genotype (variant retained) - "0/." → Partially missing (variant retained) - "0/0/0" → Polyploid homozygous reference (filtered if all samples have reference)
Handling Special Cases¶
- Missing genotypes: Variants with samples having missing genotypes ("./.") are retained
- Partial missing: Genotypes with some missing alleles (e.g., "0/.") are considered not definitively homozygous reference, so the variant is retained
- No GT field: If the GT field is not present in the FORMAT column, the variant is retained
- Empty lines: Skipped in output
- Header lines: Preserved unchanged
- Malformed VCF lines: Lines with fewer than 10 columns (required for at least one sample) are passed through unchanged
- Data before header: Warning issued and lines passed through unchanged
Performance¶
The tool is designed for efficiency:
- Processes the VCF file line by line, with minimal memory requirements
- Fast determination of homozygous reference status using string parsing
- Early exit when a non-homozygous sample is found
- No requirements to load the entire file into memory
Limitations¶
- No option to filter based on a subset of samples
- Cannot retain specific homozygous reference variants based on other criteria
- No support for filtering by percentage of non-reference samples
- Missing data is always treated as "not definitely homozygous reference"
- No built-in option to keep variants where the reference allele might be incorrect
- Cannot incorporate quality values in the filtering decision
- No reporting on the number of variants removed or statistics about filtered variants