Skip to content

VCFX_genotype_query

Overview

VCFX_genotype_query filters VCF files to keep only those variant lines where at least one sample has a specified genotype. It's a powerful tool for extracting variants with specific genotypic patterns.

Usage

VCFX_genotype_query [OPTIONS] < input.vcf > filtered.vcf

Options

Option Description
--genotype-query, -g "GENOTYPE" Specify the genotype to query (e.g., "0/1", "1/1")
--strict Use strict string comparison (no phasing unification or allele sorting)
--help, -h Display help message and exit (handled by vcfx::handle_common_flags)
-v, --version Show program version and exit (handled by vcfx::handle_common_flags)

Description

VCFX_genotype_query reads a VCF file from standard input and outputs variants (plus all header lines) where at least one sample has a genotype matching the specified pattern in the 'GT' field.

By default, the tool uses flexible matching: - Unifies phasing separators ('|' and '/' are treated the same) - Sorts alleles numerically (e.g., "1/0" is treated the same as "0/1")

When the --strict option is used, the tool performs exact string matching on genotypes, maintaining phase status and allele order.

Output Format

The output is a standard VCF file containing: - All header lines from the input file (unchanged) - Only variant lines where at least one sample matches the specified genotype pattern

Examples

Basic Usage (Flexible Matching)

# Find all variants where any sample is heterozygous (0/1)
./VCFX_genotype_query --genotype-query "0/1" < input.vcf > heterozygous.vcf

Strict Matching

# Find variants where any sample has a specifically phased heterozygous genotype (0|1)
./VCFX_genotype_query --genotype-query "0|1" --strict < input.vcf > phased_heterozygous.vcf

Finding Homozygous Variants

# Find variants where any sample is homozygous for the alternate allele
./VCFX_genotype_query --genotype-query "1/1" < input.vcf > homozygous_alt.vcf

Finding Multi-allelic Genotypes

# Find variants where any sample has specific multi-allelic genotype
./VCFX_genotype_query --genotype-query "1/2" < input.vcf > multi_allelic.vcf

Handling Special Cases

  • Phased genotypes: By default, "0|1" and "0/1" are considered equivalent; use --strict to differentiate them
  • Multi-allelic variants: Can be queried with appropriate genotypes (e.g., "1/2", "0/2")
  • Missing genotypes: Can match against "././" or "."
  • Non-diploid genotypes: Supported in both flexible and strict modes
  • Malformed VCF lines: Lines with fewer than 10 columns are skipped with a warning
  • Missing GT field: Lines without a GT field in the FORMAT column are skipped

Performance

The tool processes VCF files line by line with minimal memory requirements, making it efficient for large files. Performance scales linearly with the number of samples in the VCF file.

Limitations

  • Only filters based on the presence of the target genotype in at least one sample
  • Cannot filter based on multiple genotype patterns in a single run
  • No support for complex queries (e.g., specific samples with specific genotypes)
  • Skips variants where the GT field is not present in the FORMAT column