VCFX_hwe_tester¶

Overview¶

VCFX_hwe_tester performs Hardy-Weinberg Equilibrium (HWE) testing on biallelic variants in a VCF file, calculating and reporting exact p-values that measure the degree of deviation from expected genotype frequencies.

Usage¶

VCFX_hwe_tester [OPTIONS] < input.vcf > hwe_results.txt

Options¶

Option	Description
`-h`, `--help`	Display help message and exit (handled by `vcfx::handle_common_flags`)
`-v`, `--version`	Show program version and exit (handled by `vcfx::handle_common_flags`)

Description¶

VCFX_hwe_tester analyzes each biallelic variant in a VCF file to determine whether its genotype frequencies conform to Hardy-Weinberg Equilibrium expectations. The tool:

Reads the VCF file line by line
Filters for biallelic variants (skips sites with multiple ALT alleles)
Counts homozygous reference (0/0), heterozygous (0/1), and homozygous alternate (1/1) genotypes
Calculates an exact p-value for Hardy-Weinberg Equilibrium
Reports results in a simple tab-delimited format

The exact test uses a full enumeration of the probability distribution to obtain an accurate p-value, rather than relying on chi-square approximations. A low p-value indicates significant deviation from Hardy-Weinberg Equilibrium, which might suggest: - Population stratification - Selection pressure - Non-random mating - Genotyping errors

Output Format¶

The output is a tab-delimited text file with the following columns:

CHROM  POS  ID  REF  ALT  HWE_pvalue

Where: - CHROM, POS, ID, REF, ALT are copied from the input VCF - HWE_pvalue is the calculated p-value for Hardy-Weinberg Equilibrium

Examples¶

Basic Usage¶

./VCFX_hwe_tester < input.vcf > hwe_results.txt

Filter by HWE p-value¶

# Extract variants with significant HWE deviation (p < 0.05)
./VCFX_hwe_tester < input.vcf | awk -F'\t' '{if(NR==1 || ($6!="HWE_pvalue" && $6<0.05)) print}' > hwe_significant.txt

Check for Genotyping Errors¶

# Find potential genotyping errors (extremely low HWE p-values)
./VCFX_hwe_tester < input.vcf | awk -F'\t' '{if(NR==1 || ($6!="HWE_pvalue" && $6<0.0001)) print}' > potential_errors.txt

Mathematical Details¶

The tool uses an exact test based on the multinomial distribution of genotypes. For each variant:

The observed counts of genotypes (homRef, het, homAlt) are calculated
The expected frequencies under HWE are computed as:
Expected homRef = p² × N
Expected het = 2pq × N
Expected homAlt = q² × N

where p = (2×homRef + het)/(2×N), q = 1-p, and N = total number of individuals

The p-value is calculated by summing the probabilities of all possible genotype configurations that are equally or less likely than the observed configuration

This approach provides accurate p-values even for low minor allele frequencies or small sample sizes.

Handling Special Cases¶

Multi-allelic variants: Skipped entirely (only biallelic variants are considered)
Missing genotypes: Excluded from counts when calculating HWE
Phased genotypes: Phase information is ignored; "0|1" is treated the same as "0/1"
Non-standard genotypes: Any genotype other than 0/0, 0/1, or 1/1 is excluded
No valid genotypes: If no valid genotypes are found, the p-value is reported as 1.0
Perfect equilibrium: For variants with genotype frequencies perfectly matching HWE expectations, the p-value is 1.0

Performance¶

The tool is optimized for efficiency: - Processes one variant at a time, keeping memory usage low - Caches logarithmic factorial values to speed up calculations - Uses numerical optimizations to handle large sample sizes - Scales linearly with the number of variants in the VCF file

Limitations¶

Only works with biallelic variants
Assumes diploid genotypes
No stratification by population or other groupings
No correction for multiple testing
May be less accurate for extremely rare variants with very few non-reference genotypes
No built-in filtering for variant quality or missing data threshold