VCFX_ld_calculator¶

Overview¶

VCFX_ld_calculator calculates pairwise linkage disequilibrium (LD) statistics between genetic variants in a VCF file, expressed as r² values. It can analyze variants across an entire file or within a specified genomic region.

Usage¶

VCFX_ld_calculator [OPTIONS] < input.vcf > ld_matrix.txt

Options¶

Option	Description
`--region <chr:start-end>`	Only compute LD for variants in the specified region
`-h`, `--help`	Display help message and exit (handled by `vcfx::handle_common_flags`)
`-v`, `--version`	Show program version and exit (handled by `vcfx::handle_common_flags`)

Description¶

VCFX_ld_calculator reads a VCF file and computes the pairwise linkage disequilibrium (r²) between genetic variants. Linkage disequilibrium is a measure of the non-random association between alleles at different loci, which is important for understanding genetic structure, identifying haplotype blocks, and designing association studies.

The tool operates as follows:

It reads the VCF file from standard input
It collects diploid genotypes for each variant, encoding them as:
0: Homozygous reference (0/0)
1: Heterozygous (0/1 or 1/0)
2: Homozygous alternate (1/1)
-1: Missing data or other scenarios (including multi-allelic variants)
For each pair of variants within the specified region (or the entire file if no region is specified), it computes pairwise r² values, ignoring samples with missing genotypes
It outputs a matrix of r² values along with variant identifiers

The r² calculation uses the standard formula: - Let X and Y be the genotype arrays for two variants - Calculate means of X and Y (meanX, meanY) - Calculate covariance: cov = average(XY) - meanX * meanY - Calculate variances: varX = average(X²) - meanX², varY similarly - r = cov / sqrt(varX * varY) - r² = r * r

Output Format¶

The output is a tab-delimited matrix of r² values with a header identifying the variants:

#LD_MATRIX_START
         chr1:100 chr1:200 chr1:300
chr1:100      1.0     0.4     0.2
chr1:200     0.4      1.0     0.6
chr1:300     0.2     0.6      1.0

If only one or no variants are found in the region, the tool outputs a message indicating that no pairwise LD could be calculated.

Examples¶

Basic Usage¶

Calculate LD for all variants in a VCF file:

VCFX_ld_calculator < input.vcf > ld_matrix.txt

Region-Specific LD¶

Calculate LD only for variants in a specific genomic region:

VCFX_ld_calculator --region chr1:10000-20000 < input.vcf > ld_matrix.txt

Integration with Other Tools¶

Filter for common variants first, then calculate LD:

cat input.vcf | VCFX_af_subsetter --af-filter '0.05-1.0' | VCFX_ld_calculator > common_variants_ld.txt

Handling Special Cases¶

Missing Genotypes: Samples with missing genotypes (./. or .|.) are skipped when calculating LD between variant pairs
Multi-allelic Variants: Genotypes involving alleles beyond the reference and first alternate (e.g., 1/2, 2/2) are treated as missing data
Single Variant: If only one variant is found in the region, the tool outputs a message stating that no pairwise LD can be calculated
Empty Region: If no variants are found in the specified region, the tool outputs a message stating that no pairwise LD can be calculated
Invalid Region Format: If the region format is invalid, the tool will display an error message

Performance¶

Time complexity is O(n²m) where n is the number of variants and m is the number of samples
Memory usage scales linearly with the number of variants and samples
For large datasets with many variants, consider using the --region option to limit the analysis to specific genomic regions
The tool processes the VCF file line by line, so it can handle large files without loading the entire file into memory

Limitations¶

Only supports biallelic variants; multi-allelic variants are treated as missing data
Requires diploid genotypes; haploid genotypes will be treated as missing data
Assumes standard VCF format with GT field in the FORMAT column
Does not support phased vs. unphased distinction; both "/" and "|" separators are treated the same
No built-in visualization of LD patterns; additional tools would be needed for heatmap creation