VCFX_ancestry_inferrer¶

Overview¶

VCFX_ancestry_inferrer infers the likely population ancestry for each sample in a VCF file by comparing sample genotypes to known population allele frequencies.

Usage¶

VCFX_ancestry_inferrer --frequency <freq_file> [OPTIONS] < input.vcf > ancestry_results.txt

Options¶

Option	Description
`--frequency <FILE>`	Required. Path to a file containing population-specific allele frequencies
`-h`, `--help`	Display help message and exit (handled by `vcfx::handle_common_flags`)
`-v`, `--version`	Show program version and exit (handled by `vcfx::handle_common_flags`)

Description¶

VCFX_ancestry_inferrer analyzes the genotypes of samples in a VCF file and compares them to known population-specific allele frequencies to determine the most likely ancestry for each sample. The tool:

Reads a frequency reference file containing population-specific allele frequencies
Processes the VCF file, examining each biallelic or multiallelic variant
For each sample, calculates ancestry scores by comparing observed genotypes to population frequency data
Assigns each sample to the population with the highest cumulative score
Outputs a simple table mapping each sample to its inferred population

The ancestry inference is based on the principle that individuals from a specific population are more likely to carry alleles at frequencies matching that population's known frequency distribution.

Output Format¶

The output is a tab-delimited text file with the following columns:

Sample  Inferred_Population

Where: - Sample is the sample name from the VCF file - Inferred_Population is the population with the highest ancestry score

Examples¶

Basic Usage¶

./VCFX_ancestry_inferrer --frequency population_freqs.txt < samples.vcf > ancestry_results.txt

Creating a Frequency Reference File¶

The frequency file should have the following tab-delimited format:

CHROM  POS  REF  ALT  POPULATION  FREQUENCY
1      100  A    G    EUR         0.75
1      100  A    G    AFR         0.10
1      100  A    G    EAS         0.25

Using with Multi-Population Data¶

# Combine ancestry results with other data
./VCFX_ancestry_inferrer --frequency global_freqs.txt < diverse_cohort.vcf | \
  join -t $'\t' -1 1 -2 1 - phenotype_data.txt > annotated_results.tsv

Algorithm¶

The ancestry inference algorithm works as follows:

For each variant in the VCF file:
For each sample with a non-reference genotype:
- Look up the frequency of that allele in each reference population
- Add the frequency value to that population's score for the sample
After processing all variants:
For each sample, find the population with the highest cumulative score
Assign the sample to that population

This approach assigns more weight to alleles that are common in a specific population but rare in others, making them more informative for ancestry inference.

Handling Special Cases¶

Multi-allelic variants: Each alternate allele is treated separately and looked up in the frequency reference
Phased genotypes: Phase information is ignored; both "0|1" and "0/1" are treated identically
Missing genotypes: Missing genotypes ("./.") are skipped and don't contribute to ancestry scores
Missing frequency data: Variants without corresponding frequency data are skipped
Identical scores: If two populations have identical scores for a sample, the first one alphabetically is assigned
Diploid genotypes: Both alleles contribute independently to the ancestry score
Empty VCF: Will produce no output rows (empty output file)
Unknown populations: Only populations defined in the frequency file will be considered

Performance¶

The tool is optimized for efficiency: - Uses hash maps for constant-time lookups of frequency data - Single-pass processing of the VCF file - Memory usage scales with: - The number of variants in the frequency file - The number of reference populations - The number of samples in the VCF

Limitations¶

Accuracy depends on the quality and relevance of the population frequency data
Works best with large numbers of variants (hundreds to thousands)
Not designed for detecting admixed individuals (reports only the highest-scoring population)
Assumes independence between variants (does not account for linkage disequilibrium)
No confidence scores or statistical measures of assignment certainty
Cannot handle non-biallelic complex variants (e.g., structural variants)
Doesn't account for sample relatedness within the input VCF