VCFX_ancestry_assigner¶

Overview¶

VCFX_ancestry_assigner assigns samples in a VCF file to ancestral populations using a likelihood-based approach based on population-specific allele frequencies.

Usage¶

VCFX_ancestry_assigner --assign-ancestry <freq_file> < input.vcf > ancestry_results.txt

Options¶

Option	Description
`-a`, `--assign-ancestry <FILE>`	Required. Path to a file containing population-specific allele frequencies
`-h`, `--help`	Display help message and exit (handled by `vcfx::handle_common_flags`)
`-v`, `--version`	Show program version and exit (handled by `vcfx::handle_common_flags`)

Description¶

VCFX_ancestry_assigner determines the most likely ancestral population for each sample in a VCF file by calculating genotype likelihoods across multiple populations. The tool:

Reads a tab-delimited file containing allele frequencies for different populations
Processes the genotypes for each sample in the VCF file
Computes likelihood scores for each possible ancestral population
Assigns each sample to the population with the highest likelihood score
Outputs a simple mapping of sample names to assigned populations

The tool uses a statistical approach that considers the probability of observing each genotype given the population-specific allele frequencies. For each genotype: - Homozygous reference (0/0): P = (1-f)² - Heterozygous (0/1): P = 2f(1-f) - Homozygous alternate (1/1): P = f²

Where f is the frequency of the alternate allele in a given population.

Output Format¶

The output is a tab-delimited text file with the following columns:

Sample  Assigned_Population

Where: - Sample is the sample name from the VCF file - Assigned_Population is the ancestral population with the highest likelihood score

Examples¶

Basic Usage¶

./VCFX_ancestry_assigner --assign-ancestry population_freqs.tsv < samples.vcf > ancestry_assignments.txt

Creating a Frequency Reference File¶

The frequency file should have the following tab-delimited format:

CHROM  POS  REF  ALT  EUR  ASN  AFR
chr1   10000  A    G    0.1  0.2  0.3
chr1   20000  C    T    0.2  0.3  0.4
chr2   15000  T    C    0.4  0.5  0.6

Using in a Pipeline¶

# Process VCF and append ancestry information as a new column in a metadata file
cat input.vcf | ./VCFX_ancestry_assigner --assign-ancestry freq.tsv | \
  join -t $'\t' metadata.txt - > metadata_with_ancestry.txt

Algorithm¶

The ancestry assignment uses a maximum likelihood approach:

For each sample and each variant:
Determine the sample's genotype (0/0, 0/1, or 1/1)
For each population, calculate the log-likelihood of observing that genotype:
- Log(P(0/0|pop)) = 2 * log(1-f)
- Log(P(0/1|pop)) = log(2) + log(f) + log(1-f)
- Log(P(1/1|pop)) = 2 * log(f)
Add this log-likelihood to the population's cumulative score
After processing all variants:
For each sample, identify the population with the highest cumulative log-likelihood
Assign the sample to that population

This approach is statistically sound and accounts for the probability distribution of genotypes under Hardy-Weinberg equilibrium.

Handling Special Cases¶

Missing genotypes: Genotypes denoted as "./." are skipped and don't contribute to likelihood calculations
Multi-allelic variants: Treated as biallelic by considering only the first alternate allele
Missing variants: Variants present in the VCF but not in the frequency file are skipped
Phased genotypes: Phase information is ignored; both "0|1" and "0/1" are treated identically
Equal likelihoods: If two populations have exactly the same likelihood (rare), the first one is assigned
No matching variants: If a sample has no variants that match the frequency file, it's assigned to a default population
Non-standard genotypes: Any genotype other than 0/0, 0/1, or 1/1 is skipped
Empty VCF: Will produce no output rows (empty output file)

Performance¶

The tool is optimized for efficiency: - Uses hash maps for fast lookup of variant frequency data - Single-pass processing of the VCF file - Calculates log-likelihoods to avoid numerical underflow with many variants - Memory usage scales with: - The number of variants in the frequency file - The number of reference populations - The number of samples in the VCF

Limitations¶

Requires a pre-existing set of population-specific allele frequencies
Assumes Hardy-Weinberg equilibrium for probability calculations
Does not account for linkage disequilibrium between variants
Cannot detect admixed individuals (assigns to a single population)
No confidence metrics for population assignment
Not designed for structural variants or complex multi-allelic sites
No support for non-diploid genotypes or unusual ploidy
Performance depends on the number and informativeness of the variants in the frequency file