Skip to content

VCFX_allele_freq_calc

Overview

The VCFX_allele_freq_calc tool calculates allele frequencies for variants in a VCF file. It reads a VCF file and outputs a TSV file with chromosome, position, ID, reference allele, alternate allele, and the calculated allele frequency.

Usage

VCFX_allele_freq_calc [OPTIONS] [input.vcf]
VCFX_allele_freq_calc [OPTIONS] < input.vcf > allele_frequencies.tsv

Options

Option Description
-i, --input FILE Input VCF file (uses memory-mapping for best performance)
-q, --quiet Suppress informational messages
-h, --help Display help message and exit
-v, --version Show program version and exit

Description

VCFX_allele_freq_calc computes the allele frequency for each variant in a VCF file. The allele frequency is calculated as the number of alternate alleles divided by the total number of alleles (reference + alternate) across all samples, considering only non-missing genotypes.

The tool: - Parses the GT (genotype) field for each sample - Counts reference (0) and alternate (non-zero) alleles - Calculates frequency as: alternate_count / (reference_count + alternate_count) - Outputs results in a clean TSV format

Output Format

The output is a tab-separated file with the following columns:

CHROM  POS  ID  REF  ALT  Allele_Frequency

Where Allele_Frequency is a value between 0.0 and 1.0, formatted with 4 decimal places.

Examples

Use memory-mapped file I/O for best performance:

VCFX_allele_freq_calc -i input.vcf > allele_frequencies.tsv
VCFX_allele_freq_calc input.vcf > allele_frequencies.tsv

Basic Usage (Stdin)

VCFX_allele_freq_calc < input.vcf > allele_frequencies.tsv

Pipe with Other Commands

# Filter variants and calculate allele frequencies
grep -v "^#" input.vcf | grep "PASS" | VCFX_allele_freq_calc > filtered_allele_frequencies.tsv

Handling Special Cases

  • Phased genotypes: Both phased (|) and unphased (/) genotypes are handled the same way
  • Missing genotypes (./.): Missing genotypes are skipped in the frequency calculation
  • Multiallelic sites: All non-reference alleles are counted as "alternate" regardless of the specific ALT index
  • No GT field: Variants without a GT field are skipped

Performance Characteristics

The tool uses several optimizations for high performance:

  • Memory-mapped I/O: Uses mmap() for file input to minimize syscall overhead
  • SIMD acceleration: Uses NEON/SSE2/AVX2 instructions for fast line/tab scanning
  • Zero-copy parsing: Parses VCF fields without creating intermediate strings
  • Buffered output: Uses 4MB output buffer with direct write() syscalls

Benchmark Results (4GB VCF, chr21, 427K variants, 2504 samples)

Mode Time
mmap (-i) ~15s
stdin ~5 min

Speedup: ~20x with memory-mapped I/O

Limitations

  • Requires the GT field to be present in the FORMAT column
  • Does not distinguish between different alternate alleles in multiallelic sites (all non-reference alleles are counted together)
  • Cannot handle malformed VCF files, though it will attempt to skip invalid lines with a warning