VCFX_allele_counter¶
Overview¶
VCFX_allele_counter
counts the number of reference and alternate alleles in each sample for each variant in a VCF file. This tool provides a simple way to quantify allele occurrences across samples.
Usage¶
VCFX_allele_counter [OPTIONS] < input.vcf > allele_counts.tsv
Options¶
Option | Description |
---|---|
-s , --samples "Sample1 Sample2..." |
Optional. Specify sample names to calculate allele counts for (space-separated). If omitted, all samples are processed. |
-h , --help |
Display help message and exit (handled by vcfx::handle_common_flags ) |
-v , --version |
Show program version and exit (handled by vcfx::handle_common_flags ) |
Description¶
VCFX_allele_counter
processes a VCF file and counts reference and alternate alleles for each variant in each specified sample. The tool:
- Reads a VCF file from standard input
- Identifies sample columns from the VCF header
- For each variant and each sample:
- Extracts the genotype information
- Counts reference alleles (0) and alternate alleles (non-0)
- Outputs both counts in a tabular format
- Outputs a tab-separated file with allele counts for each variant-sample combination
This tool is particularly useful for: - Analyzing allele distribution across samples - Quantifying the presence of specific alleles - Preparing data for population genetics analyses - Validating genotype calls across samples
Output Format¶
The tool produces a tab-separated values (TSV) file with the following columns:
Column | Description |
---|---|
CHROM | Chromosome of the variant |
POS | Position of the variant |
ID | Variant identifier |
REF | Reference allele |
ALT | Alternate allele(s) |
Sample | Sample name |
Ref_Count | Number of reference alleles (0) in the sample's genotype |
Alt_Count | Number of alternate alleles (non-0) in the sample's genotype |
Examples¶
Basic Usage (All Samples)¶
Count alleles for all samples in a VCF file:
VCFX_allele_counter < input.vcf > allele_counts_all.tsv
Specific Samples¶
Count alleles for specific samples:
VCFX_allele_counter --samples "SAMPLE1 SAMPLE2" < input.vcf > allele_counts_subset.tsv
Using with Other Tools¶
Process the output for further analysis:
VCFX_allele_counter < input.vcf | awk -F'\t' '$8 > 0' > samples_with_alt_alleles.tsv
Allele Counting Method¶
Reference Alleles¶
The tool counts an allele as a reference allele when it has the value "0" in the genotype field. For example: - In genotype "0/0", there are 2 reference alleles - In genotype "0/1", there is 1 reference allele - In genotype "1/2", there are 0 reference alleles
Alternate Alleles¶
The tool counts an allele as an alternate allele when it has any non-zero numeric value in the genotype field. For example: - In genotype "0/0", there are 0 alternate alleles - In genotype "0/1", there is 1 alternate allele - In genotype "1/2", there are 2 alternate alleles - In genotype "1/1", there are 2 alternate alleles
Handling Special Cases¶
- Missing genotypes (e.g., "./.", ".|."): No counts are recorded for these samples
- Partial missing (e.g., "0/."): Only the valid allele is counted
- Non-numeric alleles: These are skipped and not counted
Handling Special Cases¶
Missing Data¶
- Genotypes with missing values (
./.
,.
) are skipped - Partial missing genotypes only count the valid alleles present
Multi-allelic Sites¶
- All non-reference alleles are counted as "alternate" regardless of their specific number
- For example, in a genotype "1/2", both alleles count as alternate alleles
- The tool does not differentiate between different alternate alleles
Phased Genotypes¶
- Phasing information is ignored for allele counting
- Phased genotypes (e.g., "0|1") are treated the same as unphased (e.g., "0/1")
Invalid Genotypes¶
- Non-numeric allele values are skipped
- Empty genotype fields are skipped
Performance Considerations¶
- Processes VCF files line by line, with minimal memory requirements
- Scales linearly with input file size and number of samples
- For very large VCF files with many samples, specifying a subset of samples can improve performance
Limitations¶
- Does not distinguish between different alternate alleles (e.g., "1" vs "2")
- No options for filtering by allele count thresholds
- Cannot account for genotype quality or read depth
- Limited to processing standard VCF genotype fields
- Does not produce summary statistics or aggregate counts
- No direct integration with population genetics metrics