VCFX_fasta_converter¶

Overview¶

VCFX_fasta_converter transforms VCF files into FASTA format, converting variant information into a multiple sequence alignment where each sample's sequence represents its genotypes across all variants.

Usage¶

VCFX_fasta_converter [OPTIONS] < input.vcf > output.fasta

Options¶

Option	Description
`-h`, `--help`	Display help message and exit (handled by `vcfx::handle_common_flags`)
`-v`, `--version`	Show program version and exit (handled by `vcfx::handle_common_flags`)

Description¶

VCFX_fasta_converter converts variant information from VCF format into a multiple sequence alignment in FASTA format. The tool:

Reads a VCF file with variant data and sample genotypes
Creates one FASTA entry for each sample in the VCF
Generates one position in the alignment for each variant in the VCF
Represents each genotype as a single character:
Homozygous genotypes (0/0, 1/1, etc.) are represented by the corresponding base
Heterozygous genotypes (0/1, 1/2, etc.) are represented by IUPAC ambiguity codes when possible
Complex genotypes (indels, multi-base variants) are represented as 'N'
Outputs a FASTA file with one sequence per sample, where each position corresponds to a variant in the VCF

This tool is useful for: - Creating alignments for phylogenetic analysis - Visualizing genetic variation across samples - Converting VCF data for use with tools that require FASTA format - Simplifying the representation of genetic variation

Output Format¶

The output is a standard FASTA file with one entry per sample:

>SAMPLE1
AGCTYRMKSW
>SAMPLE2
ATCGYRMNNA
>SAMPLE3
GACTYRSWNN

Each position in the sequence corresponds to a variant in the input VCF, with genotypes encoded as follows: - Homozygous reference (0/0): The reference base (e.g., 'A') - Homozygous alternate (1/1): The alternate base (e.g., 'G') - Heterozygous (0/1): IUPAC ambiguity code (e.g., 'R' for A/G) - Missing genotypes (./.): 'N' - Complex or unrepresentable genotypes: 'N'

Examples¶

Basic Usage¶

./VCFX_fasta_converter < variants.vcf > alignment.fasta

Viewing the Alignment¶

# Convert to FASTA and view with alignment viewer
./VCFX_fasta_converter < variants.vcf > alignment.fasta
aliview alignment.fasta

Building a Phylogenetic Tree¶

# Create a FASTA alignment from VCF and build a tree
./VCFX_fasta_converter < variants.vcf > alignment.fasta
iqtree -s alignment.fasta

IUPAC Ambiguity Codes¶

The tool uses standard IUPAC nucleotide ambiguity codes to represent heterozygous genotypes:

Code	Bases	Meaning
R	A/G	puRine
Y	C/T	pYrimidine
M	A/C	aMino
K	G/T	Keto
S	C/G	Strong (3 H-bonds)
W	A/T	Weak (2 H-bonds)
N	Any	aNy base or missing data

Handling Special Cases¶

Indels and multi-base variants: Represented as 'N' since they can't be unambiguously encoded as a single nucleotide
Multi-allelic sites: Processed using the appropriate IUPAC codes when possible
Phased vs. unphased genotypes: Treated identically (e.g., "0|1" and "0/1" both map to the same IUPAC code)
Missing genotypes: Represented as 'N' in the output sequence
Missing GT field: Variants without a genotype field are skipped
Malformed VCF lines: Skipped with a warning
Invalid nucleotide combinations: Represented as 'N' when no IUPAC code exists

Performance¶

The converter is optimized for efficiency: - Single-pass processing of the VCF file - Efficient string handling for sequence construction - Scales linearly with the number of variants and samples - Maintains a small memory footprint proportional to the number of samples

Limitations¶

Cannot represent structural variants, indels, or multi-base substitutions
Loss of information (quality scores, filters, etc.) from the original VCF
No support for non-diploid genotypes
Limited to the standard IUPAC ambiguity codes for representing heterozygosity
Not suitable for variants with complex ALT alleles
No option to include position information in the output
Cannot handle extremely large VCF files due to memory constraints (sequence storage)