VCFX_sorter¶

Overview¶

VCFX_sorter is a utility tool for sorting VCF files by chromosome and position. It provides two sorting methods: standard lexicographic sorting and natural chromosome sorting, which handles chromosome numbering in a more intuitive way.

Usage¶

VCFX_sorter [OPTIONS] < input.vcf > output.vcf

Options¶

Option	Description
`-h`, `--help`	Display help message and exit (handled by `vcfx::handle_common_flags`)
`-v`, `--version`	Show program version and exit (handled by `vcfx::handle_common_flags`)
`-n`, `--natural-chr`	Use natural chromosome sorting (chr1 < chr2 < chr10) instead of lexicographic sorting

Description¶

VCFX_sorter processes a VCF file to organize variants in a consistent order by:

Reading the VCF file from standard input
Preserving all header lines without modification
Loading all data lines into memory
Sorting the data lines by chromosome and position
Writing the header lines followed by the sorted data lines to standard output

The tool supports two distinct sorting methods: - Lexicographic sorting (default): Sorts chromosomes alphabetically (chr1, chr10, chr2, ...) - Natural sorting: Sorts chromosomes in numeric order when possible (chr1, chr2, ..., chr10, ...)

This tool is particularly useful for: - Preparing VCF files for downstream analysis tools that expect sorted input - Merging multiple VCF files that need consistent ordering - Improving readability and navigation of VCF files - Making binary searches possible on VCF data

Sorting Details¶

Lexicographic Sorting¶

In the default lexicographic mode: - Chromosomes are compared as strings (e.g., 'chr2' comes after 'chr10') - Positions are compared numerically within the same chromosome

Natural Chromosome Sorting¶

When the --natural-chr option is used: 1. The "chr" prefix (case-insensitive) is identified and removed 2. Any leading digits are parsed as a number 3. Remaining characters are treated as a suffix 4. Sorting precedence: - First by chromosome prefix (if different) - Then by numeric part (if both have numbers) - Then by suffix (if both have the same number) - Finally by position

This results in more intuitive ordering where chr1 < chr2 < chr10, instead of chr1 < chr10 < chr2.

Examples¶

Basic Lexicographic Sorting¶

Sort a VCF file using standard lexicographic chromosome ordering:

VCFX_sorter < unsorted.vcf > sorted.vcf

Natural Chromosome Sorting¶

Sort a VCF file using natural chromosome ordering:

VCFX_sorter --natural-chr < unsorted.vcf > sorted.vcf

Example Transformations¶

Lexicographic Sorting¶

Before:
chr2  1000  .  A  T  .  PASS  .
chr1  2000  .  G  C  .  PASS  .
chr10 1500  .  T  A  .  PASS  .

After:
chr1  2000  .  G  C  .  PASS  .
chr10 1500  .  T  A  .  PASS  .
chr2  1000  .  A  T  .  PASS  .

Natural Chromosome Sorting¶

Before:
chr2  1000  .  A  T  .  PASS  .
chr1  2000  .  G  C  .  PASS  .
chr10 1500  .  T  A  .  PASS  .

After:
chr1  2000  .  G  C  .  PASS  .
chr2  1000  .  A  T  .  PASS  .
chr10 1500  .  T  A  .  PASS  .

Handling Special Cases¶

Malformed Lines¶

Lines with fewer than 8 columns are skipped with a warning
Lines with an invalid position value are skipped with a warning

Empty Input¶

If no input is provided, the help message is displayed

Missing Header¶

If no #CHROM header line is found in the input, a warning is issued but processing continues

Complex Chromosome Names¶

Chromosomes with non-standard naming follow sorting rules based on the selected mode
Examples of parsing in natural mode:
"chr1" → prefix="chr", number=1, suffix=""
"chrX" → prefix="chr", number=none, suffix="X"
"chr10_alt" → prefix="chr", number=10, suffix="_alt"
"scaffold_123" → prefix="", number=none, suffix="scaffold_123"

Performance Considerations¶

The tool reads the entire VCF file into memory before sorting
Memory usage scales with the number of variants in the input file
Very large VCF files may require significant memory
Processing time is dominated by the sorting operation, which is O(n log n)

Limitations¶

No support for on-disk sorting of files too large to fit in memory
Cannot sort by other fields besides chromosome and position
Does not validate VCF format beyond basic column counting
No handling of compressed (gzipped) VCF files directly
Cannot maintain the original order of variants at the same chromosome and position