VCFX_sorter¶
Overview¶
VCFX_sorter
is a utility tool for sorting VCF files by chromosome and position. It provides two sorting methods: standard lexicographic sorting and natural chromosome sorting, which handles chromosome numbering in a more intuitive way.
Usage¶
VCFX_sorter [OPTIONS] < input.vcf > output.vcf
Options¶
Option | Description |
---|---|
-h , --help |
Display help message and exit (handled by vcfx::handle_common_flags ) |
-v , --version |
Show program version and exit (handled by vcfx::handle_common_flags ) |
-n , --natural-chr |
Use natural chromosome sorting (chr1 < chr2 < chr10) instead of lexicographic sorting |
Description¶
VCFX_sorter
processes a VCF file to organize variants in a consistent order by:
- Reading the VCF file from standard input
- Preserving all header lines without modification
- Loading all data lines into memory
- Sorting the data lines by chromosome and position
- Writing the header lines followed by the sorted data lines to standard output
The tool supports two distinct sorting methods: - Lexicographic sorting (default): Sorts chromosomes alphabetically (chr1, chr10, chr2, ...) - Natural sorting: Sorts chromosomes in numeric order when possible (chr1, chr2, ..., chr10, ...)
This tool is particularly useful for: - Preparing VCF files for downstream analysis tools that expect sorted input - Merging multiple VCF files that need consistent ordering - Improving readability and navigation of VCF files - Making binary searches possible on VCF data
Sorting Details¶
Lexicographic Sorting¶
In the default lexicographic mode: - Chromosomes are compared as strings (e.g., 'chr2' comes after 'chr10') - Positions are compared numerically within the same chromosome
Natural Chromosome Sorting¶
When the --natural-chr
option is used:
1. The "chr" prefix (case-insensitive) is identified and removed
2. Any leading digits are parsed as a number
3. Remaining characters are treated as a suffix
4. Sorting precedence:
- First by chromosome prefix (if different)
- Then by numeric part (if both have numbers)
- Then by suffix (if both have the same number)
- Finally by position
This results in more intuitive ordering where chr1 < chr2 < chr10, instead of chr1 < chr10 < chr2.
Examples¶
Basic Lexicographic Sorting¶
Sort a VCF file using standard lexicographic chromosome ordering:
VCFX_sorter < unsorted.vcf > sorted.vcf
Natural Chromosome Sorting¶
Sort a VCF file using natural chromosome ordering:
VCFX_sorter --natural-chr < unsorted.vcf > sorted.vcf
Example Transformations¶
Lexicographic Sorting¶
Before:
chr2 1000 . A T . PASS .
chr1 2000 . G C . PASS .
chr10 1500 . T A . PASS .
After:
chr1 2000 . G C . PASS .
chr10 1500 . T A . PASS .
chr2 1000 . A T . PASS .
Natural Chromosome Sorting¶
Before:
chr2 1000 . A T . PASS .
chr1 2000 . G C . PASS .
chr10 1500 . T A . PASS .
After:
chr1 2000 . G C . PASS .
chr2 1000 . A T . PASS .
chr10 1500 . T A . PASS .
Handling Special Cases¶
Malformed Lines¶
- Lines with fewer than 8 columns are skipped with a warning
- Lines with an invalid position value are skipped with a warning
Empty Input¶
- If no input is provided, the help message is displayed
Missing Header¶
- If no #CHROM header line is found in the input, a warning is issued but processing continues
Complex Chromosome Names¶
- Chromosomes with non-standard naming follow sorting rules based on the selected mode
- Examples of parsing in natural mode:
- "chr1" → prefix="chr", number=1, suffix=""
- "chrX" → prefix="chr", number=none, suffix="X"
- "chr10_alt" → prefix="chr", number=10, suffix="_alt"
- "scaffold_123" → prefix="", number=none, suffix="scaffold_123"
Performance Considerations¶
- The tool reads the entire VCF file into memory before sorting
- Memory usage scales with the number of variants in the input file
- Very large VCF files may require significant memory
- Processing time is dominated by the sorting operation, which is O(n log n)
Limitations¶
- No support for on-disk sorting of files too large to fit in memory
- Cannot sort by other fields besides chromosome and position
- Does not validate VCF format beyond basic column counting
- No handling of compressed (gzipped) VCF files directly
- Cannot maintain the original order of variants at the same chromosome and position