VCFX_compressor¶

Overview¶

VCFX_compressor provides simple compression and decompression functionality for VCF files using the zlib library. It allows users to compress VCF files for storage or transfer, and decompress them for analysis.

Usage¶

VCFX_compressor [OPTIONS] < input_file > output_file

Options¶

Option	Description
`-c`, `--compress`	Compress the input VCF file (read from stdin, write to stdout)
`-d`, `--decompress`	Decompress the input VCF.gz file (read from stdin, write to stdout)
`-h`, `--help`	Display help message and exit (handled by `vcfx::handle_common_flags`)
`-v`, `--version`	Show program version and exit (handled by `vcfx::handle_common_flags`)

Description¶

VCFX_compressor is a straightforward utility that enables compression and decompression of VCF files using zlib's DEFLATE algorithm. The tool:

Reads input data from standard input (stdin)
Processes the data in memory-efficient chunks
Applies compression or decompression based on the specified mode
Writes the processed data to standard output (stdout)

The compression mode produces output compatible with gzip, while the decompression mode can handle standard gzip-compressed files. This makes the tool interoperable with widely used genomics software that expects gzip-compressed VCF files.

Output Format¶

The output format depends on the chosen mode:

Compression mode: Produces a gzip-compatible compressed binary file
Decompression mode: Produces a plain text VCF file

Examples¶

Compressing a VCF File¶

# Basic compression
./VCFX_compressor --compress < input.vcf > output.vcf.gz

# Compress and view file size reduction
./VCFX_compressor --compress < input.vcf > output.vcf.gz
echo "Original size: $(wc -c < input.vcf) bytes"
echo "Compressed size: $(wc -c < output.vcf.gz) bytes"

Decompressing a VCF File¶

# Basic decompression
./VCFX_compressor --decompress < input.vcf.gz > output.vcf

# Decompress for analysis
./VCFX_compressor --decompress < input.vcf.gz | head -n 20

In a Pipeline¶

# Filter a VCF file, compress it, then decompress for viewing
cat input.vcf | grep -v "^#" | grep "PASS" | ./VCFX_compressor --compress > filtered.vcf.gz
./VCFX_compressor --decompress < filtered.vcf.gz | head

Data Processing¶

VCFX_compressor processes data in chunks to maintain memory efficiency. The default chunk size is 16KB, which provides a good balance between memory usage and processing efficiency. The tool:

Reads input data in 16KB chunks
Processes each chunk using zlib's compression/decompression functions
Writes processed data to output immediately as each chunk is completed
Continues until all input data has been processed

This streaming approach allows the tool to handle files of any size without loading the entire file into memory.

Handling Special Cases¶

The tool implements several strategies for handling edge cases:

Empty files: Properly handles empty input, producing valid empty output
Truncated inputs: When decompressing, detects and warns about truncated or incomplete compressed data
Invalid compressed data: Reports errors when attempting to decompress invalid or corrupted data
I/O errors: Provides error messages for issues with reading input or writing output
Incorrect usage: Enforces mutually exclusive selection of compression or decompression mode

Performance¶

VCFX_compressor is designed for efficiency:

Processes data in chunks, maintaining a low and consistent memory footprint
Uses zlib's optimized compression/decompression algorithms
Avoids unnecessary memory copying or buffering of the entire file
Provides reasonable compression ratios typical of gzip compression
Handles large files efficiently due to its streaming architecture

Limitations¶

Not BGZF compatible: Does not produce block-gzipped format required for indexed access via tabix
No compression level control: Uses zlib's default compression level with no user-configurable options
Single-threaded: Does not utilize multi-threading for potentially faster processing
No integrity verification: Does not verify the integrity of decompressed data
Limited format support: Only handles gzip compression, not other formats like bzip2 or xz
No indexing support: Does not maintain or generate indices for compressed files
Standard I/O only: Cannot directly specify input and output filenames (uses stdin/stdout)