Skip to content

VCFX_sample_extractor

Overview

VCFX_sample_extractor is a tool that extracts a subset of samples from a VCF file, allowing you to create a smaller, focused VCF containing only the samples of interest.

Usage

VCFX_sample_extractor [OPTIONS] < input.vcf > subset.vcf

Options

Option Description
-s, --samples LIST Comma or space separated list of sample names to extract
-h, --help Display help message and exit (handled by vcfx::handle_common_flags)
-v, --version Show program version and exit (handled by vcfx::handle_common_flags)

Description

VCFX_sample_extractor reads a VCF file from standard input, identifies the samples specified in the command line, and produces a new VCF file containing only those samples. This is useful for:

  • Reducing file size by extracting only relevant samples
  • Creating sample-specific VCF files for specialized analyses
  • Focusing on specific cohorts or subgroups
  • Compliance with data sharing permissions that allow sharing only specific samples

The tool: 1. Reads the VCF header to identify sample columns 2. Maintains all meta-information and header lines 3. Extracts only the specified samples, preserving order and data integrity 4. Warns about any requested samples that aren't found in the input VCF

Output Format

The output is a standard VCF file containing: - All header lines from the input file - A modified #CHROM header line that includes only the selected samples - All variant lines from the input with only the selected sample columns

Examples

Extract a Single Sample

./VCFX_sample_extractor --samples "SAMPLE1" < input.vcf > single_sample.vcf

Extract Multiple Samples with Comma Delimiter

./VCFX_sample_extractor --samples "SAMPLE1,SAMPLE2,SAMPLE3" < input.vcf > subset.vcf

Extract Multiple Samples with Space Delimiter

./VCFX_sample_extractor --samples "SAMPLE1 SAMPLE2 SAMPLE3" < input.vcf > subset.vcf

Process Large Files

# Extract a few samples from a large compressed VCF
zcat large_file.vcf.gz | ./VCFX_sample_extractor --samples "SAMPLE1,SAMPLE2" | gzip > subset.vcf.gz

Handling Special Cases

  • Missing samples: If a requested sample isn't found in the input VCF, a warning is issued but processing continues with the samples that were found
  • No samples found: If none of the requested samples are found in the input VCF, the output will contain only the header and variant lines with no sample columns
  • Malformed VCF: Lines with fewer than 8 columns are skipped with a warning
  • No sample columns: Input variant lines without sample columns (fewer than 10 columns) are skipped
  • Empty sample names: Empty sample names in the input list are ignored

Performance

The tool processes VCF files line by line, with minimal memory requirements even for very large VCF files. Performance scales with: - Number of samples in the input VCF (parsing time) - Number of samples being extracted (output size)

Limitations

  • No wildcards or regular expressions for sample name matching
  • Cannot extract samples based on properties or metadata
  • Cannot reorder samples in the output file (order follows the original VCF)
  • No option to rename samples in the output file