VCFX_sample_extractor¶
Overview¶
VCFX_sample_extractor is a tool that extracts a subset of samples from a VCF file, allowing you to create a smaller, focused VCF containing only the samples of interest.
Usage¶
VCFX_sample_extractor [OPTIONS] < input.vcf > subset.vcf
Options¶
Option | Description |
---|---|
-s , --samples LIST |
Comma or space separated list of sample names to extract |
-h , --help |
Display help message and exit (handled by vcfx::handle_common_flags ) |
-v , --version |
Show program version and exit (handled by vcfx::handle_common_flags ) |
Description¶
VCFX_sample_extractor reads a VCF file from standard input, identifies the samples specified in the command line, and produces a new VCF file containing only those samples. This is useful for:
- Reducing file size by extracting only relevant samples
- Creating sample-specific VCF files for specialized analyses
- Focusing on specific cohorts or subgroups
- Compliance with data sharing permissions that allow sharing only specific samples
The tool: 1. Reads the VCF header to identify sample columns 2. Maintains all meta-information and header lines 3. Extracts only the specified samples, preserving order and data integrity 4. Warns about any requested samples that aren't found in the input VCF
Output Format¶
The output is a standard VCF file containing: - All header lines from the input file - A modified #CHROM header line that includes only the selected samples - All variant lines from the input with only the selected sample columns
Examples¶
Extract a Single Sample¶
./VCFX_sample_extractor --samples "SAMPLE1" < input.vcf > single_sample.vcf
Extract Multiple Samples with Comma Delimiter¶
./VCFX_sample_extractor --samples "SAMPLE1,SAMPLE2,SAMPLE3" < input.vcf > subset.vcf
Extract Multiple Samples with Space Delimiter¶
./VCFX_sample_extractor --samples "SAMPLE1 SAMPLE2 SAMPLE3" < input.vcf > subset.vcf
Process Large Files¶
# Extract a few samples from a large compressed VCF
zcat large_file.vcf.gz | ./VCFX_sample_extractor --samples "SAMPLE1,SAMPLE2" | gzip > subset.vcf.gz
Handling Special Cases¶
- Missing samples: If a requested sample isn't found in the input VCF, a warning is issued but processing continues with the samples that were found
- No samples found: If none of the requested samples are found in the input VCF, the output will contain only the header and variant lines with no sample columns
- Malformed VCF: Lines with fewer than 8 columns are skipped with a warning
- No sample columns: Input variant lines without sample columns (fewer than 10 columns) are skipped
- Empty sample names: Empty sample names in the input list are ignored
Performance¶
The tool processes VCF files line by line, with minimal memory requirements even for very large VCF files. Performance scales with: - Number of samples in the input VCF (parsing time) - Number of samples being extracted (output size)
Limitations¶
- No wildcards or regular expressions for sample name matching
- Cannot extract samples based on properties or metadata
- Cannot reorder samples in the output file (order follows the original VCF)
- No option to rename samples in the output file