How good is Genozip at compressing VCF files?
Relative sizes of already-compressed .vcf.gz files, before and after compression with genozip.
Used Genozip 15.0.4 with --best. --reference was used for compressing the two GVCF files. Datasets can be found here.
Compressing and uncompressing a VCF or BCF file
In this page, we will give examples of VCF files. Genozip is also capable of compressing BCF files with some limitations.
Compressing and uncompressing VCF is straightforward:
$ genozip myfile.vcf.gz
$ ls -lh myfile.vcf*
-rwxrwxrwx 1 divon divon 88M Aug 1 08:49 myfile.vcf.genozip
-rwxrwxrwx 1 divon divon 244M Feb 10 2020 myfile.vcf.gz
$ genounzip myfile.vcf.genozip
$ genocat myfile.vcf.genozip
Compressing using a reference file
It is possible to compress a VCF using a reference file, however this provides meaningful benefit only in certain cases:
1. GVCF files
3. Ultima Genomics Deep Variant files
3. VCF files with little or no sample or INFO data.
$ genozip --reference hs37d5.fa.gz myfile.g.vcf.gz
Note: when a particular FASTA file is used as a reference for the first time, Genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to
--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):
$ genozip --make-reference hs37d5.fa.gz
Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly.
Compressing BCF files
Genozip does not support BCF natively - it uses bcftools to convert BCF files to the VCF format, and as such it requires bcftools to be installed for BCF compression to work.
To decompress to a BCF file (regardless of whether original file was VCF or BCF). As above, this option relies on bcftools:
genocat --bcf myfile.vcf.genozip
The --bcf option is implicit if --output is specified with a file name with a .bcf extension.
The following options modify the file in ways that improve compression. --optimize is an umbrella option that activates all these optimization options.
Option Fields affected Action
--optimize-sort INFO Sorts INFO annotations alphabetically by tag name.
--optimize-phred FORMAT/PL Phred scores are rounded to the nearest integer and capped at 60
FORMAT/GL (VCF v4.2 or earlier)
--GL-to-PL FORMAT/GL The GL annotation is converted to PL and Phred values are capped at 60
--GP-to-PP FORMAT/GP (VCF v4.3- ) The GP annotation is converted to PP and Phred values are capped at 60
--optimize-VQSLOD INFO/VQSLOD The value is rounded to 2 significant digits
More information: VCF optimizations
Slicing & dicing your data with genocat
Here's a summary of the filtering and subsetting options available for VCF files. See genocat for more information.
--downsample Show only one in every X variants
--regions -r Exclude or include certain genomic regions
--regions-file -R Like --regions, but list of regions is specified in a file
--grep Show only variants containing the specified string
--grep-w -g Like --grep, but match whole words
--lines -n Show only a variants from given range of line numbers
--head Show only a certain number of variants from the start of the file
--tail Show only a certain number of variants from the end of the file
--no-header Show only the variants - exclude the VCF header
--header-only Show only the VCF header - exclude the variants
--header-one -1 Show only the the #CHROM line of the VCF header and the variants
--samples -s Show a subset of samples
--drop-genotypes -G Output the data without the samples and FORMAT column
--GT-only Within samples output only genotype (GT) data - dropping the other tags
--snps-only Drop variants that are not a Single Nucleotide Polymorphism (SNP)
--indels-only Drop variants that are not Insertions or Deletions (indel)
Note: in multi-sample files with both INFO/DP and FORMAT/DP fields, subsetting with --drop-genotypes, --GT-only or
--samples would normally cause INFO/DP to show as -1. Compressing with --secure-DP avoids this issue, at the expense of a slightly worse compression.
VCF files are required to be sorted according to the VCF specification, although this is not a requirement for Genozip, which can compress unsorted files as well.
Using the --sort option causes genozip to also sort the file while compressing it. This works well with mildly unsorted files, but may consume significant memory if the files are significantly unsorted. As with all data-modifying options, automatic verification is disabled when using this option.
genozip --sort myfile.vcf.gz
If a file was compressed with --sort, it is still possible to view it in its original, unsorted order:
genocat --unsorted myfile.vcf.genozip
Dual Coordinate VCF files (DVCF)
When compressing a VCF file, use the --chain option to add a second coordinate system, based on a second reference file, lifting over the coordinates and annotations. Subsequently, it is possible to view and manipulate the VCF file in either coordinate system, allowing different steps in a pipeline to operate using different coordinate system (e.g., GRCh37 and GRCh38).
More information: Dual Coordinate VCF files.