top of page
Anchor 1
How good is Genozip at compressing VCF files?
v15 vcf benchmark.png
Relative sizes of already-compressed .vcf.gz files, before and after compression with genozip. 

Used Genozip 15.0.4 with --best--reference was used for compressing the two GVCF files. Datasets can be found here.

Compressing and uncompressing a VCF or BCF file

In this page, we will give examples of VCF files. Genozip is also capable of compressing BCF files, provided that bcftools is installed on your system.

Compressing

Compressing and uncompressing VCF is straightforward:

$ genozip myfile.vcf.gz

$ ls -lh myfile.vcf*

-rwxrwxrwx 1 divon divon  88M Aug 1  08:49 myfile.vcf.genozip
-rwxrwxrwx 1 divon divon 244M Feb 10  2020 myfile.vcf.gz

Uncompressing

$ genounzip myfile.vcf.genozip 

Viewing

$ genocat myfile.vcf.genozip 

A VCF or BCF output might be selected with --vcf or --bcf respectively, or implicitly with --output and a filename ending with .vcf .vcf.gz or .bcf . Re-compression level of the output VCF or BCF file may be determined with --bgzf where --bgzf=0 means "no compression at all".

Compressing VCF using a reference file 

It is possible to compress a VCF using a reference file, and in the following cases it is indeed recommended to do so, as the compression improvement is expected to be significant:

1. GVCF files

2. Structural variants files 

3. Illumina Genotyping VCF files

4. Ultima Genomics files

5. VCF files with little or no sample or INFO data.

Example:

$ genozip --reference hs37d5.fa.gz myfile.g.vcf.gz

Note: when a particular FASTA file is used as a reference for the first time, Genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to

--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

$ genozip --make-reference hs37d5.fa.gz

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly, or set $GENOZIP_REFERENCE to point to either the file or the directory containing the file..

--optimize : even better compression, with some caveats.

 

While Genozip is primarily a lossless compressor, the --optimize option is provided to allow lossy compression as well. The idea of lossy compression is this: It is common for data to contain information in a higher resolution than we actually need for downstream analysis. Examples include the number of decimal digits in fractions and the maximum meaningful Phred score. If we could reduce the resolution, we could gain signficantly better compression. The compression-enhacing modifications to VCF data which genozip performs when --optimize is specified are described below. They are designed to have negligible impact on downstream analysis in many common cases, however you should definitely validate this for your own specific use case.

Phred-score rounding: annotations containing Phred score values, are modified such that the Phred values are rounded to the nearest integer and capped at 60. This method is applied to the following FORMAT fields: PL, SPL, PP, PRI, GQ and also to GP if it contains Phred scores.

Conversion to Phred scores: the GL annotation which contains likelihoods is converted to the equivalent PL annotation  that contains Phred scores. If the GP annotation contains probabilities, it is convered to the equivalent PP annotation that contains Phred scores. The Phred scores generated are rounded and capped as described above.

INFO sorting: The annotations in the INFO field re-ordered (but not otherwise modified) so that they appear in the same consistent order across all variants.

Floating-point number rounding: numerical INFO and FORMAT annotations containing fractions (also known as floating-point numbers), if not already optimized by one of the methods above, are rounded to their 3 signifcant digits.

QUAL field: Usually, the QUAL field is rounded as described in "Floating-point number rounding". The exception is single-sample DRAGEN-produced files with a FORMAT/GP annotation, where it is rounded as desribed in "Phred-score rounding".

Note: genounzip and genozip --test verify that the data uncompresses to precisely the same data as it was after the modifications. Note that this does not test the correctness of the modifications themselves.

Note: --optimize can take an argument, to fine grain which fields get optimized, for example:

> genozip --optimize=VQSLOD,SOR   - optimizes only the VQSLOD and SOR fields, if they are optimizable

> genozip --optimize=^VQSLOD,SOR  - optimizes all optimizable fields, except VQSLOD and SOR.

Note: To see which fields are actually optimized use --stats (you can use this with genozip, genocat or genounzip):

> genozip --optimize --stats test.vcf

VCF file: test.vcf
Samples: 1   variants: 149,870   Contexts: 70   Vblocks: 11 x 4.0 MB   Sections: 938
Programs: DRAGEN
Fields optimized: QUAL,INFO,AF,GP,GQ,PL,PRI,AF,MQ,VQSLOD,R2_5P_bias,QD,FS,SOR,MQRankSum,ReadPosRankSum
Genozip version: 15.0.60 github

 

vcf-with-ref
optimize-vcf
VCF benchmark for optimize 15.0.60.png

Genozip compression without and with --optimize, showing file sizes in MBs. The file tested is a GVCF file produced by Illumina DRAGEN obtained from here. Note: Compression performance may vary considerably depending on the specific data.

Slicing & dicing your data with genocat

Here's a summary of the filtering and subsetting options available for VCF files. See genocat for more information.

Option                              Effect

--downsample         Show only one in every X variants

--regions        -r  Exclude or include certain genomic regions

--regions-file   -R  Like --regions, but list of regions is specified in a file

--grep               Show only variants containing the specified string

--grep-w         -g  Like --grep, but match whole words

--lines          -n  Show only a variants from given range of line numbers

--head               Show only a certain number of variants from the start of the file

--tail               Show only a certain number of variants from the end of the file

--no-header          Show only the variants - exclude the VCF header

--header-only        Show only the VCF header - exclude the variants

--header-one     -1  Show only the the #CHROM line of the VCF header and the variants

--samples        -s  Show a subset of samples

--drop-genotypes -G  Output the data without the samples and FORMAT column

--GT-only            Within samples output only genotype (GT) data - dropping the other tags

--snps-only          Drop variants that are not a Single Nucleotide Polymorphism (SNP)

--indels-only        Drop variants that are not Insertions or Deletions (indel)

Note: in multi-sample files with both INFO/DP and FORMAT/DP fields, subsetting with --drop-genotypes, --GT-only or

bottom of page