top of page
How good is Genozip at compressing VCF files?
v15 vcf benchmark.png
Relative sizes of already-compressed .vcf.gz files, before and after compression with genozip. 

Used Genozip 15.0.4 with --best--reference was used for compressing the two GVCF files. Datasets can be found here.

Compressing and uncompressing a VCF or BCF file

In this page, we will give examples of VCF files. Genozip is also capable of compressing BCF files with some limitations.

Compressing

Compressing and uncompressing VCF is straightforward:

$ genozip myfile.vcf.gz

$ ls -lh myfile.vcf*

-rwxrwxrwx 1 divon divon  88M Aug 1  08:49 myfile.vcf.genozip
-rwxrwxrwx 1 divon divon 244M Feb 10  2020 myfile.vcf.gz

Uncompressing

$ genounzip myfile.vcf.genozip 

Viewing

$ genocat myfile.vcf.genozip 

Compressing using a reference file 

It is possible to compress a VCF using a reference file, however this provides meaningful benefit only in certain cases:

1. GVCF files 

2. Illumina Genotyping VCF files

3. VCF files with little or no sample or INFO data.

Example:

$ genozip --reference hs37d5.fa.gz myfile.g.vcf.gz

Note: when a particular FASTA file is used as a reference for the first time, Genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to

--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

$ genozip --make-reference hs37d5.fa.gz

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly.

Compressing BCF files

Genozip does not support BCF natively - it uses bcftools to convert BCF files to the VCF format, and as such it requires bcftools to be installed for BCF compression to work.

genozip myfile.bcf​​

To decompress to a BCF file (regardless of whether original file was VCF or BCF). As above, this option relies on bcftools:

genocat --bcf myfile.vcf.genozip

The --bcf option is implicit if --output is specified with a file name with a .bcf extension.

Optimizing compression

The following options modify the file in ways that improve compression. --optimize is an umbrella option that activates all these optimization options.

Option                        Fields affected               Action

--optimize-sort   INFO                                        Sorts INFO annotations alphabetically by tag name.

--optimize-phred  FORMAT/PL                            Phred scores are rounded to the nearest integer and capped at 60
                                          FORMAT/PRI   

                                          FORMAT/PP

                                          FORMAT/GL (VCF v4.2 or earlier)

   

    --GL-to-PL                  FORMAT/GL                           The GL annotation is converted to PL and Phred values are capped at 60

    --GP-to-PP                   FORMAT/GP (VCF v4.3- )     The GP annotation is converted to PP and Phred values are capped at 60

    --optimize-VQSLOD   INFO/VQSLOD                       The value is rounded to 2 significant digits

More information: VCF optimizations

Slicing & dicing your data with genocat

Here's a summary of the filtering and subsetting options available for VCF files. See genocat for more information.

Option                              Effect

--downsample         Show only one in every X variants

--regions        -r  Exclude or include certain genomic regions

--regions-file   -R  Like --regions, but list of regions is specified in a file

--grep               Show only variants containing the specified string

--grep-w         -g  Like --grep, but match whole words

--lines          -n  Show only a variants from given range of line numbers

--head               Show only a certain number of variants from the start of the file

--tail               Show only a certain number of variants from the end of the file

--no-header          Show only the variants - exclude the VCF header

--header-only        Show only the VCF header - exclude the variants

--header-one     -1  Show only the the #CHROM line of the VCF header and the variants

--samples        -s  Show a subset of samples

--drop-genotypes -G  Output the data without the samples and FORMAT column

--GT-only            Within samples output only genotype (GT) data - dropping the other tags

--snps-only          Drop variants that are not a Single Nucleotide Polymorphism (SNP)

--indels-only        Drop variants that are not Insertions or Deletions (indel)

Sorting

VCF files are required to be sorted according to the VCF specification, although this is not a requirement for Genozip, which can compress unsorted files as well.

 

Using the --sort option causes genozip to also sort the file while compressing it. This works well with mildly unsorted files, but may consume significant memory if the files are significantly unsorted. As with all data-modifying options, automatic verification is disabled when using this option.

genozip --sort myfile.vcf.gz 

If a file was compressed with --sort, it is still possible to view it in its original, unsorted order:

genocat --unsorted myfile.vcf.genozip

Dual Coordinate VCF files (DVCF)

When compressing a VCF file, use the --chain option to add a second coordinate system, based on a second reference file, lifting over the coordinates and annotations. Subsequently, it is possible to view and manipulate the VCF file in either coordinate system, allowing different steps in a pipeline to operate using different coordinate system (e.g., GRCh37 and GRCh38).

As with all data-modifying options, automatic verification is disabled when using this option.

More information: Dual Coordinate VCF files.

bottom of page