top of page
Anchor 1
How good is Genozip at compressing VCF files?
v15 vcf benchmark.png
​Relative sizes of already-compressed .vcf.gz files, before and after compression with genozip. 

Used Genozip 15.0.4 with --best--reference was used for compressing the two GVCF files. Datasets can be found here.

​
Compressing and uncompressing a VCF or BCF file

​

In this page, we will give examples of VCF files. Genozip is also capable of compressing BCF files with some limitations.

​

Compressing​

​

Compressing and uncompressing VCF is straightforward:

​

$ genozip myfile.vcf.gz

​

$ ls -lh myfile.vcf*

-rwxrwxrwx 1 divon divon  88M Aug 1  08:49 myfile.vcf.genozip
-rwxrwxrwx 1 divon divon 244M Feb 10  2020 myfile.vcf.gz

​

Uncompressing

​

$ genounzip myfile.vcf.genozip 

​

Viewing

​

$ genocat myfile.vcf.genozip 

​

Compressing VCF using a reference file 
​​

It is possible to compress a VCF using a reference file, and in the following cases it is indeed recommended to do so, as the compression improvement is expected to be significant:

1. GVCF files

2. Structural variants files 

3. Illumina Genotyping VCF files

4. Ultima Genomics files

5. VCF files with little or no sample or INFO data.

​

Example:

​

$ genozip --reference hs37d5.fa.gz myfile.g.vcf.gz

​

Note: when a particular FASTA file is used as a reference for the first time, Genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to

--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

​​

$ genozip --make-reference hs37d5.fa.gz

​​

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly, or set $GENOZIP_REFERENCE to point to either the file or the directory containing the file..

​

Compressing BCF files

​​

Genozip does not support BCF natively - it uses bcftools to convert BCF files to the VCF format, and as such it requires bcftools to be installed for BCF compression to work.

​​

genozip myfile.bcf​​

​

To decompress to a BCF file (regardless of whether original file was VCF or BCF). As above, this option relies on bcftools:

​

genocat --bcf myfile.vcf.genozip

​

The --bcf option is implicit if --output is specified with a file name with a .bcf extension.

​

Optimizing compression

​

The following options modify the file in ways that improve compression. --optimize is an umbrella option that activates all these optimization options.

​​

Option                        Fields affected               Action

--optimize-sort   INFO                                        Sorts INFO annotations alphabetically by tag name.

​

--optimize-phred  FORMAT/PL                            Phred scores are rounded to the nearest integer and capped at 60
                                          FORMAT/PRI   

                                          FORMAT/PP

                                          FORMAT/GL (VCF v4.2 or earlier)

   

    --GL-to-PL                  FORMAT/GL                           The GL annotation is converted to PL and Phred values are capped at 60

​​

    --GP-to-PP                   FORMAT/GP (VCF v4.3- )     The GP annotation is converted to PP and Phred values are capped at 60

​

    --optimize-VQSLOD   INFO/VQSLOD                       The value is rounded to 2 significant digits

​

More information: VCF optimizations

​

Slicing & dicing your data with genocat
​

Here's a summary of the filtering and subsetting options available for VCF files. See genocat for more information.

​

Option                              Effect

--downsample         Show only one in every X variants

--regions        -r  Exclude or include certain genomic regions

--regions-file   -R  Like --regions, but list of regions is specified in a file

--grep               Show only variants containing the specified string

--grep-w         -g  Like --grep, but match whole words

--lines          -n  Show only a variants from given range of line numbers

--head               Show only a certain number of variants from the start of the file

--tail               Show only a certain number of variants from the end of the file

--no-header          Show only the variants - exclude the VCF header

--header-only        Show only the VCF header - exclude the variants

--header-one     -1  Show only the the #CHROM line of the VCF header and the variants

--samples        -s  Show a subset of samples

--drop-genotypes -G  Output the data without the samples and FORMAT column

--GT-only            Within samples output only genotype (GT) data - dropping the other tags

--snps-only          Drop variants that are not a Single Nucleotide Polymorphism (SNP)

--indels-only        Drop variants that are not Insertions or Deletions (indel)

​

Note: in multi-sample files with both INFO/DP and FORMAT/DP fields, subsetting with --drop-genotypes, --GT-only or
--samples would normally cause INFO/DP and INFO/QD to show as -1 and INFO/BaseCounts to show as '.'. Compressing with --secure-DP avoids this issue, at the expense of a slightly worse compression.

​

​
vcf-with-ref
bottom of page