Compressing VCF files

Anchor 1

How good is Genozip at compressing VCF files?

Relative sizes of already-compressed .vcf.gz files, before and after compression with genozip.

Used Genozip 15.0.4 with --best. --reference was used for compressing the two GVCF files. Datasets can be found here.

Compressing and uncompressing a VCF or BCF file

In this page, we will give examples of VCF files. Genozip is also capable of compressing BCF files with some limitations.

Compressing

Compressing and uncompressing VCF is straightforward:

$ genozip myfile.vcf.gz

$ ls -lh myfile.vcf*

-rwxrwxrwx 1 divon divon 88M Aug 1 08:49 myfile.vcf.genozip
-rwxrwxrwx 1 divon divon 244M Feb 10 2020 myfile.vcf.gz

Uncompressing

$ genounzip myfile.vcf.genozip

Viewing

$ genocat myfile.vcf.genozip

Compressing VCF using a reference file

It is possible to compress a VCF using a reference file, and in the following cases it is indeed recommended to do so, as the compression improvement is expected to be significant:

1. GVCF files

2. Structural variants files

3. Illumina Genotyping VCF files

4. Ultima Genomics files

5. VCF files with little or no sample or INFO data.

Example:

$ genozip --reference hs37d5.fa.gz myfile.g.vcf.gz

Note: when a particular FASTA file is used as a reference for the first time, Genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to

--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

$ genozip --make-reference hs37d5.fa.gz

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly, or set $GENOZIP_REFERENCE to point to either the file or the directory containing the file..

Compressing BCF files

Genozip does not support BCF natively - it uses bcftools to convert BCF files to the VCF format, and as such it requires bcftools to be installed for BCF compression to work.

genozip myfile.bcf

To decompress to a BCF file (regardless of whether original file was VCF or BCF). As above, this option relies on bcftools:

genocat --bcf myfile.vcf.genozip

The --bcf option is implicit if --output is specified with a file name with a .bcf extension.

Optimizing compression

The following options modify the file in ways that improve compression. --optimize is an umbrella option that activates all these optimization options.

Option Fields affected Action

--optimize-sort INFO Sorts INFO annotations alphabetically by tag name.

--optimize-phred FORMAT/PL Phred scores are rounded to the nearest integer and capped at 60
FORMAT/PRI

FORMAT/PP

FORMAT/GL (VCF v4.2 or earlier)

--GL-to-PL FORMAT/GL The GL annotation is converted to PL and Phred values are capped at 60

--GP-to-PP FORMAT/GP (VCF v4.3- ) The GP annotation is converted to PP and Phred values are capped at 60

--optimize-VQSLOD INFO/VQSLOD The value is rounded to 2 significant digits

More information: VCF optimizations

Slicing & dicing your data with genocat

Here's a summary of the filtering and subsetting options available for VCF files. See genocat for more information.

Option Effect

--downsample Show only one in every X variants

--regions -r Exclude or include certain genomic regions

--regions-file -R Like --regions, but list of regions is specified in a file

--grep Show only variants containing the specified string

--grep-w -g Like --grep, but match whole words

--lines -n Show only a variants from given range of line numbers

--head Show only a certain number of variants from the start of the file

--tail Show only a certain number of variants from the end of the file

--no-header Show only the variants - exclude the VCF header

--header-only Show only the VCF header - exclude the variants

--header-one -1 Show only the the #CHROM line of the VCF header and the variants

--samples -s Show a subset of samples

--drop-genotypes -G Output the data without the samples and FORMAT column

--GT-only Within samples output only genotype (GT) data - dropping the other tags

--snps-only Drop variants that are not a Single Nucleotide Polymorphism (SNP)

--indels-only Drop variants that are not Insertions or Deletions (indel)

Note: in multi-sample files with both INFO/DP and FORMAT/DP fields, subsetting with --drop-genotypes, --GT-only or
--samples would normally cause INFO/DP and INFO/QD to show as -1 and INFO/BaseCounts to show as '.'. Compressing with --secure-DP avoids this issue, at the expense of a slightly worse compression.

vcf-with-ref