top of page

Objective

Genozip Enterprise is designed for medium-to-large deployments ranging from 50 TB​ to petabytes, and contains our best compression methods, plus, of course, everything in Genozip Standard.

The incremental functionality available in Genozip Enterprise which is not available in Genozip Standard, consists of three compression-enhancing methods: Deep, Pair and Optimize.

These options are also available in Genozip Premium.

Deep: co-compression of FASTQ and BAM

Screenshot 2023-06-22 143754.png

Genozip Deep™ is a patent-pending method for co-compressing a BAM file (or SAM or CRAM) along with all the FASTQs that contributed reads to the BAM. The result of this compression is a single file with the extension .deep.genozip. Uncompressing the deep file with genounzip outputs all the original files (losslessly, obviously!). This method results in very substantial savings vs compressing the FASTQs and the BAM separately, as can be appreciated from the benchmark above.

The exact compression ratios achieved are very much dependent on the specifics of the data being compressed, but it is not unusual to achieve file size reductions of 85%-90% with this method.

v15 pair benchmark.png

Pair: co-compression of paired-end FASTQ files

            The left bar shows sizes of each .fastq.genozip file when compressed separately, relative

           to the combined size of the .fastq.gz files, and the right bar shows the relative size of the

           .fastq.genozip file when co-compressed together using --pair. 

With the --pair option, Genozip exploits redundancies between corresponding reads in a pair of FASTQ files to improve compression when they are compressed together. Typically, this results in shrinking the compressed file by an additional 10-15%.

By default, decompression recovers the original two FASTQs, however, it is also possible to output them together in interleaved format.


 

Optimize: even better compression, with some caveats.

While Genozip is primarily a lossless compressor, the --optimize option is where we venture into "lossy compression" as well. The idea of lossy compression is this: It is common for data to contain information in a higher resolution than we actually need for downstream analysis. Examples include resolution of base quality scores and number of decimal digits in fractions. If we could reduce the resolution, we could gain signficantly better compression. The compression-enhacing modifications which genozip performs when --optimize is used, are designed to have negligible impact on downstream analysis in many common cases, however you should validate this for your own data.

The additional savings with the Optimize method highly depend on the details of the specific file being compressed. If the file is already highly optimized, then --optimize might have limited effect. In other files, it might halve the size of the file or better, compared to genozip without --optimize.

Those who resent the Z in the word "optimize" will be glad to know that --optimise works as well 😊.

Details of the specific modifications can be found here: FASTQ  SAM/BAM/CRAM  VCF

Optimize for FASTQ

benchmark for optimize FASTQ 15.0.60.png

Genozip compression without and with --optimize, showing file sizes in MBs. The file tested is a FASTQ file produced by an MGI Tech sequencer, obtained from here

Optimize for BAM

benchmark for optimize BAM 15.0.60.png

Genozip compression without and with --optimize, showing file sizes in MBs. The file tested is a BAM file which consists of MGI Tech reads aligned with Illumina DRAGEN

Optimize for VCF

VCF benchmark for optimize 15.0.60.png

Genozip compression without and with --optimize, showing file sizes in MBs. The file tested is a GVCF file produced by Illumina DRAGEN obtained from here

Questions? support@genozip.com

bottom of page