Losslessness

Exceptions to Losslessness

BGZF

Genozip compression is lossless relative to the underlying data, which means that the data reconstructed during decompression is exactly identical to the original data compressed. This means that the MD5 values generated by:

.GZ files: zcat file.fastq.gz | md5sum

BAM: zcat file.bam | md5sum

CRAM: samtools view -u --no-PG file.cram | zcat | md5sum

BCF: bcftools view file.bam | md5sum

are the same for the original file and the file reconstructed with genounzip.

Verification of Losslessness

When uncompressing with genounzip (and genocat with some exceptions), genounzip verifies that the reconstructed data is exactly identical to the source data, using an MD5 or Adler32 digest. More details: Verifying file integrity.

Exceptions to Losslessness

1. Compressing already-compressed files (gzip): Many tools that generate genomic files compress them into the gzip format (or a variant of it, BGZF). These typically have a .gz extension. BAM and BCF files, despite not having a .gz extension, are also gzip files.

When genozip compresses a gzip-compressed file, it uncompresses the gzipped data it to recover the original underlying data just before compressing it with Genozip. Similarly, when genounzip uncompresses a .genozip file, it recompresses the data back to gzip (or more precisely, BGZF). Since there are many gzip compression libraries, each with dozens of parameter combinations, it is possible that genounzip's gzip compression will achieve a slightly different compression level than the gzip compression of the original file, resulting in the final gzip-compressed file differing from the original gzip file. However, the underlying data is identical: "zcat myfile.gz | md5sum" should yield the same value for the original file and the reconstructed one. It is also the same value as reported by genozip when using the --md5 option.

The --bgzf=exact option instructs genounzip to attempt to identify the gzip library used for compressing the original file. If the library is identified, the gzip compression will be reproduced precisely. Using this option may cause slower decompression as the gzip library used might be slower than the one used by genounzip by default.

It is possible to know whether Genozip is able identify the gzip library used for compressing a file, using genozip --show-gz, for example:

> genozip --show-gz my-file.bam
my-file.bam: Identified as generated with libdeflate_1.7 level 6

2. CRAM: Likewise, CRAM files have their own internal compression (not using BGZF), and reconstruction of a CRAM file by genounznip or genocat might result in a different compression level than the original file. There is no equivalent of --bgzf=exact for CRAM files.

3. Compressing already-compressed files (.bz2 .xz .zip): genozip is capable of compressing files that are already compressed with these methods. It does so by uncompressing the data just before recompressing with Genozip. In these cases, genounzip recompresses to .gz, not to .bz2 .xz or .zip.

To instruct genounzip to refrain from recompressing, use --bgzf=0.

4. BCF: Genozip does not compress BCF files natively - it first converts the data to VCF format. Because BCF stores floating point numbers in base-2 while VCF stores them in base-10, there is a theoritcal possibility that there might be rounding-error differences in annotations containing floating pointing numbers between the original file and the reconstructed one. This theoretical issue has not been observed yet in practice.

5. There are some cases in which you may request genozip to modify the source data before compressing it. In these cases, the digest is calculated on the data after the modifications. These cases are:

- Using --optimize

- Using --truncate, if the data was actually truncated

- Using --head

- Using --add-line-numbers

Questions? support@genozip.com

verify