top of page

Losslessness

Exceptions to Losslessness
gz
What do we mean by Losslessness
​

Genozip compression is lossless relative to the underlying data, which means that the data reconstructed during decompression is exactly identical to the original data compressed. This means that the MD5 values generated by:

​

.GZ files: zcat file.fastq.gz | md5sum

BAM:       zcat file.bam | md5sum

CRAM:    samtools view -u --no-PG file.cram | zcat | md5sum

BCF:        bcftools view --no-version file.bcf | md5sum

​

are the same for the original file and the file reconstructed with genounzip.

​

Verification of Losslessness

​

When uncompressing with genounzip (and genocat with some exceptions), genounzip verifies that the reconstructed data is exactly identical to the source data, using an MD5 or Adler32 digest. More details: Verifying file integrity.

​

Losslessness and gz-compressed files 
​

Many tools that generate genomic files compress them  into the gzip format (or a variant of it, BGZF). These typically have a .gz extension. BAM and BCF files, despite not having a .gz extension, are also gzip files.

​​​

When genozip compresses a gzip-compressed file, it removes the gz-compression as it progresses through the file, compressing the data with the Genozip method instead. Similarly, when genounzip uncompresses a .genozip file, it also re-compresses the data back to gzip format if needed. genounzip uses a very effecient method for gz-compression which contributes to its speed – even though this usually results in a slightly different gz-compression compared to the original file. Obviously, the data itself remains identical.

​​​

The --bgzf=exact (or --bgzf=exact-strict) option instructs genounzip to gz-recompress to precisely the same gz-compression as the original file. Whether or not it is possible, depends on whether genozip was able to identify the library used by the tool that created the original file: At the same time genozip is busy compressing the data, it also tests several libraries known to be used by popular bioinformatics tools to try to identify the one used for the original compression.

​

Using genozip --is-exactable it is possible to test a .gz or .bam file (before even compressing with genozip) to see whether --bgzf=exact would work.   

​​​​

Note on index files: Unless --bgzf=exact option is used, the change in gzip compression means that bai, tbi, csi and gzi index files need to be re-generated after genounzip. Luckily, genounzip automatically generates a bai index file when uncompressing .bam and .sam.gz files, and a tbi index files when uncompressing .vcf.gz files (unless  --no-bai or --no-tbi are specified). This generation of an index file is done during uncompression and has no noticeable performance penalty. 

​

Losslessness and compressed files other that gz-compressed
​

CRAM: CRAM files have their own internal compression (which is not gzip), and reconstruction of a CRAM file by genounznip or genocat might result in a different compression level than the original file. There is no equivalent of --bgzf=exact for CRAM files.

​

.bz2, .xz or .zip : similar to removing gz-compression, genozip is also capable of removing these types of compression, but genounzip cannot re-compress into .bz2.xz or .zip.

​​

Exceptions to Losslessness​​

​
1. There are some cases in which you may request genozip to modify the source data before compressing it. In these cases, the digest is calculated on the data after the modifications. These cases are:

​​​

Using --optimize 

Using --truncate, if the data was actually truncated

Using --head

- Using --add-line-numbers

​​​

2. BCF: Genozip does not compress BCF files natively - it first converts the data to VCF format. Because BCF stores floating point numbers in base-2 while VCF stores them in base-10, there is a theoritcal possibility that there might be rounding-error differences in annotations containing floating pointing numbers between the original file and the reconstructed one. This theoretical issue has not been observed yet in practice.
​

Questions? support@genozip.com

verify
definition

© 2024 Genozip Limited. All rights reserved. Genozip™ is a trademark. Our technology is patent-pending. Privacy Policy.

bottom of page