Verifying file integrity

Overview

When compressing with genozip, as the source file is being read from disk (or standard input), Genozip calculates a numeric value which is a function of the entire source file’s content - a value known as the digest of the file. If the source file is compressed in .gz .bz2 .xz or .zip - the digest is calculated on the source data after decompressing it.

The digest value is then stored in the compressed .genozip file.

When the .genozip file is decompressed with genounzip (and genocat with some exceptions), a digest is again calculated on the final output data (but before recompressing it with BGZF). The output data digest is then compared to the digest stored in the .genozip file. If these two digests are identical, then we know with a very high probability (see below), that the uncompressed file is exactly identical to the source file. If it is not, genounzip will report an error. Short of intentional tampering of the file, or a bug in Genozip, this should never happen.

When running genozip (unless --no-test is specified), after the file is compressed, genounzip --test is automatically run in a separate process on the resulting output .genozip file, to ensure its integrity. genounzip --test, which may also be run separately, decompresses the file in memory in order to compare the digest - the resulting reconstructed data is then discarded and not written to disk.

Probabilities

By default, the digest is calculated using the Alder32 algorithm, resulting in a 32 bit digest. This would mean that the probability of a corrupt file passing the integrity test by chance being about 1 in 4 billion (the exact probability may be different as it is subject to the statistical properties of the data).

Genozip also offers a stronger verification method using the MD5 algorithm, which may be invoked by compressing with

genozip --md5. MD5 results in a 128 bit digest, bringing down the probability of a corrupt file passing the integrity test by chance to about 2^(-128) or approximately 1 in 34,000,000,000,000,000,000,000,000,000,000,000,000.

MD5 versus Adler32

In addition to the different probabilities, there are two other notable differences between MD5 and Adler32 in Genozip:

1. MD5, unlike Adler32, also has cryptographic usage - an attacker interested in maliciously replacing a file with another file, will find it very hard to purposely slightly modify his desired file so that it has the same MD5 digest as the original file. Therefore, recording the MD5 digest of the original file, and comparing it to the MD5 of the final file provides a high level of confidence that this is indeed the same file. In contrast, intentionally adjusting a file to achieve any desired Adler32 digest is trivial. Note: MD5, while still widely used for cryptographic purposes, is considered by security experts to be outdated, with stronger algorithms available. If you need a stronger algorithm, please contact support@genozip.com.

2. Genozip compresses files by breaking them into VBlocks and compressing each VBlock using its own compute thread - this allows compressing many VBlocks in parallel using multiple CPU cores. If Adler32 is used, the digest is calculated (and verified) for each VBlock independently and there is no file-wide digest. In contrast, with MD5, the digest is calculated on the entire file - which means that it is calculated one VBlock at a time, in their correct order. This serialization of VBlocks results in introducing some inefficiencies in thread parallelization in both compression and decompression resulting in Genozip utilizing less cores than with Adler32 and hence resulting in slower compression and decompression.

Cases in which genozip doesn’t calculate a digest

There are some cases in which you may request genozip to change the source data before compressing it. In these cases, the digest is not calculated. See Exceptions to Losslessness.

Cases in which genocat doesn’t verify the digest

When using any genocat option which results in intentional modification of the output data, the digest is not verified. These options include (partial list): --regions --grep --downsample --drop-genotypes --gt-only --no-header --header-only header-one --MAPQ --FLAG --bases --samples --sequential --head --tail --lines --one-vb --one-component

Command line examples

Compressing with Adler32 digest:

genozip myfile.sam

Compressing with MD5 digest:

genozip --md5 myfile.sam

Compressing without verifying after compression:

genozip --no-test myfile.sam

Verifying a compressed file by decompressing in memory without writing to disk:

genounzip --test myfile.sam.genozip

Viewing the MD5 digest of a single file or the entire directory:

genols myfile.sam.genozip

genols

Viewing the flow of creating the digest (useful mostly for Genozip developers):

genozip --show-digest myfile.sam

genounzip --test --show-digest myfile.sam.genozip

genocat --show-digest myfile.sam.genozip

Viewing the digest as it appears in the .genozip file

Viewing the digest as it appears in the GENOZIP_HEADER section of the .genozip file, and the digest of the individual components as it appears in the TXT_HEADER sections or individual VBlocks (useful mostly for Genozip developers):

genocat --show-header=GENOZIP_HEADER myfile.sam.genozip

genocat --show-header=TXT_HEADER myfile.sam.genozip

genocat --show-header=VB_HEADER myfile.sam.genozip

Questions? support@genozip.com

genocat exceptions