
Verifying file integrity
Overview
When compressing with genozip, as the source file is being read from disk (or standard input), Genozip calculates a numeric value which is a function of the entire source file’s content - a value known as the digest of the file. If the source file is compressed in .gz .bz2 .xz or .zip - the digest is calculated on the source data after decompressing it.
The digest value is then stored in the compressed .genozip file.
When the .genozip file is decompressed with genounzip (and genocat with some exceptions), a digest is again calculated on the final output data (but before recompressing it with BGZF). The output data digest is then compared to the digest stored in the .genozip file. If these two digests are identical, then we know with a very high probability (see below), that the uncompressed file is exactly identical to the source file. If it is not, genounzip will report an error. Short of intentional tampering of the file, or a bug in Genozip, this should never happen.
When running genozip (unless --no-test is specified), after the file is compressed, genounzip --test is automatically run in a separate process on the resulting output .genozip file, to ensure its integrity. genounzip --test, which may also be run separately, decompresses the file in memory in order to compare the digest - the resulting reconstructed data is then discarded and not written to disk.
Probabilities
By default, the digest is calculated using the XXH3 algorithm, resulting in a 64 bit digest.
Genozip also offers a verification method using the MD5 algorithm, which may be invoked by compressing with
genozip --md5. MD5 results in a 128 bit digest.
MD5 versus XXH3
In addition to the digest sizes, there are two other notable differences between MD5 and XXH3 in Genozip:
1. MD5, unlike XXH3, is cryptographic. By that we mean that an attacker interested in maliciously replacing a file with another file, will find it very hard to purposely slightly modify his desired file so that it has the same MD5 digest as the original file. Therefore, recording the MD5 digest of the original file, and comparing it to the MD5 of the final file provides a high level of confidence that this is indeed the same file. In contrast, intentionally adjusting a file to achieve any desired XXH3 digest is relatively easy. Note: MD5, while still widely used for cryptographic purposes, is considered by security experts to be outdated, with stronger algorithms available. If you need a stronger algorithm, please contact support@genozip.com.
2. Genozip compresses files by breaking them into VBlocks and compressing each VBlock using its own compute thread - this allows compressing many VBlocks in parallel using multiple CPU cores. If XXH3 is used, the digest is calculated (and verified) for each VBlock independently and there is no file-wide digest. In contrast, with MD5, the digest is calculated on the entire file - which means that it is calculated one VBlock at a time, in their correct order. This serialization of VBlocks results in introducing some inefficiencies in thread parallelization in both compression and uncompression resulting in Genozip utilizing less cores than with XXH3 and hence resulting in slower compression and uncompression.
Cases in which genocat doesn’t verify the digest
When using any genocat option which results in intentional modification of the output data, the digest is not verified. These options include (partial list): --regions --grep --downsample --drop-genotypes --gt-only --no-header --header-only header-one --MAPQ --FLAG --bases --samples --sequential --head --tail --lines --one-vb --one-component
Command line examples
Compressing with XXH3 digest:
genozip myfile.sam
Compressing with MD5 digest:
genozip --md5 myfile.sam
Compressing without verifying after compression:
genozip --no-test myfile.sam
Verifying a compressed file by uncompressing in memory without writing to disk:
genounzip --test myfile.sam.genozip
Viewing the MD5 digest of a single file or the entire directory:
genols myfile.sam.genozip
genols
Questions? support@genozip.com
