Compression - Quick Guide

reference_caching

File types compressible with Genozip

Genozip is designed to compress the following genomic file formats:

FASTQ, BAM, SAM, CRAM, VCF, BCF, FASTA, GFF, GVF, GTF, BED, TRACK, 23andMe, Illumina LOC files.

Note: Files that are not of one of these formats are treated as generic and are compressible as well. Genozip is often better and faster than general-purpose compressors even for generic files. The support for generic files allows compression of entire directories with Genozip, even if they contain some generic files as well.

Simple compression and decompression

genozip sample.bam

genounzip sample.bam.genozip

Compressing with a reference

To achieve good compression for FASTQ and BAM files, it is highly recommended to compress with a reference. This is also recommended for certain VCF files.

genozip --reference hs37d5.fa.gz mydata.fq.gz

genounzip mydata.fq.genozip

Note: when a particular FASTA file is used as a reference for the first time, genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to

--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

genozip --make-reference hs37d5.fa.gz

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to the reference file directly.

Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of

--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.

Co-compressing a pair of FASTQ files

Genozip can take advantage of redundancies between corresponding reads in a pair of FASTQ files to improve compression when they are compressed together, using --pair:

genozip --reference hs37d5.fa.gz --pair mysample-R1.fastq.gz mysample-R2.fastq.gz

Co-compressing a BAM file and its related FASTQ files

Simiarly, Genozip can take advantage of redundancies between a BAM file and the FASTQ files from which the BAM was generated, to significantly improve compression when they are compressed together, using --deep:

$ genozip --deep --reference hs37d5.fa.gz myfile.bam myfile-R1.fq.gz myfile-R2.fq.gz

$ ls -lh myfile*
-rw-------+ 1 57G Feb 7 2020 myfile.bam
-rw-------+ 1 31G May 3 23:28 myfile-R1.fq.gz
-rw-------+ 1 35G May 3 23:31 myfile-R2.fq.gz

-rw-------+ 1 16G Jun 21 19:27 myfile.deep.genozip

Important: for --deep to work, all FASTQ files that contributed reads to this BAM file must be included.

Compressing and decompressing multiple files into a tar file (preserving directory structure)

genozip my_data/ --tar my_data.tar --subdirs

tar xvf my_data.tar |& genounzip --files-from - --replace

Using Genozip in a pipeline

Example:

my-bam-outputting-method | genozip --output mysample.bam.genozip

genocat mysample.bam.genozip | my-sam-inputting-method

Note: when piping data into genozip, genozip attempts to detect the file type from the data. The file type may also be given explicitly with --input, e.g., --input bam

Slicing & dicing your data with genocat

Genozip allows you to slice and dice your data in many ways, using genocat's subsetting options.

Some examples:

genocat --regions ^Y,MT mysample.bam.genozip # Displays all alignments except Y and MT contigs.

genocat --regions chrM GRCh38.fa.genozip # Dislays the sequence of chrM.

genocat --samples SMPL1,SMPL2 mysamples.vcf.genozip # Displays 2 samples.

genocat --grep 1101:2392 myreads.fq.genozip # Displays reads with “1101:2392” in the description.

genocat --downsample 10 mysample.fq.genozip # Displays 1 in 10 reads.

Removing the original file after successful compression or decompression

genozip --replace myfile.bam

genounzip --replace myfile.bam.genozip

Balancing compression, speed and memory

There is a tradeoff between the execution speed and memory consumption of genozip, and the compression achieved. By default, genozip attempts to strike a good balance. Options --fast, --best and --low-memory can be used to tell genozip to skew the balance in either direction:

genozip --best sample.bam # better, but slower, compression

genozip --fast sample.bam # faster, but less efficient compression

genozip --low-memory sample.bam # uses less RAM, but slower and less efficient compression

MD5

Genozip verifies the identicality of a decompressed file to its original using Adler32. To use MD5 instead, as well as report the MD5 value, use --md5. Note that this usually results in slower compression and decompression.

genozip --md5 sample.bam

See: Verifying file integrity

Encryption

For security and compliance, it is possible to encrypt the file while compressing it. Genozip uses AES (256 bits) for encryption, and using encryption has no negative impact on compression speed or the compression ratio.

genozip --password mysecret sample.bam

genounzip --password mysecret sample.bam.genozip

See: Encryption

Multi-threading

By default, genozip attempts to utilize as many cores as it can. For that, it sets the number of threads to be a bit more than the number of cores (a practice known as over-subscription), as at any given moment some threads might be idle, waiting for a resource to become available. The --threads option allows explicit specification of the number of compute threads to be used (in addition, a small number of I/O threads is used too, usually 1 or 2).

genozip --threads 20 sample.sam

On machines with a large number of cores (> 100), the bottleneck becomes the speed in which Genozip and read and write from disk, and Genozip will not utilize all the cores available.

When running on a personal computer (Windows, Mac or WSL), genozip uses less cores than are available, to avoid starving the user interface threads of the operating system or of the other applications running, causing the computer to feel "stuck". To override this behavior and utilize all available cores, use --threads to specify a number that is 10% higher than the number of cores on your machine.

Reference file caching in RAM

To speed up loading of reference data, Genozip caches (=stores) the reference data in RAM (=memory).

To instruct genozip not to cache the reference data, use --no-cache :

genozip --reference hs37d5.fa.gz --no-cache mydata.fq.gz

To see the list of reference genomes currently cached in RAM:

genols --cache

To remove a specific reference genome from RAM:

genozip --reference hs37d5.ref.genozip --no-cache

To remove all genomes from RAM:

genozip --no-cache

Suppressing automatic testing

By default, after compressing a file, genozip verifies the compression by decompressing the compressed file in memory and comparing the digest (Adler32 or MD5) of the original file to that of the decompressed file. Using the --no-test option suppresses this verification, saving execution time. This is strongly discouraged as the post-compression testing is an essential part of Genozip's strategy to ensure data integrity.

genozip --no-test myfile.bam

Memory (RAM) consumption

In genozip, each compute thread is assigned a segment of the input file, known as a VBlock. By default, the VBlock size is selected based on characteristics of the data, however it may be set explicitly with --vblock. A larger VBlock usually results in better compression while a smaller VBlock causes genozip to consume less RAM. The VBlock size can be observed at the top of the --stats report. genozip’s memory consumption is linear with (VBlock-size X number-of-threads).

genozip --vblock 32 sample.bam # 32 MB of source file data per VBlock

genocat and genounzip also consume memory linearly with (VBlock-size X number-of-threads), where VBlock-size is the value used by genozip of the particular file (it cannot be modified genocat or genounzip). Usually, genocat and genounzip consume significantly less memory compared to genozip.

When using a reference file, it is loaded to memory too. If multiple genozip / genocat / genounzip processes are running in parallel, only one copy of the reference file is loaded to memory and shared between all processes, and depending on how busy the computer is, that reference file data might persist in RAM even between consecutive runs, therefore avoiding the need to load it again from disk. All this all happens behind the scenes.

Use --low-memory to instruct genozip, genounzip or genocat to make tradeoffs in a way that uses less RAM, even at the expense of lesser compression or slower execution.

Questions? support@genozip.com