top of page
reference_caching
File types compressible with Genozip

Genozip is designed to compress the following genomic file formats:

FASTQ, BAM, SAM, CRAM, VCF, BCF, FASTA, GFF, GVF, GTF, BEDTRACK, Chain files, KRAKEN, PHYLIP, 23andMe,

Illumina LOC files.

Note: Files that are not of one of these formats are treated as GENERIC and are compressible as well. Genozip is often better and faster than general-purpose compressors even for GENERIC files. The support for GENERIC files allows compression of entire directories with Genozip, even if they contain some GENERIC files as well.

Simple compression and decompression

genozip sample.bam

 

genounzip sample.bam.genozip

Compressing with a reference

To achieve good compression for FASTQ and BAM files, it is highly recommended to compress with a reference. This is also recommended for certain VCF files such as GVCF files.

genozip --reference myfasta.fa.gz mydata.fq.gz

 

genounzip mydata.fq.genozip

Note: when a particular FASTA file is used as a reference for the first time, genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to

--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

genozip --make-reference myfasta.fa.gz

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to the reference file directly.

Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of

--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.

reference genome caching in RAM

To speed up loading of reference data, Genozip caches (=stores) the reference data in RAM (=memory).

To instruct genozip not to cache the reference data, use --no-cache :

genozip --reference myfasta.fa.gz --no-cache mydata.fq.gz

To see the list of reference genomes currently cached in RAM:

genols --cache

To remove a specific reference genome from RAM:

genozip --reference myfasta.ref.genozip --no-cache 

To remove all genomes from RAM:

genozip --no-cache 

Compressing a pair of FASTQ files

Genozip can take advantage of redundancies between corresponding reads in a pair of FASTQ files to improve compression when they are compressed together, using --pair:

genozip --reference myfasta.fa.gz --pair mysample-R1.fastq.gz mysample-R2.fastq.gz

Compressing and decompressing multiple files into a tar file (preserving directory structure)

genozip my_data/ --tar my_data.tar --subdirs

tar xvf my_data.tar |& genounzip --files-from - --replace

See also: Archiving

Using Genozip in a pipeline

Example:

my-bam-outputting-method | genozip --output mysample.bam.genozip

genocat mysample.bam.genozip | my-sam-inputting-method

Note: when piping data into genozip, genozip attempts to detect the file type from the data. The file type may also be given explicitly with --input, e.g., --input bam

Slicing & dicing your data with genocat

Genozip allows you to slice and dice your data in many ways, using genocat's subsetting options.

Some examples:

genocat --regions ^Y,MT mysample.bam.genozip  # Displays all alignments except Y and MT contigs.

genocat --regions chrM GRCh38.fa.genozip      # Dislays the sequence of chrM.

genocat --samples SMPL1,SMPL2 mysamples.vcf.genozip  # Displays 2 samples.

genocat --grep 1101:2392 myreads.fq.genozip   # Displays reads with “1101:2392” in the description.

genocat --downsample 10 mysample.fq.genozip   # Displays 1 in 10 reads.

Removing the original file after successful compression

genozip --replace myfile.bam

Balancing compression, speed and memory

There is a tradeoff between the execution speed and memory consumption of genozip, and the compression achieved. By default, genozip attempts to strike a good balance. Options --fast, --best  and --low-memory can be used to tell genozip to skew the balance in either direction:

genozip --best sample.bam  # better, but slower, compression 

genozip --fast sample.bam  # faster, but less efficient compression

genozip --low-memory sample.bam  # uses less RAM

MD5

Genozip verifies the identicality of a decompressed file to its original using Adler32. To use MD5 instead, as well as report the MD5 value, use --md5:

genozip --md5 sample.bam

See: Verifying file integrity

Encryption

genozip --password mysecret sample.bam

genounzip --password mysecret sample.bam.genozip

See: Encryption

Suppressing automatic testing

By default, after compressing a file, genozip verifies the compression by decompressing the compressed file in memory and comparing the digest (Adler32 or MD5) of the original file to that of the decompressed file. Using the --no-test option suppresses this verification, saving execution time (not recommended).

genozip --no-test myfile.bam

Multi-threading


By default, genozip attempts to utilize as many cores as available. For that, it sets the number of threads to be a bit more than the number of cores (a practice known as over-subscription), as at any given moment some threads might be idle, waiting for a resource to become available. The --threads option allows explicit specification of the number of compute threads to be used (in addition a small number of I/O threads is used too, usually 1 or 2).

Memory (RAM) consumption


In genozip, each compute thread is assigned a segment of the input file, known as a VBlock. By default, the VBlock size is selected based on characteristics of the data, however it may be set explicitly with --vblock. A larger VBlock usually results in better compression while a smaller VBlock causes genozip to consume less RAM. The VBlock size can be observed at the top of the --stats report. genozip’s memory consumption is linear with (VBlock-size X number-of-threads).

genocat and genounzip also consume memory linearly with (VBlock-size X number-of-threads), where VBlock-size is the value used by genozip of the particular file (it cannot be modified genocat or genounzip). Usually, genocat and genounzip consume significantly less memory compared to genozip.

When using a reference file, it is loaded to memory too. If multiple genozip / genocat / genounzip processes are running in parallel, only one copy of the reference file is loaded to memory and shared between all processes, and depending on how busy the computer is, that reference file data might persist in RAM even between consecutive runs, therefore avoiding the need to load it again from disk. All this all happens behind the scenes.

Use --low-memory to instruct genozip, genounzip or genocat to make tradeoffs in a way that uses less RAM, even at the expense of lesser compression or slower execution.

Questions? support@genozip.com

bottom of page