File types compressible with Genozip
Genozip is designed to compress the following genomic file formats:
FASTQ, BAM, SAM, CRAM, VCF, BCF, FASTA, GFF3, GVF, Chain files, KRAKEN, PHYLIP, 23andMe, Illumina LOC files.
Note: Files that are not of one of these formats are treated as GENERIC and are compressible as well. Genozip is often better and faster than general-purpose compressors even for GENERIC files. The support for GENERIC files allows compression of entire directories with Genozip, even if they contain some GENERIC files as well.
Simple compression and decompression
Compressing with a reference
To achieve good compression for FASTQ and BAM files, it is highly recommended to compress with a reference. This is also recommended for certain VCF files such as GVCF files.
genozip --reference myfasta.fa.gz mydata.fq.gz
Note: when a particular FASTA file is used as a reference for the first time, genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to
--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):
genozip --make-reference myfasta.fa.gz
Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to the reference file directly.
Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of
--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.
Compressing a pair of FASTQ files
Genozip can take advantage of redundancies between corresponding reads in a pair of FASTQ files to improve compression when they are compressed together, using --pair:
genozip --reference myfasta.fa.gz --pair mysample-R1.fastq.gz mysample-R2.fastq.gz
Compressing and decompressing multiple files into a tar file (preserving directory structure)
genozip my_data/ --tar my_data.tar --subdirs
tar xvf my_data.tar |& genounzip --files-from - --replace
See also: Archiving
Using Genozip in a pipeline
my-bam-outputting-method | genozip --output mysample.bam.genozip
genocat mysample.bam.genozip | my-sam-inputting-method
Note: when piping data into genozip, genozip attempts to detect the file type from the data. The file type may also be given explicitly with --input, e.g., --input bam
Slicing & dicing your data with genocat
Genozip allows you to slice and dice your data in many ways, using genocat's subsetting options.
genocat --regions ^Y,MT mysample.bam.genozip # Displays all alignments except Y and MT contigs.
genocat --regions chrM GRCh38.fa.genozip # Dislays the sequence of chrM.
genocat --samples SMPL1,SMPL2 mysamples.vcf.genozip # Displays 2 samples.
genocat --grep 1101:2392 myreads.fq.genozip # Displays reads with “1101:2392” in the description.
genocat --downsample 10 mysample.fq.genozip # Displays 1 in 10 reads.
Removing the original file after successful compression
genozip --replace myfile.bam
Balancing compression and speed
There is a tradeoff between execution speed of genozip, and the compression achieved. By default, genozip attempts to strike a good balance. Options --fast or --best can be used to tell genozip to skew the balance in either direction:
genozip --best sample.bam # better, but slower, compression
genozip --fast sample.bam # faster, but less efficient compression
genozip --md5 sample.bam
See: Verifying file integrity
genozip --password mysecret sample.bam
genounzip --password mysecret sample.bam.genozip
Suppressing automatic testing
By default, after compressing a file, genozip verifies the compression by decompressing the compressed file in memory and comparing the digest (Adler32 or MD5) of the original file to that of the decompressed file. Using the --no-test option suppresses this verification, saving execution time (not recommended).
genozip --no-test myfile.bam
By default, genozip attempts to utilize as many cores as available. For that, it sets the number of threads to be a bit more than the number of cores (a practice known as over-subscription), as at any given moment some threads might be idle, waiting for a resource to become available. The --threads option allows explicit specification of the number of compute threads to be used (in addition a small number of I/O threads is used too, usually 1 or 2).
Memory (RAM) consumption
In genozip, each compute thread is assigned a segment of the input file, known as a VBlock. By default, the VBlock size is selected based on characteristics of the data, however it may be set explicitly with --vblock. A larger VBlock usually results in better compression while a smaller VBlock causes genozip to consume less RAM. The VBlock size can be observed at the top of the --stats report. genozip’s memory consumption is linear with (VBlock-size X number-of-threads).
genocat and genounzip also consume memory linearly with (VBlock-size X number-of-threads), where VBlock-size is the value used by genozip of the particular file (it cannot be modified genocat or genounzip). Usually, genocat and genounzip consume significantly less memory compared to genozip.
When using a reference file, it is loaded to memory too. If multiple genozip / genocat / genounzip processes are running in parallel, only one copy of the reference file is loaded to memory and shared between all processes, and depending on how busy the computer is, that reference file data might persist in RAM even between consecutive runs, therefore avoiding the need to load it again from disk. All this all happens behind the scenes.