Compress files.
genozip can compress any file, but is optimally designed to compress the following file types: VCF/BCF, SAM/BAM/CRAM, FASTQ, FASTA, GFF/GVF/GTF, BED, PHYLIP, Chain, Kraken, 23andMe and LOCS.
Usage
genozip [options]… [files or urls]…
One or more file names or URLs may be given, or if omitted, standard input is used instead. - means standard input.
Supported input file types, as recognized by their listed filename extension(s):
Type Filename extensions
FASTA fasta, fa, fas, faa, ffn, fnn, fna (possibly .gz .bgz .bz2 .xz)
FASTQ fastq, fq (possibly .gz .bgz .bz2 .xz)
SAM sam (possibly .gz .bgz .bz2 .xz)
BAM bam (possibly also .gz .bgz)
CRAM cram
VCF vcf (possibly .gz .bgz .bz2 .xz)
BCF bcf (possibly also .gz .bgz)
GFF gff3, gff, gvf, gtf (possibly .gz .bgz .bz2 .xz)
BED bed (possibly .gz .bgz .bz2 .xz)
PHYLIP phy (possibly .gz .bgz .bz2 .xz)
Chain chain (possibly .gz .bgz .bz2 .xz)
Kraken kraken (possibly .gz .bgz .bz2 .xz)
23andMe txt (possibly .zip)
LOCS locs (possibly .gz .bgz .bz2 .xz)
Generic any other file (possibly .gz .bgz .bz2 .xz)
Note: compressing .bcf, .cram,.xz or .zip files requires bcftools, samtools, xz or .zip respectively, to be installed.
Examples
genozip sample.bam
genozip sample.R1.fq.gz sample.R2.fq.gz --pair --reference hg19.ref.genozip -o sample.genozip
genozip --optimize -password 12345 ftp://ftp.ncbi.nlm.nih.gov/file2.vcf.gz
Options
-i, --input data-type
data-type is one of the supported file extensions listed in the table above (eg bam vcf.gz fq.xz. See "genozip --help=input" for full list of accepted file types. This flag should be used when redirecting input data with a < or | or if the input file type cannot be determined by its file name.
-f, --force
Force overwrite of the output file or force writing the compressed data to standard output.
-^, --replace
Replace the source file with the result file rather than leaving it unchanged.
-D, --subdirs
If a file name on the command line is a directory include all files of that directory (recursively).
-o, --output output-filename
Note: output-filename can also be a directory name, in which case the output file is written to the specified directory. If the name has a ‘/’ suffix (e.g. “-o my-dir/”), then the directory is created if it doesn’t already exist.
-9, --optimize, --optimise
Modify the file in ways that are likely insignificant for analytical purposes but significantly improve compression and
somewhat improve the speed of genocat --regions. This option activates all optimizations.
Note: files compressed with this option are NOT identical to the original file after decompression. For this reason, it is not possible to use this option in combination with --test or --md5.
Note: For the list of optimizations available for each data type, see below.
-b, --best
Best compression.
Note: Running with this option is a bit slower and consumes more memory.
Note: Subsetting files compressed with --best with genocat --regions or --regions-file is slower than usual.
Note: When using --best with SAM/BAM or FASTQ, --reference must be used as well (except for long-read FASTQ files and
long-read, unmapped SAM/BAM files); This can be overridden with --best=NO_REF.
Tip: To avoid running out of memory on a low-resource personal computer, combine with limiting threads using --threads.
-F, --fast
Faster compression but lower compression ratio than normal. Files compressed with this option also decompress faster.
-p, --password password.
Password-protected - encrypted with 256-bit AES. See Encryption.
--tar tarfilename.tar
Compress directly into a standard tar file. Each file is compressed independently and written directly into a standard tar file
as it is being formed. See Archiving.
Note: to decompress all files packaged in a tar file use:
tar xvf tarfilename.tar |& genounzip --files-from - --replace
-I, --input-size file-size-in-bytes
genozip configures its internal data structures to optimize execution speed based on the file size. When redirecting the input
file with < or | genozip cannot determine its size and this might result in slower execution. This problem can be overcome by using this option to inform genozip of the file size.
-t, --test
After compressing normally - decompress in memory (i.e. without writing the decompressed file to disk) - comparing the
digest of resulting decompressed file to that of the original file. The digest algorithm used is Adler32 - this may be changed to (slower) MD5 by combining with --md5. This option is set by default. See Verifying file integrity.
Note: Running genozip --test is the same as running genozip followed by genounzip --test.
Note: If the file is compressed (eg with .gz), the digest calculated is of the uncompressed file.
-X, --no-test
Disable --test.
-m, --md5
Use MD5 (rather than the default Adler32) to calculate the digest of the file. The MD5 digest is also viewable with genols. See Verifying file integrity.
Note: for compressed files (eg .fq.gz) the MD5 calculated is that of the original uncompressed file. This applies to BAM files too which are usually compressed internally with BGZF.
-q, --quiet
Don't show the progress indicator or warnings, and disable upgrade checks.
-Q, --noisy
The --quiet option is turned on by default when outputting to the terminal. --noisy stops the suppression of warnings.
--no-tip
Don't show a tip after compression.
-@, --threads number
Specify the maximum number of threads. By default genozip allocates 1.1 threads per core in order to maximize usage of all available cores. An exception is on Mac and Windows (including WSL) where the default allocation is 0.75 threads per core to maintain the operating system's UI's feeling of interactivity.
Note: For genounzip and genocat this limit is only approximate. For genozip, it is strictly enforced.
-B, --vblock megabytes
Set the maximum size of data (between 1 and 2048 in megabytes) of the textual input data that a thread processes at any
given time. By default genozip sets this value dynamically based on the characteristics of the file and it is reported in
--stats. Smaller values will result in faster subsetting with genocat --regions and --grep. Larger values will result in better compression. Note that memory consumption of both genozip and genounzip is linear with the vblock value used for compression. --vblock can also accept exact bytes with the B suffix (for example: 10000000B).
-e, --reference filename
filename maybe be a FASTA file or a reference file generated from a FASTA file with genozip --make-reference.
The same reference file needs to be provided to genounzip or genocat.
While genozip is capable of compressing without a reference it can utilize a reference file to improve compression of
FASTQ, SAM/BAM and VCF files.
The improvement for FASTQ files is substantial; for SAM/BAM it may be significant, in particular for low coverage files; for VCF if it is significant for GVCFs or if REFALT content is a significant percentage of the zip content
(see "% of zip" in --stats)
Note: this is equivalent of setting the environment variable $GENOZIP_REFERENCE with the reference filename.
-E, --REFERENCE filename
Similar to --reference except genozip copies the reference (or part of it) to the output file so there is no need to specify
--reference in genounzip and genocat.
Note: when using with --password: the copy of the reference file stored in the compressed file is never encrypted.
--match-chrom-to-reference
Used in combination with --reference. Contig (Chromosome) names are rewritten to match the names in the
reference file provided. Examples: 22➔chr22 ; chrM➔MT. See Matching contig names to reference.
-w, --stats
Show the internal structure of a genozip file and the associated compression statistics.
-W, --STATS
Show more detailed statistics.
Note: specifying -W or -w twice, results in the header line of the statistics printed to stderr, thereby surviving piping stdout to grep.
--show-filename
Show the file name for each file. Useful when logging the output.
--register
Register (or re-register) a license to use genozip.
--licfile filename
Point to a non-default location of the Genozip license file. See: Using Genozip on an HPC.
d, --decompress
Same as running genounzip.
-l, --list
Same as running genols.
-T, --files-from filename.
An alternative to providing input file names on the command line. filename it a textual file containing a newline-separated list of files. If filename is - (a hyphen) data is taken from stdin rather than a file.
--log filename
Send non-file output to a log file instead of the terminal.
--echo
Output the full command line upon successful or failed completion of execution. Useful if logging output.
--help
Show a link to this page.
--help=attributions
Show attributions.
-L, --license, --licence
Show the license terms and conditions for this product as accepted. Combine with --force to see the version of the license current to the version of Genozip used. If you wish to change your license to the most recent one - make sure your version of Genozip is the latest and re-register with genozip --register.
-V, --version
Display Genozip's version number.
VCF-specific options
--chain chain-file
Lifts a VCF to be a dual-coordinate VCF (DVCF). See: Dual-coordinate VCF files.
--dvcf-rename, --dvcf-drop
Used in combination with --chain to specify annotations that should be renamed or dropped when cross rendering Primary➝Luft or Luft➝Primary. See: Renaming and dropping annotations in a DVCF.
--show-lifts
Used in combination with --chain - output successful lifts to the rejects file too, not only rejected lifts.
See: Dual-coordinate VCF files.
--show-counts=o\$TATUS
Show summary statistics of variant lift outcome. This is set by default when using --chain.
See: Dual-coordinate VCF files.
--show-counts=COORDS
Show summary statistics of variant coordinates.
See: Dual-coordinate VCF files.
--show-chain
Used in combination with --chain - displays all chain file alignments.
--show-rename-tags
Show tags that are to be renamed. Used when compressing a DVCF or in combination with --chain.
See: Renaming and dropping annotations in a DVCF.
--sort
Causes genozip to generate a reconstruction plan that will allow genocat to show the file sorted. This is designed for mildly-unsorted files. If the file is highly unsorted this might result in genocat loading a big portion of the uncompressed file to memory (genocat --unsorted can be used to prevent sorting). This option is always set for dual-coordinates files unless overridden with --unsorted.
--unsorted
Don't generate a reconstruction plan.
--add-line-numbers
Replaces the ID field in each variant with a sequential line number starting from 1.
VCF optimizations. Applying these improves the compression. Note: --optimize (or -9) is a shortcut for combining all optimizations
--optimize-sort
INFO subfields are sorted alphabetically.
Example: AN=21;AC=3 ➔ AC=3;AN=21
--optimize-phred
Applied to FORMAT/PL FORMAT/PRI FORMAT/PP and (VCF v4.2 or earlier) FORMAT/GL - Phred scores are rounded to the nearest integer and capped at 60.
Example: 0.40,17.75,270.4 ➔ 0,18,60
--GL-to-PL
The FORMAT/GL field is converted to PL and Phred values are capped at 60.
Example: GL= -7.61618,-0.447624,-0.193264 ➔ PL= 60,4,2
--GP-to-PP
Applicable to VCF v4.3 and later: The FORMAT/GP field is converted to PP and Phred values are capped at 60.
Example: GP= -7.61618,-0.447624,-0.193264 ➔ PP= 60,4,2
--optimize-VQSLOD
VQSLOD data: Number is rounded to 2 significant digits.
Example: -4.19494 ➔ -4.2
SAM/BAM-specific options
-K, --kraken filename
Create a tx:i field, containing the Taxonomy ID of the alignment, based on the Kraken data. filename is a kraken2-generated file (genozipped or not).
See: Filtering BAM or FASTQ reads by species using kraken2.
SAM and BAM optimizations. Applying these improves the compression. Note: --optimize (or -9) is a shortcut for combining all optimizations
--optimize-QUAL
The QUAL quality field and the secondary U2 quality field (if it exists) are modified to group quality scores into a smaller number of bins:
Old values New value
2-9 | 6
10-19 | 15
20-24 | 22
25-29 | 27
... |
85-89 | 87
90-92 | 91
93 Unchanged
This assumes a standard Sanger format of Phred quality scores 0➔93 encoded in ASCII 33➔126
Note: this follows Illumina’s quality bins for values up to Phred 39, and extends with additional similar bins for values of 40 and above common in some non-Illumina technologies.
Example: LSVIHINKHK ➔ IIIIFIIIFI
--optimize-ZM
ZM:B:s data: negative Ion Torrent flow signal values are changed to zero and positives are rounded to the nearest 10.
Example: -20,212,427 ➔ 0,210,430
FASTQ-specific options
-2, --pair
Compress a pairs of paired-end FASTQ files resulting in compression ratios better than compressing the files individually. When using this option every two consecutive files on the file list should be paired-end FASTQ files with an identical number of reads and consistent file names and --reference or --REFERENCE must be specified. To display the genozip file interleaved use genocat. To uncompress the genozip file back to its original FASTQ files use genounzip.
--multiseq
Inform genozip that the sequences are somewhat similar to each other (e.g. multiple sequences of the same virus). genozip uses this information to improve the compression.
-K, --kraken filename
Incorporate the Taxonomy ID of each read into the file. For use with genocat --taxid. filename is a kraken2-generated file (genozipped or not).
See: Filtering BAM or FASTQ reads by species using kraken2.
FASTQ optimizations. Applying these improves the compression. Note: --optimize (or -9) is a shortcut for combining all optimizations.
--optimize-DESC
Replaces the description line with @filename:read_number. Also - if the 3rd line (the '+' line) contains a copy of the description it is shortened to just '+'.
Example: @A00488:61:HMLGNDSXX:4:1101:1561:1000 2:N:0:CTGAAGCT+ATAGAGGC ➔ @sample.100 (100 is the read sequential number within this FASTQ file)
--optimize-QUAL
The quality data is optimized as described for SAM/BAM above.
FASTA-specific options
--make-reference
Convert a FASTA file to be used as a reference in --reference or --REFERENCE.
Example: genozip --make-reference hs37d5.fa.gz
Example: cat *.fa | genozip --input fasta --make-reference - --output myref.ref.genozip
--multiseq
Inform genozip that the sequences are somewhat similar to each other (e.g. multiple sequences of the same virus). genozip uses this information to improve the compression.
-K, --kraken filename
Incorporate the Taxonomy ID of each read into the file. For use with genocat --taxid. filename is a kraken2-generated file (genozipped or not).
See: Filtering BAM or FASTQ reads by species using kraken2.
GFF/GVF/GTF-specific options
GFF/GVF/GTF optimizations. Applying these improves the compression. Note: --optimize (or -9) is a shortcut for combining all optimizations.
--optimize-sort
Attributes are sorted alphabetically.
Example: Notes=hi;ID=rs12 ➔ ID=rs12;Notes=hi
--optimize-Vf
Variant_freq data: Number is rounded to 2 significant digits.
Example: 0.006351 ➔ 0.0064
KRAKEN-specific options
--no-kmers
Drop SEQLEN and KMER fields, which are not required for subsequent use of the kraken.genozip with
genozip --kraken or genocat --kraken. This reduces the compression time and the .kraken.genozip file size by 60%-90%.