genozip

optimize

threads

vblock

--stats

Compress files.

genozip can compress any file, but is optimally designed to compress the following file types: VCF/BCF, SAM/BAM/CRAM, FASTQ, FASTA, GFF/GVF/GTF, BED, 23andMe and LOCS.

Usage

genozip [options]… [files or urls]…

One or more file names or URLs may be given, or if omitted, standard input is used instead. - means standard input.

Supported input file types, as recognized by their listed filename extension(s):

Type Filename extensions

FASTA fasta, fa, fas, fsa, faa, ffn, fnn, fna (possibly .gz .bgz .bz2 .xz .zip)

FASTQ fastq, fq (possibly .gz .bgz .bz2 .xz .ora)

SAM sam (possibly .gz .bgz .bz2 .xz)

BAM bam (possibly also .gz .bgz)

CRAM cram

VCF vcf (possibly .gz .bgz .bz2 .xz)

BCF bcf (possibly also .gz .bgz)

GFF gff3, gff (possibly .gz .bgz .bz2 .xz)

GVF gvf (possibly .gz .bgz .bz2 .xz)

GTF gtf (possibly .gz .bgz .bz2 .xz)

BED bed (possibly .gz .bgz .bz2 .xz)

TRACK track (possibly .gz .bgz .bz2 .xz)

23andMe txt (possibly .zip)

LOCS locs (possibly .gz .bgz .bz2 .xz)

Generic any other file (possibly .gz .bgz .bz2 .xz .zip)

Note: Compressing .bcf, .cram, .xz , .zip or .ora files requires bcftools, samtools, xz, unzip or orad, respectively, to be installed. These are 3rd-party software packages not associated with Genozip.

Note: Compressing .ora might require the environment variable ORA_REF_PATH to be set to the directory containing the Ora reference file.

Examples

genozip sample.bam

genozip sample.R1.fq.gz sample.R2.fq.gz --pair --reference hg19.ref.genozip -o sample.genozip

genozip --optimize --password 12345 ftp://ftp.ncbi.nlm.nih.gov/file2.vcf.gz

Options

-i, --input data-type

data-type is one of the supported file extensions listed in the table above (eg bam vcf.gz fq.xz. See genozip --help=input for full list of accepted file types. This flag should be used when redirecting input data with a < or | or if the input file type cannot be determined by its file name.

-f, --force

Force overwrite of the output file or force writing the compressed data to standard output.

-^, --replace

Replace the source file with the result file rather than leaving it unchanged.

-D, --subdirs

If a file name on the command line is a directory include all files of that directory (recursively).

-o, --output output-filename

Note: output-filename can also be a directory name, in which case the output file is written to the specified directory. If the name has a ‘/’ suffix (e.g. “-o my-dir/”), then the directory is created if it doesn’t already exist.

-9, --optimize, --optimise[=[^]field,field...]

Modify the file in ways that are usually¹ insignificant for analytical purposes but significantly improve compression. Details of the optimizations can be found here: FASTQ optimizations, SAM/BAM/CRAM optimizations, VCF optimizations.

An optional argument can be used to provide granular optimization instructions. For example:

genozip --optimize=QUAL,rq:f - optimizes only the QUAL and rq:f fields, if they are optimizable

genozip --optimize=^QUAL,rq:f - optimizes all optimizable fields, except QUAL and rq:f fields.

Note¹: you should verify that the effect on your downstream analysis is indeed insignificant for your particular analysis.

Note: genounzip and genozip --test of a file compressed with --optimize verify that the decompressed file is identical to the original file after optimizations were applied, but there is no testing for the correctness of the optimizations.

Note: combine with --stats to see which fields were actually optimized.

-b, --best

Best compression.

Note: Running with this option is a bit slower and consumes more memory.

Note: Subsetting files compressed with --best with genocat --regions or --regions-file is slower than usual.

Note: When using --best with SAM/BAM or FASTQ, --reference must be used as well (except for long-read FASTQ files and

long-read, unmapped SAM/BAM files); This can be overridden with --force.

Tip: To avoid running out of memory on a low-resource personal computer, combine with limiting threads using --threads.

-F, --fast

Faster compression but lower compression ratio than normal. Files compressed with this option also decompress faster.

--low-memory

Uses less memory than normal, at the cost of lesser compression.

-p, --password password.

Password-protected - encrypted with 256-bit AES. See Encryption.

--tar tarfilename.tar

Compress directly into a standard tar file. Each file is compressed independently and written directly into a standard tar file

as it is being formed. See Archiving.

Note: The Linux genozip executables are automatically added to the tar file too (Linux only).

Note: to decompress all files packaged in a tar file use:

tar xvf tarfilename.tar |& genounzip --files-from - --replace

--sendto license-number

Compress the file such that only the user designated by license-number can decompress it. Use this option to compress a file ahead of sending it to another user. More details.

Note: The intended recipient can use genozip --license to find their license number.

Note: Compressing a file with --sendto is free, even for commercial use, while decompressing it requires Genozip Premium.

Note: --sendto is a licensing feature, not a security / access control feature. It is not designed to prevent a sophisticated hacker from accessing the data. To achieve better security, use in combination with --password.

--user-message filename

A message contained in the file filename will be included in the compressed file, and displayed when decompressing the file. The message may contain mulitple lines, and can use any alphabet (including Chinese, Japanese, Korean...) in regular text or UTF-8 format. More details.

-I, --input-size file-size-in-bytes

Use this option to inform genozip of the file size (an approximation or educated guess are just fine). Useful when the input file is redirected, and hence genozip cannot know its size. Useful because genozip configures its internal data structures to optimize execution speed based on the file size, and when lacking it, execution might be slower.

-t, --test

After compressing normally - decompress in memory (i.e. without writing the decompressed file to disk) - comparing the

digest of resulting decompressed file to that of the original file. The digest algorithm used is Adler32 - this may be changed to (slower) MD5 by combining with --md5. This option is set by default. See Verifying file integrity.

Note: Running genozip --test is the same as running genozip followed by genounzip --test.

Note: If the file is compressed (eg with .gz), the digest calculated is of the uncompressed file.

-X, --no-test

Disable --test.

Note: Testing a file post-compression is a critical component of Genozip's data integrity startegy - this option should not be used unless you plan to verify the file's integrity in another way, such as uncompressing it or running genounzip --test.

-m, --md5

Use MD5 (rather than the default Adler32) to calculate the digest of the file. The MD5 digest is also viewable with genols. See Verifying file integrity.

Note: for compressed files (eg .fq.gz) the MD5 calculated is that of the original uncompressed file. This applies to BAM files too which are usually compressed internally with BGZF.

-q, --quiet

Don't show the progress indicator or warnings, and disable upgrade checks.

-Q, --noisy

The --quiet option is turned on by default when outputting to the terminal. --noisy stops the suppression of warnings.

--no-tip

Don't show a tip after compression.

--no-upgrade

Don't check for a new version of Genozip

-@, --threads number

Specify the maximum number of threads. By default genozip allocates 1.1 threads per core in order to maximize usage of all available cores. An exception is on Mac and Windows (including WSL) where the default allocation is 0.75 threads per core to maintain the operating system's UI's feeling of interactivity.

Note: For genounzip and genocat this limit is only approximate. For genozip, it is strictly enforced.

-B, --vblock megabytes

Set the maximum size of data (between 1 and 1024 in megabytes) of the textual input data that a thread processes at any

given time. By default genozip sets this value dynamically based on the characteristics of the file and it is reported in

--stats. Smaller values will result in faster subsetting with genocat --regions and --grep. Larger values will result in better compression, and also increase the number cores genozip is able to utilize. Note that memory consumption of both genozip and genounzip is linear with the vblock value used for compression. --vblock can also accept exact bytes with the B suffix (for example: 10000000B).

Note: For certain FASTQ files, Genozip might decide to ignore this option and use larger blocks than requested to improve performance. To override this behavior, use --no-bgzf.

-e, --reference filename

filename maybe be a FASTA file or a reference file generated from a FASTA file with genozip --make-reference. The same reference file needs to be provided to genounzip or genocat.

While Genozip is capable of compressing without a reference it can utilize a reference file to improve compression of

FASTQ, SAM/BAM and VCF files.

The improvement for FASTQ files is substantial; for SAM/BAM it may be significant, in particular for low coverage files; for VCF if it significant in some cases.

Note: this is equivalent of setting the environment variable $GENOZIP_REFERENCE with the reference filename.

-E, --REFERENCE filename

Similar to --reference except genozip copies the reference (or part of it) to the output file so there is no need to specify

--reference in genounzip and genocat.

Note: when using with --password: the copy of the reference file stored in the compressed file is never encrypted.

--no-cache

Don't store reference genome data in RAM. Can also be used to delete previously cached genomes. See reference genome caching.

-w, --stats

Show the internal structure of a genozip file and the associated compression statistics.

--print-filename

Show the file name for each file. Useful when logging the output.

--activate

Activates Genozip. Can be used combination with --licfile (the order of the options is important in this case: --activate must come first).

--licfile filename

Point to a non-default location of the Genozip license file. See: Using Genozip on an HPC.

--truncate

Z. Allow compression of truncated files. Genozip will discard the final defective BGZF / GZIL block and/or the final partial line (read, alignment, variant...), and compress only the data that is intact. The digest is computed only on the data actually compressed.

--no-bgzf

Uncompress the .gz source compression in the main thread, rather than uncompressing in parallel in compute threads. This option is only needed to overcome some rare edge cases (genozip will tell you if this is the case) - normally, this option should not be used (even when compressing non-BGZF files), as it will slow down Genozip.

d, --decompress

Same as running genounzip.

-l, --list

Same as running genols.

-T, --files-from filename.

An alternative to providing input file names on the command line. filename it a textual file containing a newline-separated list of files. If filename is - (a hyphen) data is taken from stdin rather than a file.

--log filename

Send non-file output to a log file instead of the terminal.

--echo

Output the full command line upon successful or failed completion of execution. Useful if logging output.

--help

Show a link to this page.

--help=attributions

Show attributions.

-L, --license, --licence

Show the license terms and conditions for this product as accepted. Combine with --force to see the version of the license current to the version of Genozip used. If you wish to change your license to the most recent one - make sure your version of Genozip is the latest and re-activate with genozip --activate.

-V, --version

Display Genozip's version number.

FASTQ-specific options

-2, --pair

Compress a pairs of paired-end FASTQ files resulting in compression ratios better than compressing the files individually. When using this option every two consecutive files on the file list should be paired-end FASTQ files with an identical number of reads and consistent file names and --reference or --REFERENCE must be specified. To display the genozip file interleaved use genocat. To uncompress the genozip file back to its original FASTQ files use genounzip.

-A, --bamass filename.[bam|sam[.gz]|cram]

BAM-assisted compression of a FASTQ file: By letting Genozip inspect the corresponding BAM file, a FASTQ file can be compressed about twice as fast (in terms of CPU time) and up to 15% better. The BAM file is not required for uncompressing. Requires --reference or --REFERENCE and may be used in combination with --pair.

-3, --deep[=no-qual]

Losslessly co-compresses a SAM/BAM/CRAM file and all the FASTQ files that contributed to it. By leveraging redundancies between BAM and FASTQ data, Genozip typically shrinks the compressed data a further 40% vs compressing the BAM and the FASTQ files separately.

Using the argument =no-qual causes Deep to match read names and sequences between the BAM and FASTQ files, but not quality score data. This drastically reduces the amount of RAM consumed during testing and uncompressing, and results in a file size somewhere in between that of compressing the BAM and FASTQ files separately, and full Deep.

Example:

genozip --deep --reference hs37d.fa.gz mydata.bam mydata.R1.fq.gz mydata.R2.fq.gz

genounzip mydata.deep.genozip # reconstructs the 3 BAM and FASTQ files losslessly

--not-paired

Used in combination with --deep to inform Genozip that the two FASTQs files provided are not paired-end. Absent this option, Genozip guesses whether they are paired-end or not by analyzing their file names.

SAM/BAM-specific options

-3, --deep[=no-qual]

Losslessly co-compresses a SAM/BAM file and all the FASTQ files that contributed to it. By leveraging redundancies between BAM and FASTQ data, Genozip typically shrinks the compressed data a further 40% vs compressing the BAM and the FASTQ files separately.

Example:

genozip --deep --reference hs37d.fa.gz mydata.bam mydata.R1.fq.gz mydata.R2.fq.gz

genounzip mydata.deep.genozip # reconstructs the 3 BAM and FASTQ files losslessly

--no-gencomp

The gencomp method leverages redundancies between supplementary or secondary alignments and the matching primary alignment to improve compression. It can make a significant difference in compression of files rich in secondary or supplementary alignments, however it consumes considerable RAM and compression time. --no-gencomp disables this method.

--force-gencomp

Enable the gencomp method even in cases where it is disabled by default: if the file is unsorted, or if --low-memory or --fast are specified.

VCF-specific options

--add-line-numbers

Replaces the ID field in each variant with a sequential line number starting from 1.

--secure-DP

In some files, subsetting the file using genocat with --drop-genotypes, --GT-only or --samples would normally cause INFO/DP and INFO/QD to show as -1 and BaseCounts to show as '.'. Compressing with --secure-DP avoids this issue, at the expense of a slightly worse compression.

FASTA-specific options

--index

Enables genocat options --regions, --regions-file, --grep and --grep-w. This option could sometimes negatively impact compression, and in particular, --reference is ignored. This option is set automatically for files which contain up to 10,000 sequences which are assembled contigs (i.e. not sequencing reads).

--make-reference

Convert a FASTA file to be used as a reference in --reference or --REFERENCE.

Example: genozip --make-reference hs37d5.fa.gz

Example: cat *.fa | genozip --input fasta --make-reference - --output myref.ref.genozip

add-line-number

VCF optimizations

truncate