How good is Genozip at compressing FASTQ files?
Relative sizes of .fastq files, before and after compression with genozip.
Used Genozip 15.0.4, --reference was used for compressing with all files, --pair was used for the first dataset (third bar from the left), --best was not used. Datasets can be found here.
Compressing a FASTQ using a reference file
While Genozip is capable of compressing with or without reference a reference file, it is always a good idea to use one if possible, as demonstrated by this chart:
Relative sizes of a .fastq file generated with Illumina Novaseq WGS 30x coverage.
Showing the benefit of using --reference or --REFERENCE
$ genozip --reference hs37d5.fa.gz myfile-R1.fq.gz
genozip myfile-R1.fq.gz : Done (5 minutes 13 seconds, FASTQ compression ratio: 24.7 - better than .fastq.gz by a factor of 5.1)
testing: genounzip myfile-R1.fq.genozip : verified as identical to the original FASTQ
$ ls -lh myfile-R1.fq.*
-rw-------+ 1 divon divon 6.0G Sep 16 19:01 myfile-R1.fq.genozip
-rw-------+ 1 divon divon 31G May 3 23:28 myfile-R1.fq.gz
Note: when a particular FASTA file is used as a reference for the first time, genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to
--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):
$ genozip --make-reference hs37d5.fa.gz
Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly.
Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of
--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.
Note: If the species has multiple versions of its reference genome FASTA, any one of them will work. For example, for human data, one option would be to use hs37d5.fa.gz.
Note: If there is no reference genome for the target species, a reference genome of a closely related species may still work better than no reference at all.
Note: For meta-genomic applications, it is possible to use a reference FASTA that contains sequences from multiple species, up to a total of 4 Gbp.
Co-compressing a pair of FASTQ files
The left bar shows sizes of each .fastq.genozip file when compressed separately, relative to the combined size of the .fastq.gz files, and the right bar shows the relative size of the .fastq.genozip file when co-compressed together using --pair.
Genozip can take advantage of redundancies between corresponding reads in a pair of FASTQ files to improve compression when they are compressed together, using --pair. Typically, this results in shrinking the compressed file by 10-15%.
$ genozip --pair --reference hs37d5.fa.gz myfile-R1.fq.gz myfile-R2.fq.gz
genozip myfile-R1.fq.gz : Done (5 minutes 4 seconds)
genozip myfile-R2.fq.gz : Done (6 minutes 25 seconds, FASTQ compression ratio: 22.2 - better than .fastq.gz by a factor of 4.9)
$ ls -lh myfile-R*
-rw-------+ 1 divon divon 31G May 3 23:28 myfile-R1.fq.gz
-rw-------+ 1 divon divon 35G May 3 23:31 myfile-R2.fq.gz
-rw-------+ 1 divon divon 14G Sep 16 19:17 myfile-R1+2_001.fq.genozip
Uncompressing both files to their original file names:
Accessing R1 and R2 data separately:
genocat myfile-R1+2.fq.genozip --R1
genocat myfile-R1+2.fq.genozip --R2
Co-compressing BAM and FASTQ files (Genozip Deep™)
Genozip supports co-compressing a BAM file (or SAM or CRAM) along with all the FASTQs that contributed reads to the BAM. The result of this compression is a single file with the extension .deep.genozip. Uncompressing the deep file with genounzip outputs all the original files (losslessly, obviously!). This method results in very substantial savings vs compressing the FASTQs and the BAM separately:
The datasets can be found here.
Note: The Deep method requires inclusion of all FASTQs that contributed reads to the BAM.
Note: There is a lot that can change in the data between a FASTQ file and the corresponding BAM file - sequences may have been be trimmed, BQSR may have been applied to quality scores, read names may have changed, reads may have been filtered out or duplicates may have been collapsed to consensus sequences just to name a few. Deep will nevertheless still work, albeit a bit less effeciently, if such changes have occurred.
Note: The Deep method typically consumes 20%-50% less CPU compared to compressing the BAM and FASTQ files separately. However, it consumes significant RAM - the exact amount of RAM required varies, but as a rule of thumb it is about as much as the size of the BAM file.
$ genozip --deep --reference hs37d5.fa.gz myfile.bam myfile-R1.fq.gz myfile-R2.fq.gz
$ ls -lh myfile*
-rw-------+ 1 57G Feb 7 2020 myfile.bam
-rw-------+ 1 31G May 3 23:28 myfile-R1.fq.gz
-rw-------+ 1 35G May 3 23:31 myfile-R2.fq.gz
-rw-------+ 1 16G Jun 21 19:27 myfile.deep.genozip
The following options modify the file in ways that improve compression. --optimize is an umbrella option that activates all optimization options.
genozip myfile.fq.gz --reference hs37d5.ref.genozip --optimize-DESC
This replaces the description line with @filename.read_number. Also - if the 3rd line (the ‘+’ line) contains a copy of the description it is shortened to just ‘+’.
genozip myfile.fq.gz --reference hs37d5.ref.genozip --optimize-QUAL
The base quality data is optimized as follows:
Old values New value
Note: because optimizations modify the data of the file, automatic verification after compression is disabled, and genounzip --test is also not available.
Uncompress a file (note: if file was created with --pair this results in two output files):
Uncompress a file into stdout (i.e. the terminal):
Uncompress a file and also generates a FAI index file, using samtools faidx. samtools needs to be installed for this option to work:
genounzip --index myfile.fq.genozip
Uncompress to a particular name. Whether or not the name has a .gz extension determines whether the output file is BGZF-compressed:
genounzip --output newname.fq.gz myfile.fq.genozip
Set the level of BGZF compression (for .fq.gz output format) - from 0 (no compression) to 12 (best yet slowest compression). With --bgzf=exact, genounzip attempts to recover the BGZF compression level of the original file.
genocat --bgzf 6 myfile.fq.genozip
genounzip --bgzf 6 myfile.fq.genozip
Uncompressing to a pipe
genocat myfile.fq.genozip | my-pipeline
In case the file was generated with --pair, genocat will output the data in interleaved format - the first read of R1, followed by the first read of R2, followed by the second read of R1 and so on. Many tools have a command line option for accepting a pair of FASTQ files in interleaved format, for example:
genocat myfile.R1+2.fq.genozip | my-pipeline # interleaved paired-end
If it is desired to output only one of the paired files at a time, use the --R1 or --R2 genocat options.
Slicing & dicing your data with genocat
Genozip allows you to slice and dice your data in many ways, using genocat's subsetting options.
When subsetting a file generated with --pair, use the --interleaved=both option (this is the default) to show the pair of reads only if both reads survived the filter, and --interleaved=either to show the pair of reads if either of them surviving the filtering.
Here's a summary of the filtering and subsetting options available for FASTQ files. See genocat for more information.
--downsample Show only one in every X reads
--grep Show only reads containing the specified string
--grep-w -g Like --grep, but match whole words
--lines -n Show only a reads from given range of line numbers
--head Show only a certain number of reads from the start of the file
--tail Show only a certain number of reads from the end of the file
--taxid Include or exclude reads mapped to a certain taxonomy ID. See kraken.
--bases Filter reads based on the IUPAC nucleotide codes in the sequence data
--no-header Show only the alignments - exclude the SAM header
--header-only Show only the description line of each read
--seq-only Show only the sequence line of each read
--qual-only Show only the base qualities line of each read
Example: show only the sequence for each read:
genocat --seq-only myfile.fq.genozip
Example: show only the first (#0) read in every 10 reads:
genocat --downsample 10,0 myfile.fq.genozip
Example: show reads with the string “ACCTTAAT” anywhere in the read (description, sequence or quality lines) - possibly a substring of a longer string:
genocat --grep ACCTTAAT myfile.fq.genozip
Example: show only reads in which all characters of the sequence are one of A,C,G,T,N:
genocat --bases ACGTN myfile.fq.genozip
Example: show only reads in which NOT all characters of the sequence are one of A,C,G,T,N:
genocat --bases ^ACGTN myfile.fq.genozip
Genozip has the ability to calculate approximate per-contig statistics (idxstats) directly from FASTQ data. See idxstats.
genocat --idxstats myfile.fq.genozip
Per-contig coverage and depth
Genozip has the ability to calculate approximate per-contig coverage and depth directly from FASTQ data.
See Coverage and Depth.
genocat --coverage myfile.fq.genozip
Genozip has the ability to estimate the sample's sex directly from FASTQ data. Not suitable for clinical applications. See Sex Classification.
genocat --sex myfile.fq.genozip
For a full list of options, see the genozip command line reference