How good is Genozip at compressing FASTQ files?
Relative sizes of .fastq files, before and after compression with genozip.
Used Genozip 15.0.4, --reference was used for compressing with all files, --pair was used for the first dataset (third bar from the left), --best was not used. Datasets can be found here.
​
​
Compressing a FASTQ using a reference file
​
While Genozip is capable of compressing with or without reference a reference file, it is always a good idea to use one if possible, as demonstrated by this chart:
Relative sizes of a .fastq file generated with Illumina Novaseq WGS 30x coverage.
Showing the benefit of using --reference or --REFERENCE
​
​
$ genozip --reference hs37d5.fa.gz myfile-R1.fq.gz
genozip myfile-R1.fq.gz : Done (5 minutes 13 seconds, FASTQ compression ratio: 24.7 - better than .fastq.gz by a factor of 5.1)
testing: genounzip myfile-R1.fq.genozip : verified as identical to the original FASTQ
​
$ ls -lh myfile-R1.fq.*
-rw-------+ 1 divon divon 6.0G Sep 16 19:01 myfile-R1.fq.genozip
-rw-------+ 1 divon divon 31G May 3 23:28 myfile-R1.fq.gz
Note: when a particular FASTA file is used as a reference for the first time, genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to
--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):
​​
$ genozip --make-reference hs37d5.fa.gz
​​
Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly, or set $GENOZIP_REFERENCE to point to either the file or the directory containing the file.
​​
Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of
--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.
​
Note: If the species has multiple versions of its reference genome FASTA, any one of them will work. For example, for human data, one option would be to use hs37d5.fa.gz.
​
Note: If there is no reference genome for the target species, a reference genome of a closely related species may still work better than no reference at all.
Note: For meta-genomic applications, it is possible to use a reference FASTA that contains sequences from multiple species, up to a total of 4 Gbp.​
​
Co-compressing a pair of FASTQ files
The left bar shows sizes of each .fastq.genozip file when compressed separately, relative to
the combined size of the .fastq.gz files, and the right bar shows the relative size of the
.fastq.genozip file when co-compressed together using --pair.
​​
​​
Genozip can take advantage of redundancies between corresponding reads in a pair of FASTQ files to improve compression when they are compressed together, using --pair. Typically, this results in shrinking the compressed file by 10-15%.
​​
$ genozip --pair --reference hs37d5.fa.gz myfile-R1.fq.gz myfile-R2.fq.gz
genozip myfile-R1.fq.gz : Done (5 minutes 4 seconds)
genozip myfile-R2.fq.gz : Done (6 minutes 25 seconds, FASTQ compression ratio: 22.2 - better than .fastq.gz by a factor of 4.9)
​
$ ls -lh myfile-R*
-rw-------+ 1 divon divon 31G May 3 23:28 myfile-R1.fq.gz
-rw-------+ 1 divon divon 35G May 3 23:31 myfile-R2.fq.gz
-rw-------+ 1 divon divon 14G Sep 16 19:17 myfile-R1+2_001.fq.genozip
​
Uncompressing both files to their original file names:
​
genounzip myfile-R1+2.fq.genozip
​
Accessing R1 and R2 data separately:
​
genocat myfile-R1+2.fq.genozip --R1
genocat myfile-R1+2.fq.genozip --R2
​​
Note: --pair works on paired-end files only, which are defined as two FASTQ files containing the same read names in the same order (possibly with a mate indicator such as /1 /2).
​
Co-compressing BAM and FASTQ files (Genozip Deep™)
​
Genozip supports co-compressing a BAM file (or SAM or CRAM) along with all the FASTQs that contributed reads to the BAM. The result of this compression is a single file with the extension .deep.genozip. Uncompressing the deep file with genounzip outputs all the original files (losslessly, obviously!). This method results in very substantial savings vs compressing the FASTQs and the BAM separately:
The datasets can be found here.
​
Note: The Deep method requires inclusion of all FASTQs that contributed reads to the BAM.
​
Note: There is a lot that can change in the data between a FASTQ file and the corresponding BAM file - sequences may have been be trimmed, BQSR may have been applied to quality scores, read names may have changed, reads may have been filtered out or duplicates may have been collapsed to consensus sequences just to name a few. Deep will nevertheless still work, albeit a bit less effeciently, if such changes have occurred.
​
Note: The Deep method typically consumes 20%-50% less CPU compared to compressing the BAM and FASTQ files separately. However, it consumes significant RAM - the exact amount of RAM required varies, but as a rule of thumb it is about as much as the size of the BAM file.
​
Note: If running --deep with an even number of FASTQ files (and a BAM, obviously), Genozip tries to determine whether these are paired-end files by pairing the files based on their file names and the description line of the first read in each file. If Genozip determines that they are paired-end files, it applies the --pair method to further improve compression. However, if it incorrectly determines that they are paired-end while in fact they are not, an error would result. To overcome this, use --not-paired to inform Genozip that these files are not paired.
​​​
Example:
​​
$ genozip --deep --reference hs37d5.fa.gz myfile.bam myfile-R1.fq.gz myfile-R2.fq.gz
​​
$ ls -lh myfile*
-rw-------+ 1 57G Feb 7 2020 myfile.bam
-rw-------+ 1 31G May 3 23:28 myfile-R1.fq.gz
-rw-------+ 1 35G May 3 23:31 myfile-R2.fq.gz
-rw-------+ 1 16G Jun 21 19:27 myfile.deep.genozip
​
--optimize : even better compression, with some caveats.
​
While Genozip is primarily a lossless compressor, the --optimize option is provided to allow lossy compression as well. The idea of lossy compression is this: It is common for data to contain information in a higher resolution than we actually need for downstream analysis. Examples include resolution of base quality scores and details contained in read names. If we could reduce the resolution, we could gain signficantly better compression.. The compression-enhacing modifications to FASTQ data which genozip performs when --optimize is specified are described below. They are designed to have negligible impact on downstream analysis in many common cases, however you should definitely validate this for your own specific use case.
​
1. Modifications of the FASTQ Description line: Genozip replaces the QNAME part of the description line with @filename.read_number. If there is a second QNAME on the line (common in data downloaded from NCBI), it is removed. Other data on the Description line (barcodes, metadata etc) as well "/1" and "/2" indicating mates, are left intact.
​
2. Sequence line: No change are made to the sequence line.
​
3. "+" line: This line is reduced to being just a "+" character.
​
4. The Quality scores line: The base quality scores ∈[0,93] (which appear in FASTQ as the ASCII characters '!' through '~') are binned, loosely following Illumina's method:
​
Old values New value
0-2 unchanged
3-9 6
10-10 15
20-24 22
25-29 27
...
85-89 87
90-92 91
93 93
​
Note: Quality score binning is not applied quality scores are already binned to 8 or fewer values.
​
Example 1: an MGI read from file called test-mgi.fq.gz (obtained here):
​
Before:
​
@S200032449L1C001R00100000094/1
TCTATTTCTCCTTTCATTTCTATCGGCTTTTGTCTCATGTATTTCGATGCTCTGTTGTTAGGTGCATACACACTTAAGATTGTTATGTGTCTTTGGAGCA
+
@FDE@@DFDFF@@@F?DD@F@E@FFFFDDDDFDFDF?@F@?@DDFFE@DFDFDFD@FD@?FF@FF?D8F?F?F@DEEB?CD6@D?DFDF@FD@DFF1CFE
​
After:
​
@test-mgi.0/1
TCTATTTCTCCTTTCATTTCTATCGGCTTTTGTCTCATGTATTTCGATGCTCTGTTGTTAGGTGCATACACACTTAAGATTGTTATGTGTCTTTGGAGCA
+
BFFFBBFFFFFBBBFBFFBFBFBFFFFFFFFFFFFFBBFBBBFFFFFBFFFFFFFBFFBBFFBFFBF7FBFBFBFFFBBBF7BFBFFFFBFFBFFF0BFF
​
Example 2: An NCBI-style formatting of an Illumina read from a file called test-ncbi.fq.gz (obtained here):
​
Before:
​
@SRR1067582.1.1 D1Y24ACXX130315:6:2102:7429:73241 length=100
TTTGGGTTAGGGTTTGGGTTAGGGTTCGGGTTAGGGTTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGTTAGGGTTAGGGTGGTTGT
+SRR1067582.1.1 D1Y24ACXX130315:6:2102:7429:73241 length=100
<@@DFFAADFHGFFDHIIHIGHIICEFHGI8?;DHI;DGIII;CFHFICCCEEH;@BDCE6;6=?B;=5<CAAA(999>?@583>?#############@
​
After:
​
@test-ncbi.0 length=100
TTTGGGTTAGGGTTTGGGTTAGGGTTCGGGTTAGGGTTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGTTAGGGTTAGGGTGGTTGT
+
<BBFFFBBFFFFFFFFKKFKFFKKBFFFFK7B<FFK<FFKKK<BFFFKBBBFFF<BBFBF7<7<BB<<7<BBBB'777<BB770<B#############B
​
Note: when combining --deep and --optimize, only the Quality scores are optimized, while the Description and "+" line are not.
​
Note: genounzip and genozip --test verify that the data uncompresses to precisely the same data as it was after the modifications. Note that this does not test the correctness of the modifications themselves.
​
Note: To optimize only QUAL data, but not the description line, use --optimize=QUAL. Conversely, To optimize only the description line, but not QUAL, use --optimize=^QUAL
​
Note: To see which fields are actually optimized use --stats (you can use this with genozip, genocat or genounzip):
​
> genozip test.fq.gz --optimize --stats
FASTQ file: test.fq.gz
reads: 100,000 Contexts: 15 Vblocks: 6 x 4.0 MB Sections: 70
Read name style: Genozip-opt@
Sequencer: MGI_Tech
Fields optimized: QNAME,QUAL
Genozip version: 15.0.60 github
Date compressed: 17/06/2024 23:02:12 Jerusalem Summer Time
​
Genozip compression without and with --optimize, showing file sizes in MBs. The file tested is a FASTQ file produced by an MGI Tech sequencer, obtained from here.
Uncompressing
​
Uncompress a file (note: if file was created with --pair this results in two output files):
​
genounzip myfile.fq.genozip
​
Uncompress a file into stdout (i.e. the terminal):
​
genocat myfile.fq.genozip
​
Uncompress a file and also generates a FAI index file, using samtools faidx. samtools needs to be installed for this option to work:
​
genounzip --index myfile.fq.genozip
​
Uncompress to a particular name. Whether or not the name has a .gz extension determines whether the output file is BGZF-compressed:
​
genounzip --output newname.fq.gz myfile.fq.genozip
​
Set the level of BGZF compression (for .fq.gz output format) - from 0 (no compression) to 12 (best yet slowest compression). With --bgzf=exact, genounzip attempts to recover the BGZF compression level of the original file.
​
genocat --bgzf 6 myfile.fq.genozip
genounzip --bgzf 6 myfile.fq.genozip
​
Uncompressing to a pipe
​
genocat myfile.fq.genozip | my-pipeline
​
In case the file was generated with --pair, genocat will output the data in interleaved format - the first read of R1, followed by the first read of R2, followed by the second read of R1 and so on. Many tools have a command line option for accepting a pair of FASTQ files in interleaved format, for example:
​
genocat myfile.R1+2.fq.genozip | my-pipeline # interleaved paired-end
​
If it is desired to output only one of the paired files at a time, use the --R1 or --R2 genocat options.
​
Slicing & dicing your data with genocat
​
Genozip allows you to slice and dice your data in many ways, using genocat's subsetting options.
​
When subsetting a file generated with --pair, use the --interleaved=both option (this is the default) to show the pair of reads only if both reads survived the filter, and --interleaved=either to show the pair of reads if either of them surviving the filtering.
​
Here's a summary of the filtering and subsetting options available for FASTQ files. See genocat for more information.
​
Option Effect
--downsample Show only one in every X reads
--grep Show only reads containing the specified string
--grep-w -g Like --grep, but match whole words
--lines -n Show only a reads from given range of line numbers
--head Show only a certain number of reads from the start of the file
--tail Show only a certain number of reads from the end of the file
--taxid Include or exclude reads mapped to a certain taxonomy ID. See kraken.
--bases Filter reads based on the IUPAC nucleotide codes in the sequence data
--no-header Show only the alignments - exclude the SAM header
--header-only Show only the description line of each read
--seq-only Show only the sequence line of each read
--qual-only Show only the base qualities line of each read
​
Example: show only the sequence for each read:
​
genocat --seq-only myfile.fq.genozip
​
Example: show only the first (#0) read in every 10 reads:
​
genocat --downsample 10,0 myfile.fq.genozip
​
Example: show reads with the string “ACCTTAAT” anywhere in the read (description, sequence or quality lines) - possibly a substring of a longer string:
​
genocat --grep ACCTTAAT myfile.fq.genozip
​
Example: show only reads in which all characters of the sequence are one of A,C,G,T,N:
​
genocat --bases ACGTN myfile.fq.genozip
​
Example: show only reads in which NOT all characters of the sequence are one of A,C,G,T,N:
​
genocat --bases ^ACGTN myfile.fq.genozip
​
idxstats
​
Genozip has the ability to calculate approximate per-contig statistics (idxstats) directly from FASTQ data. See idxstats.
​
genocat --idxstats myfile.fq.genozip
​
Per-contig coverage and depth
​
Genozip has the ability to calculate approximate per-contig coverage and depth directly from FASTQ data.
See Coverage and Depth.
​
genocat --coverage myfile.fq.genozip
​​
For a full list of options, see the genozip command line reference
​
Questions? support@genozip.com