Compressing FASTQ files

How good is Genozip at compressing FASTQ files?

Relative sizes of .fastq files, before and after compression with genozip.

Used Genozip 15.0.4, --reference was used for compressing with all files, --pair was used for the first dataset (third bar from the left), --best was not used. Datasets can be found here.

Compressing a FASTQ using a reference file

While Genozip is capable of compressing with or without reference a reference file, it is always a good idea to use one if possible, as demonstrated by this chart:

$ genozip --reference hs37d5.fa.gz myfile-R1.fq.gz

genozip myfile-R1.fq.gz : Done (5 minutes 13 seconds, FASTQ compression ratio: 24.7 - better than .fastq.gz by a factor of 5.1)

testing: genounzip myfile-R1.fq.genozip : verified as identical to the original FASTQ

$ ls -lh myfile-R1.fq.*

-rw-------+ 1 divon divon 6.0G Sep 16 19:01 myfile-R1.fq.genozip

-rw-------+ 1 divon divon 31G May 3 23:28 myfile-R1.fq.gz

Note: when a particular FASTA file is used as a reference for the first time, genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to --reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

$ genozip --make-reference hs37d5.fa.gz

or alternatively:

$ genozip --make-reference ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/

reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly, or set $GENOZIP_REFERENCE to point to either the file or the directory containing the file.

Note: If you do not want genounzip to require an external reference file, you may compress with ‑‑REFERENCE instead of ‑‑reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.

Note: If the species has multiple versions of its reference genome FASTA, any one of them will work. For example, for human data, one option would be to use hs37d5.fa.gz.

Note: If there is no reference genome for the target species, a reference genome of a closely related species may still work better than no reference at all.

Note: For meta-genomic applications, it is possible to use a reference FASTA that contains sequences from multiple species, up to a total of 4 Gbp.

Relative sizes of a .fastq file generated with Illumina Novaseq WGS 30x coverage. Showing the benefit of using --reference or --REFERENCE

PAIR: Co-compressing™ a pair of FASTQ files

The left bar shows sizes of each .fastq.genozip file when compressed separately, relative to
the combined size of the .fastq.gz files, and the right bar shows the relative size of the
.fastq.genozip file when co-compressed together using --pair.

Genozip can take advantage of redundancies between corresponding reads in a pair of FASTQ files to improve compression when they are compressed together, using --pair. Typically, this results in shrinking the compressed file by 10-15%.

$ genozip --pair --reference hs37d5.fa.gz myfile-R1.fq.gz myfile-R2.fq.gz

genozip myfile-R1.fq.gz : Done (5 minutes 4 seconds)
genozip myfile-R2.fq.gz : Done (6 minutes 25 seconds, FASTQ compression ratio: 22.2 - better than .fastq.gz by a factor of 4.9)

$ ls -lh myfile-R*

-rw-------+ 1 divon divon 31G May 3 23:28 myfile-R1.fq.gz
-rw-------+ 1 divon divon 35G May 3 23:31 myfile-R2.fq.gz

-rw-------+ 1 divon divon 14G Sep 16 19:17 myfile-R1+2_001.fq.genozip

Uncompressing both files to their original file names:

$ genounzip myfile-R1+2.fq.genozip

Accessing R1 and R2 data separately:

$ genocat myfile-R1+2.fq.genozip --R1

$ genocat myfile-R1+2.fq.genozip --R2

Note: --pair works on paired-end files only, which are defined as two FASTQ files containing the same read names in the same order (possibly with a mate indicator such as /1 /2).

DEEP: Co-compressing™ BAM and FASTQ files (Genozip Deep™)

Genozip supports co-compressing™ a BAM file (or SAM or CRAM) along with all the FASTQs that contributed reads to the BAM. The result of this compression is a single file with the extension .deep.genozip. Uncompressing the deep file with genounzip outputs all the original files (losslessly, obviously!). This method results in very substantial savings vs compressing the FASTQs and the BAM separately:

The datasets can be found here.

Note: The Deep method requires inclusion of all FASTQs that contributed reads to the BAM.

Note: There is a lot that can change in the data between a FASTQ file and the corresponding BAM file - sequences may have been be trimmed, BQSR may have been applied to quality scores, read names may have changed, reads may have been filtered out or duplicates may have been collapsed to consensus sequences just to name a few. Deep will nevertheless still work, albeit a bit less effeciently, if such changes have occurred.

Note: The Deep method typically consumes 20%-50% less CPU compared to compressing the BAM and FASTQ files separately. However, it consumes significant RAM - the exact amount of RAM required varies, but as a rule of thumb it is about as much as the size of the BAM file. The RAM consumption is mostly during uncompressing.

Note: If running --deep with an even number of FASTQ files (and a BAM, obviously), Genozip tries to determine whether these are paired-end files by pairing the files based on their file names and the description line of the first read in each file. If Genozip determines that they are paired-end files, it applies the --pair method to further improve compression. However, if it incorrectly determines that they are paired-end while in fact they are not, an error would result. To overcome this, use --not-paired to inform Genozip that these files are not paired.

Example:

$ genozip --deep --reference hs37d5.fa.gz myfile.bam myfile-R1.fq.gz myfile-R2.fq.gz

$ ls -l
-rw-------+ 1 57G May 4 05:23 myfile.bam
-rw-------+ 1 31G May 3 23:28 myfile-R1.fq.gz
-rw-------+ 1 35G May 3 23:31 myfile-R2.fq.gz

-rw-------+ 1 16G Jun 21 19:27 myfile.deep.genozip

Reducing RAM consumption: Uncompression of a .deep.genozip file is very RAM hungry. Usually, most of the RAM consumption is due to the matching of the QUAL data between the BAM and FASTQ files. Getting rid of the QUAL redundancy between BAM and FASTQ also usually accounts for the majority of benefit of Deep. Using Deep, but without matching QUAL, drastically reduces the RAM consumption, with the resulting file size being somewhere in between compressing the FASTQ and BAM separately, and full Deep.

Example:

$ genozip --deep=no-qual --reference hs37d5.fa.gz myfile.bam myfile-R1.fq.gz myfile-R2.fq.gz

BAMASS: BAM-assisted compression of FASTQ files

A FASTQ file can be compressed about twice as fast (in terms of CPU time) and up to 15% better by allowing Genozip to inspect the alignments in the corresponding BAM file. The BAM file is not required for uncompressing.

Note: if both BAM and FASTQ are to be compressed, a still better option is to use --deep to co-compress them together.

The effect of using the --bamass option on CPU consumption: the blue bars show the CPU consumption of compressing FASTQ files with Genozip (taken as 100%), and the orange bars show the relative CPU consumption when compressing the same files with Genozip using the --bamass option, showing a reduction of 40-60% in CPU time.

Not shown here: In addition to reduction in CPU time, the compression was better as well: the compressed file size was smaller by 14% in the case of Illumina WGS and 4% in the case of MGI WGS.

Background: When Genozip compresses a FASTQ file, it uses its internal Approximate Aligner™ to find a region in the reference genome that is similar enough to the read on hand, so that the coordinates in the reference genome plus a record of the discrepency between that region of the reference and the actual read could be used to describe the read more parsimoniously that the read itself, and hence achieve compression. Unlike regular FASTQ-to-BAM aligners that try to find the true location on the DNA molecule from which this read originated, Genozip's Approximate Aligner™ only desires to find some location on the reference which is similar enough to the read - a relaxed objective which results in Approximate Aligner™ being extremely fast. However, Its work still accounts for about half the CPU usage when compressing a FASTQ file.

Genozip uses the patent-pending BAM-Asssited Compression of FASTQ method to leverage the alignment information contained in the BAM file to reduce the need of using the Approximate Aligner™, typically slashing compression CPU time by 40-60%. In addition, in some cases, this may also improve the compression ratio by up to 15% (the higher end of the range was observed in Illumina FASTQs with binned quality scores as well as in Ultima Genomics data).

Example:

$ genozip myfile.R1.fastq.gz myfile.R2.fastq.gz --bamass myfile.bam --reference hs37d5.fa.gz

$ ls -l

-rw-rw-r--+ 1 72G Nov 19 11:38 myfile.bam

-rw-rw-r--+ 1 40G Nov 19 07:27 myfile.R1.fastq.gz

-rw-rw-r--+ 1 45G Nov 19 07:31 myfile.R2.fastq.gz

-rw-rw-r--+ 1 9G Nov 19 20:25 myfile.R1.fastq.genozip

-rw-rw-r--+ 1 11G Nov 19 20:28 myfile.R2.fastq.genozip

It is also possible to combine --bamass with --pair to further improve the compression ratio.

OPTIMIZE : even better compression, with some caveats, with --optimize

While Genozip is primarily a lossless compressor, the --optimize option is provided to allow lossy compression as well. The idea of lossy compression is this: It is common for data to contain information in a higher resolution than we actually need for downstream analysis. Examples include resolution of base quality scores and details contained in read names. If we could reduce the resolution, we could gain signficantly better compression.. The compression-enhacing modifications to FASTQ data which genozip performs when --optimize is specified are described below. They are designed to have negligible impact on downstream analysis in many common cases, however you should definitely validate this for your own specific use case.

1. Modifications of the FASTQ Description line: Genozip replaces the QNAME part of the description line with @filename.read_number. If there is a second QNAME on the line (common in data downloaded from NCBI), it is removed. Other data on the Description line (barcodes, metadata etc) as well "/1" and "/2" indicating mates, are left intact.

2. Sequence line: No change are made to the sequence line.

3. "+" line: This line is reduced to being just a "+" character.

4. The Quality scores line: The base quality scores ∈[0,93] (which appear in FASTQ as the ASCII characters '!' through '~') are binned, loosely following Illumina's method:

Old values New value

0-2 unchanged

3-9 6

10-10 15

20-24 22

25-29 27

...

85-89 87

90-92 91

93 93

Note: Quality score binning is not applied quality scores are already binned to 8 or fewer values.

Example 1: an MGI read from file called test-mgi.fq.gz (obtained here):

Before:

@S200032449L1C001R00100000094/1
TCTATTTCTCCTTTCATTTCTATCGGCTTTTGTCTCATGTATTTCGATGCTCTGTTGTTAGGTGCATACACACTTAAGATTGTTATGTGTCTTTGGAGCA
+
@FDE@@DFDFF@@@F?DD@F@E@FFFFDDDDFDFDF?@F@?@DDFFE@DFDFDFD@FD@?FF@FF?D8F?F?F@DEEB?CD6@D?DFDF@FD@DFF1CFE

After:

@test-mgi.0/1
TCTATTTCTCCTTTCATTTCTATCGGCTTTTGTCTCATGTATTTCGATGCTCTGTTGTTAGGTGCATACACACTTAAGATTGTTATGTGTCTTTGGAGCA
+
BFFFBBFFFFFBBBFBFFBFBFBFFFFFFFFFFFFFBBFBBBFFFFFBFFFFFFFBFFBBFFBFFBF7FBFBFBFFFBBBF7BFBFFFFBFFBFFF0BFF

Example 2: An NCBI-style formatting of an Illumina read from a file called test-ncbi.fq.gz (obtained here):

Before:

@SRR1067582.1.1 D1Y24ACXX130315:6:2102:7429:73241 length=100
TTTGGGTTAGGGTTTGGGTTAGGGTTCGGGTTAGGGTTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGTTAGGGTTAGGGTGGTTGT
+SRR1067582.1.1 D1Y24ACXX130315:6:2102:7429:73241 length=100
<@@DFFAADFHGFFDHIIHIGHIICEFHGI8?;DHI;DGIII;CFHFICCCEEH;@BDCE6;6=?B;=5<CAAA(999>?@583>?#############@

After:

@test-ncbi.0 length=100
TTTGGGTTAGGGTTTGGGTTAGGGTTCGGGTTAGGGTTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGTTAGGGTTAGGGTGGTTGT
+
<BBFFFBBFFFFFFFFKKFKFFKKBFFFFK7B<FFK<FFKKK<BFFFKBBBFFF<BBFBF7<7<BB<<7<BBBB'777<BB770<B#############B

Note: when combining --deep and --optimize, only the Quality scores are optimized, while the Description and "+" line are not.

Note: genounzip and genozip --test verify that the data uncompresses to precisely the same data as it was after the modifications. Note that this does not test the correctness of the modifications themselves.

Note: To optimize only QUAL data, but not the description line, use --optimize=QUAL. Conversely, To optimize only the description line, but not QUAL, use --optimize=^QUAL

Note: To see which fields are actually optimized use --stats (you can use this with genozip, genocat or genounzip):

$ genozip test.fq.gz --optimize --stats

FASTQ file: test.fq.gz
reads: 100,000 Contexts: 15 Vblocks: 6 x 4.0 MB Sections: 70
Read name style: Genozip-opt@
Sequencer: MGI_Tech
Fields optimized: QNAME,QUAL
Genozip version: 15.0.60 github
Date compressed: 17/06/2024 23:02:12 Jerusalem Summer Time

Genozip compression without and with --optimize, showing file sizes in MBs. The file tested is a FASTQ file produced by an MGI Tech sequencer, obtained from here.

optimize

deep

bamass

pair

reference

Uncompressing

Uncompress a file (note: if file was created with --pair this results in two output files):

$ genounzip myfile.fq.genozip

Uncompress a file into stdout (i.e. the terminal):

$ genocat myfile.fq.genozip

Uncompress a file and also generates a FAI index file, using samtools faidx. samtools needs to be installed for this option to work:

$ genounzip --index myfile.fq.genozip

Uncompress to a particular name. Whether or not the name has a .gz extension determines whether the output file is BGZF-compressed:

$ genounzip --output newname.fq.gz myfile.fq.genozip

Set the level of BGZF compression (for .fq.gz output format) - from 0 (no compression) to 12 (best yet slowest compression). With --bgzf=exact, genounzip attempts to recover the BGZF compression level of the original file.

$ genocat --bgzf 6 myfile.fq.genozip

$ genounzip --bgzf 6 myfile.fq.genozip

Uncompressing to a pipe:

$ genocat myfile.fq.genozip | my-pipeline

In case the file was generated with --pair, genocat will output the data in interleaved format - the first read of R1, followed by the first read of R2, followed by the second read of R1 and so on. Many tools have a command line option for accepting a pair of FASTQ files in interleaved format, for example:

bwa -p

bowtie2 --interleaved

fastp --interleaved_in

$ genocat myfile.R1+2.fq.genozip | my-pipeline # interleaved paired-end

If it is desired to output only one of the paired files at a time, use the --R1 or --R2 genocat options.

Slicing & dicing your data with genocat

Genozip allows you to slice and dice your data in many ways, using genocat's subsetting options.

When subsetting a file generated with --pair, use the --interleaved=both option (this is the default) to show the pair of reads only if both reads survived the filter, and --interleaved=either to show the pair of reads if either of them surviving the filtering.

Here's a summary of the filtering and subsetting options available for FASTQ files. See genocat for more information.

Option Effect

--downsample Show only one in every X reads

--grep Show only reads containing the specified string

--grep-w -g Like --grep, but match whole words

--lines -n Show only a reads from given range of line numbers

--head Show only a certain number of reads from the start of the file

--tail Show only a certain number of reads from the end of the file

--bases Filter reads based on the IUPAC nucleotide codes in the sequence data

--no-header Show only the alignments - exclude the SAM header

--header-only Show only the description line of each read

--seq-only Show only the sequence line of each read

--qual-only Show only the base qualities line of each read

Example: show only the sequence for each read:

$ genocat --seq-only myfile.fq.genozip

Example: show only the first (#0) read in every 10 reads:

$ genocat --downsample 10,0 myfile.fq.genozip

Example: show reads with the string “ACCTTAAT” anywhere in the read (description, sequence or quality lines) - possibly a substring of a longer string:

$ genocat --grep ACCTTAAT myfile.fq.genozip

Example: show only reads in which all characters of the sequence are one of A,C,G,T,N:

$ genocat --bases ACGTN myfile.fq.genozip

Example: show only reads in which NOT all characters of the sequence are one of A,C,G,T,N:

$ genocat --bases ^ACGTN myfile.fq.genozip

idxstats

Genozip has the ability to calculate approximate per-contig statistics (idxstats) directly from FASTQ data. See idxstats.

$ genocat --idxstats myfile.fq.genozip

Per-contig coverage and depth

Genozip has the ability to calculate approximate per-contig coverage and depth directly from FASTQ data.

See Coverage and Depth.

$ genocat --coverage myfile.fq.genozip

For a full list of options, see the genozip command line reference

Questions? support@genozip.com

Compressing FASTQ files

How good is Genozip at compressing FASTQ files?

Compressing a FASTQ using a reference file

​

PAIR: Co-compressing™ a pair of FASTQ files

​​​​​​​OPTIMIZE : even better compression, with some caveats, with --optimize

​

$ genozip test.fq.gz --optimize --stats

FASTQ file: test.fq.gz reads: 100,000 Contexts: 15 Vblocks: 6 x 4.0 MB Sections: 70 Read name style: Genozip-opt@ Sequencer: MGI_Tech Fields optimized: QNAME,QUAL Genozip version: 15.0.60 github Date compressed: 17/06/2024 23:02:12 Jerusalem Summer Time

Uncompressing

Slicing & dicing your data with genocat

idxstats

Per-contig coverage and depth

OPTIMIZE : even better compression, with some caveats, with --optimize

FASTQ file: test.fq.gz
reads: 100,000 Contexts: 15 Vblocks: 6 x 4.0 MB Sections: 70
Read name style: Genozip-opt@
Sequencer: MGI_Tech
Fields optimized: QNAME,QUAL
Genozip version: 15.0.60 github
Date compressed: 17/06/2024 23:02:12 Jerusalem Summer Time