Compressing FASTQ

How good is Genozip at compressing FASTQ files?

Relative sizes of .fastq files, before and after compression with genozip.

Used Genozip 15.0.4, --reference was used for compressing with all files, --pair was used for the first dataset (third bar from the left), --best was not used. Datasets can be found here.

Compressing a FASTQ using a reference file

While Genozip is capable of compressing with or without reference a reference file, it is always a good idea to use one if possible, as demonstrated by this chart:

Relative sizes of a .fastq file generated with Illumina Novaseq WGS 30x coverage.

Showing the benefit of using --reference or --REFERENCE

$ genozip --reference hs37d5.fa.gz myfile-R1.fq.gz

genozip myfile-R1.fq.gz : Done (5 minutes 13 seconds, FASTQ compression ratio: 24.7 - better than .fastq.gz by a factor of 5.1)

testing: genounzip myfile-R1.fq.genozip : verified as identical to the original FASTQ

$ ls -lh myfile-R1.fq.*

-rw-------+ 1 divon divon 6.0G Sep 16 19:01 myfile-R1.fq.genozip

-rw-------+ 1 divon divon 31G May 3 23:28 myfile-R1.fq.gz

Note: when a particular FASTA file is used as a reference for the first time, genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to

--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

$ genozip --make-reference hs37d5.fa.gz

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly, or set $GENOZIP_REFERENCE to point to either the file or the directory containing the file.

Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of

--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.

Note: If the species has multiple versions of its reference genome FASTA, any one of them will work. For example, for human data, one option would be to use hs37d5.fa.gz.

Note: If there is no reference genome for the target species, a reference genome of a closely related species may still work better than no reference at all.

Note: For meta-genomic applications, it is possible to use a reference FASTA that contains sequences from multiple species, up to a total of 4 Gbp.

Co-compressing a pair of FASTQ files

The left bar shows sizes of each .fastq.genozip file when compressed separately, relative to the combined size of the .fastq.gz files, and the right bar shows the relative size of the .fastq.genozip file when co-compressed together using --pair.

Genozip can take advantage of redundancies between corresponding reads in a pair of FASTQ files to improve compression when they are compressed together, using --pair. Typically, this results in shrinking the compressed file by 10-15%.

$ genozip --pair --reference hs37d5.fa.gz myfile-R1.fq.gz myfile-R2.fq.gz

genozip myfile-R1.fq.gz : Done (5 minutes 4 seconds)
genozip myfile-R2.fq.gz : Done (6 minutes 25 seconds, FASTQ compression ratio: 22.2 - better than .fastq.gz by a factor of 4.9)

$ ls -lh myfile-R*

-rw-------+ 1 divon divon 31G May 3 23:28 myfile-R1.fq.gz
-rw-------+ 1 divon divon 35G May 3 23:31 myfile-R2.fq.gz

-rw-------+ 1 divon divon 14G Sep 16 19:17 myfile-R1+2_001.fq.genozip

Uncompressing both files to their original file names:

genounzip myfile-R1+2.fq.genozip

Accessing R1 and R2 data separately:

genocat myfile-R1+2.fq.genozip --R1

genocat myfile-R1+2.fq.genozip --R2

Co-compressing BAM and FASTQ files (Genozip Deep™)

Genozip supports co-compressing a BAM file (or SAM or CRAM) along with all the FASTQs that contributed reads to the BAM. The result of this compression is a single file with the extension .deep.genozip. Uncompressing the deep file with genounzip outputs all the original files (losslessly, obviously!). This method results in very substantial savings vs compressing the FASTQs and the BAM separately:

The datasets can be found here.

Note: The Deep method requires inclusion of all FASTQs that contributed reads to the BAM.

Note: There is a lot that can change in the data between a FASTQ file and the corresponding BAM file - sequences may have been be trimmed, BQSR may have been applied to quality scores, read names may have changed, reads may have been filtered out or duplicates may have been collapsed to consensus sequences just to name a few. Deep will nevertheless still work, albeit a bit less effeciently, if such changes have occurred.

Note: The Deep method typically consumes 20%-50% less CPU compared to compressing the BAM and FASTQ files separately. However, it consumes significant RAM - the exact amount of RAM required varies, but as a rule of thumb it is about as much as the size of the BAM file.

Example:

$ genozip --deep --reference hs37d5.fa.gz myfile.bam myfile-R1.fq.gz myfile-R2.fq.gz

$ ls -lh myfile*
-rw-------+ 1 57G Feb 7 2020 myfile.bam
-rw-------+ 1 31G May 3 23:28 myfile-R1.fq.gz
-rw-------+ 1 35G May 3 23:31 myfile-R2.fq.gz

-rw-------+ 1 16G Jun 21 19:27 myfile.deep.genozip

Compression optimizations

The following options modify the file in ways that improve compression. --optimize is an umbrella option that activates all optimization options.

--optimize-DESC

genozip myfile.fq.gz --reference hs37d5.ref.genozip --optimize-DESC

This replaces the description line with @filename.read_number. Also - if the 3rd line (the ‘+’ line) contains a copy of the description it is shortened to just ‘+’.

--optimize-QUAL

genozip myfile.fq.gz --reference hs37d5.ref.genozip --optimize-QUAL

The base quality data is optimized as follows:

Old values New value

0-2 unchanged

3-9 6

10-10 15

20-24 22

25-29 27

...

85-89 87

90-92 91

93 93

Note: because optimizations modify the data of the file, automatic verification after compression is disabled, and genounzip --test is also not available.

Uncompressing

Uncompress a file (note: if file was created with --pair this results in two output files):

genounzip myfile.fq.genozip

Uncompress a file into stdout (i.e. the terminal):

genocat myfile.fq.genozip

Uncompress a file and also generates a FAI index file, using samtools faidx. samtools needs to be installed for this option to work:

genounzip --index myfile.fq.genozip

Uncompress to a particular name. Whether or not the name has a .gz extension determines whether the output file is BGZF-compressed:

genounzip --output newname.fq.gz myfile.fq.genozip

Set the level of BGZF compression (for .fq.gz output format) - from 0 (no compression) to 12 (best yet slowest compression). With --bgzf=exact, genounzip attempts to recover the BGZF compression level of the original file.

genocat --bgzf 6 myfile.fq.genozip

genounzip --bgzf 6 myfile.fq.genozip

Uncompressing to a pipe

genocat myfile.fq.genozip | my-pipeline

In case the file was generated with --pair, genocat will output the data in interleaved format - the first read of R1, followed by the first read of R2, followed by the second read of R1 and so on. Many tools have a command line option for accepting a pair of FASTQ files in interleaved format, for example:

bwa -p

bowtie2 --interleaved

fastp --interleaved_in

genocat myfile.R1+2.fq.genozip | my-pipeline # interleaved paired-end

If it is desired to output only one of the paired files at a time, use the --R1 or --R2 genocat options.

Slicing & dicing your data with genocat

Genozip allows you to slice and dice your data in many ways, using genocat's subsetting options.

When subsetting a file generated with --pair, use the --interleaved=both option (this is the default) to show the pair of reads only if both reads survived the filter, and --interleaved=either to show the pair of reads if either of them surviving the filtering.

Here's a summary of the filtering and subsetting options available for FASTQ files. See genocat for more information.

Option Effect

--downsample Show only one in every X reads

--grep Show only reads containing the specified string

--grep-w -g Like --grep, but match whole words

--lines -n Show only a reads from given range of line numbers

--head Show only a certain number of reads from the start of the file

--tail Show only a certain number of reads from the end of the file

--taxid Include or exclude reads mapped to a certain taxonomy ID. See kraken.

--bases Filter reads based on the IUPAC nucleotide codes in the sequence data

--no-header Show only the alignments - exclude the SAM header

--header-only Show only the description line of each read

--seq-only Show only the sequence line of each read

--qual-only Show only the base qualities line of each read

Example: show only the sequence for each read:

genocat --seq-only myfile.fq.genozip

Example: show only the first (#0) read in every 10 reads:

genocat --downsample 10,0 myfile.fq.genozip

Example: show reads with the string “ACCTTAAT” anywhere in the read (description, sequence or quality lines) - possibly a substring of a longer string:

genocat --grep ACCTTAAT myfile.fq.genozip

Example: show only reads in which all characters of the sequence are one of A,C,G,T,N:

genocat --bases ACGTN myfile.fq.genozip

Example: show only reads in which NOT all characters of the sequence are one of A,C,G,T,N:

genocat --bases ^ACGTN myfile.fq.genozip

idxstats

Genozip has the ability to calculate approximate per-contig statistics (idxstats) directly from FASTQ data. See idxstats.

genocat --idxstats myfile.fq.genozip

Per-contig coverage and depth

Genozip has the ability to calculate approximate per-contig coverage and depth directly from FASTQ data.

See Coverage and Depth.

genocat --coverage myfile.fq.genozip

Sex classification

Genozip has the ability to estimate the sample's sex directly from FASTQ data. Not suitable for clinical applications. See Sex Classification.

genocat --sex myfile.fq.genozip

For a full list of options, see the genozip command line reference

Questions? support@genozip.com

Compressing FASTQ files