top of page
How good is Genozip at compressing FASTQ files?

According to our Benchmarks, for a wide range of analyses types, Genozip compresses significantly better and much faster than other available options. As a commercial product, Genozip enjoys excellent technical support, and being able to compress almost all relevant file formats (FASTQ, BAM, VCF and many more) adds to the convenience. Our customers and academic users routinely use Genozip to archive hundreds of terabytes of data, and integrate it into production pipelines.

For any large scale deployment, we encourage you to compare and benchmark Genozip against our competitors. For reference, our main competitors in the area of compression of FASTQ files are Petagene's Petasuite and Illumina's DRAGEN ORA.

Compressing a FASTQ using a reference file 

While Genozip is technically capable of compressing FASTQ files without using a reference, in practice, to achieve good compression ratios, compression should always be done against a reference genome.

$ genozip --reference GRCh38_full_analysis_set_plus_decoy_hla.fa.gz myfile.fq.gz

genozip myfile.fq.gz : Done (3 minutes 0 seconds, FASTQ compression ratio: 24.0 - better than .fastq.gz by a factor of 4.7)

testing: genounzip myfile.fastq.genozip : verified as identical to the original FASTQ

$ ls -lh myfile.fastq.* 
-rw-------+ 1 divon divon 6.2G Jul 21 19:53 myfile.fastq.genozip
-rw-------+ 1 divon divon 29G  Oct 12  2020 myfile.fastq.gz

 

Note: when a particular FASTA file is used as a reference for the first time, genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to

--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

$ genozip --make-reference hs37d5.fa.gz

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly.

Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of

--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.

Note: If the species has multiple versions of its reference genome FASTA, any one of them will work. For example, for human data, one option would be to use hs37d5.fa.gz.

Note: If there is no reference genome for the target species, a reference genome of a closely related species may still work better than no reference at all. 

Note: For meta-genomic applications, it is possible to use a reference FASTA that contains sequences from multiple species, up to a total of 4 Gbp.

Note: supported input file extensions include .fq .fq.gz .fq.bz2 .fq.xz and also .fastq .fastq.gz .fastq.bz2 .fastq.xz. For FASTQ files with a different extension, use --input fastq to inform Genozip that this is FASTQ data.

Compressing a pair of FASTQ files

Genozip can take advantage of redundancies between corresponding reads in a pair of FASTQ files to improve compression when they are compressed together, using --pair:

$ genozip --reference hs37d5.fa.gz --pair myfile-R1.fq.gz myfile-R2.fq.gz

genozip myfile-R1.fq.gz : Done (2 seconds)

genozip myfile-R2.fq.gz : Done (8 seconds, FASTQ compression ratio: 20.3 - better than .fq.gz by a factor of 4.3)

$ ls -l myfile*

-rwxrwxrwx 1 divon divon 3624227 Aug 21 12:50 myfile-R1+2.fastq.genozip

-rwxrwxrwx 1 divon divon 7338338 Aug 21 12:02 myfile-R1.fq.gz

-rwxrwxrwx 1 divon divon 8232187 Aug 21 12:02 myfile-R2.fq.gz

Uncompressing both files to their original file names:

genounzip myfile-R1+2.fq.genozip

Accessing R1 and R2 data separately:

genocat myfile-R1+2.fq.genozip --R1

genocat myfile-R1+2.fq.genozip --R2

Compressing related sequences

In files where it is expected that the sequences (reads) are similar - for example in the case of long-reads of similar virus genomes, conveying that expectation to Genozip using --multiseq will usually improve the compression:

genozip --multiseq myfile.fq.gz

Compression optimizations

 

The following options modify the file in ways that improve compression. --optimize is an umbrella option that activates all optimization options.

--optimize-DESC

genozip myfile.fq.gz --reference hs37d5.ref.genozip --optimize-DESC

This replaces the description line with @filename.read_number. Also - if the 3rd line (the ‘+’ line) contains a copy of the description it is shortened to just ‘+’.
 

--optimize-QUAL

genozip myfile.fq.gz --reference hs37d5.ref.genozip --optimize-QUAL

 

The base quality data is optimized as follows:

 

Old values  New value

2-9         6

10-10       15

20-24       22

25-29       27

...

85-89       87

90-92       91

93          93

Note: because optimizations modify the data of the file, automatic verification after compression is disabled, and genounzip --test is also not available.

Uncompressing

 

Uncompress a file (note: if file was created with --pair this results in two output files):

genounzip myfile.fq.genozip    

Uncompress a file into stdout (i.e. the terminal):

genocat myfile.fq.genozip

Uncompress a file and also generates a FAI index file, using samtools faidx. samtools needs to be installed for this option to work:

genounzip --index myfile.fq.genozip         

Uncompress to a particular name. Whether or not the name has a .gz extension determines whether the output file is BGZF-compressed:

genounzip --output newname.fq.gz myfile.fq.genozip

Set the level of BGZF compression (for .fq.gz output format) - from 0 (no compression) to 12 (best yet slowest compression). With --bgzf=exact, genounzip attempts to recover the BGZF compression level of the original file.

genocat --bgzf 6 myfile.fq.genozip

genounzip --bgzf 6 myfile.fq.genozip

Uncompressing to a pipe

genocat myfile.fq.genozip | my-pipeline        

In case the file was generated with --pair, genocat will output the data in interleaved format - the first read of R1, followed by the first read of R2, followed by the second read of R1 and so on. Many tools have a command line option for accepting a pair of FASTQ files in interleaved format, for example:

bwa -p 

bowtie2 --interleaved 

fastp --interleaved_in 

genocat myfile.R1+2.fq.genozip | my-pipeline  # interleaved paired-end

If it is desired to output only one of the paired files at a time, use the --R1 or --R2 genocat options.
 

Slicing & dicing your data with genocat

Genozip allows you to slice and dice your data in many ways, using genocat's subsetting options.

When subsetting a file generated with --pair, use the --interleaved=both option (this is the default) to show the pair of reads only if both reads survived the filter, and --interleaved=either to show the pair of reads if either of them surviving the filtering.

Here's a summary of the filtering and subsetting options available for FASTQ files. See genocat for more information.

Option                           Effect

--downsample        Show only one in every X reads

--grep              Show only reads containing the specified string 

--grep-w        -g  Like --grep, but match whole words

--lines         -n  Show only a reads from given range of line numbers

--head              Show only a certain number of reads from the start of the file

--tail              Show only a certain number of reads from the end of the file

--taxid             Include or exclude reads mapped to a certain taxonomy ID. See kraken.

--bases             Filter reads based on the IUPAC nucleotide codes in the sequence data

--no-header         Show only the alignments - exclude the SAM header

--header-only       Show only the description line of each read

--seq-only          Show only the sequence line of each read

--qual-only         Show only the base qualities line of each read

Example: show only the sequence for each read:

 

genocat --seq-only myfile.fq.genozip

Example: show only the first (#0) read in every 10 reads:

genocat --downsample 10,0 myfile.fq.genozip

Example: show reads with the string “ACCTTAAT” anywhere in the read (description, sequence or quality lines) - possibly a substring of a longer string:

genocat --grep ACCTTAAT myfile.fq.genozip

Example: show only reads in which all characters of the sequence are one of A,C,G,T,N:

genocat --bases ACGTN myfile.fq.genozip

Example:  show only reads in which NOT all characters of the sequence are one of A,C,G,T,N:

 

genocat --bases ^ACGTN myfile.fq.genozip

idxstats

Genozip has the ability to calculate approximate per-contig statistics (idxstats) directly from FASTQ data. See idxstats.

genocat --idxstats myfile.fq.genozip

Per-contig coverage and depth

Genozip has the ability to calculate approximate per-contig coverage and depth directly from FASTQ data.

See Coverage and Depth.

genocat --coverage myfile.fq.genozip

Sex classification

Genozip has the ability to estimate the sample's sex directly from FASTQ data. Not suitable for clinical applications. See Sex Classification.

genocat --show-sex myfile.fq.genozip

For a full list of options, see the genozip command line reference

Questions? support@genozip.com

bottom of page