How good is Genozip at compressing FASTQ files?
According to our Benchmarks, for a wide range of analyses types, Genozip compresses significantly better and much faster than other available options. As a commercial product, Genozip enjoys excellent technical support, and being able to compress almost all relevant file formats (FASTQ, BAM, VCF and many more) adds to the convenience. Our customers and academic users routinely use Genozip to archive hundreds of terabytes of data, and integrate it into production pipelines.
For any large scale deployment, we encourage you to compare and benchmark Genozip against our competitors. For reference, our main competitors in the area of compression of FASTQ files are Petagene's Petasuite and Illumina's DRAGEN ORA.
Compressing a FASTQ using a reference file
While Genozip is technically capable of compressing FASTQ files without using a reference, in practice, to achieve good compression ratios, compression should always be done against a reference genome.
$ genozip --reference GRCh38_full_analysis_set_plus_decoy_hla.fa.gz myfile.fq.gz
genozip myfile.fq.gz : Done (3 minutes 0 seconds, FASTQ compression ratio: 24.0 - better than .fastq.gz by a factor of 4.7)
testing: genounzip myfile.fastq.genozip : verified as identical to the original FASTQ
$ ls -lh myfile.fastq.*
-rw-------+ 1 divon divon 6.2G Jul 21 19:53 myfile.fastq.genozip
-rw-------+ 1 divon divon 29G Oct 12 2020 myfile.fastq.gz
Note: when a particular FASTA file is used as a reference for the first time, genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to
--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):
$ genozip --make-reference hs37d5.fa.gz
Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly.
Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of
--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.
Note: If the species has multiple versions of its reference genome FASTA, any one of them will work. For example, for human data, one option would be to use hs37d5.fa.gz.
Note: If there is no reference genome for the target species, a reference genome of a closely related species may still work better than no reference at all.
Note: For meta-genomic applications, it is possible to use a reference FASTA that contains sequences from multiple species, up to a total of 4 Gbp.
Note: supported input file extensions include .fq .fq.gz .fq.bz2 .fq.xz and also .fastq .fastq.gz .fastq.bz2 .fastq.xz. For FASTQ files with a different extension, use --input fastq to inform Genozip that this is FASTQ data.
Compressing a pair of FASTQ files
Genozip can take advantage of redundancies between corresponding reads in a pair of FASTQ files to improve compression when they are compressed together, using --pair:
$ genozip --reference hs37d5.fa.gz --pair myfile-R1.fq.gz myfile-R2.fq.gz
genozip myfile-R1.fq.gz : Done (2 seconds)
genozip myfile-R2.fq.gz : Done (8 seconds, FASTQ compression ratio: 20.3 - better than .fq.gz by a factor of 4.3)
$ ls -l myfile*
-rwxrwxrwx 1 divon divon 3624227 Aug 21 12:50 myfile-R1+2.fastq.genozip
-rwxrwxrwx 1 divon divon 7338338 Aug 21 12:02 myfile-R1.fq.gz
-rwxrwxrwx 1 divon divon 8232187 Aug 21 12:02 myfile-R2.fq.gz
Uncompressing both files to their original file names:
genounzip myfile-R1+2.fq.genozip
Accessing R1 and R2 data separately:
genocat myfile-R1+2.fq.genozip --R1
genocat myfile-R1+2.fq.genozip --R2
Compressing related sequences
In files where it is expected that the sequences (reads) are similar - for example in the case of long-reads of similar virus genomes, conveying that expectation to Genozip using --multiseq will usually improve the compression:
genozip --multiseq myfile.fq.gz
Compression optimizations
The following options modify the file in ways that improve compression. --optimize is an umbrella option that activates all optimization options.
--optimize-DESC
genozip myfile.fq.gz --reference hs37d5.ref.genozip --optimize-DESC
This replaces the description line with @filename.read_number. Also - if the 3rd line (the ‘+’ line) contains a copy of the description it is shortened to just ‘+’.
--optimize-QUAL
genozip myfile.fq.gz --reference hs37d5.ref.genozip --optimize-QUAL
The base quality data is optimized as follows:
Old values New value
2-9 6
10-10 15
20-24 22
25-29 27
...
85-89 87
90-92 91
93 93
Note: because optimizations modify the data of the file, automatic verification after compression is disabled, and genounzip --test is also not available.
Uncompressing
Uncompress a file (note: if file was created with --pair this results in two output files):
genounzip myfile.fq.genozip
Uncompress a file into stdout (i.e. the terminal):
genocat myfile.fq.genozip
Uncompress a file and also generates a FAI index file, using samtools faidx. samtools needs to be installed for this option to work:
genounzip --index myfile.fq.genozip
Uncompress to a particular name. Whether or not the name has a .gz extension determines whether the output file is BGZF-compressed:
genounzip --output newname.fq.gz myfile.fq.genozip
Set the level of BGZF compression (for .fq.gz output format) - from 0 (no compression) to 12 (best yet slowest compression). With --bgzf=exact, genounzip attempts to recover the BGZF compression level of the original file.
genocat --bgzf 6 myfile.fq.genozip
genounzip --bgzf 6 myfile.fq.genozip
Uncompressing to a pipe
genocat myfile.fq.genozip | my-pipeline
In case the file was generated with --pair, genocat will output the data in interleaved format - the first read of R1, followed by the first read of R2, followed by the second read of R1 and so on. Many tools have a command line option for accepting a pair of FASTQ files in interleaved format, for example:
genocat myfile.R1+2.fq.genozip | my-pipeline # interleaved paired-end
If it is desired to output only one of the paired files at a time, use the --R1 or --R2 genocat options.
Slicing & dicing your data with genocat
Genozip allows you to slice and dice your data in many ways, using genocat's subsetting options.
When subsetting a file generated with --pair, use the --interleaved=both option (this is the default) to show the pair of reads only if both reads survived the filter, and --interleaved=either to show the pair of reads if either of them surviving the filtering.
Here's a summary of the filtering and subsetting options available for FASTQ files. See genocat for more information.
Option Effect
--downsample Show only one in every X reads
--grep Show only reads containing the specified string
--grep-w -g Like --grep, but match whole words
--lines -n Show only a reads from given range of line numbers
--head Show only a certain number of reads from the start of the file
--tail Show only a certain number of reads from the end of the file
--taxid Include or exclude reads mapped to a certain taxonomy ID. See kraken.
--bases Filter reads based on the IUPAC nucleotide codes in the sequence data
--no-header Show only the alignments - exclude the SAM header
--header-only Show only the description line of each read
--seq-only Show only the sequence line of each read
--qual-only Show only the base qualities line of each read
Example: show only the sequence for each read:
genocat --seq-only myfile.fq.genozip
Example: show only the first (#0) read in every 10 reads:
genocat --downsample 10,0 myfile.fq.genozip
Example: show reads with the string “ACCTTAAT” anywhere in the read (description, sequence or quality lines) - possibly a substring of a longer string:
genocat --grep ACCTTAAT myfile.fq.genozip
Example: show only reads in which all characters of the sequence are one of A,C,G,T,N:
genocat --bases ACGTN myfile.fq.genozip
Example: show only reads in which NOT all characters of the sequence are one of A,C,G,T,N:
genocat --bases ^ACGTN myfile.fq.genozip
idxstats
Genozip has the ability to calculate approximate per-contig statistics (idxstats) directly from FASTQ data. See idxstats.
genocat --idxstats myfile.fq.genozip
Per-contig coverage and depth
Genozip has the ability to calculate approximate per-contig coverage and depth directly from FASTQ data.
See Coverage and Depth.
genocat --coverage myfile.fq.genozip
Sex classification
Genozip has the ability to estimate the sample's sex directly from FASTQ data. Not suitable for clinical applications. See Sex Classification.
genocat --show-sex myfile.fq.genozip
For a full list of options, see the genozip command line reference
Questions? support@genozip.com