top of page
How good is Genozip at compressing FASTQ files?

See Benchmarks.

Compressing a FASTQ using a reference file 

While Genozip is technically capable of compressing FASTQ files without using a reference, in practice, to achieve good compression ratios, compression should always be done against a reference genome.

$ genozip --reference GRCh38_full_analysis_set_plus_decoy_hla.fa.gz myfile.fq.gz

genozip myfile.fq.gz : Done (3 minutes 0 seconds, FASTQ compression ratio: 24.0 - better than .fastq.gz by a factor of 4.7)

testing: genounzip myfile.fastq.genozip : verified as identical to the original FASTQ

$ ls -lh myfile.fastq.* 
-rw-------+ 1 divon divon 6.2G Jul 21 19:53 myfile.fastq.genozip
-rw-------+ 1 divon divon 29G  Oct 12  2020 myfile.fastq.gz

 

Note: when a particular FASTA file is used as a reference for the first time, genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to

--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

$ genozip --make-reference hs37d5.fa.gz

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly.

Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of

--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.

Note: If the species has multiple versions of its reference genome FASTA, any one of them will work. For example, for human data, one option would be to use hs37d5.fa.gz.

Note: If there is no reference genome for the target species, a reference genome of a closely related species may still work better than no reference at all. 

Note: For meta-genomic applications, it is possible to use a reference FASTA that contains sequences from multiple species, up to a total of 4 Gbp.

Note: supported input file extensions include .fq .fq.gz .fq.bz2 .fq.xz and also .fastq .fastq.gz .fastq.bz2 .fastq.xz. For FASTQ files with a different extension, use --input fastq to inform Genozip that this is FASTQ data.

Compressing a pair of FASTQ files

Genozip can take advantage of redundancies between corresponding reads in a pair of FASTQ files to improve compression when they are compressed together, using --pair:

$ genozip --reference hs37d5.fa.gz --pair myfile-R1.fq.gz myfile-R2.fq.gz

genozip myfile-R1.fq.gz : Done (2 seconds)

genozip myfile-R2.fq.gz : Done (8 seconds, FASTQ compression ratio: 20.3 - better than .fq.gz by a factor of 4.3)

$ ls -l myfile*

-rwxrwxrwx 1 divon divon 3624227 Aug 21 12:50 myfile-R1+2.fastq.genozip

-rwxrwxrwx 1 divon divon 7338338 Aug 21 12:02 myfile-R1.fq.gz

-rwxrwxrwx 1 divon divon 8232187 Aug 21 12:02 myfile-R2.fq.gz

Uncompressing both files to their original file names:

genounzip myfile-R1+2.fq.genozip

Accessing R1 and R2 data separately:

genocat myfile-R1+2.fq.genozip --R1

genocat myfile-R1+2.fq.genozip --R2

Compressing related sequences

In files where it is expected that the sequences (reads) are similar - for example in the case of long-reads of similar virus genomes, conveying that expectation to Genozip using --multiseq will usually improve the compression:

genozip --multiseq myfile.fq.gz

Compression optimizations

 

The following options modify the file in ways that improve compression. --optimize is an umbrella option that activates all optimization options.

--optimize-DESC

genozip myfile.fq.gz --reference hs37d5.ref.genozip --optimize-DESC

This replaces the description line with @filename.read_number. Also - if the 3rd line (the ‘+’ line) contains a copy of the description it is shortened to just ‘+’.
 

--optimize-QUAL

genozip myfile.fq.gz --reference hs37d5.ref.genozip --optimize-QUAL

 

The base quality data is optimized as follows:

 

Old values  New value

2-9         6

10-10       15

20-24       22

25-29       27

...

85-89       87

90-92       91

93          93

Note: because optimizations modify the data of the file, automatic verification after compression is disabled, and genounzip --test is also not available.

Uncompressing

 

Uncompress a file (note: if file was created with --pair this results in two output files):

genounzip myfile.fq.genozip    

Uncompress a file into stdout (i.e. the terminal):

genocat myfile.fq.genozip

Uncompress a file and also generates a FAI index file, using samtools faidx. samtools needs to be installed for this option to work:

genounzip --index myfile.fq.genozip         

Uncompress to a particular name. Whether or not the name has a .gz extension determines whether the output file is BGZF-compressed:

genounzip --output newname.fq.gz myfile.fq.genozip

Set the level of BGZF compression (for .fq.gz output format) - from 0 (no compression) to 12 (best yet slowest compression). With --bgzf=exact, genounzip attempts to recover the BGZF compression level of the original file.

genocat --bgzf 6 myfile.fq.genozip

genounzip --bgzf 6 myfile.fq.genozip

Uncompressing to a pipe

genocat myfile.fq.genozip | my-pipeline        

In case the file was generated with --pair, genocat will output the data in interleaved format - the first read of R1, followed by the first read of R2, followed by the second read of R1 and so on. Many tools have a command line option for accepting a pair of FASTQ files in interleaved format, for example:

bwa -p 

bowtie2 --interleaved 

fastp --interleaved_in 

genocat myfile.R1+2.fq.genozip | my-pipeline  # interleaved paired-end

If it is desired to output only one of the paired files at a time, use the --R1 or --R2 genocat options.
 

Slicing & dicing your data with genocat

Genozip allows you to slice and dice your data in many ways, using genocat's subsetting options.

When subsetting a file generated with --pair, use the --interleaved=both option (this is the default) to show the pair of reads only if both reads survived the filter, and --interleaved=either to show the pair of reads if either of them surviving the filtering.

Here's a summary of the filtering and subsetting options available for FASTQ files. See genocat for more information.

Option                           Effect

--downsample        Show only one in every X reads

--grep              Show only reads containing the specified string 

--grep-w        -g  Like --grep, but match whole words

--lines         -n  Show only a reads from given range of line numbers

--head              Show only a certain number of reads from the start of the file

--tail              Show only a certain number of reads from the end of the file

--taxid             Include or exclude reads mapped to a certain taxonomy ID. See kraken.

--bases             Filter reads based on the IUPAC nucleotide codes in the sequence data

--no-header         Show only the alignments - exclude the SAM header

--header-only       Show only the description line of each read

--seq-only          Show only the sequence line of each read

--qual-only         Show only the base qualities line of each read

Example: show only the sequence for each read:

 

genocat --seq-only myfile.fq.genozip

Example: show only the first (#0) read in every 10 reads:

genocat --downsample 10,0 myfile.fq.genozip

Example: show reads with the string “ACCTTAAT” anywhere in the read (description, sequence or quality lines) - possibly a substring of a longer string:

genocat --grep ACCTTAAT myfile.fq.genozip

Example: show only reads in which all characters of the sequence are one of A,C,G,T,N:

genocat --bases ACGTN myfile.fq.genozip

Example:  show only reads in which NOT all characters of the sequence are one of A,C,G,T,N:

 

genocat --bases ^ACGTN myfile.fq.genozip

idxstats

Genozip has the ability to calculate approximate per-contig statistics (idxstats) directly from FASTQ data. See idxstats.

genocat --idxstats myfile.fq.genozip

Per-contig coverage and depth

Genozip has the ability to calculate approximate per-contig coverage and depth directly from FASTQ data.

See Coverage and Depth.

genocat --coverage myfile.fq.genozip

Sex classification

Genozip has the ability to estimate the sample's sex directly from FASTQ data. Not suitable for clinical applications. See Sex Classification.

genocat --show-sex myfile.fq.genozip

For a full list of options, see the genozip command line reference

Questions? support@genozip.com

bottom of page