Benchmarks

Below are some benchmarks measuring Genozip’s performance on a variety of file types.

To see more, peer-reviewed, benchmarks, see Publications.

FASTQ

Title	Sequencer	Data	.fq.gz size	Genozip size	Size reduction
F1	Illumina NovaSeq	Human 30x WGS (R1+R2)	61.2 GB	13.1 GB	-79%
F3	MGI Tech DNBSEQ-G400	Human WGS (R1+R2)	99.9 GB	48.8 GB	-50%

Notes:

1. The compression shown here is of files that are already compressed with .gz

2. genozip options used: --best. For F1 and F3 --reference and --pair were used as well. For F2, genozip was used with the --multiseq option.

BAM

Title	Name	Feature	Sequencer	Aligner	BAM size	Genozip size	Size reduction
T1	NA12878.final.cram	WGS - NovaSeq	Illumina NovaSeq	bwa	15,797,182,294	9,015,105,386	-43%
T2	NA12878_S1.bam	WGS - HiSeq 2000	Illumina HiSeq 2000	bwa	121,691,186,161	52,005,672,886	-57%
T3	NA12878.pacbio.bwa-sw.20140202.bam	WGS - PacBio CLR	PacBio	bwa sw	57,785,894,776	25,442,443,993	-44%
T4	ENCFF047UEJ.bam	RNA-seq - transcriptome alignments (STAR)	Illumina HiSeq 2500	STAR	2,065,298,931	324,674,984	-84%
T5	ENCFF575KZB.bam	RNA-seq - genome alignments (STAR)	Illumina HiSeq 2500	STAR	2,322,825,802	428,436,333	-82%
T6	ENCFF900XHI.bam	Long read RNA-seq	PacBio Sequel II	minimap2	1,308,956,918	82,497,181	-94%
T7	sorted_final_merged.bam	WGS (GIAB)	PacBio	blasr	146,870,854,017	31,796,498,533	-78%
T8	ENCFF069XDC.bam	DNase-Seq	Illumina HiSeq 4000	bwa sampe	6,668,114,400	1,598,034,373	-76%
T9	ENCFF669HBS.bam	STARR-seq	Illumina NovaSeq 6000	bowtie2	5,322,159,747	1,242,102,254	-77%
T10	ENCFF046VPK.bam	scRNA-seq	Illumina NextSeq 2000	STARsolo	1,306,491,967	277,976,393	-79%
T11	ENCFF460RWK.bam	totalRNA-seq	Illumina HiSeq 2500	STAR	4,171,989,444	1,473,370,983	-65%
T12	hgmm_10k_v3_possorted_genome_bam.bam	1:1 Mixture of Fresh Frozen Human and Mouse Cells	Illumina NovaSeq	STAR + cellranger	53,694,972,688	15,552,234,159	-71%
T13	ENCFF283TLK.bam	WGS - Nanopore MinION	Oxford Nanopore MinION	ngmlr	222,354,403,768	107,386,919,618	-52%
T14	ENCFF786GJA.bam	WGBS paired-end (Methylation)	Illumina HiSeq X Ten	Bismark	198,026,253,992	27,448,746,040	-86%

Notes:

1. genozip options used: --best ; --reference with the appropriate reference file (except T4, T12)

2. T1 is a CRAM file, and hence compression ratio is shown relative to CRAM, not BAM. Genozip compresses a CRAM by first converting it to BAM.

VCF

Title	Data	.vcf.gz size	.genozip size	Size reduction
V1	3202 human samples from "1000 Genomes Project"	27 GB	7.8 GB	-71%
V2	1135 plant samples from 1001 "Genomes - Arabidopsis Thaliana"	132.4 GB	31.6 GB	-76%
V3	3K Rice samples from "3K Rice Genome"	1.9 GB	315.5 MB	-84%
V4	GVCF (single sample, human)	12.6 GB	763.5 MB	-94%
V5	Illumina Genotyping	35 MB	9.6 MB	-73%

Notes:

1. The compression shown here is of files that are already compressed with .gz

2. genozip options used: --best ; For V4 and V5 --reference was used as well

Other types

Other data types

Title	Data	Type	.gz Size	.genozip	Size reduction
A1	Covid-19 Multi-FASTA from coronavirus.innar.com	FASTA	254.9 MB	1.5 MB	-99%
G1	Gene annotation from the Telomere-to-telomere consortium	GFF3	91.8 MB	32.3 MB	-65%
G2	Homo_sapiens.GRCh38.108.gtf.gz from Ensembl	GTF	51.6 MB	11.3 MB	-78%
L1	Illumina s.locs file	LOCS (Illumina)	12MB	5.9 KB	-99.95%
L2	Illumina s_X_XXXX.locs file from BaseSpace Demo Data	LOCS (Illumina)	4.4MB	3.3 MB	-25%
M1	Consumer DNA test “raw data”	23andMe	5.1 MB	2.9 MB	-43%

Notes:

1. The compression shown here is of files that are already compressed with .gz, except for M1 which a file compressed with .zip

2. genozip options used: --best. For A1 genozip was used with the --multiseq option.