
Benchmarks
Below are some benchmarks measuring Genozip’s performance on a variety of file types.
To see more, peer-reviewed, benchmarks, see Publications.
Title | Sequencer | Data | .fq.gz size | Genozip size | Size reduction |
---|---|---|---|---|---|
F1 | Illumina NovaSeq | Human 30x WGS (R1+R2) | 61.2 GB | 13.1 GB | -79% |
F3 | MGI Tech DNBSEQ-G400 | Human WGS (R1+R2) | 99.9 GB | 48.8 GB | -50% |
Notes:
1. The compression shown here is of files that are already compressed with .gz
2. genozip options used: --best. For F1 and F3 --reference and --pair were used as well. For F2, genozip was used with the --multiseq option.
Title | Name | Feature | Sequencer | Aligner | BAM size | Genozip size | Size reduction |
---|---|---|---|---|---|---|---|
T1 | NA12878.final.cram | WGS - NovaSeq | Illumina NovaSeq | bwa | 15,797,182,294 | 9,015,105,386 | -43% |
T2 | NA12878_S1.bam | WGS - HiSeq 2000 | Illumina HiSeq 2000 | bwa | 121,691,186,161 | 52,005,672,886 | -57% |
T3 | NA12878.pacbio.bwa-sw.20140202.bam | WGS - PacBio CLR | PacBio | bwa sw | 57,785,894,776 | 25,442,443,993 | -44% |
T4 | ENCFF047UEJ.bam | RNA-seq - transcriptome alignments (STAR) | Illumina HiSeq 2500 | STAR | 2,065,298,931 | 324,674,984 | -84% |
T5 | ENCFF575KZB.bam | RNA-seq - genome alignments (STAR) | Illumina HiSeq 2500 | STAR | 2,322,825,802 | 428,436,333 | -82% |
T6 | ENCFF900XHI.bam | Long read RNA-seq | PacBio Sequel II | minimap2 | 1,308,956,918 | 82,497,181 | -94% |
T7 | sorted_final_merged.bam | WGS (GIAB) | PacBio | blasr | 146,870,854,017 | 31,796,498,533 | -78% |
T8 | ENCFF069XDC.bam | DNase-Seq | Illumina HiSeq 4000 | bwa sampe | 6,668,114,400 | 1,598,034,373 | -76% |
T9 | ENCFF669HBS.bam | STARR-seq | Illumina NovaSeq 6000 | bowtie2 | 5,322,159,747 | 1,242,102,254 | -77% |
T10 | ENCFF046VPK.bam | scRNA-seq | Illumina NextSeq 2000 | STARsolo | 1,306,491,967 | 277,976,393 | -79% |
T11 | ENCFF460RWK.bam | totalRNA-seq | Illumina HiSeq 2500 | STAR | 4,171,989,444 | 1,473,370,983 | -65% |
T12 | hgmm_10k_v3_possorted_genome_bam.bam | 1:1 Mixture of Fresh Frozen Human and Mouse Cells | Illumina NovaSeq | STAR + cellranger | 53,694,972,688 | 15,552,234,159 | -71% |
T13 | ENCFF283TLK.bam | WGS - Nanopore MinION | Oxford Nanopore MinION | ngmlr | 222,354,403,768 | 107,386,919,618 | -52% |
T14 | ENCFF786GJA.bam | WGBS paired-end (Methylation) | Illumina HiSeq X Ten | Bismark | 198,026,253,992 | 27,448,746,040 | -86% |
Source: Lan, D., et al. (2022) Genozip 14 - advances in compression of BAM and CRAM files (preprint)
Notes:
1. genozip options used: --best ; --reference with the appropriate reference file (except T4, T12)
2. T1 is a CRAM file, and hence compression ratio is shown relative to CRAM, not BAM. Genozip compresses a CRAM by first converting it to BAM.
Title | Data | .vcf.gz size | .genozip size | Size reduction |
---|---|---|---|---|
V1 | 3202 human samples from "1000 Genomes Project" | 27 GB | 7.8 GB | -71% |
V2 | 1135 plant samples from 1001 "Genomes - Arabidopsis Thaliana" | 132.4 GB | 31.6 GB | -76% |
V3 | 3K Rice samples from "3K Rice Genome" | 1.9 GB | 315.5 MB | -84% |
V4 | GVCF (single sample, human) | 12.6 GB | 763.5 MB | -94% |
V5 | Illumina Genotyping | 35 MB | 9.6 MB | -73% |
Notes:
1. The compression shown here is of files that are already compressed with .gz
2. genozip options used: --best ; For V4 and V5 --reference was used as well
Other data types
Title | Data | Type | .gz Size | .genozip | Size reduction |
---|---|---|---|---|---|
A1 | Covid-19 Multi-FASTA from coronavirus.innar.com | FASTA | 254.9 MB | 1.5 MB | -99% |
G1 | Gene annotation from the Telomere-to-telomere consortium | GFF3 | 91.8 MB | 32.3 MB | -65% |
G2 | Homo_sapiens.GRCh38.108.gtf.gz from Ensembl | GTF | 51.6 MB | 11.3 MB | -78% |
L1 | Illumina s.locs file | LOCS (Illumina) | 12MB | 5.9 KB | -99.95% |
L2 | Illumina s_X_XXXX.locs file from BaseSpace Demo Data | LOCS (Illumina) | 4.4MB | 3.3 MB | -25% |
M1 | Consumer DNA test “raw data” | 23andMe | 5.1 MB | 2.9 MB | -43% |
Notes:
1. The compression shown here is of files that are already compressed with .gz, except for M1 which a file compressed with .zip
2. genozip options used: --best. For A1 genozip was used with the --multiseq option.