top of page

Stats collected

Subject to your consent, every time a file is compressed with genozip, a record containing aggregate statistics regarding the performance of the compression algorithm and associated metadata is uploaded to the Genozip server.

We use these statistics to gain a better understanding of how users are using Genozip, so we know where to focus our efforts of improving it further. We also these use these records for customer support.

Note: if you have a Standard license (i.e. paid) you will be prompted to decide whether you consent to statistics collection - and if you do not consent, no statistics are collected. When using a free license, statistics are always collected (consent is included in the license agreement). 

Retention policy: statistics records are retained indefinitely, and may be deleted if no longer needed, if required to do so by law or regulations, or if requested to do so by the user. To request deletion or receive a copy of the records submitted under your license, please email support@genozip.com.

The structure of a stats record is illustrated by the following example (one stats record per file compressed). This structure is of the most recent version of Genozip and may continue to evolve over time as Genozip develops.

Field name
Example
Notes
timestamp
1/June/2022 9:54
Time this record was created
version
14.0.0
Version of Genozip used
data_type
BAM
txt_size
8.3 GB,3463182091
Approx size of uncompressed source file and exact size of compressed source file
src_codec
GZ,3.6
Codec of the source file and its ratio
genozip_gain
17.5
Compression ratio of genozip vs the uncompressed source file
fields_gain
QUAL,53.1%,15.0X; QNAME,18.1%,11.4X; SEQ,16.0%,49.9X; PNEXT,3.2%,11.5X; CIGAR,2.4%,12.1X; AS:i,1.4%,14.9X; TLEN,1.2%,19.1X; POS,1.2%,30.6X; MAPQ,1.1%,10.4X; FLAG,0.6%,30.1X; XS:i,0.6%,35.3X; TXT_HEADER,0.4%,3.4X; SA:Z,0.4%,2.0X; Other,0.2%,587.9X; RNEXT,0.1%,129.0X; XQ:i,0.0%,2.6X; RNAME,0.0%,466.9X; MD:Z,0.0%,1712.4X; NM:i,0.0%,2365.2X; BAM_BIN,0.0%,0.0X; RG:Z,0.0%,4757.6X
For each field: its name, % of the genozip file which is this field, and compression ratio of the field
contexts
DIVRQUAL,QUAL,27.8%,NONE,27.8%,N/A,0.0%,0,0.0%,; NONREF,SEQUENCE,5.9%,NONE,5.9%,N/A,0.0%,0,0.0%,; QUAL,QUAL,5.3%,NONE,5.2%,N/A,0.0%,1,0.0%,; SQBITMAP,SEQUENCE,4.8%,NONE,4.2%,RANB,0.6%,2,0.0%,; Q5NAME,QNAME,4.4%,BSC,0.0%,LZMA,4.0%,3536,0.4%,; Q6NAME,QNAME,4.0%,BSC,0.0%,BSC,3.7%,2304,0.3%,; DOMQRUNS,QUAL,18.5%,NONE,18.5%,N/A,0.0%,3,0.0%,; Q4NAME,QNAME,3.7%,BSC,0.0%,ARTw,3.6%,936,0.1%,; P2NEXT,PNEXT,3.0%,BSC,0.0%,BSC,2.9%,822,0.1%,; XS:i,XS:i,2.4%,NONE,2.4%,N/A,0.0%,1,0.0%,; CIGAR,CIGAR,2.3%,BSC,0.6%,BSC,1.5%,853,0.1%,; AS:i,AS:i,1.5%,NONE,0.6%,ARTb,0.9%,9,0.0%,; TLEN,TLEN,1.2%,ARTB,0.2%,ARTB,1.0%,16,0.0%,; P0OS0,POS,1.1%,BZ2,0.0%,ARTW,1.1%,170,0.0%,; Q2NAME,QNAME,0.8%,ARTB,0.0%,RANB,0.8%,5,0.0%,; Q1NAME,QNAME,0.8%,NONE,0.8%,N/A,0.0%,0,0.0%,; Q3NAME,QNAME,0.7%,NONE,0.7%,N/A,0.0%,0,0.0%,; QNAME,QNAME,0.6%,ARTB,0.0%,RANB,0.6%,3,0.0%,; F0LAG0,FLAG,0.6%,ARTB,0.0%,ARTB,0.5%,26,0.0%,;
Aggregate statistics of contexts. For each context: its name, parent name, % of genozip file, codec of local data, % of genozip file of local data, codec and % of genozip file of b250 of b250 data, number of words in dictionary, % of genozip file of dictionary
hash_issues
TaOKEN,QNAME,512.0 KB,73%,SRR34514354.57574038,SRR10260032.79514335,SRR10260015.78254568,SRR10260015.71887571,SRR10260013.69705869,SRR10260015.55836671
Fields with statistical properties that slow Genozip down. 6 example values of the field are included to allow debugging of the issue
hash_issues
QNAME,,,,SRR11234134.1 1/2,SRR11234134.2 2/2,SRR11234134.3 3/2,SRR11234134.4 4/2,SRR11234134.5 5/2,SRR11234134.6 6/2
In SAM/BAM/QNAME/KRAKEN - 6 examples of QNAME (read name), if QNAME is not of a format recognized by Genozip
features (FASTA)
VBs=196 X 16.0 MB; num_lines=12167; Nucleotide_bases; segconf.line_len=1649;
Features of the file that affect compression
features (--make-reference)
VBs=2998 X 1.0 MB; num_contigs=24; num_bases=3145129148;
Features of the file
features (FASTQ)
VBs=531 X 16.0 MB;num_lines=46009532;Qname=Illumina-old/;segconf.line_len=194;segconf.longest_seq_len=76;
Features of the file that affect compression
features (SAM/BAM)
VBs=4 X 28.1 MB; num_lines=99909; num_hdr_contigs=86; Sorted; Mapper=dragen; Paired-End; sag_type=BY_SA; mate=49%; saggy_near=0%; prim_far=0.01%; Qname=Illumina; segconf.line_len=344; segconf.longest_seq_len=151;
Features of the file that affect compression
features (VCF)
VBs=7 X 32.0 MB; num_lines=6907; num_samples=722; GVCF; segconf.line_len=20964;
Features of the file that affect compression
features (GENERIC)
VBs=1 X 16.0 MB; magic="MZ??????" 4D.5A.90.00.03.00.00.00; extension="exe"; segconf.line_len=0;
Features of the file that affect compression. "magic" is the first 8 bytes of the file; "extension" is the last the component of the filename following the final "."
flags
best; optimize; reference=EXTERNAL
Flags that affect compression
environment
OS=Windows_10.0.22000; cores=8;runtime=0h1'23";dist=conda;
Compute environment, distribution and genozip runtime
user_host
john@lab
User and host running genozip
license_num
442123256
Genozip license of user
programs (SAM/BAM)
MarkDuplicates;bwa;
programs that generated the data - generated from the ID and PN subfields of the @PG header lines
programs (VCF)
VarScan2;
programs that generated the data - generated from the ##source lines of the VCF header lines
bottom of page