top of page

Genozip compression logs

Subject to your consent, every time a file is compressed with genozip, a record containing aggregate statistics regarding the performance of the compression algorithm and associated metadata is uploaded and logged on the Genozip server.

 

We are well aware of the sensitivity of genomic data, and we never log sequences, variants, file names, samples names, read group names etc - our logging is strictly limited to only the aggregate statistics and metadata specified below. 

We use these logs to help us provide you with technical support when needed, and also to gain a better understanding of how users are using Genozip, so we know where to focus our efforts of improving it further. 

Note: if you have a paid license (i.e. Standard, Enterprise, Premium or Paid Academic) you will be prompted to decide whether you consent to logging - and if you do not consent, no logging occurs. When using a free license, logging always occurs (consent is included in the license agreement). 

Retention policy: logs may be retained indefinitely, or may be deleted if no longer needed, if required to do so by law or regulations, or if requested to do so by the user. To request deletion or to receive a copy of the records submitted under your license, please email support@genozip.com.

The structure of a log record is illustrated by the following example (one record per file compressed). This structure is of the most recent version of Genozip and may continue to evolve over time as Genozip develops.

If you still have concerns regarding our logging, please contact us at support@genozip.com to find a solution.

Field name
Example
Notes
contexts
DIVRQUAL,QUAL,27.8%,NONE,27.8%,N/A,0.0%,0,0.0%,; NONREF,SEQUENCE,5.9%,NONE,5.9%,N/A,0.0%,0,0.0%,; QUAL,QUAL,5.3%,NONE,5.2%,N/A,0.0%,1,0.0%,; SQBITMAP,SEQUENCE,4.8%,NONE,4.2%,RANB,0.6%,2,0.0%,; Q5NAME,QNAME,4.4%,BSC,0.0%,LZMA,4.0%,3536,0.4%,; Q6NAME,QNAME,4.0%,BSC,0.0%,BSC,3.7%,2304,0.3%,; DOMQRUNS,QUAL,18.5%,NONE,18.5%,N/A,0.0%,3,0.0%,; Q4NAME,QNAME,3.7%,BSC,0.0%,ARTw,3.6%,936,0.1%,; P2NEXT,PNEXT,3.0%,BSC,0.0%,BSC,2.9%,822,0.1%,; XS:i,XS:i,2.4%,NONE,2.4%,N/A,0.0%,1,0.0%,; CIGAR,CIGAR,2.3%,BSC,0.6%,BSC,1.5%,853,0.1%,; AS:i,AS:i,1.5%,NONE,0.6%,ARTb,0.9%,9,0.0%,; TLEN,TLEN,1.2%,ARTB,0.2%,ARTB,1.0%,16,0.0%,; P0OS0,POS,1.1%,BZ2,0.0%,ARTW,1.1%,170,0.0%,; Q2NAME,QNAME,0.8%,ARTB,0.0%,RANB,0.8%,5,0.0%,; Q1NAME,QNAME,0.8%,NONE,0.8%,N/A,0.0%,0,0.0%,; Q3NAME,QNAME,0.7%,NONE,0.7%,N/A,0.0%,0,0.0%,; QNAME,QNAME,0.6%,ARTB,0.0%,RANB,0.6%,3,0.0%,; F0LAG0,FLAG,0.6%,ARTB,0.0%,ARTB,0.5%,26,0.0%,;
Aggregate statistics of contexts. For each context: its name, parent name, % of genozip file, codec of local data, % of genozip file of local data, codec and % of genozip file of b250 of b250 data, number of words in dictionary, % of genozip file of dictionary
data_type
BAM
environment
OS=Windows_10.0.22000; cores=8; physical_GB=16; runtime=0h1'23"; dist=conda; n_files=3; remote=0.0.0.0; local=174.22.10.11; glibc=2.27; filesystem=NTFS
Compute environment, distribution, genozip runtime, and number of files compressed in this execution,local and remote IP addresses
features (--make-reference)
VBs=2998 X 1.0 MB; num_contigs=24; num_bases=3145129148;
Features of the file
features (BED)
columns=10;sorted;
Features of the file that affect compression.
features (FASTA)
Nucleotide_bases;num_sequences=12311;
Features of the file that affect compression.
features (FASTA)
VBs=196 X 16.0 MB; num_lines=12167; Nucleotide_bases; segconf.line_len=1649;
Features of the file that affect compression
features (FASTQ)
VBs=531 X 16.0 MB;num_lines=46009532;Qname=Illumina-old/;segconf.line_len=194;segconf.longest_seq_len=76;Sequencer=Illumina;ref_nbases=2542341441;ref_ncontigs=25;
Features of the file that affect compression
features (GENERIC)
VBs=1 X 16.0 MB; magic="MZ??????????????????????@???????" 4D.5A.90.00.03.00.00.00.04.00.00.00.FF.FF.00.00.B8.00.00.00.00.00.00.00.40.00.00.00.00.00.00.00; extension="exe"; segconf.line_len=0;
Features of the file that affect compression. "magic" is the first 32 bytes of the file; "extension" is the component of the filename following the final ".", but if it is 'gz', 'bz2', 'xz' or 'zip', the before-last component is included too.
features (GFF)
num_fasta_sequences=1
Features of the file that affect compression.
features (SAM/BAM)
VBs=4 X 28.1 MB; num_lines=99909; num_hdr_contigs=86; Sorted; Mapper=dragen; Paired-End; sag_type=BY_SA; mate=49%; saggy_near=0%; prim_far=0.01%; Qname=Illumina; segconf.line_len=344; segconf.longest_seq_len=151;hdr_ncontigs=25;bisulfite;ref_nbases=2542341441;ref_ncontigs=25;
Features of the file that affect compression
features (VCF)
VBs=7 X 32.0 MB; num_lines=6907; num_samples=722; GVCF; segconf.line_len=20964; ref_nbases=2542341441;ref_ncontigs=25;
Features of the file that affect compression
features (reference file)
num_contigs=18;num_bases=124234153;
Features of the file that affect --make-reference.
fields_gain
QUAL,53.1%,15.0X; QNAME,18.1%,11.4X; SEQ,16.0%,49.9X; PNEXT,3.2%,11.5X; CIGAR,2.4%,12.1X; AS:i,1.4%,14.9X; TLEN,1.2%,19.1X; POS,1.2%,30.6X; MAPQ,1.1%,10.4X; FLAG,0.6%,30.1X; XS:i,0.6%,35.3X; TXT_HEADER,0.4%,3.4X; SA:Z,0.4%,2.0X; Other,0.2%,587.9X; RNEXT,0.1%,129.0X; XQ:i,0.0%,2.6X; RNAME,0.0%,466.9X; MD:Z,0.0%,1712.4X; NM:i,0.0%,2365.2X; BAM_BIN,0.0%,0.0X; RG:Z,0.0%,4757.6X
For each field: its name, % of the genozip file which is this field, and compression ratio of the field
flags
best; optimize; reference=EXTERNAL ; file_i=4/12
Flags that affect compression
genozip_gain
17.5
Compression ratio of genozip vs the uncompressed source file
hash_issues
TaOKEN,QNAME,512.0 KB,73%,SRR34514354.57574038,SRR10260032.79514335,SRR10260015.78254568,SRR10260015.71887571,SRR10260013.69705869,SRR10260015.55836671
In rare cases in which a certain field has statistical properties that cause Genozip to run slowly - 6 example values of the field are sent for diagnosis.
hash_issues
QNAME,,,,SRR11234134.1 1/2,SRR11234134.2 2/2,SRR11234134.3 3/2,SRR11234134.4 4/2,SRR11234134.5 5/2,SRR11234134.6 6/2
Read names and other similar fields: in extremely rare cases in which Genozip cannot effeciently parse the string due to unsupported formatting, 6 example values are sent for diagnosis
hash_issues
A00910:85:HYGWJDSXX:1:1101:3025:1000_1:N:0:CAACGAGAGC+GAATTGAGTG;A00910:85:HYGWJDSXX:1:1101:3025:1000
when using --deep: the first FASTQ read name and the first BAM QNAME in the respective files. Sent for diagnosis in rare cases in which Genozip cannot make sense of the relationship between them.
license_num
442123256
Genozip license of user
programs (GFF)
Prodigal;
programs that generated the data - deduced from the data format
programs (SAM/BAM)
MarkDuplicates;bwa;
programs that generated the data - generated from the ID and PN subfields of the @PG header lines
programs (VCF)
VarScan2;
programs that generated the data - generated from the ##source lines of the VCF header lines
qual_acgt (SAM/BAM/FASTQ)
I@A?;:>9786≐<,52――I;:986>7≐5<430/1――I;:97865/3140≐>,――I@A?>;≐:<98HG756
the most common base quality scores corresponding to each of A,C,G,T in the sequence, in descending order of frequency.
qual_histo (SAM/BAM/FASTQ)
ISD<72+$
the most common base quality scores, in descending order of frequency.
runtime
0h14'25"; 25,13
runtime of genozip, and the average number of cores used to compress each component
src_codec
GZ,GZ
Codec of the source file (for each component)
timestamp
1/June/2022 9:54
Time this record was created
txt_size
8234126873,3463182091
Sizes of file(s) prior to Genozip compression - after removal of source compression (eg .gz) and original size (i.e. with source compression).
user_host
john@lab
User and host running genozip
version
15.0.25
Version of Genozip used
bottom of page