
Reference files
What is a reference file
​​​
A Genozip reference file is a file with a .ref.genozip file name extension. It is derived from a set of genomic sequences retrieved from one or more FASTA files. These sequences represent the genome(s) of the organism(s) from which the data to-be-compressed originated.
​
genozip uses the reference file to better compress files that contain genomic sequences, essentially by storing the location in the reference file containing the sequence at hand rather than storing the sequence itself, thereby obtaining a smaller representation of the data – the essence of compression.
A reference file is useful for compressing these file types:
​
- FASTQ files.
- SAM / BAM / CRAM files.
- VCF files - this is particularly effective for GVCF files.
- Certain types of FASTA files: those that contain sequencer reads (i.e. a FASTA that is essentially a FASTQ without the quality scores).
​
While Genozip can always compress without using a reference, if one is available, it is a good idea to use it – the compression will be better, and also faster. In particular, the effect is dramatic on FASTQ files and unaligned BAM files – these, really, should always be compressed using a reference file.
​​​​
​
Making a reference file
​​​​
A reference file is made like this:
​
$ genozip --make-reference hs37d5.fa.gz
​​​
It can also be made from a URL:
​​​
$ genozip --make-reference ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/
reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
It can be made by combining multiple FASTAs, and it is the user's responsibility to make sure that contig names are unique:
​
$ cat organismA.fa.gz organismB.fa.gz | genozip --make-reference - --output myref.ref.genozip
​
Genozip supports references of up to 1 Tbps (~ 1 trillion bases), but if you plan on using references larger than 10 Gbps or so we advise contacting support@genozip.com to discuss.​
​​​
​
Using a reference file
​​​​
The most common way of using a reference file is in external reference mode, as this achieves the best compression, for example:
​
$ genozip --reference hs37d5.ref.genozip myfile-R1.fq.gz
​​
In this mode, no data from the reference file is copied into the compressed file. Instead, the reference file is needed again to uncompress the data with genounzip or genocat:
​
$ genounzip --reference hs37d5.ref.genozip myfile-R1.fq.genozip
​​​
The --reference option may be omitted in genounzip and genocat in which case the reference file will be sought in the same location on the filesystem as was used when the file was generated with genozip.
​
In contrast, in the stored reference mode, the parts of the reference data that are actually referenced by the file being compressed are stored in the compressed file itself, adding about 0.23 bytes per reference base to its size. For example, for a human WGS FASTQ or BAM file, this would increase the compressed file size by about 700 MB.
​
$ genozip --REFERENCE hs37d5.ref.genozip myfile-R1.fq.gz
​
The advantage is that the reference file is not needed to uncompress:
​
$ genounzip myfile-R1.fq.genozip
​​
Note that the --reference and --REFERENCE command line options can also take a FASTA file name as an argument, instead of a .ref.genozip file. This is just a shortcut for convenience: Genozip searches for the corresponding .ref.genozip in the same directory as the FASTA, and if not found, it makes it.
​​
Using $GENOZIP_REFERENCE: this environment variable can be set to the path of a reference file, which will be used by genozip, genounzip or genocat if the --reference option is omitted, as an external reference. For genounzip and genocat (but not genozip), it can also be set to a directory name, in which case the reference file used by genozip to generate the compressed file, will be sought in that directory.
​
​
In-memory caching of reference files
​​
Since reference files are RAM-hungry and take a few seconds to load, genozip​ caches them in shared memory - so that the time-consuming loading occurs only the first a particular reference file is used.
​
The reference data remains in RAM until the machine is rebooted, or it is explicitly removed. To see the references currently cached in memory:
​
$ genols --cache
hs37d5.ref.genozip (shmid=1 size=2797630464 loaded=2026-04-01 8:46:39)
GRCh38.ref.genozip (shmid=2 size=2817628528 loaded=2026-04-04 17:47:18)
​
To remove them from memory:
​
$ genozip --no-cache
genozip: Unloading reference cache "hs37d5.ref.genozip"
genozip: Unloading reference cache "GRCh38.ref.genozip"
​
The actual removal of these shared memory segments will occur after all processes currently running and using the reference data have completed.
​
It is possible to instruct Genozip to load a copy of the reference from disk, neither seeking it in the cache, nor storing it there:
​
$ genozip --reference hs37d5.ref.genozip --no-cache myfile-R1.fq.gz
​​
​
Docker containers and caching of reference files​
​
In order to avoid loading the reference file from disk with each execution in a Docker container and having mulitple copies of the reference consuming RAM if multiple containers are running Genozip in parallel, it is advisable to share the shared memory between docker containers. Luckily, Docker allows doing just that using docker run --ipc.
One strategy is to have a docker container which holds the reference, and the other containers using it. To load reference data into cache, compress a tiny dummy file:
​​​
$ genozip --force --no-test --reference hs37d5.ref.genozip tiny.fq
​​
It is possible to repeat this command with another reference file in order to cache more than one reference in memory.
​​
Note that Genozip identifies reference files in cache by their full path, so for caching to work, the same path must be used in all containers.
​
​
Viewing and subsetting the reference data
​​
In addition to their primary use for compressing files, reference files are also useful for analyzing the genome contained within:
they can be used to easy view sub-sequences of contigs in certain regions (forward or reverse complemented) using --regions and --regions-file, for finding IUPAC non-ACGTN pseudo-bases in the file with --show-ref-iupacs, and seeing properties of the contigs with --show-ref-contigs. See more here: Reference file options.
​​
Note on backwards compatability of reference files
​​
Genozip version 15, the current version, is able to use reference files made by older versions - as old as Genozip version 8. However, a sequence of major improvements occurred to Genozip reference file technology over time, and it is highly recommended to use a reference file made by genozip 15.0.81 and above which provide much faster and better compression.
​
Files compressed with a reference file made by Genozip version 15, can be uncompressed with any other reference file made by Genozip version 15 from the same FASTA file regardless of the size parameter used in --make-reference (as long as the reference file was not made by a newer version of Genozip than the version trying to use it).
​
To find out which Genozip version was used to make a particular reference file, use:
​
$ genocat --stats myref.ref.genozip
​
Genozip's backward compatibility of reference files notwithstanding, the best practice is to store the reference file used to compress together with the archive of compressed files, and to use the same reference file to uncompress.
