top of page

Compressing BAM files

How good is Genozip at compressing BAM files?
v15 bam benchmark.png

Relative sizes of .bam files, before and after compression with genozip. 

Used Genozip 15.0.4, --reference was used for compressing with all files, --best was not used. Datasets can be found here.

Compressing a BAM, SAM or CRAM file 

In the rest of this page we will give examples of BAM files. Genozip is also capable of compressing CRAM and SAM files. To compress CRAM, samtools must be install on your system.

Compressing and uncompressing BAM is straightforward:

$ genozip myfile.bam

$ ls -lh myfile.bam*

-rw-rw-r--+ 1 divon divon  56G Apr 10  2022 myfile.bam
-rw-rw-r--+ 1 divon divon  16G Aug 1  18:44 myfile.bam.genozip

 

Compressing using a reference file 

Better (sometimes significantly so) compression can be achieved by providing a reference file.

$ genozip --reference GRCh38_full_analysis_set_plus_decoy_hla.fa.gz myfile.bam

$ ls -lh myfile.bam*

-rw-rw-r--+ 1 divon divon  56G Apr 10  2022 myfile.bam

-rw-rw-r--+ 1 divon divon  15G Aug 1  19:01 myfile.bam.genozip

 

Note: when a particular FASTA file is used as a reference for the first time, Genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to

--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

$ genozip --make-reference hs37d5.fa.gz

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly, or set $GENOZIP_REFERENCE to point to either the file or the directory containing the file..

Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of

--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.

Note: It is best if this reference is the one with which the BAM file was aligned, but if this original reference is not available, a "similar enough" reference file might also be very beneficial (by similar we mean a reference file that contains at least the contigs that make up of the bulk of the alignment in the BAM file).

Co-compressing BAM and FASTQ files (Genozip Deep™)

Genozip supports co-compressing a BAM file (or SAM or CRAM) along with all the FASTQs that contributed reads to the BAM. The result of this compression is a single file with the extension .deep.genozip. Uncompressing the deep file with genounzip outputs all the original files (losslessly, obviously!). This method results in very substantial savings vs compressing the FASTQs and the BAM separately:

Screenshot 2023-06-22 143754.png

The datasets can be found here.

  • Note: The Deep method requires inclusion of all FASTQs that contributed reads to the BAM.
     

  • Note: There is a lot that can change in the data between a FASTQ file and the corresponding BAM file - sequences may have been be trimmed, BQSR may have been applied to quality scores, read names may have changed, reads may have been filtered out or duplicates may have been collapsed to consensus sequences just to name a few. Deep will nevertheless still work, albeit a bit less effeciently, if such changes have occurred.
     

  • Note: The Deep method typically consumes 20%-50% less CPU compared to compressing the BAM and FASTQ files separately. However, it consumes significant RAM - the exact amount of RAM required varies, but as a rule of thumb it is about as much as the size of the BAM file.
     

Example:

$ genozip --deep --reference hs37d5.fa.gz myfile.bam myfile-R1.fastq.gz myfile-R2.fastq.gz 

$ ls -lh GFX0241869*

-rw-------+ 1 57G Feb  7  2020 myfile.bam
-rw-------+ 1 31G May  3 23:28
myfile-R1.fastq.gz
-rw-------+ 1 35G May  3 23:31
myfile-R2.fastq.gz

-rw-------+ 1 16G Jun 21 19:27 myfile.deep.genozip

 
--optimize : even better compression, with some caveats.

While Genozip is primarily a lossless compressor, the --optimize option is provided to allow lossy compression as well. The idea of lossy compression is this: It is common for data to contain information in a higher resolution than we actually need for downstream analysis. Examples include resolution of base quality scores and number of decimal digits in fractions. If we could reduce the resolution, we could gain signficantly better compression. The compression-enhacing modifications to SAM/BAM/CRAM data which genozip performs when --optimize is specified are described below. They are designed to have negligible impact on downstream analysis in many common cases, however you should definitely validate this for your own specific use case.

Base quality scores binning: base quality scores ∈[0,93] (which appear in textual SAM files as the ASCII characters '!' through '~') are binned, loosely following Illumina's method. Binning is applied to the main QUAL field, as well as to the following tags which also contain base quality scores: QT:Z CY:Z BZ:Z and in the case of files generated by 10xGenomics or STARsolo, also to the following tags: UY:Z QX:Z sQ:Z 2Y:Z fq:Z. Note that it is not applied to OQ:Z.

Old values  New value

0-2         unchanged

3-9         6

10-10       15

20-24       22

25-29       27

...

85-89       87

90-92       91

93          93

Note: binning is not applied if the main QUAL field is already binned to 8 or fewer values.

Floating-point number rounding: all floating point tags (XX:f and XX:B:f) are rounded: for BAM and CRAM, they are rounded to the 10 significant bits (this provides an accuracy of approximately 3 significant decimal digits), and for SAM, they are rounded to the 3 significant digits.

 

Note: when displaying optimized BAM/CRAM data in textual (SAM) format, these numbers might seem surprisely not round - this is because they are displayed in base-10 - rest assured that they are indeed round in base-2.

IonTerrent flow signal rounding: Relevant only to IonTorrent data: the ZM:B flow signal field is modified such that negative values are changed to zero and positives are rounded to the nearest 10.

Example: -20,212,427 ➔ 0,210,430

Note: genounzip and genozip --test verify that the data uncompresses to precisely the same data as it was after the modifications. Note that this does not test the correctness of the modifications themselves.

Note: --optimize can take an argument, to fine grain which fields get optimized, for example:

 

> genozip --optimize=QUAL,rq:f    - optimizes only the QUAL and rq:f fields, if they are optimizable

> genozip --optimize=^QUAL,rq:f  - optimizes all optimizable fields, except QUAL and rq:f .

Note: To see which fields are actually optimized use --stats (you can use this with genozip, genocat or genounzip):

> genozip --optimize --stats test.bam

BAM file: test.bam
BAM alignments: 100,000 (in Prim VBs: 50 in Depn VBs: 53)
Contexts: 55  Vblocks: 10 x 4.0 MB  Sections: 658
Main VBs: 8 Prim VBs: 1 Depn VBs: 1
Sorting: Sorted by POS
Aligner: dragen
Buddying: sag_type=BY_SA mate=50% saggy_near=0% prim_far=0.02%
Read name style: MGI-R8
Programs: ID: Hash Table Build;ID: DRAGEN HW build
Fields optimized: QUAL,sd:f
Genozip version: 15.0.60 github

optimize-bam
benchmark for optimize BAM 15.0.60.png

Genozip compression without and with --optimize, showing file sizes in MBs. The file tested is a BAM file which consists of MGI Tech reads aligned with Illumina DRAGEN. Note: Compression performance may vary considerably depending on the specific data - in this particular case the incremental gain of --optimize is primarily due to binning of base quality scores.

Uncompressing

Uncompress a file:

genounzip myfile.bam.genozip    

Uncompress a file into stdout (i.e. the terminal). Note: this outputs the data in SAM (i.e. textual) format. Use --bam or
--cram to output in BAM or CRAM formats respectively, or --bgzf to output gz-compressed SAM data.

genocat myfile.bam.genozip

Uncompress a file and also generates a BAI index file, using samtools index. samtools needs to be installed for this option to work:

genounzip --index myfile.bam.genozip         

.bam files are compressed internally with BGZF, as are .sam.gz files. Use --bgzf to specifiy the re-compression level (from 0=no recompression to 5=maximum recompression) or with --bgzf=exact, genounzip attempts to recover the BGZF compression level of the original file. Note: this does not apply for CRAM files since they do not use BGZF.

genounzip --bgzf exact myfile.bam.genozip

genounzip --bgzf 6 myfile.sam.genozip

Uncompress to a SAM, BAM or CRAM format by using genocat and either explicitly specifying the format with --sam --bam or --cram, or implicitly with specifying a .cram .bam .sam or .sam.gz file name extension in --output.

genocat myfile.bam.genozip --output myfile.cram # outputs a CRAM file

Slicing & dicing your data with genocat

Here's a summary of the filtering and subsetting options available for BAM files. See genocat for more information.

Option                                 Effect

--downsample        Show only one in every X alignments

--regions       -r  Exclude or include certain genomic regions (works even if file is not sorted)

--regions-file  -R  Like --regions, but list of regions is specified in a file

--grep              Show only alignments containing the specified string

--grep-w        -g  Like --grep, but match whole words

--lines         -n  Show only a alignments from given range of line numbers

--head              Show only a certain number of alignments from the start of the file

--tail              Show only a certain number of alignments from the end of the file

--FLAG              Include or exclude alignments with certain SAM FLAGs

--MAPQ              Include or exclude alignments with at least a certain MAPQ value

--bases             Filter alignments based on the IUPAC nucleotide codes in the sequence data

--no-header         Show only the alignments - exclude the SAM header

--header-only       Show only the SAM header - exclude the alignments

--qnames            Show only alignments with a QNAME specified (or not) in a comma-separated list

--qnames-file       Show only alignments with a QNAME specified (or not) in a file

--seqs-file         Show only alignments with a SEQ specified (or not) in a file

idxstats

Genozip has the ability to calculate per-contig statistics (idxstats). See idxstats.

genocat --idxstats myfile.bam.genozip

Per-contig coverage and depth

Genozip has the ability to calculate per-contig coverage. See Coverage and Depth.

genocat --coverage myfile.bam.genozip

Retrieving one of the components in a Deep file

genocat --fastq myfile.deep.bam.genozip

genocat --R1 myfile.deep.bam.genozip

genocat --R2 myfile.deep.bam.genozip

genocat --R=3 myfile.deep.bam.genozip # in case Deep file consists of more than 2 FASTQs 

genocat --interleaved myfile.deep.bam.genozip

genocat --sam myfile.deep.bam.genozip

genocat --bam myfile.deep.bam.genozip

For a full list of options, see the genozip command line reference

Questions? support@genozip.com

bottom of page