top of page

Compressing BAM files

How good is Genozip at compressing BAM files?
v15 bam benchmark.png

Relative sizes of .bam files, before and after compression with genozip. 

Used Genozip 15.0.4, --reference was used for compressing with all files, --best was not used. Datasets can be found here.

Compressing a BAM, SAM or CRAM file 

In the rest of this page we will give examples of BAM files. Genozip is also capable of compressing SAM files, and with some limitations, CRAM files as well.

Compressing and uncompressing BAM is straightforward:

$ genozip myfile.bam

$ ls -lh myfile.bam*

-rw-rw-r--+ 1 divon divon  56G Apr 10  2022 myfile.bam
-rw-rw-r--+ 1 divon divon  16G Aug 1  18:44 myfile.bam.genozip

 

Compressing using a reference file 

Better (sometimes significantly so) compression can be achieved by providing a reference file.

$ genozip --reference GRCh38_full_analysis_set_plus_decoy_hla.fa.gz myfile.bam

$ ls -lh myfile.bam*

-rw-rw-r--+ 1 divon divon  56G Apr 10  2022 myfile.bam

-rw-rw-r--+ 1 divon divon  15G Aug 1  19:01 myfile.bam.genozip

 

Note: when a particular FASTA file is used as a reference for the first time, Genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to

--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

$ genozip --make-reference hs37d5.fa.gz

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly.

Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of

--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.

Note: It is best if this reference is the one with which the BAM file was aligned, but if this original reference is not available, a "similar enough" reference file might also be very beneficial (by similar we mean a reference file that contains at least the contigs that make up of the bulk of the alignment in the BAM file).

Co-compressing BAM and FASTQ files (Genozip Deep™)

Genozip supports co-compressing a BAM file (or SAM or CRAM) along with all the FASTQs that contributed reads to the BAM. The result of this compression is a single file with the extension .deep.genozip. Uncompressing the deep file with genounzip outputs all the original files (losslessly, obviously!). This method results in very substantial savings vs compressing the FASTQs and the BAM separately:

Screenshot 2023-06-22 143754.png

The datasets can be found here.

Note: The Deep method consumes significant RAM. The exact amount of RAM required varies, but as a rule of thumb it is about as much as the size of the BAM file.

Note: The Deep method requires inclusion of all FASTQs that contributed reads to the BAM.

Example:

$ genozip --deep --reference hs37d5.fa.gz myfile.bam myfile-R1.fastq.gz myfile-R2.fastq.gz 

$ ls -lh GFX0241869*

-rw-------+ 1 57G Feb  7  2020 myfile.bam
-rw-------+ 1 31G May  3 23:28
myfile-R1.fastq.gz
-rw-------+ 1 35G May  3 23:31
myfile-R2.fastq.gz

-rw-------+ 1 16G Jun 21 19:27 myfile.deep.genozip

Compressing CRAM files

Genozip does not support CRAM natively - it uses samtools to convert CRAM files to the BAM format, and as such it requires samtools to be installed for CRAM compression to work.

genozip --reference hs37d5.fa.gz myfile.cram

Note: use of a reference file is mandatory when compressing a CRAM file.

Note: When decompressing the file, it is decompressed into SAM or BAM format. Genozip does not support decompressing to CRAM format.

Compression optimizations

The following options modify the file in ways that improve compression. --optimize is an umbrella option that activates all optimization options.

--optimize-QUAL

genozip myfile.bam --optimize-QUAL

The base quality data is optimized as follows:

Old values  New value

2-9         6

10-10       15

20-24       22

25-29       27

...

85-89       87

90-92       91

93          93

--optimize-ZM

Relevant only to IonTorrent data: the ZM:B flow signal field is modified such that negative values are changed to zero and positives are rounded to the nearest 10.

Example: -20,212,427 ➔ 0,210,430

Note: because optimizations modify the data of the file, automatic verification after compression is disabled, and genounzip --test is also not available.

Example:

 

$ wget ftp://ftp-trace.ncbi.nih.gov/HG002_NA24385_SRR1767406_IonXpress_020_rawlib_24028.30k.b37.bam

$ genozip IonXpress_020_rawlib.b37.bam

$ genozip IonXpress_020_rawlib.b37.bam --optimize-ZM -o IonXpress_020_rawlib.b37.optimized.bam.genozip

$ ls -Ggh IonXpress_020_rawlib.b37*

 

-rw-rw-r--+ 1 26G Aug 1  23:53 IonXpress_020_rawlib.b37.bam

-rw-rw-r--+ 1 17G Aug 2  00:10 IonXpress_020_rawlib.b37.bam.genozip

-rw-rw-r--+ 1 12G Aug 2  00:17 IonXpress_020_rawlib.b37.optimized.bam.genozip

Uncompressing

Uncompress a file:

genounzip myfile.bam.genozip    

Uncompress a file into stdout (i.e. the terminal). Note: this outputs the data in SAM (i.e. textual) format. Use --bam to output in BAM (i.e. binary) format.

genocat myfile.bam.genozip

Uncompress a file and also generates a BAI index file, using samtools index. samtools needs to be installed for this option to work:

genounzip --index myfile.bam.genozip         

Uncompress to a particular name and format. You may explicitly specify the format explicitly with --sam or --bam, or implicitly with specifying a .bam .sam or .sam.gz file name extension in --output.

 

genounzip --output newname.bam myfile.bam.genozip

.bam files are compressed internally with BGZF, as are .sam.gz files. Use --bgzf to set the level of BGZF compression - from 0 (no compression) to 12 (best yet slowest compression). With --bgzf=exact, genounzip attempts to recover the BGZF compression level of the original file.

genocat --bgzf 6 myfile.bam.genozip --sam --output myfile.sam.gz

genounzip --bgzf 6 myfile.bam.genozip

Slicing & dicing your data with genocat

Here's a summary of the filtering and subsetting options available for BAM files. See genocat for more information.

Option                           Effect

--downsample        Show only one in every X alignments

--regions       -r  Exclude or include certain genomic regions (works even if file is not sorted)

--regions-file  -R  Like --regions, but list of regions is specified in a file

--grep              Show only alignments containing the specified string

--grep-w        -g  Like --grep, but match whole words

--lines         -n  Show only a alignments from given range of line numbers

--head              Show only a certain number of alignments from the start of the file

--tail              Show only a certain number of alignments from the end of the file

--taxid             Include or exclude alignments mapped to a certain taxonomy ID. See kraken.

--FLAG              Include or exclude alignments with certain SAM FLAGs

--MAPQ              Include or exclude alignments with at least a certain MAPQ value

--bases             Filter alignments based on the IUPAC nucleotide codes in the sequence data

--no-header         Show only the alignments - exclude the SAM header

--header-only       Show only the SAM header - exclude the alignments

--qnames-file       Show only alignments with a QNAME specified (or not) in a file

--seqs-file         Show only alignments with a SEQ specified (or not) in a file

idxstats

Genozip has the ability to calculate per-contig statistics (idxstats). See idxstats.

genocat --idxstats myfile.bam.genozip

Per-contig coverage and depth

Genozip has the ability to calculate per-contig coverage. See Coverage and Depth.

genocat --coverage myfile.bam.genozip

Sex classification

Genozip has the ability to estimate the sample's sex. Not suitable for clinical applications. See Sex Classification.

genocat --sex myfile.bam.genozip

Converting SAM/BAM to FASTQ

Outputting a SAM/BAM file in FASTQ format:

genocat --fastq myfile.bam.genozip

More details: Converting SAM/BAM to FASTQ.

For a full list of options, see the genozip command line reference

Questions? support@genozip.com

bottom of page