top of page

Compressing BAM files

How good is Genozip at compressing BAM files?
v15 bam benchmark.png

Relative sizes of .bam files, before and after compression with genozip. 

Used Genozip 15.0.4, --reference was used for compressing with all files, --best was not used. Datasets can be found here.

​
Compressing a BAM, SAM or CRAM file 

​

In the rest of this page we will give examples of BAM files. Genozip is also capable of compressing SAM files, and with some limitations, CRAM files as well.

​

Compressing and uncompressing BAM is straightforward:

​

$ genozip myfile.bam

​

$ ls -lh myfile.bam*

​

-rw-rw-r--+ 1 divon divon  56G Apr 10  2022 myfile.bam
-rw-rw-r--+ 1 divon divon  16G Aug 1  18:44 myfile.bam.genozip

 

Compressing using a reference file 
​

Better (sometimes significantly so) compression can be achieved by providing a reference file.

​

$ genozip --reference GRCh38_full_analysis_set_plus_decoy_hla.fa.gz myfile.bam

​

$ ls -lh myfile.bam*

​

-rw-rw-r--+ 1 divon divon  56G Apr 10  2022 myfile.bam

-rw-rw-r--+ 1 divon divon  15G Aug 1  19:01 myfile.bam.genozip

 â€‹

Note: when a particular FASTA file is used as a reference for the first time, Genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to

--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):

​

$ genozip --make-reference hs37d5.fa.gz

​

Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly.

​​

Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of

--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.

​

Note: It is best if this reference is the one with which the BAM file was aligned, but if this original reference is not available, a "similar enough" reference file might also be very beneficial (by similar we mean a reference file that contains at least the contigs that make up of the bulk of the alignment in the BAM file).

​

Co-compressing BAM and FASTQ files (Genozip Deep™)

​

Genozip supports co-compressing a BAM file (or SAM or CRAM) along with all the FASTQs that contributed reads to the BAM. The result of this compression is a single file with the extension .deep.genozip. Uncompressing the deep file with genounzip outputs all the original files (losslessly, obviously!). This method results in very substantial savings vs compressing the FASTQs and the BAM separately:

Screenshot 2023-06-22 143754.png

The datasets can be found here.

​

  • Note: The Deep method requires inclusion of all FASTQs that contributed reads to the BAM.
     

  • Note: There is a lot that can change in the data between a FASTQ file and the corresponding BAM file - sequences may have been be trimmed, BQSR may have been applied to quality scores, read names may have changed, reads may have been filtered out or duplicates may have been collapsed to consensus sequences just to name a few. Deep will nevertheless still work, albeit a bit less effeciently, if such changes have occurred.
     

  • Note: The Deep method typically consumes 20%-50% less CPU compared to compressing the BAM and FASTQ files separately. However, it consumes significant RAM - the exact amount of RAM required varies, but as a rule of thumb it is about as much as the size of the BAM file.
     

Example:

​

$ genozip --deep --reference hs37d5.fa.gz myfile.bam myfile-R1.fastq.gz myfile-R2.fastq.gz 

​

$ ls -lh GFX0241869*

-rw-------+ 1 57G Feb  7  2020 myfile.bam
-rw-------+ 1 31G May  3 23:28
myfile-R1.fastq.gz
-rw-------+ 1 35G May  3 23:31
myfile-R2.fastq.gz

-rw-------+ 1 16G Jun 21 19:27 myfile.deep.genozip

​

Compressing CRAM files

​

Genozip does not support CRAM natively - it uses samtools to convert CRAM files to the BAM format, and as such it requires samtools to be installed for CRAM compression to work.

​

genozip --reference hs37d5.fa.gz myfile.cram

​

Note: use of a reference file is mandatory when compressing a CRAM file.

​

Note: When decompressing the file, it is decompressed into SAM or BAM format. Genozip does not support decompressing to CRAM format.

​

Compression optimizations

​

The following options modify the file in ways that improve compression. --optimize is an umbrella option that activates all optimization options.

​

--optimize-QUAL

​

genozip myfile.bam --optimize-QUAL

​

The base quality data is optimized as follows:

​

Old values  New value

0-2         unchanged

3-9         6

10-10       15

20-24       22

25-29       27

...

85-89       87

90-92       91

93          93

​

--optimize-ZM

​

Relevant only to IonTorrent data: the ZM:B flow signal field is modified such that negative values are changed to zero and positives are rounded to the nearest 10.

​

Example: -20,212,427 âž” 0,210,430

​

Note: because optimizations modify the data of the file, automatic verification after compression is disabled, and genounzip --test is also not available.

​

Example:

 

$ wget ftp://ftp-trace.ncbi.nih.gov/HG002_NA24385_SRR1767406_IonXpress_020_rawlib_24028.30k.b37.bam

$ genozip IonXpress_020_rawlib.b37.bam

$ genozip IonXpress_020_rawlib.b37.bam --optimize-ZM -o IonXpress_020_rawlib.b37.optimized.bam.genozip

$ ls -Ggh IonXpress_020_rawlib.b37*

 

-rw-rw-r--+ 1 26G Aug 1  23:53 IonXpress_020_rawlib.b37.bam

-rw-rw-r--+ 1 17G Aug 2  00:10 IonXpress_020_rawlib.b37.bam.genozip

-rw-rw-r--+ 1 12G Aug 2  00:17 IonXpress_020_rawlib.b37.optimized.bam.genozip

​

Uncompressing

​

Uncompress a file:

​

genounzip myfile.bam.genozip    

​

Uncompress a file into stdout (i.e. the terminal). Note: this outputs the data in SAM (i.e. textual) format. Use --bam to output in BAM (i.e. binary) format.

​

genocat myfile.bam.genozip

​

Uncompress a file and also generates a BAI index file, using samtools index. samtools needs to be installed for this option to work:

​

genounzip --index myfile.bam.genozip         

​

Uncompress to a particular name and format. You may explicitly specify the format explicitly with --sam or --bam, or implicitly with specifying a .bam .sam or .sam.gz file name extension in --output.

 

genounzip --output newname.bam myfile.bam.genozip

​

.bam files are compressed internally with BGZF, as are .sam.gz files. Use --bgzf to set the level of BGZF compression - from 0 (no compression) to 12 (best yet slowest compression). With --bgzf=exact, genounzip attempts to recover the BGZF compression level of the original file.

​

genocat --bgzf 6 myfile.bam.genozip --sam --output myfile.sam.gz

genounzip --bgzf 6 myfile.bam.genozip

​

Slicing & dicing your data with genocat
​

Here's a summary of the filtering and subsetting options available for BAM files. See genocat for more information.

​

Option                           Effect

--downsample        Show only one in every X alignments

--regions       -r  Exclude or include certain genomic regions (works even if file is not sorted)

--regions-file  -R  Like --regions, but list of regions is specified in a file

--grep              Show only alignments containing the specified string

--grep-w        -g  Like --grep, but match whole words

--lines         -n  Show only a alignments from given range of line numbers

--head              Show only a certain number of alignments from the start of the file

--tail              Show only a certain number of alignments from the end of the file

--taxid             Include or exclude alignments mapped to a certain taxonomy ID. See kraken.

--FLAG              Include or exclude alignments with certain SAM FLAGs

--MAPQ              Include or exclude alignments with at least a certain MAPQ value

--bases             Filter alignments based on the IUPAC nucleotide codes in the sequence data

--no-header         Show only the alignments - exclude the SAM header

--header-only       Show only the SAM header - exclude the alignments

--qnames            Show only alignments with a QNAME specified (or not) in a comma-separated list

--qnames-file       Show only alignments with a QNAME specified (or not) in a file

--seqs-file         Show only alignments with a SEQ specified (or not) in a file

​

idxstats

​

Genozip has the ability to calculate per-contig statistics (idxstats). See idxstats.

​

genocat --idxstats myfile.bam.genozip

​

Per-contig coverage and depth

​

Genozip has the ability to calculate per-contig coverage. See Coverage and Depth.

​

genocat --coverage myfile.bam.genozip

​

Sex classification

​

Genozip has the ability to estimate the sample's sex. Not suitable for clinical applications. See Sex Classification.

​

genocat --sex myfile.bam.genozip

​

Retrieving one of the components in a Deep file

​

genocat --fastq myfile.deep.bam.genozip

genocat --R1 myfile.deep.bam.genozip

genocat --R2 myfile.deep.bam.genozip

genocat --R=3 myfile.deep.bam.genozip # in case Deep file consists of more than 2 FASTQs 

genocat --interleaved myfile.deep.bam.genozip

genocat --sam myfile.deep.bam.genozip

genocat --bam myfile.deep.bam.genozip

​

​

For a full list of options, see the genozip command line reference

​

Questions? support@genozip.com

​

bottom of page