How good is Genozip at compressing BAM files?
According to our Benchmarks, for a wide range of analyses types, Genozip compresses significantly better and much faster than other available options. As a commercial product, it enjoys excellent technical support, and being able to compress almost all relevant file formats (FASTQ, BAM, VCF and many more) adds to the convenience. Our customers and academic users routinely use Genozip to archive hundreds of terabytes of data, and integrate it into production pipelines.
For any large scale deployment, we encourage you to compare and benchmark Genozip against our competitors. For reference, our main competitors in the area of compression of BAM files are Petagene's Petasuite and samtools for generating CRAM files.
Compressing a BAM, SAM or CRAM file
In the rest of this page we will give examples of BAM files. Genozip is also capable of compressing SAM files, and with some limitations, CRAM files as well.
Compressing and uncompressing BAM is straightforward:
$ genozip myfile.bam
$ ls -lh myfile.bam*
-rw-rw-r--+ 1 divon divon 56G Apr 10 2022 myfile.bam
-rw-rw-r--+ 1 divon divon 16G Aug 1 18:44 myfile.bam.genozip
Compressing using a reference file
Better (sometimes significantly so) compression can be achieved by providing a reference file.
$ genozip --reference GRCh38_full_analysis_set_plus_decoy_hla.fa.gz myfile.bam
$ ls -lh myfile.bam*
-rw-rw-r--+ 1 divon divon 56G Apr 10 2022 myfile.bam
-rw-rw-r--+ 1 divon divon 15G Aug 1 19:01 myfile.bam.genozip
Note: when a particular FASTA file is used as a reference for the first time, Genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to
--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):
$ genozip --make-reference hs37d5.fa.gz
Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly.
Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of
--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.
Note: It is best if this reference is the one with which the BAM file was aligned, but if this original reference is not available, a "similar enough" reference file might also be very beneficial (by similar we mean a reference file that contains at least the contigs that make up of the bulk of the alignment in the BAM file).
Compressing CRAM files
Genozip does not support CRAM natively - it uses samtools to convert CRAM files to the BAM format, and as such it requires samtools to be installed for CRAM compression to work.
genozip --reference hs37d5.fa.gz myfile.cram
Note: use of a reference file is mandatory when compressing a CRAM file.
Note: When decompressing the file, it is decompressed into SAM or BAM format. Genozip does not support decompressing to CRAM format.
Compression optimizations
The following options modify the file in ways that improve compression. --optimize is an umbrella option that activates all optimization options.
--optimize-QUAL
genozip myfile.bam --optimize-QUAL
The base quality data is optimized as follows:
Old values New value
2-9 6
10-10 15
20-24 22
25-29 27
...
85-89 87
90-92 91
93 93
--optimize-ZM
Relevant only to IonTorrent data: the ZM:B flow signal field is modified such that negative values are changed to zero and positives are rounded to the nearest 10.
Example: -20,212,427 ➔ 0,210,430
Note: because optimizations modify the data of the file, automatic verification after compression is disabled, and genounzip --test is also not available.
Example:
$ wget ftp://ftp-trace.ncbi.nih.gov/HG002_NA24385_SRR1767406_IonXpress_020_rawlib_24028.30k.b37.bam
$ genozip IonXpress_020_rawlib.b37.bam
$ genozip IonXpress_020_rawlib.b37.bam --optimize-ZM -o IonXpress_020_rawlib.b37.optimized.bam.genozip
$ ls -Ggh IonXpress_020_rawlib.b37*
-rw-rw-r--+ 1 26G Aug 1 23:53 IonXpress_020_rawlib.b37.bam
-rw-rw-r--+ 1 17G Aug 2 00:10 IonXpress_020_rawlib.b37.bam.genozip
-rw-rw-r--+ 1 12G Aug 2 00:17 IonXpress_020_rawlib.b37.optimized.bam.genozip
Uncompressing
Uncompress a file:
genounzip myfile.bam.genozip
Uncompress a file into stdout (i.e. the terminal). Note: this outputs the data in SAM (i.e. textual) format. Use --bam to output in BAM (i.e. binary) format.
genocat myfile.bam.genozip
Uncompress a file and also generates a BAI index file, using samtools index. samtools needs to be installed for this option to work:
genounzip --index myfile.bam.genozip
Uncompress to a particular name and format. You may explicitly specify the format explicitly with --sam or --bam, or implicitly with specifying a .bam .sam or .sam.gz file name extension in --output.
genounzip --output newname.bam myfile.bam.genozip
.bam files are compressed internally with BGZF, as are .sam.gz files. Use --bgzf to set the level of BGZF compression - from 0 (no compression) to 12 (best yet slowest compression). With --bgzf=exact, genounzip attempts to recover the BGZF compression level of the original file.
genocat --bgzf 6 myfile.bam.genozip --sam --output myfile.sam.gz
genounzip --bgzf 6 myfile.bam.genozip
Slicing & dicing your data with genocat
Here's a summary of the filtering and subsetting options available for BAM files. See genocat for more information.
Option Effect
--downsample Show only one in every X alignments
--regions -r Exclude or include certain genomic regions (works even if file is not sorted)
--regions-file -R Like --regions, but list of regions is specified in a file
--grep Show only alignments containing the specified string
--grep-w -g Like --grep, but match whole words
--lines -n Show only a alignments from given range of line numbers
--head Show only a certain number of alignments from the start of the file
--tail Show only a certain number of alignments from the end of the file
--taxid Include or exclude alignments mapped to a certain taxonomy ID. See kraken.
--FLAG Include or exclude alignments with certain SAM FLAGs
--MAPQ Include or exclude alignments with at least a certain MAPQ value
--bases Filter alignments based on the IUPAC nucleotide codes in the sequence data
--no-header Show only the alignments - exclude the SAM header
--header-only Show only the SAM header - exclude the alignments
idxstats
Genozip has the ability to calculate per-contig statistics (idxstats). See idxstats.
genocat --idxstats myfile.bam.genozip
Per-contig coverage and depth
Genozip has the ability to calculate per-contig coverage. See Coverage and Depth.
genocat --coverage myfile.bam.genozip
Sex classification
Genozip has the ability to estimate the sample's sex. Not suitable for clinical applications. See Sex Classification.
genocat --show-sex myfile.bam.genozip
Converting SAM/BAM to FASTQ
Outputting a SAM/BAM file in FASTQ format:
genocat --fastq myfile.bam.genozip
More details: Converting SAM/BAM to FASTQ.
For a full list of options, see the genozip command line reference
Questions? support@genozip.com