Relative sizes of .bam files, before and after compression with genozip.
Used Genozip 15.0.4, --reference was used for compressing with all files, --best was not used. Datasets can be found here.
Compressing a BAM, SAM or CRAM file
In the rest of this page we will give examples of BAM files. Genozip is also capable of compressing SAM files, and with some limitations, CRAM files as well.
Compressing and uncompressing BAM is straightforward:
$ genozip myfile.bam
$ ls -lh myfile.bam*
-rw-rw-r--+ 1 divon divon 56G Apr 10 2022 myfile.bam
-rw-rw-r--+ 1 divon divon 16G Aug 1 18:44 myfile.bam.genozip
Compressing using a reference file
Better (sometimes significantly so) compression can be achieved by providing a reference file.
$ genozip --reference GRCh38_full_analysis_set_plus_decoy_hla.fa.gz myfile.bam
$ ls -lh myfile.bam*
-rw-rw-r--+ 1 divon divon 56G Apr 10 2022 myfile.bam
-rw-rw-r--+ 1 divon divon 15G Aug 1 19:01 myfile.bam.genozip
Note: when a particular FASTA file is used as a reference for the first time, Genozip produces a .ref.genozip file which is placed in the same directory as the FASTA. If you already have a .ref.genozip file - you can use it as an argument to
--reference instead of the FASTA. You can also generate a .ref.genozip file explicitly with (for example):
$ genozip --make-reference hs37d5.fa.gz
Note: genounzip needs the same reference file for decompression. It looks for it at the same location as the reference file used for genozip. Alternatively, use can use --reference to point to it directly.
Note: If you do not want genounzip to require an external reference file, you may compress with --REFERENCE instead of
--reference, which would store the relevant parts of the reference file data within the compressed file. This would obviously cause the compressed file to be larger.
Note: It is best if this reference is the one with which the BAM file was aligned, but if this original reference is not available, a "similar enough" reference file might also be very beneficial (by similar we mean a reference file that contains at least the contigs that make up of the bulk of the alignment in the BAM file).
Co-compressing BAM and FASTQ files (Genozip Deep™)
Genozip supports co-compressing a BAM file (or SAM or CRAM) along with all the FASTQs that contributed reads to the BAM. The result of this compression is a single file with the extension .deep.genozip. Uncompressing the deep file with genounzip outputs all the original files (losslessly, obviously!). This method results in very substantial savings vs compressing the FASTQs and the BAM separately:
The datasets can be found here.
Note: The Deep method consumes significant RAM. The exact amount of RAM required varies, but as a rule of thumb it is about as much as the size of the BAM file.
Note: The Deep method requires inclusion of all FASTQs that contributed reads to the BAM.
$ genozip --deep --reference hs37d5.fa.gz myfile.bam myfile-R1.fastq.gz myfile-R2.fastq.gz
$ ls -lh GFX0241869*
-rw-------+ 1 57G Feb 7 2020 myfile.bam
-rw-------+ 1 31G May 3 23:28 myfile-R1.fastq.gz
-rw-------+ 1 35G May 3 23:31 myfile-R2.fastq.gz
-rw-------+ 1 16G Jun 21 19:27 myfile.deep.genozip
Compressing CRAM files
Genozip does not support CRAM natively - it uses samtools to convert CRAM files to the BAM format, and as such it requires samtools to be installed for CRAM compression to work.
genozip --reference hs37d5.fa.gz myfile.cram
Note: use of a reference file is mandatory when compressing a CRAM file.
Note: When decompressing the file, it is decompressed into SAM or BAM format. Genozip does not support decompressing to CRAM format.
The following options modify the file in ways that improve compression. --optimize is an umbrella option that activates all optimization options.
genozip myfile.bam --optimize-QUAL
The base quality data is optimized as follows:
Old values New value
Relevant only to IonTorrent data: the ZM:B flow signal field is modified such that negative values are changed to zero and positives are rounded to the nearest 10.
Example: -20,212,427 ➔ 0,210,430
Note: because optimizations modify the data of the file, automatic verification after compression is disabled, and genounzip --test is also not available.
$ wget ftp://ftp-trace.ncbi.nih.gov/HG002_NA24385_SRR1767406_IonXpress_020_rawlib_24028.30k.b37.bam
$ genozip IonXpress_020_rawlib.b37.bam
$ genozip IonXpress_020_rawlib.b37.bam --optimize-ZM -o IonXpress_020_rawlib.b37.optimized.bam.genozip
$ ls -Ggh IonXpress_020_rawlib.b37*
-rw-rw-r--+ 1 26G Aug 1 23:53 IonXpress_020_rawlib.b37.bam
-rw-rw-r--+ 1 17G Aug 2 00:10 IonXpress_020_rawlib.b37.bam.genozip
-rw-rw-r--+ 1 12G Aug 2 00:17 IonXpress_020_rawlib.b37.optimized.bam.genozip
Uncompress a file:
Uncompress a file into stdout (i.e. the terminal). Note: this outputs the data in SAM (i.e. textual) format. Use --bam to output in BAM (i.e. binary) format.
Uncompress a file and also generates a BAI index file, using samtools index. samtools needs to be installed for this option to work:
genounzip --index myfile.bam.genozip
Uncompress to a particular name and format. You may explicitly specify the format explicitly with --sam or --bam, or implicitly with specifying a .bam .sam or .sam.gz file name extension in --output.
genounzip --output newname.bam myfile.bam.genozip
.bam files are compressed internally with BGZF, as are .sam.gz files. Use --bgzf to set the level of BGZF compression - from 0 (no compression) to 12 (best yet slowest compression). With --bgzf=exact, genounzip attempts to recover the BGZF compression level of the original file.
genocat --bgzf 6 myfile.bam.genozip --sam --output myfile.sam.gz
genounzip --bgzf 6 myfile.bam.genozip
Slicing & dicing your data with genocat
Here's a summary of the filtering and subsetting options available for BAM files. See genocat for more information.
--downsample Show only one in every X alignments
--regions -r Exclude or include certain genomic regions (works even if file is not sorted)
--regions-file -R Like --regions, but list of regions is specified in a file
--grep Show only alignments containing the specified string
--grep-w -g Like --grep, but match whole words
--lines -n Show only a alignments from given range of line numbers
--head Show only a certain number of alignments from the start of the file
--tail Show only a certain number of alignments from the end of the file
--taxid Include or exclude alignments mapped to a certain taxonomy ID. See kraken.
--FLAG Include or exclude alignments with certain SAM FLAGs
--MAPQ Include or exclude alignments with at least a certain MAPQ value
--bases Filter alignments based on the IUPAC nucleotide codes in the sequence data
--no-header Show only the alignments - exclude the SAM header
--header-only Show only the SAM header - exclude the alignments
--qnames-file Show only alignments with a QNAME specified (or not) in a file
--seqs-file Show only alignments with a SEQ specified (or not) in a file
Genozip has the ability to calculate per-contig statistics (idxstats). See idxstats.
genocat --idxstats myfile.bam.genozip
Per-contig coverage and depth
Genozip has the ability to calculate per-contig coverage. See Coverage and Depth.
genocat --coverage myfile.bam.genozip
Genozip has the ability to estimate the sample's sex. Not suitable for clinical applications. See Sex Classification.
genocat --sex myfile.bam.genozip
Converting SAM/BAM to FASTQ
Outputting a SAM/BAM file in FASTQ format:
genocat --fastq myfile.bam.genozip
More details: Converting SAM/BAM to FASTQ.
For a full list of options, see the genozip command line reference