top of page

Index files (bai, tbi...)

Summary
​

After genounzip uncompresses a file, it is often necessary to re-index it, as the old index may no longer be valid. genounzip does this automatically for .bam, .sam.gz and .vcf.gz files.

​

Details
​

Most genomic file formats have accompanying index files:​

- bam and sam.gz files can be indexed with either bai or csi indices, while cram files can be indexed with a crai index.

- vcf.gz and bcf files can be indexed with either tbi or csi indices.

- Less common and usually not needed: fasta.gz and fastq.gz can be indexed with a fai index combined with gzi index.

​​

The purpose of an index is to allow software tools to access specific objects (reads, alignments, variants etc) in the corresponding genomic file by going directly to the relevant location within the file, rather than needing to read the entire file from disk (a capability often termed random access). 

​​

Of these indices, bai, tbi, csi and gzi are closely tied to the BGZF file format. BGZF is the specific flavor of gz compression designed to allow indexing and random access. This is the file format output by bgzip and many other bioinformatics tools, when creating bamsam.gzvcf.gz and bcf files. Indeed, bam and bcf files, while not having a .gz file name extension, are in fact also compressed with the same BGZF compression. Note that BGZF is different from standard gzip or pigz, although compatible: gunzip or pigz -d can uncompress a BGZF-compressed file.

​​

When genounzip or genocat reconstruct a file, they usually re-compress it with BGZF. The specific BGZF parameters used that affect the speed and strength of the BGZF re-compression are selected to balance speed, compression, and genounzip CPU core scalaibility as well as the user's desire conveyed using the
--bgzf option. Because of this, the BGZF compression of the reconstructed file may be different than that of the original file, although obviously the underlying data is identical (see losslessness for more details). Because the BGZF compression might have changed, the index files need to be re-generated to match the new BGZF compression.

​

genounzip and genocat automatically generate a bai index file for bam or sam.gz or a tbi index file for vcf.gz. This works for both genounzip and genocat including if subsetting the file. The index generation is done at the same time as uncompression and has a negligible performance overhead (substantially less than 1%).

​​

Generating a fai and gzi index for fastq and fasta is available using the --index option. However, in this case genounzip (or genocat) simply runs samtools faidx after uncompressing (assuming the user has samtools installed).

​

The --no-bai or --no-tbi options can be used to instruct genounzip or genocat to refrain from creating an index file.

 

genozip --skip-index is useful when compressing many files using the --tar option or a shell wildcard, and instructs genozip to not compress bai, csi, tbi, crai and gzi index files – since these files need to be re-generated after uncompression anyway. This is also to avoid the situation in which, when uncompressing many files together, the uncompression of old index files overwrites the fresh index files automatically generated by genounzip.

​

If keeping the original index files is required, this might be possible by instructing genounzip to attempt to re-compress with the exact same BGZF compression as the original file used – using the --bgzf=exact option. When using this option, genounzip does not generate a bai or tbi file. This will work if genozip was able to identify the library used to create the original file – and you can check in advance whether or not this is the case for particular files by using genozip --is-exactable. Note: using --bgzf=exact is slower than just letting genounzip generate a new index file, so usually it is not recommended.

​​​

Finally, genozip --show-bai or --show-tbi can be used to inspect the contents of a bai or tbi file and might be useful for bioinformatics software developers.

​​

Questions? support@genozip.com

© 2024 Genozip Limited. All rights reserved. Genozip™ is a trademark. Our technology is patent-pending. Privacy Policy.

bottom of page