Matching contig names of the file to those in the reference file
(Feature discontinued as of version 15.0.60)
Data Types
SAM, BAM, CRAM, VCF, GFF/GTF/GVF and 23andMe files
Overview
The unfortunate reality of the bioinformatics world is that contigs may appear with different names in different reference files, causing problems in analysis pipelines.
Examples:
- chr22 ⇆ 22
- M ⇆ chrM ⇆ MT ⇆ chrMT
- chr21_gl000210_random ⇆ GL000210.1
Genozip offers a command line option, --match-chrom-to-reference, that updates the contigs of a file to match those of the reference file.
Example
Notice that in the example below, the contig name 1 was updated to chr1 both in the SAM header and the 3rd column of the data line, which is the RNAME field.
$ cat example.sam
@HD VN:1.4 SO:coordinate
@SQ SN:1 LN:249250621
A00910:85:HYGWJDSXX:2:2502:31647:7701 99 1 9997 34 10M = 10159 324 CCCTTAACCC FFFFFFFF:F NM:i:2
$ genozip example.sam --reference hg19.fa.gz --match-chrom-to-reference
genozip example.sam : Done (1 second, SAM compression ratio: 14.4)
$ genocat example.sam.genozip
@HD VN:1.4 SO:coordinate
@SQ SN:chr1 LN:249250621
A00910:85:HYGWJDSXX:2:2502:31647:7701 99 chr1 9997 34 10M = 10159 324 CCCTTAACCC FFFFFFFF:F NM:i:2
Fields updated
Data type Fields updated
SAM, BAM @SQ lines in the file header
CRAM RNAME (column 3) and RNEXT (column 7)
SA, OA and XA tags - contig names
VCF ##contig lines in the file header
CHROM (column 1)
Method of converting contig names
Each contig name that appears in the file is searched for in the reference file - in its unmodified form as well as variations of its name. Once a match is found in the reference file, the length of the contig, if known, is compared to make sure it is indeed the same contig. If a match is not found in the reference file, the contig name is not changed.
The variations of the contig names considered are:
- For chromosomes with a numeric suffix (eg chr22) as well as chrX, chrY, chrW and chrZ - with or without the chr prefix - the name with or without a lower case chr prefix is considered.
- NC_000001 to NC_000024 with a version number (eg NC_000001.10) are considered as equivalent to 1,…22,X,Y (with or without a chr prefix)
- For the Mitochondria chromosome, four options are considered - M, MT, chrM, chrMT
- For contigs that contain an embedded Accession Number, the following variations are considered:
Example contig name Interpreted as…
chr4_gl383528_alt Accession Number GL383528 version 1
chrUn_JTFH01001867v2_decoy Accession Number JTFH01001867 version 2
GL000192.1 Accession Number GL000192 version 1
Questions? support@genozip.com