Matching contig names of the file to those in the reference file
SAM, BAM, VCF, GFF3, GVF, 23andMe, and chain files
The unfortunate reality of the bioinformatics world is that contigs may appear with different names in different reference files, causing problems in analysis pipelines.
- chr22 ⇆ 22
- M ⇆ chrM ⇆ MT ⇆ chrMT
- chr21_gl000210_random ⇆ GL000210.1
Genozip offers a command line option, --match-chrom-to-reference, that updates the contigs of a file to match those of the reference file.
Notice that in the example below, the contig name 1 was updated to chr1 both in the SAM header and the 3rd column of the data line, which is the RNAME field.
$ cat example.sam
@HD VN:1.4 SO:coordinate
@SQ SN:1 LN:249250621
A00910:85:HYGWJDSXX:2:2502:31647:7701 99 1 9997 34 10M = 10159 324 CCCTTAACCC FFFFFFFF:F NM:i:2
$ genozip example.sam --reference hg19.fa.gz --match-chrom-to-reference
genozip example.sam : Done (1 second, SAM compression ratio: 14.4)
$ genocat example.sam.genozip
@HD VN:1.4 SO:coordinate
@SQ SN:chr1 LN:249250621
A00910:85:HYGWJDSXX:2:2502:31647:7701 99 chr1 9997 34 10M = 10159 324 CCCTTAACCC FFFFFFFF:F NM:i:2
Data type Fields updated
SAM, BAM @SQ lines in the file header
RNAME (column 3) and RNEXT (column 7)
SA, OA and XA tags - contig names
VCF ##contig lines in the file header
CHROM (column 1)
Chain qName and tName fields
GFF3, GVF SequenceId (column 1)
optional chr= attribute
23andMe Chromosome (column 2)
Method of converting contig names
Each contig name that appears in the file is searched for in the reference file - in its unmodified form as well as variations of its name. Once a match is found in the reference file, the length of the contig, if known, is compared to make sure it is indeed the same contig. If a match is not found in the reference file, the contig name is not changed.
The variations of the contig names considered are:
- For chromosomes with a numeric suffix (eg chr22) as well as chrX, chrY, chrW and chrZ - with or without the chr prefix - the name with or without a lower case chr prefix is considered.
- NC_000001 to NC_000024 with a version number (eg NC_000001.10) are considered as equivalent to 1,…22,X,Y (with or without a chr prefix)
- For the Mitochondria chromosome, four options are considered - M, MT, chrM, chrMT
- For contigs that contain an embedded Accession Number, the following variations are considered:
Example contig name Interpreted as…
chr4_gl383528_alt Accession Number GL383528 version 1
chrUn_JTFH01001867v2_decoy Accession Number JTFH01001867 version 2
GL000192.1 Accession Number GL000192 version 1