top of page

Matching contig names of the file to those in the reference file

Data Types

SAM, BAM, VCF, GFF3, GVF, 23andMe, and chain files

Overview

The unfortunate reality of the bioinformatics world is that contigs may appear with different names in different reference files, causing problems in analysis pipelines.

Examples:

- chr22 ⇆ 22

- M ⇆ chrM ⇆ MT ⇆ chrMT

- chr21_gl000210_random ⇆ GL000210.1

Genozip offers a command line option, --match-chrom-to-reference, that updates the contigs of a file to match those of the reference file.

Example

Notice that in the example below, the contig name 1 was updated to chr1 both in the SAM header and the 3rd column of the data line, which is the RNAME field.

$ cat example.sam

@HD VN:1.4 SO:coordinate
@SQ SN:1 LN:249250621
A00910:85:HYGWJDSXX:2:2502:31647:7701 99 1 9997 34 10M = 10159 324 CCCTTAACCC FFFFFFFF:F NM:i:2 

$ genozip example.sam --reference hg19.fa.gz --match-chrom-to-reference
genozip example.sam : Done (1 second, SAM compression ratio: 14.4)

$ genocat example.sam.genozip

 

@HD VN:1.4 SO:coordinate
@SQ SN:chr1 LN:249250621

A00910:85:HYGWJDSXX:2:2502:31647:7701 99 chr1 9997 34 10M = 10159 324 CCCTTAACCC FFFFFFFF:F NM:i:2 

Fields updated

Data type    Fields updated

SAM, BAM   @SQ lines in the file header

           RNAME (column 3) and RNEXT (column 7)

           SA, OA and XA tags - contig names

VCF        ##contig lines in the file header

           CHROM (column 1)

Chain      qName and tName fields

GFF3, GVF  SequenceId (column 1)

           optional chr= attribute

23andMe    Chromosome (column 2)

Method of converting contig names

 

Each contig name that appears in the file is searched for in the reference file - in its unmodified form as well as variations of its name. Once a match is found in the reference file, the length of the contig, if known, is compared to make sure it is indeed the same contig. If a match is not found in the reference file, the contig name is not changed.

The variations of the contig names considered are:

 

- For chromosomes with a numeric suffix (eg chr22) as well as chrX, chrY, chrW and chrZ - with or without the chr prefix - the name with or without a lower case chr prefix is considered.

- NC_000001 to NC_000024 with a version number (eg NC_000001.10) are considered as equivalent to 1,…22,X,Y (with or without a chr prefix)

- For the Mitochondria chromosome, four options are considered - M, MT, chrM, chrMT

 

- For contigs that contain an embedded Accession Number, the following variations are considered:

 

Example contig name                  Interpreted as…

chr4_gl383528_alt            Accession Number GL383528 version 1

chrUn_JTFH01001867v2_decoy   Accession Number JTFH01001867 version 2

GL000192.1                   Accession Number GL000192 version 1

Questionssupport@genozip.com

bottom of page