top of page

Obtaining a chain file

 

Here is a non-exhaustive list of chain files (also called LiftOver files). The creators of these chain files are not associated with Genozip in any way, and Genozip does not endorse, recommend or warrant the correctness of any particular file.

 

From Ensembl

From UCSC

Chain file format

 

The file format of chain files can be found here: UCSC chain file format.

--chain: Chain files and Genozip

 

Chain files are compressible with Genozip:

 

genozip GRCh37_to_GRCh38.chain

To make the compressed Chain files useable with genozip --chain to lift a VCF file, two reference files are required - the reference genome in Primary coordinates, and the reference genome in Luft coordinates, for example:

genozip GRCh37_to_GRCh38.chain --reference hs37d5.ref.genozip --reference GRCh38_full_analysis_set_plus_decoy_hla.ref.genozip

The reference files themselves should be prepared from the corresponding FASTA file, for example:

genozip --make-reference hs37d5.fa.gz # outputs hs37d5.ref.genozip

--match-chrom-to-reference: Fix mismatching contigs

It is an unfortunate reality of the current state of bioinformatics, that multiple references exist with differing names for the same contigs. For example, some human references may have a contig called “22” while other references call the same contig “chr22”. Likewise chain files may map between any combinations of these options. Moreover, a chain file may refer to contigs that are absent in the specific reference used.

 

To address this problem, Genozip provides the option --match-chrom-to-reference that updates contig names to match those in the specified references, and removes alignments of contigs that don’t appear in the references:

 

genozip GRCh37_to_GRCh38.chain --match-chrom-to-reference --reference hs37d5.ref.genozip --reference GRCh38_full_analysis_set_plus_decoy_hla.ref.genozip

--show-chain: Viewing chain file alignments

The chain file format is purposely dense, and hence hard to understand visually. To view the alignments contained in a chain file in a human-friendly format, use --show-chain:

$ genocat --show-chain GRCh37_to_GRCh38.chain.genozip

 

##fileformat=GENOZIP-CHAIN

#ALN_I PRIM_CONTIG PRIM_START PRIM_END LUFT_CONTIG LUFT_START LUFT_ENDS XSTRAND ALN_OVERLAP

1 1 10001   177417  chr1 10001   177417  -

2 1 227418  267719  chr1 257667  297968  -

3 1 317720  471368  chr1 347969  501617  X

4 1 521369  1566075 chr1 585989  1630695 -

5 1 1566076 1569784 chr1 1630697 1634405 -

6 1 1569785 1570918 chr1 1634409 1635542 -

7 1 1570919 1570922 chr1 1635547 1635550 -

8 1 1570923 1574299 chr1 1635561 1638937 -

9 1 1574300 1583669 chr1 1638939 1648308 -

Column 1 is the sequential number of this alignment.

 

Columns 2,3,4 are coordinates in the Primary reference genome and columns 5,6,7 are coordinates in the Luft reference genome. Note that these are 1-based coordinates (as in VCF), whereas the Chain file format uses 0-based coordinates.

Column 8 contains an X if the range in the Luft reference is reverse complimented.

Column 9 contains the ALN_I of an alignment with an overlapping Luft range: This is for cases in which multiple alignments contain overlapping Luft ranges. If this alignment’s Luft range overlaps with the Luft range of more than one other alignment, the column will contain one of them.

--show-chain-contigs: Viewing chain file contig information

 

$ genocat GRCh37_to_GRCh38.chain.genozip --show-chain-contigs

 

PRIMARY chain file contigs that also exist in the reference file:

PRIMARY 1 length=249250621

PRIMARY 2 length=243199373

PRIMARY 3 length=198022430

PRIMARY 4 length=191154276

PRIMARY 5 length=180915260

 

LUFT chain file contigs that also exist in the reference file:

LUFT chr1 length=248956422

LUFT chr2 length=242193529

LUFT chr3 length=198295559

LUFT chr4 length=190214555

LUFT chr5 length=181538259

Note: All contigs of the respective reference files are shown, even if not referred to in the chain data.

Questions? support@genozip.com

bottom of page