At a glance
Dual coordinates VCF files (or DVCFs), are VCF files that contain coordinates in two coordinate systems concurrently, for example, GRCh37 and GRCh38. A DVCF can be rendered in either Primary coordinates or Luft coordinates (Luft being our made-up past participle of Lift).
Both renditions are standard-compliant VCF files, and both contain precisely the same information, with the only difference being how the information is presented.
Crucially, since both renditions contain all the information needed for rendering the file in both coordinate systems, the file in either rendition can be processed through any bioinformatics tool or pipeline, and still maintain its dual coordinates. In other words, an analytics pipeline may now have some steps that operate on the file in one coordinate system, and other steps operate in the other coordinate system, and the same data can flow through these steps seamlessly.
An example - (1) using --chain to create a DVCF
Let’s walk through an example:
Consider this small VCF file, called mydata.vcf.gz, which is in the coordinates defined by GRCh37, a popular Homo Sapiens reference:
##fileformat=VCFv4.2
##reference=file:///references/grch37/reference.bin
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles"> ##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=OBSCURE,Number=1,Type=Float,Description="Obscure annotation",RendAlg="A_1" ##contig=<ID=1,length=249250621>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Person1 Person2
1 10285 . T C 4.4 PASS AC=3;AN=4 GT:AD:AF:PL 0/1:31,18:0.367:37,0,46 1/1
1 329162 . A T 4.6 PASS AC=3;AN=4 GT:AD:AF:PL 0/1:28,9:0.3:36,0,0 1/1
1 366042 . TA A 100 PASS . GT 1|0 0|0 1 20159588 . C A 100 PASS . GT:AD:AF:PL 0/1:28,9:0.3:36,0,0 1/1
We shall now convert this file to a Dual-coordinate VCF, containing both GRCh37 coordinates, as well as coordinates from another popular reference, GRCh38. This process is called lifting. We shall call GRCh37 the Primary coordinates, and GRCh38 the Luft coordinates.
Steps:
1. Prepare two reference files in genozip format: for Primary and Luft coordinates. The following results in the files named GRCh37.ref.genozip and GRCh38.ref.genozip
genozip --make-reference GRCh37.fa
genozip --make-reference GRCh38.fa
2. Prepare a chain file in genozip format. Genozip uses chain files in the UCSC chain file format. Links to some chain files can be found here. The following command line generates the file GRCh37_to_GRCh38.chain.genozip :
genozip GRCh37_to_GRCh38.chain.gz --reference GRCh37.ref.genozip --reference GRCh38.ref.genozip
Note that the first two steps are preparation steps that need to be executed only once.
3. Now that we have the chain file, we can convert any number of VCF files to DVCF:
genozip mydata.vcf.gz --chain GRCh37_to_GRCh38.chain.genozip --add-line-numbers
Note that as usual with genozip, .vcf .vcf.gz .vcf.bz2 and .vcf.xz are all accepted. Genozip does not require the VCF file to be indexed.
An example - (2) rendering the DVCF in Primary and Luft coordinates
Now, we have a dual-coordinate VCF file: mydata.vcf.genozip. We can render it in either coordinate:
Rendering in Primary coordinates (GRCh37), followed by Luft coordinates (GRCh38):
$ genocat mydata.d.vcf.genozip | tee mydata.37.vcf ##fileformat=VCFv4.2
##original_reference=file:///references/grch37/reference.bin
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles",RendAlg="R"
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed",RendAlg="A_1"
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype",RendAlg="GT"
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes",RendAlg="G"
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele",RendAlg="A_AN"
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes",RendAlg="NONE"
##INFO=<ID=OBSCURE,Number=1,Type=Float,Description="Obscure annotation",RendAlg="A_1"
##contig=<ID=1,length=249250621>
##dual_coordinates=PRIMARY
##chain=file:///C/Users/USER/projects/genozip/data/GRCh37_to_GRCh38.chain.genozip ##luft_reference=file://data/GRCh38_full_analysis_set_plus_decoy_hla.ref.genozip
##reference=file://data/hs37d5.ref.genozip
##INFO=<ID=LUFT,Number=4,Type=String,Description="Info for rendering variant in LUFT coords. See https://genozip.com/dvcf.html",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=PRIM,Number=4,Type=String,Description="Info for rendering variant in PRIMARY coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=Lrej,Number=1,Type=String,Description="Reason variant was rejected for LUFT coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=Prej,Number=1,Type=String,Description="Reason variant was rejected for PRIMARY coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##luft_contig=<ID=chr1,length=248956422>
##genozip_command="C:\Users\USER\projects\genozip\genocat-debug.exe -o docs/dvcf-example-files/mydata.37.vcf -f docs/dvcf-example-files/mydata.d.vcf.genozip" 2021-06-15 11:36:11 Cen. Australia Daylight Time
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Person1 Person2 1 10285 LN=1 T C 4.4 PASS AC=3;AN=4;LUFT=chr1,10285,T,- GT:AD:AF:PL 0/1:31,18:0.367:37,0,46 1/1 1 329162 LN=2 A T 4.6 PASS AC=3;AN=4;LUFT=chr1,490175,T,X GT:AD:AF:PL 0/1:28,9:0.3:36,0,0 1/1 1 366042 LN=3 TA A 100 PASS Lrej=RefNewAlleleIndel GT 1|0 0|0 1 20159588 LN=4 C A 100 PASS AC=3;AN=4;AF=0.75;OBSCURE=0.002;LUFT=chr1,19833095,A,- GT:AD:AF:PL 0/1:28,9:0.3:36,0,0 1/1
$ genocat mydata.d.vcf.genozip --luft | tee mydata.38.vcf
##fileformat=VCFv4.2
##primary_only=1 366042 LN=1 TA A 100 PASS Lrej=RefNewAlleleIndel GT 1|0 0|0
##original_reference=file:///references/grch37/reference.bin
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles",RendAlg="R"
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed",RendAlg="A_1"
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype",RendAlg="GT"
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes",RendAlg="G"
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele",RendAlg="A_AN"
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes",RendAlg="NONE"
##INFO=<ID=OBSCURE,Number=1,Type=Float,Description="Obscure annotation",RendAlg="A_1"
##primary_contig=<ID=1,length=249250621>
##dual_coordinates=LUFT
##chain=file:///C/Users/USER/projects/genozip/data/GRCh37_to_GRCh38.chain.genozip ##reference=file://data/GRCh38_full_analysis_set_plus_decoy_hla.ref.genozip
##primary_reference=file://data/hs37d5.ref.genozip
##INFO=<ID=LUFT,Number=4,Type=String,Description="Info for rendering variant in LUFT coords. See https://genozip.com/dvcf.html",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=PRIM,Number=4,Type=String,Description="Info for rendering variant in PRIMARY coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=Lrej,Number=1,Type=String,Description="Reason variant was rejected for LUFT coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=Prej,Number=1,Type=String,Description="Reason variant was rejected for PRIMARY coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##contig=<ID=chr1,length=248956422>
##genozip_command="C:\Users\USER\projects\genozip\genocat-debug.exe -vo docs/dvcf-example-files/mydata.38.vcf -f docs/dvcf-example-files/mydata.d.vcf.genozip" 2021-06-15 11:36:09 Cen. Australia Daylight Time
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Person1 Person2 chr1 10285 LN=1 T C 4.4 PASS AC=3;AN=4;PRIM=1,10285,T,- GT:AD:AF:PL 0/1:31,18:0.367:37,0,46 1/1 chr1 490175 LN=2 T A 4.6 PASS AC=3;AN=4;PRIM=1,329162,A,X GT:AD:AF:PL 0/1:28,9:0.3:36,0,0 1/1 chr1 19833095 LN=4 A C 100 PASS AC=1;AN=4;AF=0.25;OBSCURE=0.998;PRIM=1,20159588,C,- GT:AD:AF:PL 1/0:9,28:0.7:0,0,36 0/0
So what do we see here?
The first thing to appreciate, is that both mydata.37.vcf and mydata.38.vcf contain precisely the same information. In other words, you can genozip either file, and then render it again in both Primary and Luft coordinates.
Notice the ID field contains a sequential line number (starting with LN=). This is a result of the genozip
--add-line-numbers option and it will make it easier for us to correspond variants in the Primary rendition to the Luft rendition.
Take a look at the first file - the Primary rendition - at the variant with LN=4. The INFO/LUFT field contains the information of the Luft coordinate, which, in this case, is mapped to the Luft reference at CHROM=chr1 POS=19833095, and the REF (the third value in INFO/LUFT) is A - which would be change from the Primary (GRCh37) REF that is C, and identical to the Primary ALT which is an A. This is common situation which we call a REF⇆ALT switch.
Now take a look at the second file - the Luft rendition - at LN=4. Indeed, the REF and ALT have switched - and also the INFO/AC, INFO/AF,INFO/OBSCURE, FORMAT/GT, FORMAT/AD, FORMAT/AF and FORMAT/PL fields have changed too, to reflect this REF⇆ALT switch. In addition, the INFO/PRIM field contains the Primary coordinates of this variant.
The algorithms determining how to cross-rendered each INFO and FORMAT field, are defined in the RendAlg attribute of the ##INFO and ##FORMAT meta-information lines, and you may modify them as you wish. Notice, for example, that INFO/OBSCURE, an annotation made up for the purpose of this example, was given a RendAlg=A_1 in the original mydata.vcf.gz file, an algorithm that cross-renders the value to (1-value) in case of a REF⇆ALT switch. To learn more about RendAlg options, see Rendering a DVCF. Notice that we don’t bother setting RendAlg for the other INFO and FORMAT fields in mydata.vcf.gz as they are all well known annotations for which Genozip’s RendAlg choices behave as expected. These default choices can be seen in the ##INFO and ##FORMAT meta-information lines of both renditions.
In contrast, take a look at variant LN=2. In the Primary rendition has REF=A ALT=T and it in the Luft rendition it has REF=T ALT=A. This appears to be a REF⇆ALT switch, however, the INFO and FORMAT fields were not cross-rendered. This is because this variant is actually not a REF⇆ALT switch - rather, the chain file mapped the alignment on which this variant resides to the reverse strand in the Luft reference, as indicated by the X in the fourth value of the INFO/PRIM field. The nominal base change is due to the required reverse-complementing of REF and ALT (as in the VCF file format variants are always shown in forward strand terms), and this variant has actually not changed between the references.
Let us now take a look the variant with LN=3. It has an INFO/Lrej field indicating that Genozip cannot render it in Luft coordinates and hence it is a Primary-only variant (or in other words, it was rejected from lifting). The reason given in this case is RefNewAlleleIndel - which means that neither the REF nor the ALT of the Primary indel variant match the Luft reference at this position. There are many reasons lifting can fail - they are listed in the table below.
In the Luft rendition, this Primary-only variant, LN=3, is indeed missing in the VCF data lines, which now contain only three variants. However, notice that it still survives as a meta-information line with the key ##primary_only.
An example - (3) applying bioinformatics tools to a DVCF rendition
Now, let us take our Luft rendition VCF mydata.38.vcf and merge it (using bcftools merge in this example) with another VCF file, person3.vcf, which is in GRCh38 coordinates and has a single sample, Person3. Note that person3.vcf is not a DVCF. person3.vcf contains 2 variants on chr1: LN=1 (also present in mydata.38.vcf), and LN=5 (not present in mydata.38.vcf):
> cat person3.vcf ##fileformat=VCFv4.2
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Phred-scaled posterior probabilities for genotypes as defined in the VCF specification">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Person3
chr1 10285 LN=1 T C 100 PASS AC=1;AN=2 GT:AD 1/0:20,15
chr1 22000000 LN=5 A C 4 PASS AC=0;AN=2 GT:AD:GP 0/0:30,3:37.9,14.9,0.14
Notice that Person3 has the field FORMAT/GP in variant LN=5 that doesn’t exist in mydata.38.vcf.
The merged file will be a DVCF file looking like this:
> bgzip person3.vcf
> bgzip mydata.38.vcf
> bcftools index mydata.38.vcf.gz
> bcftools index person3.vcf.gz
> bcftools merge mydata.38.vcf.gz person3.vcf.gz | tee merged.38.vcf
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##primary_only=1 366042 LN=1 TA A 100 PASS Lrej=RefNewAlleleIndel GT 1|0 0|0
##original_reference=file:///references/grch37/reference.bin
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles",RendAlg="R"
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed",RendAlg="A_1"
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype",RendAlg="GT"
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes",RendAlg="G"
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele",RendAlg="A_AN"
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes",RendAlg="NONE"
##INFO=<ID=OBSCURE,Number=1,Type=Float,Description="Obscure annotation",RendAlg="A_1"
##primary_contig=<ID=1,length=249250621>
##dual_coordinates=LUFT
##chain=file:///C/Users/USER/projects/genozip/data/GRCh37_to_GRCh38.chain.genozip ##reference=file://data/GRCh38_full_analysis_set_plus_decoy_hla.ref.genozip ##primary_reference=file://data/hs37d5.ref.genozip
##INFO=<ID=LUFT,Number=4,Type=String,Description="Info for rendering variant in LUFT coords. See https://genozip.com/dvcf.html",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=PRIM,Number=4,Type=String,Description="Info for rendering variant in PRIMARY coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=Lrej,Number=1,Type=String,Description="Reason variant was rejected for LUFT coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=Prej,Number=1,Type=String,Description="Reason variant was rejected for PRIMARY coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##contig=<ID=chr1,length=248956422>
##genozip_command="C:\Users\USER\projects\genozip\genocat-debug.exe -vo docs/dvcf-example-files/mydata.38.vcf -f docs/dvcf-example-files/mydata.d.vcf.genozip" 2021-06-15 11:36:09 Cen. Australia Daylight Time
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Phred-scaled posterior probabilities for genotypes as defined in the VCF specification">
##bcftools_mergeVersion=1.11+htslib-1.11
##bcftools_mergeCommand=merge mydata.38.vcf.gz person3.vcf.gz; Date=Tue Jun 15 12:13:51 2021
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Person1 Person2 Person3
chr1 10285 LN=1 T C 100 PASS PRIM=1,10285,T,-;AN=6;AC=4 GT:AD:AF:PL 0/1:31,18:0.367:37,0,46 1/1:.:.:. 1/0:20,15:.:.
chr1 490175 LN=2 T A 4.6 PASS PRIM=1,329162,A,X;AN=4;AC=3 GT:AD:AF:PL 0/1:28,9:0.3:36,0,0 1/1:.:.:. ./.:.:.:.
chr1 19833095 LN=4 A C 100 PASS AF=0.25;OBSCURE=0.998;PRIM=1,20159588,C,-;AN=4;AC=1 GT:AD:AF:PL 1/0:9,28:0.7:0,0,36 0/0:.:.:. ./.:.:.:.
chr1 22000000 LN=5 A C 4 PASS AN=2;AC=0 GT:AD:GP ./.:.:. ./.:.:. 0/0:30,3:37.9,14.9,0.14
The merged file, as expected, is in Luft (i.e. GRCh38) coordinates, and the new variant, LN=5, has no DVCF INFO tag (i.e. no INFO/PRIM).
An example - (4) the modified file is still a DVCF
Now let us genozip the merged file and render it first in Luft coordinates:
> genozip merged.38.vcf
genozip merged.38.vcf : 0%
FYI: the number of samples in variant CHROM=1 POS=366043 is 2, different than the VCF column header line which has 3 samples Done (0 seconds)
> genocat merged.38.vcf.genozip --luft
##fileformat=VCFv4.2
##primary_only=1 366042 LN=1 TA A 100 PASS Lrej=RefNewAlleleIndel GT 1|0 0|0
##FILTER=<ID=PASS,Description="All filters passed"> ##original_reference=file:///references/grch37/reference.bin
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles",RendAlg="R"
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed",RendAlg="A_1"
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype",RendAlg="GT"
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes",RendAlg="G"
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele",RendAlg="A_AN"
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes",RendAlg="NONE"
##INFO=<ID=OBSCURE,Number=1,Type=Float,Description="Obscure annotation",RendAlg="A_1"
##primary_contig=<ID=1,length=249250621>
##dual_coordinates=LUFT
##chain=file:///C/Users/USER/projects/genozip/data/GRCh37_to_GRCh38.chain.genozip
##reference=file://data/GRCh38_full_analysis_set_plus_decoy_hla.ref.genozip
##primary_reference=file://data/hs37d5.ref.genozip
##INFO=<ID=LUFT,Number=4,Type=String,Description="Info for rendering variant in LUFT coords. See https://genozip.com/dvcf.html",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=PRIM,Number=4,Type=String,Description="Info for rendering variant in PRIMARY coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=Lrej,Number=1,Type=String,Description="Reason variant was rejected for LUFT coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=Prej,Number=1,Type=String,Description="Reason variant was rejected for PRIMARY coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##contig=<ID=chr1,length=248956422> ##genozip_command="C:\Users\USER\projects\genozip\genocat-debug.exe -vo docs/dvcf-example-files/mydata.38.vcf -f docs/dvcf-example-files/mydata.d.vcf.genozip" 2021-06-15 11:36:09 Cen. Australia Daylight Time
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Phred-scaled posterior probabilities for genotypes as defined in the VCF specification",RendAlg="G"
##bcftools_mergeVersion=1.11+htslib-1.11 ##bcftools_mergeCommand=merge mydata.38.vcf.gz person3.vcf.gz; Date=Tue Jun 15 12:13:51 2021
##genozip_command="C:\Users\USER\projects\genozip\genocat-debug.exe -vo ../../junk.vcf docs/dvcf-example-files/merged.38.vcf.genozip" 2021-06-15 12:18:35 Cen. Australia Daylight Time
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Person1 Person2 Person3
chr1 10285 LN=1 T C 100 PASS AN=6;AC=4;PRIM=1,10285,T,- GT:AD:AF:PL 0/1:31,18:0.367:37,0,46 1/1:.:.:. 1/0:20,15:.:.
chr1 490175 LN=2 T A 4.6 PASS AN=4;AC=3;PRIM=1,329162,A,X GT:AD:AF:PL 0/1:28,9:0.3:36,0,0 1/1:.:.:. ./.:.:.:.
chr1 19833095 LN=4 A C 100 PASS AF=0.25;OBSCURE=0.998;AN=4;AC=1;PRIM=1,20159588,C,- GT:AD:AF:PL 1/0:9,28:0.7:0,0,36 0/0:.:.:. ./.:.:.:.
chr1 22000000 LN=5 A C 4 PASS AN=2;AC=0;Prej=AddedVariant GT:AD:GP ./.:.:. ./.:.:. 1/1:3,30:37.9,14.9,0.14
Looking at the Luft rendition of the merged file, we notice that the new variant, LN=5, now contains a INFO/Prej field. Therefore, this is a Luft-only variant that cannot be expressed in Primary coordinates. This is always the case for variants that are added to an existing DVCF file, unless the new variants also originate from a DVCF file.
Also, note that when running genozip, we got a warning about the number of samples of variant LN=1. This is our Primary-only variant, that didn’t participate in the merge in Luft coordinates, and remains with two samples only. This is ok, as long as the DVCF file is the first file on the bctools merge command line, so that the missing sample aligns with the new sample Person3. It is perfectly acceptable to have missing samples in VCF.
Let us now take a look at the Primary rendition of the merged file:
> genocat merged.38.vcf.genozip
##fileformat=VCFv4.2
##luft_only=chr1 22000000 LN=5 A C 4 PASS AN=2;AC=0;Prej=AddedVariant GT:AD:GP ./.:.:. ./.:.:. 0/0:30,3:37.9,14.9,0.14
##FILTER=<ID=PASS,Description="All filters passed">
##original_reference=file:///references/grch37/reference.bin
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles",RendAlg="R"
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed",RendAlg="A_1"
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype",RendAlg="GT"
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes",RendAlg="G"
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele",RendAlg="A_AN"
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes",RendAlg="NONE"
##INFO=<ID=OBSCURE,Number=1,Type=Float,Description="Obscure annotation",RendAlg="A_1"
##contig=<ID=1,length=249250621>
##dual_coordinates=PRIMARY
##chain=file:///C/Users/USER/projects/genozip/data/GRCh37_to_GRCh38.chain.genozip ##luft_reference=file://data/GRCh38_full_analysis_set_plus_decoy_hla.ref.genozip
##reference=file://data/hs37d5.ref.genozip ##INFO=<ID=LUFT,Number=4,Type=String,Description="Info for rendering variant in LUFT coords. See https://genozip.com/dvcf.html",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=PRIM,Number=4,Type=String,Description="Info for rendering variant in PRIMARY coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=Lrej,Number=1,Type=String,Description="Reason variant was rejected for LUFT coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##INFO=<ID=Prej,Number=1,Type=String,Description="Reason variant was rejected for PRIMARY coords",Source="genozip",Version="12.0.25",RendAlg="NONE"
##luft_contig=<ID=chr1,length=248956422>
##genozip_command="C:\Users\USER\projects\genozip\genocat-debug.exe -vo docs/dvcf-example-files/mydata.38.vcf -f docs/dvcf-example-files/mydata.d.vcf.genozip" 2021-06-15 11:36:09 Cen. Australia Daylight Time
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Phred-scaled posterior probabilities for genotypes as defined in the VCF specification",RendAlg="G"
##bcftools_mergeVersion=1.11+htslib-1.11
##bcftools_mergeCommand=merge mydata.38.vcf.gz person3.vcf.gz; Date=Tue Jun 15 12:13:51 2021
##genozip_command="C:\Users\USER\projects\genozip\genocat-debug.exe -o ../../junk.vcf docs/dvcf-example-files/merged.38.vcf.genozip" 2021-06-15 12:20:53 Cen. Australia Daylight Time
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Person1 Person2 Person3
1 10285 LN=1 T C 100 PASS AN=6;AC=4;LUFT=chr1,10285,T,- GT:AD:AF:PL 0/1:31,18:0.367:37,0,46 1/1:.:.:. 1/0:20,15:.:.
1 329162 LN=2 A T 4.6 PASS AN=4;AC=3;LUFT=chr1,490175,T,X GT:AD:AF:PL 0/1:28,9:0.3:36,0,0 1/1:.:.:. ./.:.:.:.
1 366042 LN=1 TA A 100 PASS Lrej=RefNewAlleleIndel GT 1|0 0|0
1 20159588 LN=4 C A 100 PASS AF=0.75;OBSCURE=0.002;AN=4;AC=3;LUFT=chr1,19833095,A,- GT:AD:AF:PL 0/1:28,9:0.3:36,0,0 1/1:.:.:. ./.:.:.:.
Notice that our new variant, LN=5, which is a Luft-only variant, is indeed absent from the variant data lines when rendering in Primary coordinates, and is instead present in the ##luft_only meta-information line at the beginning of the file.
Also, look at the variant LN=4 - this is the variant with the REF⇆ALT switch. Notice how the INFO and FORMAT fields were cross-rendered for the new sample Person3 to reflect the REF⇆ALT switch even though this sample was not present when we lifted the file with genozip --chain. For the GP field, this happened even though the GP field was entirely absent from the file at the time of lifting:
GT: The values were flipped - 0/0 became 1/1
AD: The order of the values was reversed: 30,3 became 3,30
GP: The order of the values was reversed: 37.9,14.9,0.14 became 0.14,14.9,37.9
INFO/AC: The new value is (AN-value)
INFO/AF, INFO/AC and INFO/OBSCURE: The value is (1-value)
This demonstrates how a VCF file can continue to evolve, adding and dropping variants, adding, removing or modifying INFO and FORMAT fields, all while continuing to maintain both coordinates, renderable in either Primary or Luft coordinates as required by any particular step in an analysis process.
Annotation dropping and renaming
In some cases, an annotation tag (rather than value) changes between the Primary and Luft rendition, and in other cases it is desirable to drop annotations in the Luft rendition.
Genozip renames and drops certain annotations by default (which may be overridden if needed) and additional renaming or dropping may be specified using the command line options --dvcf-rename and --dvcf-drop.
See: Renaming and dropping annotations in a DVCF.
--single-coord: Converting back to a single coordinate (i.e. normal) VCF file
genocat --single-coord myfile.d.vcf.genozip
genocat --single-coord --luft myfile.d.vcf.genozip
The option --single-coord removes all DVCF-specific lines from the VCF header, and removes the DVCF INFO annotations, leaving the file as a normal VCF file in single coordinates - either the Primary coordinates, or Luft coordinates (when combined with --luft).
--show-dvcf and --show-ostatus - variant by variant info
Now let’s look at a interesting analytics tool: --show-dvcf is a useful tool for getting visibility into how Genozip handled each variant. It may also be used in combination with --luft:
> genocat merged.38.vcf.genozip --show-dvcf --header-one
#COORD oSTATUS CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Person1 Person2 Person3
BOTH OkRefSameSNP 1 10285 LN=1 T C 100 PASS AN=6;AC=4;LUFT=chr1,10285,T,- GT:AD:AF:PL 0/1:31,18:0.367:37,0,46 1/1:.:.:. 1/0:20,15:.:.
BOTH OkRefSameSNP 1 329162 LN=2 A T 4.6 PASS AN=4;AC=3;LUFT=chr1,490175,T,X GT:AD:AF:PL 0/1:28,9:0.3:36,0,0 1/1:.:.:. ./.:.:.:.
PRIM RefNewAlleleIndel 1 366042 LN=1 TA A 100 PASS Lrej=RefNewAlleleIndel GT 1|0 0|0 BOTH OkRefAltSwitchSNP 1 20159588 LN=4 C A 100 PASS AF=0.75;OBSCURE=0.002;AN=4;AC=3;LUFT=chr1,19833095,A,- GT:AD:AF:PL 0/1:28,9:0.3:36,0,0 1/1:.:.:. ./.:.:.:.
> genocat merged.38.vcf.genozip --show-dvcf --header-one --luft
#COORD oSTATUS CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Person1 Person2 Person3
BOTH OkRefSameSNP chr1 10285 LN=1 T C 100 PASS AN=6;AC=4;PRIM=1,10285,T,- GT:AD:AF:PL 0/1:31,18:0.367:37,0,46 1/1:.:.:. 1/0:20,15:.:.
BOTH OkRefSameSNP chr1 490175 LN=2 T A 4.6 PASS AN=4;AC=3;PRIM=1,329162,A,X GT:AD:AF:PL 0/1:28,9:0.3:36,0,0 1/1:.:.:. ./.:.:.:.
BOTH OkRefAltSwitchSNP chr1 19833095 LN=4 A C 100 PASS AF=0.25;OBSCURE=0.998;AN=4;AC=1;PRIM=1,20159588,C,- GT:AD:AF:PL 1/0:9,28:0.7:0,0,36 0/0:.:.:. ./.:.:.:.
LUFT AddedVariant chr1 22000000 LN=5 A C 4 PASS AN=2;AC=0;Prej=AddedVariant GT:AD:GP ./.:.:. ./.:.:. 1/1:3,30:37.9,14.9,0.14
Two columns were added at the start of each variant line - the first states the coordinates of the variant - Primary-only, Luft-only or Both coordinates. The second is the oSTATUS, states how the variant was handled. Dual-coordinate variants (“BOTH”) have an oStatus starting with Ok, while Primary-only and Luft-only variants have a status indicating the reason for being rejected from the opposite rendition.
Alternatively, --show-ostatus may be used to add the oSTATUS to the INFO field:
> genocat merged.38.vcf.genozip --show-ostatus --header-one
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Person1 Person2 Person3
1 10285 LN=1 T C 100 PASS AN=6;AC=4;LUFT=chr1,10285,T,-;oSTATUS=OkRefSameSNP GT:AD:AF:PL 0/1:31,18:0.367:37,0,46 1/1:.:.:. 1/0:20,15:.:.
1 329162 LN=2 A T 4.6 PASS AN=4;AC=3;LUFT=chr1,490175,T,0;oSTATUS=OkRefSameSNP GT:AD:AF:PL 0/1:28,9:0.3:36,0,0 1/1:.:.:. ./.:.:.:.
1 366042 LN=1 TA A 100 PASS Lrej=RefNewAlleleIndel;oSTATUS=RefNewAlleleIndel GT 1|0 0|0 1 20159588 LN=4 C A 100 PASS AF=0.75;OBSCURE=0.002;AN=4;AC=3;LUFT=chr1,19833095,A,-;oSTATUS=OkRefAltSwitchSNP GT:AD:AF:PL 0/1:28,9:0.3:36,0,0 1/1:.:.:. ./.:.:.:.
More information about how Genozip lifts and cross-renders each field can found here: Rendering a DVCF
The rejects file and --show-lifts
When running genozip --chain, a human-readable rejects file, mydata.d.vcf.genozip.rejects in our case, is generated containing additional information regarding the rejection reasons related to each rejected variant:
> cat mydata.d.vcf.genozip.rejects
##fileformat=GENOZIP-REJECTSv12.0.33
#oSTATUS CHROM POS REF ALT REASON RefNewAlleleIndel
1 366042 TA A new indel allele: PRIM=TAA LUFT(453295)=TAAAA
Here we can see why TA would be a new allele in the Luft reference: it is because the Primary REF allele relates to T followed by two As, and hence the ALT deletion allele refers to a T followed by a single A. Since the Luft reference contains four repeating As, and therefore the corresponding variant in Luft coordinates would therefore be
“REF=TAAA ALT=TA,T”, however at present Genozip is limited in that it cannot change a bi-allelic variant to a tri-allelic one, and hence the variant is rejected.
The option --show-lifts in combination with genozip --chain causes all lifts (not just rejected ones) to be reported in the rejects file:
> genozip mydata.vcf.gz --chain GRCh37_to_GRCh38.chain.genozip --show-lifts
> cat mydata.d.vcf.genozip.rejects
##fileformat=GENOZIP-REJECTSv12.0.33
#oSTATUS CHROM POS REF ALT REASON OkRefSameSNP
1 10285 T . SNP: REF unchanged OkRefSameSNP
1 329162 A . SNP: REF unchanged RefNewAlleleIndel
1 366042 TA A new indel allele: PRIM=TAA LUFT(453295)=TAAAA OkRefAltSwitchSNP
1 20159588 C A SNP: REF<>ALT switch
Overlapping variants and the overlaps file
There are cases where two or more variants which have distinct Primary coordinates, are mapped to the same Luft coordinates, due to overlapping alignments in the chain file. When this occurs, Genozip generates a file with the .overlaps extension, containing the CHROM and POS of overlapping variants in Luft coordintes. This file can be used to view these variants:
> genocat myfile.d.vcf.genozip --luft --regions-file myfile.d.vcf.genozip.overlaps
Filtering out these variants can be done with:
> genocat myfile.d.vcf.genozip --luft --regions-file ^myfile.d.vcf.genozip.overlaps
The format of the .overlaps file, after removing the comment lines, is also compatible with bcftools view
--regions-file.
Viewing summary statistics
To see summary statistics of how variants were handled, we can use --show-counts (this works both with genozip and genocat):
> genocat --show-counts=o\$TATUS merged.38.vcf.genozip
Showing counts of o$TATUS (did_i=16). Total items=5 Number of categories=4
OkRefSameSNP 2 40.00%
RefNewAlleleIndel 1 20.00%
AddedVariant 1 20.00%
OkRefAltSwitchSNP 1 20.00%
> genocat --show-counts=COORDS merged.38.vcf.genozip
Showing counts of COORDS (did_i=15). Total items=5 Number of categories=3
BOTH 3 60.00%
LUFT 1 20.00%
PRIM 1 20.00%
Other useful options
-
genocat --show-chain <my-chain.chain.genozip> - displays all the chain file alignments, including overlap information.
-
genocat --snps-only <myfile.vcf.genozip> - drops non-SNP variants.
-
genocat --indels-only <myfile.vcf.genozip> - drops non-indel variants.
-
genocat --reference <ref-file> --regions chr22:17000000-17000100 (or equivalently: chr22:17000000+101) - displays a region of a reference.
-
genocat --reference <ref-file> --regions chr22:17000000-16999900 (or equivalently: chr22:17000000-101) - displays a region in reverse-complement.
-
genozip --chain <my-chain.chain.genozip> --add-line-numbers <myfile.vcf> - replaces the ID field with a sequential variant number, making it easier to locate corresponding variants between Primary and Luft renditions.
oSTATUS list
oSTATUS | Occurs when… |
---|---|
AltsNotSameLenInsRev | Lift failed - a multi-allelic insertion at a reverse-strand chain alignment has ALTs with different lengths - these would have resulted in variants with different POS values |
OkRefSameSNP | Lift succeeded - REF unchanged for a SNP variant |
OkRefSameSNPRev | Lift succeeded - REF unchanged for a SNP variant - with strand reversal |
OkRefSameSNPIupac | Lift succeeded - REF considered unchanged as it matches a IUPAC “base” in the Luft reference |
OkRefSameIndel | Lift succeeded - REF unchanged for an indel variant |
OkRefSameNDNIRev | Lift succeeded - Lift succeeded - REF unchanged for an left-anchored non-Ins non-Del indel variant with strand reversal |
OkRefSameDelRev | Lift succeeded - Lift succeeded - REF unchanged for an left-anchored Deletion variant with strand reversal |
OkRefSameInsRev | Lift succeeded - Lift succeeded - REF unchanged for an left-anchored Insertion variant with strand reversal |
OkRefSameNotLeftAnc | Lift succeeded - REF unchanged for a variant that is neither a SNP nor left-anchored |
OkRefSameStructVariant | Lift succeeded - REF unchanged for a variant with a symbolic ALT allele |
OkRefAltSwitchSNP | Lift succeeded - REF⇆ALT switch for a SNP variant |
OkRefAltSwitchIndel | Lift succeeded - REF⇆ALT switch for an left-anchored indel variant |
OkRefAltSwitchDelToIns | Lift succeeded - Indel REF⇆ALT switch: The Deletion variant in Primary is incorporated in the Luft reference |
OkRefAltSwitchIndelRpts | Lift succeeded - Indel REF⇆ALT switch: Switched number of payload repeats in reference |
OkRefAltSwitchIndelFlank | Lift succeeded - Indel REF⇆ALT switch: REF bases the same, but switch called based on flanking regions |
OkRefAltSwitchNDNI | Lift succeeded - Indel REF⇆ALT switch: Left-anchored non-Ins non-Del indel |
OkRefAltSwitchWithGap | Lift succeeded - Indel REF⇆ALT switch: Deletion with payload in chain file gap |
OkRefAltSwitchNotLeftAnc | Lift succeeded - REF⇆ALT switch for a variant that is neither a SNP nor left-anchored |
OkNewRefSNP | Lift succeeded - REF changed (not to ALT) for a SNP variant with AF=1 or AC=AN |
ChromNotInPrimReference | Lift failed - Primary coordinates reference file doesn’t contain CHROM |
ChromNotInChainFile | Lift failed - chain file doesn’t have a mapping for CHROM |
RefNotMappedInChain | Lift failed - Chain file doesn’t contain a mapping covering REF |
NewAnchorNotInChrom | Lift failed - New left-anchor base (after reverse-complementing) is before beginning of the chromosome |
RefSplitInChain | Lift failed - REF is not fully within a single alignment in the chain file |
RefMismatchesReference | Lift failed - REF in VCF file mismatches the Primary coordinates reference file. There is an error in the VCF file, or the wrong reference file is provided |
RefMultiAltSwitchSNP | Lift failed - A REF⇆ALT switch for a multi-allelic SNP |
RefMultiAltSwitchIndel | Lift failed - A REF⇆ALT switch for a multi-allelic indel |
RefNewAlleleSNP | Lift failed - The Luft reference would introduce a new allele which is neither REF or ALT |
RefNewAlleleDelRefChgHas* | Lift failed - REF changed in a Deletion variant that has a “*” ALT |
RefNewAlleleDelRefChanged | Lift failed - REF changed in Deletion variant, but not REF<>ALT switch (i.e. Deletion not integrated into new reference) |
RefNewAlleleDelSameRef | Lift failed - REF bases match, but this is a new Deletion allele based on context |
RefNewAllelInsRefChanged | Lift failed - REF changed in Insertion variant |
RefNewAlleleInsSameRef | Lift failed - REF bases match, but this is a new Insertion allele based on context |
RefNewAlleleIndelNoSwitch | Lift failed - REF switched with one of the ALTs, but flanking regions differ |
RefNewAlleleNDNI | Lift failed - REF is a new allele a left-anchored non-Ins non-Del indel variant |
RefNewAlleleNotLeftAnc | Lift failed - The Luft reference would introduce a new allele for a variant that is neither a SNP nor left-anchored |
RefNewAlleleSV | Lift failed - The Luft reference is different than REF for a variant with a symbolic ALT allele |
XstrandNotLeftAnc | Lift failed - A variant that is neither a SNP nor left-anchored mapped to the reverse strand |
XstrandSV | Lift failed - A variant with a symbolic ALT allele mapped to the reverse strand |
ComplexRearrangements | Lift failed - Variant is a Complex Rearrangement |
INFO/END | Lift failed - POS and INFO/END are not on the same chain file alignment |
AddedVariant | Failed to cross-render a variant - Non-dual-coordinate variant added to a DVCF |
UnsupportedRefAlt | Failed to cross-render a variant - Combination of REF/ALT in Primary and Luft coordinates not supported by Genozip |
INFO/AC | Failed to cross-render AC: INFO/AN is missing for this variant |
INFO/tag | Failed to cross-render this INFO tag |
FORMAT/tag | Failed to cross-render this FORMAT tag |
Questions? support@genozip.com