
Handling GATK’s
“Unexpected base in allele bases” error
When running GATK’s HaplotypeCaller (and perhaps other commands?) on a BAM files that contains bases other than A,C,T,G (and N?), GATK throws an exception:
java.lang.IllegalArgumentException: Unexpected base in allele bases
This was observed both in GATK 3.5 and 4.1.
Genozip can directly filter the offending lines out of a BAM file:
Step 1: Compress the file with Genozip:
genozip myfile.bam
Step 2: Filter it:
genocat myfile.bam.genozip --bases ACGTN # Keep only lines in which SEQ has only A,C,T,G,N
genocat myfile.bam.genozip --bases ^ACGTN # See the offending lines
genocat myfile.bam.genozip --bases ^ACGTN --count # Count the number of offending lines
This also works for SAM and FASTQ files.
The list of IUPAC characters can be found here: IUPAC codes