top of page
Filtering

Filtering BAM, FASTQ or FASTA data by species using kraken2

Overview

kraken2 is a tool (not associated with Genozip) that analyzes a FASTQ or FASTA file against a database typically derived from

reference sequences of a large number of species (for example: human plus bacteria) and assigns an 

NCBI Taxonomy ID (taxid) to each read - generating a kraken output file containing these assignments.

 

Genozip allows filtering of SAM, BAM, FASTQ and FASTA files by taxid, using the kraken data (i.e. the kraken output file). This allows, for example, filtering contamination out of an existing BAM file without needing to re-map it.

Filtering an existing BAM (or SAM, FASTQ of FASTA) file based on existing kraken data

Step 1: Compress the BAM (or SAM, FASTQ of FASTA) file with genozip

genozip myfile.bam

Step 2: Prepare the kraken file

Option 1: From a FASTQ file or a pair of FASTQ files - follow the kraken2 instructions.

Option 2: From a .bam.genozip file:

mkfifo mysample.fq

genocat myfile.bam.genozip --fastq -o mysample.fq &

kraken2 --db $kraken_db --report mykrakendata.kraken.report --output mykrakendata.kraken mysample.fq

rm mysample.fq

Note: we use a named pipe (generated by mkfifo) for in-memory communication from genocat to kraken2. This is to overcome kraken’s inability to use a regular pipe for the input file.

Option 3: From a paired .fastq.genozip file:

mkfifo mysample.kraken mysample.R1.fq mysample.R2.fq

genocat myfile.fq.genozip --R1 -o mysample.R1.fq &

genocat myfile.fq.genozip --R2 -o mysample.R2.fq &

kraken2 --db kraken-db --paired --mykrakendata.kraken --report mykrakendata.kraken.report mysample.R1.fq mysample.R2.fq 

rm mysample.kraken mysample.R1.fq mysample.R2.fq

Note: in this case we split the paired FASTQ data to two different pipes to overcomes kraken’s inability to consume a paired fastq in interleaved format.

Step 3: Filter your file using the kraken data

To display only the lines of the BAM file containing human data (NCBI Taxonomy ID 9606):

 

genocat myfile.bam.genozip --kraken mykrakendata.kraken --taxid 9606

To display all the lines of the BAM file that DO NOT contain human data:

genocat myfile.bam.genozip --kraken mykrakendata.kraken --taxid ^9606

Note: A +0 suffix can be used include or exclude, in addition, also unclassified reads:

genocat myfile.bam.genozip --kraken mykrakendata.kraken --taxid 9606+0

The --kraken and --taxid options work the same for SAM/BAM, FASTQ and FASTA files.

Adding taxonomy information to a Genozip file

You can generate a SAM/BAM, FASTQ or FASTA file with taxonomy information baked in:

genozip myfile.bam --kraken mykrakendata.kraken

This stores internally in the Genozip file, for each line, its taxid. Unless you need the kraken file for other purposes, you can discard it at this point.

In the case of SAM and BAM (but not FASTQ and FASTA), the taxid is will be visible as an additional field tx:i, for example: tx:i:9606.

Now, you can filter the file, as before, but this time it will be a lot faster since Genozip doesn’t need to read the kraken file:

genocat myfile.bam.genozip --taxid 9606

You can see the breakdown of Taxonomy IDs in the file with:

genocat myfile.bam.genozip --show-counts TAXID

Finally, you can see the line-by-line taxid assignments with:

--show-kraken            all lines

--show-kraken=included   only included lines

--show-kraken=excluded   only excluded lines

genocat myfile.bam.genozip --show-kraken

Kraken data containing reads with /1 and /2

genozip --kraken and genocat --kraken and can handle kraken data containing reads with /1 or /2, similar the way two components are handled, as described above.

This can happen, for example, when feeding kraken2 with an interleaved FASTQ file rather than two separate FASTQ files.

Note: Genozip requires that if the kraken file has a read name with /1, then it must also contain the corresponding read name with /2 and vice versa - their order is not important and they needn’t be consecutive.

Assumptions

genocat --taxid makes the assumption that all read names (QNAMEs in SAM terminology) that appear in the viewed file also appear (possibly with a /1 or /2 suffix) in the kraken data. If a read name appears in the viewed file, but is absent from the kraken data, its line will be silently assigned the most common taxid according to the kraken data.

It is ok if the kraken data contains additional read names not present in the viewed file.

Questions? support@genozip.com

bottom of page