Filtering BAM, FASTQ or FASTA data by species using kraken2
kraken2 is a tool (not associated with Genozip) that analyzes a FASTQ or FASTA file against a database typically derived from
reference sequences of a large number of species (for example: human plus bacteria) and assigns an
Genozip allows filtering of SAM, BAM, FASTQ and FASTA files by taxid, using the kraken data (i.e. the kraken output file). This allows, for example, filtering contamination out of an existing BAM file without needing to re-map it.
Filtering an existing BAM (or SAM, FASTQ of FASTA) file based on existing kraken data
Step 1: Compress the BAM (or SAM, FASTQ of FASTA) file with genozip
Step 2: Prepare the kraken file
Option 1: From a FASTQ file or a pair of FASTQ files - follow the kraken2 instructions.
Option 2: From a .bam.genozip file:
genocat myfile.bam.genozip --fastq -o mysample.fq &
kraken2 --db $kraken_db --report mykrakendata.kraken.report --output mykrakendata.kraken mysample.fq
Note: we use a named pipe (generated by mkfifo) for in-memory communication from genocat to kraken2. This is to overcome kraken’s inability to use a regular pipe for the input file.
Option 3: From a paired .fastq.genozip file:
mkfifo mysample.kraken mysample.R1.fq mysample.R2.fq
genocat myfile.fq.genozip --R1 -o mysample.R1.fq &
genocat myfile.fq.genozip --R2 -o mysample.R2.fq &
kraken2 --db kraken-db --paired --mykrakendata.kraken --report mykrakendata.kraken.report mysample.R1.fq mysample.R2.fq
rm mysample.kraken mysample.R1.fq mysample.R2.fq
Note: in this case we split the paired FASTQ data to two different pipes to overcomes kraken’s inability to consume a paired fastq in interleaved format.
Step 3: Filter your file using the kraken data
To display only the lines of the BAM file containing human data (NCBI Taxonomy ID 9606):
genocat myfile.bam.genozip --kraken mykrakendata.kraken --taxid 9606
To display all the lines of the BAM file that DO NOT contain human data:
genocat myfile.bam.genozip --kraken mykrakendata.kraken --taxid ^9606
To filter by multiple taxonomy IDs, use a comma-separated list:
genocat myfile.bam.genozip --kraken mykrakendata.kraken --taxid 9606,570
Note: A +0 suffix can be used include or exclude, in addition, also unclassified reads:
genocat myfile.bam.genozip --kraken mykrakendata.kraken --taxid 9606+0
The --kraken and --taxid options work the same for SAM/BAM, FASTQ and FASTA files.
Adding taxonomy information to a Genozip file
You can generate a SAM/BAM, FASTQ or FASTA file with taxonomy information baked in:
genozip myfile.bam --kraken mykrakendata.kraken
This stores internally in the Genozip file, for each line, its taxid. Unless you need the kraken file for other purposes, you can discard it at this point.
In the case of SAM and BAM, the taxid is will be visible as an additional field tx:i, for example: tx:i:9606. In case of FASTQ, it will be added to the description line of each read, for example: taxid=9606.
Now, you can filter the file, as before, but this time it will be a lot faster since Genozip doesn’t need to read the kraken file:
genocat myfile.bam.genozip --taxid 9606
You can see the breakdown of Taxonomy IDs in the file with:
genocat myfile.bam.genozip --show-counts TAXID
Finally, you can see the line-by-line taxid assignments with:
--show-kraken all lines
--show-kraken=included only included lines
--show-kraken=excluded only excluded lines
genocat myfile.bam.genozip --show-kraken
Kraken data containing reads with /1 and /2
genozip --kraken and genocat --kraken and can handle kraken data containing reads with /1 or /2, similar the way two components are handled, as described above.
This can happen, for example, when feeding kraken2 with an interleaved FASTQ file rather than two separate FASTQ files.
Note: Genozip requires that if the kraken file has a read name with /1, then it must also contain the corresponding read name with /2 and vice versa - their order is not important and they needn’t be consecutive.
genocat --taxid makes the assumption that all read names (QNAMEs in SAM terminology) that appear in the viewed file also appear (possibly with a /1 or /2 suffix) in the kraken data. If a read name appears in the viewed file, but is absent from the kraken data, its line will be silently assigned the most common taxid according to the kraken data.
It is ok if the kraken data contains additional read names not present in the viewed file.