
Converting SAM/BAM to FASTQ
Usage
genocat --fastq genozip_files
Viewing only R1 reads:
genocat --fastq --FLAG +0x40 genozip_files
Viewing only R2 reads:
genocat --fastq --FLAG +0x80 genozip_files
Outputting all SAM fields on the FASTQ description line:
genocat --fastq=all genozip_files
Outputting fq.gz (BGZF-compressed FASTQ):
genocat myfile.bam.genozip --output myfile.fq.gz
Description
Displays the contents of the SAM / BAM data in FASTQ format:
● The alignments are outputted as FASTQ reads in the order they appear in the SAM/BAM file.
● Alignments with FLAG 0x10 (reverse complimented) have their SEQ reverse complimented and their QUAL reversed.
● Alignments with FLAG 0x0800 (supplementary) or 0x0100 (secondary) are dropped.
● Alignments with FLAG 0x40 (first segment in the template) have a /1 added after the read name unless --FLAG is specified as well.
● Alignments with FLAG 0x80 (last segment in the template) have a /2 added after the read name unless --FLAG is specified as well.
● If --output specifies a filename ending with .fq[.gz] or .fastq[.gz] then --fastq is activated implicitly.
Example
Consider a simple paired-end SAM file with three alignments, all with the same QNAME. The first two are the primary alignments of R1 and R2, and the third is a secondary alignment of R1:
$ genozip myfile.sam # compress with genozip
$ genocat myfile.sam.genozip
ST-E00185:547:HCNMNCCX2:5:1118:17269:28593 99 chr1 33656 0 10M = 33901 386 CCTAATGCTA AAAFFJJJJJ NM:i:0
ST-E00185:547:HCNMNCCX2:5:1118:17269:28593 147 chr1 33901 0 10M = 33656 -386 CACATTTTCT JJFAF7-7F7 NM:i:2
ST-E00185:547:HCNMNCCX2:5:1118:17269:28593 355 chr22 33656 0 10M = 33901 386 CCTAATGCTA AAAFFJJJJJ NM:i:1
Let’s now display the data as FASTQ. Notice that:
1. The last SAM line is eliminated because it is a secondary alignment.
2. The read names have /1 and /2 added to them.
3. The second alignment's sequence is reverse-complemented and its base qualities are reversed
$ genocat myfile.sam.genozip --fastq
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/1
CCTAATGCTA
+
AAAFFJJJJJ
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/2
AGAAAATGTG
+
7F7-7FAFJJ
let’s now output only the R1 reads (the first SAM line in this case). Notice that:
1. /1 is not added.
2. --fq is an alternative spelling for activating the --fastq option.
$ genocat myfile.sam.genozip --fq --FLAG +0x40
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593
CCTAATGCTA
+
AAAFFJJJJJ
Finally, we can output all SAM fields to the FASTQ description lines with --fq=all:
$ genocat myfile.sam.genozip --fq=all
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/1 FLAG:99 RNAME:chr1 POS:33656 MAPQ:0 CIGAR:10M RNEXT:= PNEXT:33901 TLEN:386 NM:i:0
CCTAATGCTA
+
AAAFFJJJJJ
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/2 FLAG:147 RNAME:chr1 POS:33901 MAPQ:0 CIGAR:10M RNEXT:= PNEXT:33656 TLEN:-386 NM:i:2
AGAAAATGTG
+
7F7-7FAFJJ
Questions? support@genozip.com