top of page

Converting SAM/BAM to FASTQ

Usage

 

genocat --fastq genozip_files

Viewing only R1 reads:

 

genocat --fastq --FLAG +0x40 genozip_files

 

Viewing only R2 reads:

 

genocat --fastq --FLAG +0x80 genozip_files

Outputting all SAM fields on the FASTQ description line:

 

genocat --fastq=all genozip_files

Outputting fq.gz (BGZF-compressed FASTQ):

genocat myfile.bam.genozip --output myfile.fq.gz

Description

 

Displays the contents of the SAM / BAM data in FASTQ format:

●  The alignments are outputted as FASTQ reads in the order they appear in the SAM/BAM file.

●  Alignments with FLAG 0x10 (reverse complimented) have their SEQ reverse complimented and their QUAL reversed.

●  Alignments with FLAG 0x0800 (supplementary) or 0x0100 (secondary) are dropped.

●  Alignments with FLAG 0x40 (first segment in the template) have a /1 added after the read name unless --FLAG is specified as well.

●  Alignments with FLAG 0x80 (last segment in the template) have a /2 added after the read name unless --FLAG is specified as well.

●  If --output specifies a filename ending with .fq[.gz] or .fastq[.gz] then --fastq is activated implicitly.

 

Example

Consider a simple paired-end SAM file with three alignments, all with the same QNAME. The first two are the primary alignments of R1 and R2, and the third is a secondary alignment of R1:

$ genozip myfile.sam # compress with genozip

 

$ genocat myfile.sam.genozip

ST-E00185:547:HCNMNCCX2:5:1118:17269:28593 99 chr1 33656 0 10M = 33901 386 CCTAATGCTA AAAFFJJJJJ NM:i:0

ST-E00185:547:HCNMNCCX2:5:1118:17269:28593 147 chr1 33901 0 10M = 33656 -386 CACATTTTCT JJFAF7-7F7 NM:i:2

ST-E00185:547:HCNMNCCX2:5:1118:17269:28593 355 chr22 33656 0 10M = 33901 386 CCTAATGCTA AAAFFJJJJJ NM:i:1

Let’s now display the data as FASTQ. Notice that:

1. The last SAM line is eliminated because it is a secondary alignment.

2. The read names have /1 and /2 added to them.

3. The second alignment's sequence is reverse-complemented and its base qualities are reversed

 

$ genocat myfile.sam.genozip --fastq

@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/1

CCTAATGCTA

+

AAAFFJJJJJ

@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/2

AGAAAATGTG

+

7F7-7FAFJJ

let’s now output only the R1 reads (the first SAM line in this case). Notice that:

1. /1 is not added.

2. --fq is an alternative spelling for activating the --fastq option.

$ genocat myfile.sam.genozip --fq --FLAG +0x40

@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593

CCTAATGCTA

+

AAAFFJJJJJ

Finally, we can output all SAM fields to the FASTQ description lines with --fq=all:

$ genocat myfile.sam.genozip --fq=all

@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/1 FLAG:99 RNAME:chr1 POS:33656 MAPQ:0 CIGAR:10M RNEXT:= PNEXT:33901 TLEN:386 NM:i:0  

CCTAATGCTA

+

AAAFFJJJJJ

@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/2 FLAG:147 RNAME:chr1 POS:33901 MAPQ:0 CIGAR:10M RNEXT:= PNEXT:33656 TLEN:-386 NM:i:2

AGAAAATGTG

+

7F7-7FAFJJ

Questions? support@genozip.com

bottom of page