
Example of a FASTQ to BAM pipeline using Genozip
Overview
This pipeline starts with a .fq.genozip file, cleans it up with fastp, maps it with bwa mem, sorts it with bamsort, removes duplicates with bamstreamingmarkduplicates and finally outputs a .bam.genozip file.
This pipeline has Genozip files on both ends, and uses no intermediate files, so it is efficient in storage usage.
Genozip adds very little overhead, as its CPU consumption is relatively minor in comparison with the tools used in this pipeline, in particular bwa mem.
Notes
• The input fastq.genozip files were originally created from pairs of fq.gz files:
genozip sample01-R1.fq.gz sample01-R2.fq.gz --pair
• Since the .fq.genozip file was generated with genozip --pair, genocat outputs FASTQ data in interleaved format, i.e. reads from R1 interleaved with reads from R2, which is the format expeceted by fastp --interleaved_in.
• This script is designed to have multiple instances of it running in parallel. Each instance will work on different files.
• Technical note: ${out}.doing_now is used as a mutex and touch/rm as mutex_lock/unlock. Since touch is not atomic, exclusivity is not 100% guarateed, but in practice it seems to work.
Script
#!/bin/bash
ref=GRCh38_full_analysis_set_plus_decoy_hla.fa
study=mystudy
fastq=myfastqdir
mapped=mymappeddir
files=($fastq/*.genozip)
processed=1
while (( processed == 1 )); do
processsed=0
for file in ${files[@]}
do
sample=`echo $file | grep -o -E sample'[[:digit:]]{2}'` # convert the file name to a sample name
out=$mapped/${sample}.bam.genozip
if [ -f $out ]; then continue; fi # already processed
if [ -f ${out}.doing_now ]; then continue; fi # another instance of this script is working on it
processed=1
touch ${out}.doing_now
echo =========================================
echo Sample $sample
echo =========================================
( genocat $file -e ${ref%.fa}.ref.genozip || >&2 echo "genocat exit=$?" )|\
( fastp --stdin --interleaved_in --stdout --html ${fastq}/${sample}.html --json ${fastq}/${sample}.json || >&2 echo "fastp exit=$?" )|\
( bwa mem $ref - -p -t 54 -T 0 -R "@RG\tID:$sample\tSM:$study\tPL:Illumina" || >&2 echo "bwa exit=$?" )|\
( samtools view -h -OSAM || >&2 echo "samtools exit=$?")|\
( bamsort fixmates=1 adddupmarksupport=1 inputformat=sam outputformat=sam inputthreads=5 outputthreads=5 sortthreads=30 level=1 || >&2 echo "bamsort exit=$?" )|\
( bamstreamingmarkduplicates inputformat=sam inputthreads=3 outputthreads=3 level=1 || >&2 echo "bamstreamingmarkduplicates exit=$?" )|\
( genozip -e $ref -i bam -o $out || >&2 echo "genozip exit=$?" )
rm ${out}.doing_now
done
done