top of page

Example of a FASTQ to BAM pipeline using Genozip

Overview

This pipeline starts with a .fq.genozip file, cleans it up with fastp, maps it with bwa mem, sorts it with bamsort, removes duplicates with bamstreamingmarkduplicates and finally outputs a .bam.genozip file.

This pipeline has Genozip files on both ends, and uses no intermediate files, so it is efficient in storage usage.

Genozip adds very little overhead, as its CPU consumption is relatively minor in comparison with the tools used in this pipeline, in particular bwa mem.

Notes

• The input fastq.genozip files were originally created from pairs of fq.gz files:

genozip sample01-R1.fq.gz sample01-R2.fq.gz --pair

 

• Since the .fq.genozip file was generated with genozip --pairgenocat outputs FASTQ data in interleaved format, i.e. reads from R1 interleaved with reads from R2, which is the format expeceted by fastp --interleaved_in.

• This script is designed to have multiple instances of it running in parallel. Each instance will work on different files.

• Technical note: ${out}.doing_now is used as a mutex and touch/rm as mutex_lock/unlock. Since touch is not atomic, exclusivity is not 100% guarateed, but in practice it seems to work.

Script

#!/bin/bash

ref=GRCh38_full_analysis_set_plus_decoy_hla.fa
study=mystudy
fastq=myfastqdir
mapped=mymappeddir

files=($fastq/*.genozip)

processed=1

while (( processed == 1 )); do
    processsed=0
    for file in ${files[@]}
    do
        sample=`echo $file | grep -o -E sample'[[:digit:]]{2}'`  # convert the file name to a sample name
        out=$mapped/${sample}.bam.genozip

        if [ -f $out ]; then continue; fi # already processed
        if [ -f ${out}.doing_now ]; then continue; fi # another instance of this script is working on it

        processed=1
        touch ${out}.doing_now

        echo =========================================
        echo Sample $sample
        echo =========================================

        ( genocat $file -e ${ref%.fa}.ref.genozip || >&2 echo "genocat exit=$?" )|\
        ( fastp --stdin --interleaved_in --stdout --html ${fastq}/${sample}.html --json ${fastq}/${sample}.json || >&2 echo "fastp exit=$?"   )|\
        ( bwa mem $ref - -p -t 54 -T 0 -R "@RG\tID:$sample\tSM:$study\tPL:Illumina" || >&2 echo "bwa exit=$?"     )|\
        ( samtools view -h -OSAM || >&2 echo "samtools exit=$?")|\
        ( bamsort fixmates=1 adddupmarksupport=1 inputformat=sam outputformat=sam inputthreads=5 outputthreads=5 sortthreads=30 level=1 || >&2 echo "bamsort exit=$?" )|\
        ( bamstreamingmarkduplicates inputformat=sam inputthreads=3 outputthreads=3 level=1                     || >&2 echo "bamstreamingmarkduplicates exit=$?" )|\
        ( genozip -e $ref -i bam -o $out || >&2 echo "genozip exit=$?" )

        rm ${out}.doing_now
    done
done

bottom of page