
Converting MultiFASTA to PHYLIP and back
Converting a MultiFASTA file to a PHYLIP file - all sequences must be the same length:
genozip mydata.fa.gz
genocat mydata.fa.genozip --phylip --output mydata.phy
Converting a PHYLIP file to a MultiFASTA file:
genozip mydata.phy
genocat mydata.phy.genozip --fasta --output mydata.fa.gz
Note: the input files can be plain files, or compressed with .gz .bz2 or .xz. The output files may be plain files or .gz.
Note: if the PHYLIP input file doesn’t have a .phy (or .phy.gz / .phy.bz2 / .phy.xz) file name extension, you can tell genozip that this is a PHYLIP file with --input phy.
Questions? support@genozip.com
$ genozip myfile.sam # compress with genozip
$ genocat myfile.sam.genozip
ST-E00185:547:HCNMNCCX2:5:1118:17269:28593 99 chr1 33656 0 10M = 33901 386 CCTAATGCTA AAAFFJJJJJ NM:i:0
ST-E00185:547:HCNMNCCX2:5:1118:17269:28593 147 chr1 33901 0 10M = 33656 -386 CACATTTTCT JJFAF7-7F7 NM:i:2
ST-E00185:547:HCNMNCCX2:5:1118:17269:28593 355 chr22 33656 0 10M = 33901 386 CCTAATGCTA AAAFFJJJJJ NM:i:1
Let’s now display the data as FASTQ. Notice that:
1. The last SAM line is eliminated because it is a secondary alignment.
2. The read names have /1 and /2 added to them.
3. The second alignment's sequence is reverse-complemented and its base qualities are reversed
$ genocat myfile.sam.genozip --fastq
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/1
CCTAATGCTA
+
AAAFFJJJJJ
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/2
AGAAAATGTG
+
7F7-7FAFJJ
let’s now output only the R1 reads (the first SAM line in this case). Notice that:
1. /1 is not added.
2. --fq is an alternative spelling for activating the --fastq option.
$ genocat myfile.sam.genozip --fq --FLAG +0x40
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593
CCTAATGCTA
+
AAAFFJJJJJ
Finally, we can output all SAM fields to the FASTQ description lines with --fq=all:
$ genocat myfile.sam.genozip --fq=all
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/1 FLAG:99 RNAME:chr1 POS:33656 MAPQ:0 CIGAR:10M RNEXT:= PNEXT:33901 TLEN:386 NM:i:0
CCTAATGCTA
+
AAAFFJJJJJ
@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/2 FLAG:147 RNAME:chr1 POS:33901 MAPQ:0 CIGAR:10M RNEXT:= PNEXT:33656 TLEN:-386 NM:i:2
AGAAAATGTG
+
7F7-7FAFJJ
Questions? support@genozip.com