top of page

Converting MultiFASTA to PHYLIP and back

Converting a MultiFASTA file to a PHYLIP file - all sequences must be the same length:

 

genozip mydata.fa.gz

genocat mydata.fa.genozip --phylip --output mydata.phy

Converting a PHYLIP file to a MultiFASTA file:

genozip mydata.phy

genocat mydata.phy.genozip --fasta --output mydata.fa.gz

Note: the input files can be plain files, or compressed with .gz .bz2 or .xz. The output files may be plain files or .gz.

Note: if the PHYLIP input file doesn’t have a .phy (or .phy.gz / .phy.bz2 / .phy.xz) file name extension, you can tell genozip that this is a PHYLIP file with --input phy.

Questions? support@genozip.com

$ genozip myfile.sam # compress with genozip

 

$ genocat myfile.sam.genozip

ST-E00185:547:HCNMNCCX2:5:1118:17269:28593 99 chr1 33656 0 10M = 33901 386 CCTAATGCTA AAAFFJJJJJ NM:i:0

ST-E00185:547:HCNMNCCX2:5:1118:17269:28593 147 chr1 33901 0 10M = 33656 -386 CACATTTTCT JJFAF7-7F7 NM:i:2

ST-E00185:547:HCNMNCCX2:5:1118:17269:28593 355 chr22 33656 0 10M = 33901 386 CCTAATGCTA AAAFFJJJJJ NM:i:1

Let’s now display the data as FASTQ. Notice that:

1. The last SAM line is eliminated because it is a secondary alignment.

2. The read names have /1 and /2 added to them.

3. The second alignment's sequence is reverse-complemented and its base qualities are reversed

 

$ genocat myfile.sam.genozip --fastq

@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/1

CCTAATGCTA

+

AAAFFJJJJJ

@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/2

AGAAAATGTG

+

7F7-7FAFJJ

let’s now output only the R1 reads (the first SAM line in this case). Notice that:

1. /1 is not added.

2. --fq is an alternative spelling for activating the --fastq option.

$ genocat myfile.sam.genozip --fq --FLAG +0x40

@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593

CCTAATGCTA

+

AAAFFJJJJJ

Finally, we can output all SAM fields to the FASTQ description lines with --fq=all:

$ genocat myfile.sam.genozip --fq=all

@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/1 FLAG:99 RNAME:chr1 POS:33656 MAPQ:0 CIGAR:10M RNEXT:= PNEXT:33901 TLEN:386 NM:i:0  

CCTAATGCTA

+

AAAFFJJJJJ

@ST-E00185:547:HCNMNCCX2:5:1118:17269:28593/2 FLAG:147 RNAME:chr1 POS:33901 MAPQ:0 CIGAR:10M RNEXT:= PNEXT:33656 TLEN:-386 NM:i:2

AGAAAATGTG

+

7F7-7FAFJJ

Questions? support@genozip.com

bottom of page