Compressing
$ genozip chm13.draft_v1.1.fasta.gz
genozip chm13.draft_v1.1.fasta.gz : Done (15 seconds, FASTA compression ratio: 4.8 - better than .fasta.gz by a factor of 1.4)
testing: genounzip chm13.draft_v1.1.fasta.genozip : verified as identical to the original FASTA
$ ls -lh chm13.draft_v1.1.*
-rw-rw-r--+ 1 divon divon 617M Aug 5 00:05 chm13.draft_v1.1.fasta.genozip
-rw-rw-r--+ 1 divon divon 852M May 8 2021 chm13.draft_v1.1.fasta.gz
Uncompressing
$ genounzip chm13.draft_v1.1.fasta.genozip
Viewing and analyzing
$ genocat chm13.draft_v1.1.fasta.genozip
genocat options:
--sequential : each sequence is output in single line - newlines are removed.
--header-only: shows only the description lines (no sequences).
--no-header: shows only the sequences, omitting the description lines.
--header-one: shows the description lines truncated at the first space or tab character.
--grep string: shows only the sequences in which string is contained in the header.
--grep-w string: same as --grep, but string must match a whole word.
--regions sequence-name[,sequence-name2...] shows only the sequences requested. sequence-name is the prefix of the description line up to the first space, tab or newline.
--regions-file filename same as --regions, but list of sequence names is taken from a file.
--head[=lines] show line lines from the top of the file. This is similar to piping genocat | head but faster.
--tail[=lines] similar to --head but shows lines from the end of file.
--lines [first]-[last] or [first] shows a range of lines.
--phylip view the data in PHYLIP format, see: Converting MultiFASTA to PHYLIP and back.
--taxid filter sequences by species using kraken2 data, See Filtering with Kraken.
--downsample rate[,shard] technically works on FASTA files, but usually not very useful. See Downsampling.
Multiseq FASTA files
We define a Multiseq file as a Multi-FASTA in which we expect the sequences to be quite similar to each other. For example, a file consisting of many samples of the same virus species.
Genozip offers some additional functionality for Multiseq FASTAs:
1. Compressing: use the --multiseq option to inform Genozip that this is a Multiseq FASTA. Genozip uses this information to improve the compression.
2. Converting to PHYLIP format: See Converting MultiFASTA to PHYLIP and back.
Reference files
A FASTA file may be compressed as a Genozip reference file, using the --make-reference option:
$ genozip --make-reference chm13.draft_v1.1.fasta.gz
Reference files are a file format used in Genozip internally, and cannot be uncompressed. They are also created implicitly when files, such as FASTQ, BAM or VCF, are compressed with the --reference or --REFERENCE options.
The main role of reference files is to be used for compressing other files, using the --reference or --REFERENCE options, which usually results in significantly improved compression.
However, In addition to their primary use, reference files are also useful for analyzing the underlying FASTA: they can be used to easy view sub-sequences of contigs in certain regions (forward or reverse complemented) using --regions and --regions-file, for finding IUPAC non-ACGTN pseudo-bases in the file with --show-ref-iupacs, and seeing properties of the contigs with --show-ref-contigs. See more here: Reference file options.
For a full list of options, see the genozip command line reference
Questions? support@genozip.com