top of page

Compressing FASTA files

Compressing

​​

$ genozip chm13.draft_v1.1.fasta.gz
genozip chm13.draft_v1.1.fasta.gz : Done (15 seconds, FASTA compression ratio: 4.8 - better than .fasta.gz by a factor of 1.4)
testing: genounzip chm13.draft_v1.1.fasta.genozip : verified as identical to the original FASTA

​​

$ ls -lh chm13.draft_v1.1.*
-rw-rw-r--+ 1 617M Aug  5 00:05 chm13.draft_v1.1.fasta.genozip
-rw-rw-r--+ 1 852M May  8  2021 chm13.draft_v1.1.fasta.gz

​​

Compressing FASTAs which are actually just FASTQ without quality scores
​

A special class of FASTA files are those that are essentially FASTQ just without the quality scores: each sequence is a single line. For this type of FASTA, Genozip supports almost all FASTQ compression options:

​

Limitations: --pair​ is not yet supported for FASTA files, and --deep cannot yet handle multiple FASTA files. Please write support@genozip.com if this is needed.

​

​​

​

​Uncompressing

​

$ genounzip chm13.draft_v1.1.fasta.genozip

​

Viewing and analyzing

​

$ genocat chm13.draft_v1.1.fasta.genozip

​

genocat options:

​

--sequential : each sequence is output in single line - newlines are removed.

​

--header-only: shows only the description lines (no sequences).

​

--no-header: shows only the sequences, omitting the description lines.

​

--header-one: shows the description lines truncated at the first space or tab character.

​

--grep string: shows only the sequences in which string is contained in the header.

​

--grep-w string: same as --grep, but string must match a whole word.

​

--regions sequence-name[,sequence-name2...] shows only the sequences requested. sequence-name is the prefix of the description line up to the first space, tab or newline.

​

--regions-file filename same as --regions, but list of sequence names is taken from a file.

​

--head[=lines] show line lines from the top of the file. This is similar to piping genocat | head but faster.

​

--tail[=lines] similar to --head but shows lines from the end of file.

​

--lines [first]-[last] or [first] shows a range of lines.

​

--downsample rate[,shard] technically works on FASTA files, but usually not very useful. See Downsampling.

​

Note: --grep, --grep-w, --regions and --regions-file require the file to be indexed during compression (only applicable when compressing FASTA files). This can be achieved by adding the --index option when compressing. However, please be aware that --index sometimes negatively impacts compression, and in particular, --reference is ignored. --index is automatically set for files which contain up to 10,000 sequences which are assembled contigs (i.e. not sequencing reads).

​​

​

Generating a Genozip reference file from a FASTA file
​

A FASTA file may be used to generate a Genozip reference file, using the --make-reference option:

​

genozip --make-reference chm13.draft_v1.1.fasta.gz

​

Reference files are a file format used in Genozip internally, and cannot be uncompressed.  Their primary use is for compressing other files, using the --reference or --REFERENCE options, which usually results in significantly improved compression.

​

However, In addition to their primary use, reference files are also useful for analyzing the underlying FASTA: they can be used to easy view sub-sequences of contigs in certain regions (forward or reverse complemented) using --regions and --regions-file, for finding IUPAC non-ACGTN pseudo-bases in the file with --show-ref-iupacs, and seeing properties of the contigs with --show-ref-contigs. See more here: Reference file options.

​

Note: A reference file is also created implicitly when a FASTQ, SAM/BAM/CRAM, VCF or FASTA file, is compressed with the --reference or --REFERENCE option and with a FASTA filename as an arugment or the option.

​

For a full list of options, see the genozip command line reference

​

Questions? support@genozip.com

​

© 2024 Genozip Limited. All rights reserved. Genozip™ is a trademark. Our technology is patent-pending. Privacy Policy.

bottom of page