top of page

Compressing GFF, GVF or GTF files

At a glance

Compressing

$ genozip myfile.gff3

genozip myfile.gff3 : Done (4 seconds, GFF3 compression ratio: 6.6)


$ ls -lh myfile.gff3*

-rwxrwxrwx 1 divon divon 26M  Aug 2 22:48 myfile.gff3

-rwxrwxrwx 1 divon divon 3.9M Aug 2 22:49 myfile.gff3.genozip

Uncompressing

$ genounzip myfile.gff3.genozip 

Viewing

$ genocat myfile.gff3.genozip 

Optimizing compression

Optimization options are options that modify the file in ways that improve compression. --optimize is an umbrella option that activates all optimization options.

genozip --optimize-sort myfile.gff3.gz - Sorts ATTR subfields alphabetically.

genozip --optimize-Vf myfile.gff3.gz - The value of Variant_freq is rounded to 2 significant digits

Slicing & dicing your data with genocat

Here's a summary of the filtering and subsetting options available for GFF3 / GVF files. See genocat for more information.

Option                           Effect

--downsample        Show only one in every X lines

--regions       -r  Exclude or include certain genomic regions

--regions-file  -R  Like --regions, but list of regions is specified in a file

--grep              Show only lines containing the specified string

--grep-w        -g  Like --grep, but match whole words

--lines         -n  Show only a lines from given range of line numbers

--head              Show only a certain number of lines from the start of the file

--tail              Show only a certain number of lines from the end of the file

--no-header         Drop the GFF3 header lines

--header-only       Show only the GFF3 header lines

Example: display the lines containing “rs1357314184” (strings that match exactly):

genocat --grep-w rs1357314184 myfile.gff3.genozip

Example: display the lines containing “Dbxref=dbSNP_152:rs” (possibly a substring of a longer string):

genocat --grep Dbxref=dbSNP_152:rs myfile.gff3.genozip

        Example: Get positions 1000 to 2000 on contig 22

genocat myfile.gff3.genozip -r 22:1000-2000   

Supported formats & limitations

Genozip can compress the closely related formats GFF2, GFF3, GTF and GVF. It does not support compressing GFF3 files which include a ##FASTA section, and may also not support other tweaks of the GFF format, of which there are many. If you have GFF data which Genozip fails to compress, and you would like us to support it, please let us know! 

Tip: if you need to compress a file whose format isn't currently supported by Genozip, you can always use --input generic.

Questions? support@genozip.com

bottom of page