top of page
stats
subsetting
lines
Display and analyze a file compressed with genozip.

Usage


genocat [options]… [files]…

 

One or more file names must be given.

 

General options

-e, --reference filename.  

Load a reference file prior to decompressing. Used only for files compressed with --reference. If not provided, genocat will use the same reference filename as used for genozip.

Note: this is equivalent of setting the environment variable $GENOZIP_REFERENCE with the reference filename.

-f, --force  

Force overwrite of the output file.

-D, --subdirs  

If a file name on the command line is a directory include all files of that directory (recursively).

-o, --output output-filename

Output to this filename.

 

Note: output-filename can also be a directory name, in which case the output file is written to the specified directory. If the name has a ‘/’ suffix (e.g. “-o my-dir/”), then the directory is created if it doesn’t already exist.

-p, --password password.  

Provide password to access file(s) that were compressed with --password.

--count

Rather than displaying the file content just report the number of lines (FASTQ: reads ; CHAIN: sets) (excluding the header) that would have been displayed. Useful in combination with filtering options.

Limitation: cannot be used in combination with --downsample, --head, --tail or --lines.

-z, --bgzf level 

Compress the output to the BGZF format (.gz extension) using libdeflate at the compression level specified by the argument. level specifies the BGZF compression level from 0 (no compression) to 12 (best yet slowest compression). If you are not sure what value to choose - 6 is a popular option.

Use --bgzf=exact to instruct genounzip to attempt to re-create the same exact BGZF compression as in the original file. Whether genounzip succeeds in re-creating the exact same BGZF compression ratio depends on the compression library used by the application that generated the original file. See also: Compressing already-compressed files.

-q, --quiet  

Don't show the progress indicator or warnings.

-Q, --noisy

The --quiet option is turned on by default when outputting to the terminal. --noisy stops the suppression of warnings.

-@, --threads number

Specify the maximum number of threads. By default genozip allocates 1.1 threads per core in order to maximize usage of all available cores. An exception is on Mac and Windows (including WSL) where the default allocation is 0.75 threads per core to maintain the operating system's UI's feeling of interactivity.

Note: For genounzip and genocat this limit is only approximate. For genozip, it is strictly enforced.

-w, --stats

Show the internal structure of a genozip file and the associated compression statistics.

-W, --STATS

Show more detailed statistics

Note: specifying -W or -w twice, results in the header line of the statistics printed to stderr, thereby surviving piping stdout to grep

--show-filename

Show the file name for each file.

-T, --files-from filename

An alternative to providing input file names on the command line. filename it a textual file containing a newline-separated list of files. If filename is - (a hyphen) data is taken from stdin rather than a file.

--log filename

Send non-file output to a log file instead of the terminal.

--echo

Output the full command line upon successful or failed completion of execution.

--help

Show a link to this page.

-L, --license, --licence

Show the license terms and conditions for this product as accepted. Combine with --force to see the version of the license current to the version of Genozip used. If you wish to change your license to the most recent one - make sure your version of Genozip is the latest and re-register with genozip --register.

-V, --version

Display Genozip's version number

--show-reference  

Show the name and MD5 of the reference file that needs to be provided to uncompress this file.

Subsetting options

Note: subsetting options are options that filter or otherwise modify the data

--downsample rate[,shard]

Applicable data types: all

Show only one in every rate lines (reads in the case of FASTQ ; sequences in the case FASTA). 

The optional shard parameter indicates which of the shards is shown - it must be a value between 0 and rate-1. 

Other subsetting options (if any) will be applied to the surviving lines.

-r, --regions [^]chr | chr:pos | pos | chr:from-to | chr:from- | chr:-to | from-to | from- | -to | from+len [,...].  

Applicable data types: VCF SAM/BAM GFF3/GVF FASTA 23andMe Chain Reference

Show one or more regions of the file.

Examples:

genocat myfile.vcf.genozip -r 22:1000-2000    # Positions 1000 to 2000 on contig 22

genocat -e myfile.ref.genozip -r 22:2000-1000 # Reverse complement of positions 1000 to 2000 on contig 22 (reference file only)

genocat myfile.sam.genozip -r 22:1000+151     #151 bases, starting pos 1000, on contig 22

genocat -e myfile.ref.genozip -r 22:1000-151  # Reverse complement of 151 bases, from 1000 to 850, on contig 22 (reference file only)

genocat myfile.vcf.genozip -r -2000,2500-     # Two ranges on all contigs

genocat myfile.sam.genozip -r chr21,chr22     # Contigs chr21 and chr22 in their entirety

genocat myfile.vcf.genozip -r ^MT,Y           # All contigs, excluding MT and Y

genocat myfile.vcf.genozip -r ^-1000          # All contigs, excluding positions up to 1000

genocat myfile.fa.genozip  -r chrM            # Contig chrM

Note: genozip files are indexed automatically during compression. There is no separate indexing step or separate index file.

.

Note: Indels are considered part of a region if their start position is.

Note: Multiple -r arguments may be specified - this is equivalent to chaining their regions with a comma separator in a single argument.

Note: For SAM/BAM files, unlike samtools, this works even if the files are not sorted by position.

Note: For Reference files, use in combination with --reference (or -e).

Note: For FASTA and Chain files, only whole-contig regions are possible.

Note: For Chain files this applies to the Primary contig (qName).

Note: Combine with --gpos to see Global POSition values instead of positions on chromosomes.

-R--regions-file [^]filename

Applicable data types: VCF SAM/BAM GFF3/GVF FASTA 23andMe Chain Reference

Show regions from a list in tab-separated file. To include all regions except those in the file٫ prefix the filename with ^. If filename is - (or ^-) data is taken from stdin rather than a file.

Example of a valid file: The first two rows (ignoring the comment line) produce the same 100-base region, and the

third row is a single base:

# Comment lines starting with a # are ignored.

chr22 17000000 17000099

chr22 17000000 +100

chr22 17000000

--grep string 

Show only lines (FASTA: sequences ; FASTQ: reads ; CHAIN: sets) in which string is a case-sensitive substring of the lines (FASTA: description). This does not affect showing the file header.

-g, --grep-w string

Same as --grep, but with whole-word matching.

-n, --lines [first]-[last] or [first].  

Show a certain range of lines. first and last are numbers of lines in the file (starting from 1, excluding header).

Examples:

genocat --lines 1000-2000 # displays the 1001 lines between 1000 and 2000

genocat --lines=1000-     # displays all lines starting from 1000 (optional =)

genocat -n -2000          # displays lines 1 to 2000 (-n instead of --lines)

genocat -n 1000           # displays 10 lines starting from line 1000

Note on outputting as BAM: The numbering excludes the BAM header.

Note on FASTQ: The numbering is of reads rather than lines.

Note: Line numbers are taken before any additional filters are applied.

--head[=num_lines]

Show num_lines lines from the start of the file (default is 10). Line count excludes header.

--tail[=num_lines]

Show num_lines lines from the end of the file (default is 10).

Subsetting: Filtering using kraken data

Applicable data types: SAM, BAM, FASTQ and FASTA - See: Filtering BAM, FASTQ or FASTA data by species using kraken2

-K, --kraken filename.

Load kraken2 data. filename is a kraken2-generated file (genozipped or not). See more.

-k, --taxid [^]taxid[,taxid...][+0].  

Show only lines than match the Taxonomy ID taxid. ^ for a negative search. +0 means taxid AND unclassified. Multiple taxids may be specified in a comma-separated list. Requires either using in combination with --kraken or for the file to have been compressed with genozip --kraken. See more.

--show-kraken[=INCLUDED|EXCLUDED]

In combination with --taxid reports whether each line is included or excluded. =INCLUDED or =EXCLUDED reports only a subset of lines accordingly. Combine with --count for a fast report without display the file itself. See more.

VCF options

-s, --samples [^]sample_name[,...] or num_samples

Show a subset of samples (individuals). 

Examples:

genocat myfile.vcf.genozip -s HG00255,HG00256  # show two samples

genocat myfile.vcf.genozip -s ^HG00255,HG00256 # show all samples except these two

genocat myfile.vcf.genozip -s 5                # show the first 5 samples

Note: This does not change the INFO data (including the AC, AF, AN tags).

 

Note: sample_name is case-sensitive.

Note: Multiple -s arguments may be specified - this is equivalent to chaining their samples with a comma separator in a single argument.

Note: The INFO/DP field may display -1 for some variants if file was compressed with --best as it requires FORMAT/DP data of all samples to reconstruct.

-G, --drop-genotypes

Output the data without the samples and FORMAT column. 

Note: This does not change the INFO data (including the AC, AF, AN tags).

Note: The INFO/DP field may display -1 for some variants if file was compressed with --best as it requires the FORMAT/DP data to reconstruct.

--GT-only

Within samples output only genotype (GT) data - dropping the other tags.

Note: The INFO/DP field may display -1 for some variants if file was compressed with --best as it requires the FORMAT/DP data to reconstruct.

--snps-only

Drops variants that are not a Single Nucleotide Polymorphism (SNP).

--indels-only

Drops variants that are not Insertions or Deletions (indel).

--unsorted

If a file contains a reconstruction plan (see genozip --sort) the file will be displayed sorted by default. --unsorted overrides this behavior and shows the file in its unsorted form. This is useful if the file was highly unsorted causing sorting during genocat to consume a lot of memory.

-1, --header-one

Output only the last line on the header (the line with the field and sample names).

--bcf  

Output as BCF. Note: bcftools needs to be installed for this option to work.

--luft

Render a DVCF file in Luft coordinates (absent this option, a DVCF will be rendered in Primary coordinates).

See: Dual-coordinate VCF files

--single-coord

Remove all DVCF-specific lines from the VCF header and remove the DVCF INFO annotations. This leave the file as a normal VCF file in single coordinates - either the Luft coordinates (when combined with --luft) or Primary coordinates.

See: Dual-coordinate VCF files

-y, --show-dvcf

For each variant show its coordinate system (Primary or Luft or Both) and its oStatus. May be used with or without --luft.

See: Dual-coordinate VCF files

--show-ostatus

Add oSTATUS to the INFO field. May be used with or without of --luft.

See: Dual-coordinate VCF files

--show-counts=o\$TATUS

Show summary statistics of variant lift outcome.

See: Dual-coordinate VCF files

--show-counts=COORDS

Show summary statistics of variant coordinates.

See: Dual-coordinate VCF files

--no-PG

Supresses adding a "##genozip_command" line to the VCF header.

--gpos

Replaces (CHROM,POS) with a coordinate in GPOS (Global POSition) terms. GPOS is a single genome-wide coordinate defined by a reference file, in which contigs appear in the order of the original FASTA data used to generate the reference file. Must be used in combination with --reference. The mapping of CHROM to GPOS can be viewed with

genocat --show-ref-contigs reference-file.ref.genozip.

BAM and SAM options

 

-H, --no-header

Don't output the SAM header lines.

-h, --header-only

Output only the SAM header lines.

--FLAG {+-^}value

Filter alignments based on the FLAG value: value is a + - or ^ followed by a decimal or hexadecimal value or a flag name  (or its unique prefix) from the table below.

+   Includes alignments in which all flags in value are set in the line’s FLAG

-    Includes alignments in which no flags in value are set in the line’s FLAG

^   excludes alignments in which all flags in value are set in the line’s FLAG

Example: --FLAG -192 includes only alignments in which neither FLAG 64 nor 128 are set. This can also be expressed as --FLAG -0xC0

Example: --FLAG +SUPP includes only alignments in which the SUPPLEMENTARY flag (2048) is set.

Decimal
Hex
Name
Meaning
1
0x1
MULTI
template having multiple segments in sequencing
2
0x2
ALIGNED
each segment properly aligned according to the aligner
4
0x4
UNMAPPED
segment unmapped
8
0x8
NUNMAPPED
next segment in the template unmapped
16
0x10
REVCOMP
SEQ being reverse complemented
32
0x20
NREVCOMP
SEQ of the next segment in the template being reverse complemented
64
0x40
FIRST
the first segment in the template
128
0x80
LAST
the last segment in the template
256
0x100
SECONDARY
secondary alignment
512
0x200
FILTERED
not passing filters, such as platform/vendor quality controls
1024
0x400
DUPLICATE
PCR or optical duplicate
2048
0x800
SUPPLEMENTARY
supplementary alignment

--MAPQ [^]value

Filter alignments based on the MAPQ value: include (or exclude if value is prefixed with ^) lines with a MAPQ greater or equal to value.

--bases [^]value

Filter alignments based on the IUPAC nucleotide codes of the sequence data.

Examples:

 

genocat --bases ACGT  # displays only lines in which all characters of the SEQ are one of A,C,G,T

genocat --bases ^ACGT # displays only lines that contain non-A,C,G,T characters (e.g. an N)

Note: In SAM/BAM, all alignments missing a sequence (i.e. SEQ=*) are included in positive --bases filters (the first example above) and excluded in negative ones.

Note: The list of IUPAC characters can be found here: IUPAC codes

--bam  

Output as BAM. This option is implicit if --output specifies a filename ending with .bam

--sam  

Output as SAM. This option is the default in genocat on SAM and BAM data and is implicit if --output specifies a filename ending with .sam

--fastq[=all]  

Output as FASTQ. This option is implicit if --output specifies a filename ending with .fq or .fastq. If --fastq=all is specified all SAM fields are outputted to the FASTQ file. See more details: Converting SAM/BAM to FASTQ.

 

--no-PG

Suppress adding a @PG line to the file header.

FASTQ options

 

--interleaved[=both|either]  

For FASTQ data compressed with --pair: Show every pair of paired-end FASTQ files with their reads interleaved: first one read of the first file ; then a read from the second file ; then the next read from the first file and so on. Optional argument 'both' (default) or 'either' determines whether both reads of a pair or only one is required for the pair to survive when combining with a subsetting option such as --grep.

--R1 and --R2

View one of the two FASTQ files in a genozip file created with --pair.

--header-only

Output only the description lines.

--seq-only

Output only the sequence (nucleotide) lines.

--qual-only

Output only the quality lines.

Limitation: doesn't work on some long read data, because Genozip compresses long read quality scores with the LONGR codec that requires sequence data to reconstruct correctly.

--bases [^]value

Filter lines based on the IUPAC nucleotide codes of the sequence data. See SAM/BAM --bases option.

FASTA options

 

-1, --header-one

Output the sequence name up to the first space or tab.

-H, --no-header

Don't output the header (sequence name) lines.

-h, --header-only

Output only the header (sequence name) lines.

--sequential

Output in sequential format - each sequence in a single line.

--phylip

Output a Multi-FASTA in PHYLIP format. All sequences must be the same length. See Converting Multi-FASTA to PHYLIP.

PHYLIP options

--fasta

Output as Multi-FASTA. See Converting PHYLIP to FASTA.

Reference file options

 

--reference filename --regions regions [--header-only]

View one or more regions of a reference file.

 

Note: For reverse complement, use a reverse range, eg -r1000000-999995 or equivalently -r1000000-6

Note: --regions-file maybe used instead of --regions

Note: Combine with --no-header to suppress output of the chromosome name.

Note: Short forms of the options (e.g., -e instead of --reference) are fine too.

--gpos

In combination with --reference and --regions or --regions-file - shows coordinates in GPOS (Global POSition) terms - a single genome-wide numeric coordinate - rather than (CHROM,POS).

--show-ref-contigs

Show the details of the reference file contigs.

--show-ref-iupacs

Show non-ACTGN IUPAC pseudo-bases in the reference file.

 

Chain file options

 

--show-chain  

Show chain file alignments.

 

--show-chain-contigs  

Show the details of the chain file contigs.

23andMe options

 

--vcf  

Output as VCF. --vcf must be used in combination with --reference to specify the reference file as listed in the header of the 23andMe file (usually this is GRCh37). Note: Indel variants ('DD' 'DI' 'II') as well as uncalled sites ('--') are discarded.

See: Converting 23andMe to VCF
 

Analysis options

 

--contigs 

Applicable data types: VCF SAM BAM FASTA GFF3/GVF 23andMe

List the names of the chromosomes (or contigs) included in the file. Alternative option names: --list-chroms --chroms.

--sex

Applicable data types: SAM BAM FASTQ

Determine whether a SAM/BAM is a Male or a Female. Limitations when using on FASTQ. See Sex Classification.

--coverage[=all|=one]

Applicable data types: SAM BAM FASTQ

Show the coverage and depth of each contig. Approximate values when using on FASTQ. See Coverage and Depth.

--idxstats

Applicable data types: SAM BAM FASTQ

Shows the count of mapped and unmapped reads by contig. Approximate values when using on FASTQ. Same output format as samtools idxstats. See idxstats.

bases
reference
bottom of page