Sex Classification
EXPERIMENTAL FEATURE - NOT SUITED FOR CLINICAL APPLICATIONS
​
NOTE: This feature is no longer available as of Genozip version 15.0.42. To access this feature, use an earlier version.
​
Data Types
​
SAM, BAM and FASTQ
​
Usage
​
genocat --sex my-file.bam.genozip
genocat --sex my-file.fq.genozip
​
Description
​
Determines the genetic Sex of the individual from which the data originated.
​
Important note: this feature was developed as a proof-of-concept and has been tested on limited datasets only. If you plan to rely on this feature, please first test it extensively, on your own datasets.
​
Output
​
Output is space-separated if sent to a terminal, and tab-separated if redirected to a file or pipe.
​
$ genocat *.genozip --show-sex
Sex File DP_1 DP_X DP_Y 1/X X/Y
Male sample01.10x.bam.genozip 13.188 6.921 5.887 1.9 1.2
Male sample02.10x.bam.genozip 14.644 7.662 7.432 1.9 1.0
Male sample03.10x.bam.genozip 10.993 5.614 6.135 2.0 0.9
Female sample04.10x.bam.genozip 4.274 4.238 0.215 1.0 19.7
Male sample05.10x.bam.genozip 12.497 6.565 4.320 1.9 1.5
Female sample06.10x.bam.genozip 10.728 10.866 0.534 1.0 20.4
Classifier algorithm
​
See: Sex Classifier algorithm (SAM/BAM) and Sex Classifier algorithm (FASTQ)
​
Advantages
​
• This method works fast, because just the relevant fields are read from disk - a small subset of the file.
​
• Since this algorithm is based on counting bases rather than counting reads, it would work just as well with data that has highly variable read lengths, as common, for example, in long-read technologies.
​
• Since the algorithm uses both X/Y and Autosome/X, it can detect a XXY.
​
Limitations
​
• This feature has been tested on human data. It may or may not work for other species.
​
• Rare sexes such as XXXY or XYY are not detected.
​
• SAM/BAM lines with an undefined RNAME or CIGAR are ignored, therefore this will not work for unaligned SAM/BAM files.
​
• This feature has not been validated for clinical diagnostics and is therefore not suitable for that purpose.
​
Additional limitations when used with FASTQ
​
• Sex assignment of a FASTQ is a an experimental feature, and the results should be taken with a grain of salt.
​
• Assignment of reads to contigs is based on the Genozip Aligner which is designed to be very fast at the expense of accuracy. This feature works because the Male vs. Female signal is usually stronger than the Genozip Aligner inaccuracies, however when running on FASTQ, the algorithm will be more conservative than when running on SAM/BAM and will report “Unassigned” in some cases where it would confidently call a Sex when running on SAM/BAM data.
• This has been tested on fastq.genozip files which consist of human data compressed with the GRCh38 reference genome. It is yet unknown if the algorithm would work with other species data or other reference genomes.
Questions? support@genozip.com