Datasets used in benchmarks


WGS-Illumina-dragen: Human whole genome sequence, sequenced at 30X on Illumina Novaseq, aligned with Illumina Dragen. .fastq.gz file sizes: 32,810,061,523 and 36,955,702,856 ; .bam file size: 60,135,227,548. Author's own DNA sequenced at Dante Labs. Unpublished.

WGS-Nanopore-ngmlr: Human whole genome sequence on Oxford Nanopore Technologies MinION, aligned with ngmlr. Data obtained from ENCODE experiment ENCSR935ZQZ, files ENCFF379FVU.fastq.gz (size 154,558,769,583) and ENCFF283TLK.bam (size 222,354,403,768). 

RNAseq-PacBio-minimap2: Human RNA-seq sequenced on Pacific Biosciences Sequel II and aligned with minimap2. Obtained from Encode experiment ENCSR842KTN, files ENCFF860CBL.fastq.gz (size 692,194,222) and ENCFF594CLT.bam (size 857,865,322).

scRNAseq-Illumina-STAR: Human single-cell RNA-seq sequenced on Illumina NovaSeq 6000 and aligned with STAR. Obtained from Encode experiment ENCSR955TRT, files ENCFF313KPF.fastq.gz (file size 15,589,566,718), ENCFF386RLQ.fastq.gz (file size 5,034,262,950) and ENCFF011AYC.bam (file size 47,972,313,111).


GATK GVCF: .vcf.gz file size is 7,483,080,739. 30X human sample aligned with bwa, GVCF created with GATK HaplotypeCaller. Unpublished. 

DRAGEN GVCF: .vcf.gz file size is 2,431,892,621. File obtained from Illumina : NA12878-PCRF450-1.hard-filtered.gvcf.gz. Requires access to Illumina BaseSpace.

1000 Genome Project: .vcf.gz file size is 1,216,886,729. File obtained from NCBI.

Arabidopsis Thaliana: .vcf.gz file size is 142,110,301,308. File obtained from 1001 Genomes - A Catalog of Arabidopsis thaliana Genetic Variation.

3K Rice Genome: .vcf.gz file size is 2,056,330,848. File obtained from The 3,000 rice genomes project.

Illumina Genotyping Array: .vcf.gz file size is 36,223,855. File obtained from Illumina Files->3049a580-4f93-4cc3-a1b6-34be3eef4f98. Contact Illumina support for access.

