Datasets used in benchmarks
FASTQ & BAM
​
WGS-Illumina-dragen: Human whole genome sequence, sequenced at 30X on Illumina Novaseq, aligned with Illumina Dragen. .fastq.gz file sizes: 32,810,061,523 and 36,955,702,856 ; .bam file size: 60,135,227,548. Author's own DNA sequenced at Dante Labs. Unpublished.
​
WGS-Nanopore-ngmlr: Human whole genome sequence on Oxford Nanopore Technologies MinION, aligned with ngmlr. Data obtained from ENCODE experiment ENCSR935ZQZ, files ENCFF379FVU.fastq.gz (size 154,558,769,583) and ENCFF283TLK.bam (size 222,354,403,768).
​
RNAseq-PacBio-minimap2: Human RNA-seq sequenced on Pacific Biosciences Sequel II and aligned with minimap2. Obtained from Encode experiment ENCSR842KTN, files ENCFF860CBL.fastq.gz (size 692,194,222) and ENCFF594CLT.bam (size 857,865,322).
​
scRNAseq-Illumina-STAR: Human single-cell RNA-seq sequenced on Illumina NovaSeq 6000 and aligned with STAR. Obtained from Encode experiment ENCSR955TRT, files ENCFF313KPF.fastq.gz (file size 15,589,566,718), ENCFF386RLQ.fastq.gz (file size 5,034,262,950) and ENCFF011AYC.bam (file size 47,972,313,111).
​
VCF
​
GATK GVCF: .vcf.gz file size is 7,483,080,739. 30X human sample aligned with bwa, GVCF created with GATK HaplotypeCaller. Unpublished.
​
DRAGEN GVCF: .vcf.gz file size is 2,431,892,621. File obtained from Illumina : NA12878-PCRF450-1.hard-filtered.gvcf.gz. Requires access to Illumina BaseSpace.
​
1000 Genome Project: .vcf.gz file size is 1,216,886,729. File obtained from NCBI.
​
Arabidopsis Thaliana: .vcf.gz file size is 142,110,301,308. File obtained from 1001 Genomes - A Catalog of Arabidopsis thaliana Genetic Variation.
​
3K Rice Genome: .vcf.gz file size is 2,056,330,848. File obtained from The 3,000 rice genomes project.
​
Illumina Genotyping Array: .vcf.gz file size is 36,223,855. File obtained from Illumina Files->3049a580-4f93-4cc3-a1b6-34be3eef4f98. Contact Illumina support for access.
​