top of page

Datasets used in benchmarks

FASTQ & BAM
​

WGS-Illumina-dragen: Human whole genome sequence, sequenced at 30X on Illumina Novaseq, aligned with Illumina Dragen. .fastq.gz file sizes: 32,810,061,523 and 36,955,702,856 ; .bam file size: 60,135,227,548. Author's own DNA sequenced at Dante Labs. Unpublished.

​

WGS-Nanopore-ngmlr: Human whole genome sequence on Oxford Nanopore Technologies MinION, aligned with ngmlr. Data obtained from ENCODE experiment ENCSR935ZQZ, files ENCFF379FVU.fastq.gz (size 154,558,769,583) and ENCFF283TLK.bam (size 222,354,403,768). 

​

RNAseq-PacBio-minimap2: Human RNA-seq sequenced on Pacific Biosciences Sequel II and aligned with minimap2. Obtained from Encode experiment ENCSR842KTN, files ENCFF860CBL.fastq.gz (size 692,194,222) and ENCFF594CLT.bam (size 857,865,322).

​

scRNAseq-Illumina-STAR: Human single-cell RNA-seq sequenced on Illumina NovaSeq 6000 and aligned with STAR. Obtained from Encode experiment ENCSR955TRT, files ENCFF313KPF.fastq.gz (file size 15,589,566,718), ENCFF386RLQ.fastq.gz (file size 5,034,262,950) and ENCFF011AYC.bam (file size 47,972,313,111).

​
VCF
​

GATK GVCF: .vcf.gz file size is 7,483,080,739. 30X human sample aligned with bwa, GVCF created with GATK HaplotypeCaller. Unpublished. 

​

DRAGEN GVCF: .vcf.gz file size is 2,431,892,621. File obtained from Illumina : NA12878-PCRF450-1.hard-filtered.gvcf.gz. Requires access to Illumina BaseSpace.

​

1000 Genome Project: .vcf.gz file size is 1,216,886,729. File obtained from NCBI.

​

Arabidopsis Thaliana: .vcf.gz file size is 142,110,301,308. File obtained from 1001 Genomes - A Catalog of Arabidopsis thaliana Genetic Variation.

​

3K Rice Genome: .vcf.gz file size is 2,056,330,848. File obtained from The 3,000 rice genomes project.

​

Illumina Genotyping Array: .vcf.gz file size is 36,223,855. File obtained from Illumina Files->3049a580-4f93-4cc3-a1b6-34be3eef4f98. Contact Illumina support for access.

​

bottom of page