Why should we use the NA12878 dataset for benchmarking?

Why should we use the NA12878 dataset for benchmarking?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

As far as I understand, the human genome sample called NA12878 provides high confidence variants for a human sample. It is being used as a benchmark for many genomic research projects.

Q: Why exactly the NA12878 is such a popular benchmark dataset? Just because we have a set of high confidence variants? But we can also get some high confidence variants from the 1000 Genome project. Does that have something to do with the sequencing technology? Anything to do with the sample itself? Why do we want to benchmark our experiments with NA12878?

So to clarify for people unfamiliar with NA12878, that's the sample identification for a particular Utah woman. Her parents are NA12891 and NA12892. In human variation data-sets that's what we are given to identify individuals, an ID, sex and population. All the other data is removed to protect patient privacy. So the question is why was NA12878 (this Utah woman) chosen as the reference patient in genomic analysis…

I don't know the real practical answer but from what I gather it's inertia.

I don't have a full history for her but I do know somethings which make her genome a good choice for a benchmark. NA12878 is a fairly old to geneticists and her DNA is included in multiple legacy projects, specifically HapMap.

She has a genetic disease (CYP2D6 mutation) which is probably what initially included her and her family genetic analysis. This is a better reason to study this genome in detail instead of it simply being someone famous (i.e. Venter).

She's Utah mormon (small founder population with extensive pedigree information), has 11 children (can do recombination/inheritance analysis). What this means is that a deep understanding of her genome will have applications to this population.

Her lymphoblastoid cell line GM12878 was included as tier-1 ENCODE cell of study. This means there's terabytes of epignomic data for her as well.

When I choose to do human genome analysis, NA12878 is the obvious choice because of how much data is already available, which also means more data is going to be available. Thus I think the answer is largely inertia.

Also consider it this way, if you're trying to say your pipeline or sequencing technology is better then other ones, and everyone uses NA12878 to benchmark their technology, then it's wise to also use NA12878 so the results can be comparable.

Comprehensively benchmarking applications for detecting copy number variation

Affiliations College of Computer Science, Sichuan University, Chengdu, China, Medical Big Data Center, Sichuan University, Chengdu, China, Zdmedical, Information polytron Technologies Inc. Chongqing, Chongqing, China

Contributed equally to this work with: Le Zhang, Wanyu Bai, Na Yuan

Roles Data curation, Writing – original draft, Writing – review & editing

Affiliation College of Computer Science, Sichuan University, Chengdu, China

Contributed equally to this work with: Le Zhang, Wanyu Bai, Na Yuan

Roles Data curation, Software

Affiliation BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, PR China

Roles Data curation, Writing – original draft, Writing – review & editing

Affiliation BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, PR China


Genomic structural variations (SVs) are generally defined as deletions (DELs), insertions (INSs), duplications (DUPs), inversions (INVs), and translocations (TRAs) of at least 50 bp in size. SVs are often considered separately from small variants, including single nucleotide variants (SNVs) and short insertions, and deletions (indels), as these are often formed by distinct mechanisms [1]. INVs and TRAs are balanced forms, with no net change in a genome, and the remaining SVs are imbalanced forms. Imbalanced deletions (DELs) and duplications (DUPs) are also referred to as copy number variations (CNVs), with DUPs comprising tandem and interspersed types depending on the distance between the duplicated copies [2, 3]. INSs are categorized into several classes based on the insertion sequences: mobile element insertions (MEIs), nuclear insertions of mitochondrial genome (NUMTs), viral element insertions (VEIs referred to in this study), and insertions of unspecified sequence.

SVs are largely responsible for the diversity and evolution of human genomes at both individual and population level [3,4,5,6]. The genomic difference between individuals caused by SVs has been estimated to be 3–10 times higher than that by SNVs [2, 6, 7]. Consequently, SVs could have higher impacts on gene functions and phenotypic changes than do SNVs and short indels. Accordingly, SVs are associated with a number of human diseases, including neurodevelopmental disorders and cancers [3, 8,9,10,11].

Two types of methods have been used to detect SVs: (1) array-based detection, including microarray comparative genome hybridization (array CGH), and (2) sequencing-based computational methods [2, 12]. Array-based methods are advantageous for high-throughput analysis, but they only detect certain types of SVs, have a lower sensitivity for small SVs, and have a lower resolution for determining breakpoints (BPs) than the sequencing-based methods. Although sequencing requires more time and money than the array-based method, it would be necessary for detecting a broad range of SVs to adopt the sequencing-based methods, as in recent projects aimed at identifying SVs on a population scale [6, 13,14,15].

Sequencing-based methods take several conceptual approaches to derive information about SVs from short read sequencing data [2, 9, 16,17,18]. Read pairs (RP) and read depth (RD) approaches utilize the discordant alignment features and depth features of paired-end reads that encompass or overlap an SV, respectively. The split read (SR) approach uses split (soft-clipped) alignment features of single-end or paired-end reads that span a BP of a SV. The assembly (AS) approach detects SVs by aligning the contigs, assembled with the entire or unmapped sequencing reads, to the reference sequence. A number of recently developed SV detection algorithms use a combination (CB) of the above four methods (here, we refer to these five basic SV detection methods as “methods” and each specific SV detection tool as an “algorithm”). Irrespective of the strategy, sequencing-based methods suffer from a high rate of miscalling of SVs because they involve errors in base call, alignment, or de novo assembly, especially in repetitive regions unable to be spanned with short reads. To overcome the shortcomings of short read sequencing, long reads generated using single-molecule sequencing technology have recently been used to detect SVs in a human sample using the AS and/or SR approach [19,20,21,22]. However, the high cost and the low throughput of this strategy currently limits its general use.

Although the sequencing-based methods can in theory detect any type of SV, no single computational algorithm can accurately and sensitively detect all types and all sizes of SVs [23]. Therefore, most projects use multiple algorithms to call SVs, then merge the outputs to increase the precision and/or the recall [6, 13,14,15, 17, 24,25,26,27,28,29]. Many projects use popular SV detection algorithms, including BreakDancer [30], CNVnator [31], DELLY [32], GenomeSTRiP [33], Pindel [34], and Lumpy [35], which give calls with relatively high accuracy. Although one study has investigated for the performances of 13 SV detection algorithms [36], there has been no systematic investigation of which algorithms can accurately detect which types of SVs. Importantly, while it is common practice to do so, there has been no systematic investigation into optimal strategies to combine the results of multiple algorithms to come to the most complete characterization of SVs in a genome. In this study, we evaluated 69 algorithms for their precision and recall for both single and overlapping SV callings, using multiple simulated and real datasets of WGS datasets.

Data availability

Raw sequence data were previously published in Scientific Data ( and were deposited in the NCBI SRA with the accession codes SRX1049768–SRX1049855, SRX847862–SRX848317, SRX1388368–SRX1388459, SRX1388732–SRX1388743, SRX852932–SRX852936, SRX847094, SRX848742–SRX848744, SRX326642, SRX1497273 and SRX1497276. 10x Genomics Chromium bam files used are available at The benchmark vcf and bed files resulting from work in this manuscript are available in the NISTv.3.3.2 directory under each genome on the GIAB FTP release folder and, in the future, updated calls will be in the ‘recent’ directory under each genome. The data used in this manuscript and other datasets for these genomes are available at, as well as in NCBI BioProject No. PRJNA200694.

Reproducible integration of multiple sequencing datasets to form high-confidence SNP, indel, and reference calls for five human genome reference materials

Benchmark small variant calls from the Genome in a Bottle Consortium (GIAB) for the CEPH/HapMap genome NA12878 (HG001) have been used extensively for developing, optimizing, and demonstrating performance of sequencing and bioinformatics methods. Here, we improve and simplify the methods we use to integrate multiple sequencing datasets, with the intention of deploying a reproducible cloud-based pipeline for application to arbitrary human genomes. We use these reproducible methods to form high-confidence calls with respect to GRCh37 and GRCh38 for HG001 and 4 additional broadly-consented genomes from the Personal Genome Project that are available as NIST Reference Materials. Our new methods produce 17% more SNPs and 176% more indels than our previously published calls for HG001. We also phase 99.5% of the variants in HG001 and call about 90% of the reference genome with high-confidence, increased from 78% previously. Our calls only contain 108 differences from the Illumina Platinum Genomes calls in GRCh37, only 14 of which are ambiguous or likely to be errors in our calls. By comparing several callsets to our new calls, our previously published calls, and Illumina Platinum Genomes calls, we highlight challenges in interpreting performance metrics when benchmarking against imperfect high-confidence calls. Our new calls address some of these challenges, but performance metrics should always be interpreted carefully. Benchmarking tools from the Global Alliance for Genomics and Health are useful for stratifying performance metrics by variant type and genome context to elucidate strengths and weaknesses of a method. We also explore differences between comparing to high-confidence calls for the 5 GIAB genomes, and show that performance metrics for one pipeline are largely similar but not identical when comparing to the 5 genomes. Finally, to explore applicability of our methods for genomes that have fewer datasets, we form high-confidence calls using only Illumina and 10x Genomics, and find that they have more high-confidence calls but have a higher error rate. These newly characterized genomes have a broad, open consent with few restrictions availability of samples and data, enabling a uniquely diverse array of applications.

MoleculeNet Part 1: Datasets for Deep Learning in the Chemical and Life Sciences

This post was co-authored by Bharath Ramsundar from DeepChem.

Benchmark datasets are an important driver of progress in machine learning. Unlike computer vision and natural language processing, the diversity and complexity of datasets in chemical and life sciences make these fields largely resistant to attempts to curate benchmarks that are widely accepted in the community. In this post, we show how to add datasets to the MoleculeNet benchmark for molecular machine learning and make them programmatically accessible with the DeepChem API.

Molecular ML Dataset Curation

MoleculeNet [1] collects datasets in six major categories: quantum mechanics, physical chemistry, proteins, biophysics, physiology, and materials science. The “first generation” of MoleculeNet showed what a molecular ML benchmark might look like, and revealed some interesting trends with respect to data scarcity, class imbalances, and the power of physics-aware featurizations over model architectures for some datasets.

It isn’t easy to cover the entire breadth and depth of molecular ML, so that’s why MoleculeNet is evolving into a flexible framework for contributing datasets and benchmarking model performance in a standardized way, powered by DeepChem.

Why Should We Care about Benchmarks?

Image and speech recognition seem like gargantuan tasks, but they are really pretty simple compared to the kinds of problems we see in physics, chemistry, and biology. That’s why it’s comparatively rare to see anyone claim that a problem in physical or life science has been “solved” by machine learning. Better datasets, dataset generating methods, and robust benchmarks are essential ingredients to progress in molecular machine learning, maybe even more so than inventing new deep learning tricks or architectures.

In many subfields of deep learning, the standard avenue of progress goes something like

1. Pick a widely used benchmark dataset (e.g., ImageNet, CIFAR-10, or MNIST).

2. Develop and test a model architecture that achieves “state of the art” performance on some aspect of the benchmark.

3. Come up with an ad-hoc “theoretical” explanation for why your particular architecture outperforms the rest.

4. Publish your results in a top conference.

If you’re lucky, other researchers might even use your model or build on it for their own research before the next SOTA architecture comes out. There are obvious issues with this paradigm, including bias in datasets, distribution shifts, and the Goodhart-Strathern Law — when a metric becomes a target, it is no longer a good metric. Still, there’s no question that benchmarks provide a kind of clarity of purpose and fuel interest in machine learning research that is lacking in other fields.

Maybe more importantly, benchmarks encourage and reward researchers for creating high-quality datasets, which has historically been underappreciated in many fields. And benchmark datasets enable striking breakthroughs, like DeepMind’s AlphaFold, which was made possible by decades of effort assembling high-resolution protein structures. AlphaFold represents a sort of “ImageNet moment” in protein folding, meaning that a problem is “solved” in some sense.

MoleculeNet contains hundreds of thousands of compounds and measured/calculated properties, all accessible through the DeepChem API. It brings a flavor of the traditional evaluation frameworks popularized in ML conferences, but also provides a standardized way to contribute and access new datasets.

Contributing a Dataset to MoleculeNet

Dataset contribution has been significantly streamlined and documented. The first step is to open an issue on GitHub in the DeepChem repo to discuss the dataset you want to add, emphasizing what unique molecular ML tasks the dataset covers that aren’t already part of MolNet. If you created or curated a dataset yourself, this is a great way to share it with the molecular ML community! Next, you need to

  • Write a DatasetLoader class that inherits from deepchem.molnet.load_function.molnet_loader._MolnetLoader . This involves documenting any special options for the dataset and the targets or “tasks” for ML.
  • Implement a create_dataset function that creates a DeepChem Dataset by applying acceptable featurizers, splitters, and transformations.
  • Write a load_dataset function that documents the dataset and provides a simple way for users to load your dataset.

The QM9 MolNet loader source code is a nice, simple starting point for writing your own MolNet loader.

This framework allows a dataset to be used directly in an ML pipeline with any reasonable combination of featurization (converts raw inputs like SMILES strings into a machine-readable format), splitter (controls how training/validation/test sets are constructed), and transformations (e.g., if the targets need to be normalized before training).

Splitters are particularly important here. When comparing how different models perform on the same task, it’s crucial that each model “sees” the same training data and is evaluated on the same data. We also want to know how a model does on samples that are similar to what it’s seen before (using a randomized train/val/test split) versus how it does on samples that are dissimilar (e.g., using a split based on chemical substructures).

Accessing Datasets with the DeepChem API

MolNet loaders make accessing datasets and pre-processing them for ML possible with a single line of Python code:

To actually make the dataset available through the DeepChem API, you simply provide a tarball or zipped folder to a DeepChem developer, who will add it to the DeepChem AWS S3 bucket. Finally, add documentation for your loader and dataset.

We Want YOUR Datasets!

After taking a look at the long list of datasets in MoleculeNet, you might find that there’s something crucial missing. The good news is that you (yes, YOU!) can contribute new datasets! If you’re not comfortable with Python programming, you can simply open an issue on GitHub, include information on why the dataset should be added to MolNet, and request help from a DeepChem developer. If you are comfortable programming, even better — you can follow the steps outlined above and make a contribution.

The real power of an open-source benchmark is that anyone can contribute this allows MolNet to evolve and expand beyond what a single research group can support.

Next Steps: Molecular ML Model Performance

In the next post, we’ll discuss how to use DeepChem and MolNet scripts to add performance metrics for ML models.

Getting in touch

If you liked this tutorial or have any questions, feel free to reach out to Nathan over email or connect on LinkedIn and Twitter.

You can find out more about Nathan’s projects and publications on his website.


The TEMP2 pipeline

The overall pipeline of TEMP2 is shown in Figure ​ Figure1, 1 , containing three steps described as follows.

Diagrams depicting how TEMP2 detects new germline and de novo transposon insertions. (A) Detection of new transposon insertions. The method contains three steps: alignment, clustering/classifying, and filtering. Paired-end reads are depicted as pairs of boxes each connected by a short horizontal line: open boxes for unmapped reads and colored boxes for mapped reads. The reference genome is represented as a blue line while a transposon਎lement (TE) as a red line. The portion of a read mapped to the reference genome is marked in blue and the portion of the read mapped to a transposon is marked in red. Properly mapped read-pairs are connected by solid lines while discordant read-pairs are connected by dashed lines. Transposon-supporting read-pairs that are anchored at the same genomic locus (defined as within 95% of the fragment length of the sequencing library) are clustered. A 1p1 cluster is supported by multiple read-pairs with at least one read-pair on each side of the transposon insertion, a 2p cluster is supported by two or more read-pairs, but only on one side of the insertion, and unclustered read-pairs are singletons. (B) Estimation of the total number of de novo insertions of a transposon family. All raw reads (empty boxes) and singleton reads as defined in A. (colored boxes) are aligned to the consensus sequence of each transposon family. According to where in the consensus the reads map, they are classified as end-mapping reads and center-mapping reads (see Materials and Methods). The total number of de novo insertions of a transposon family is defined as the difference between the actual number of end-mapping singleton reads and the expected number of end-mapping chimera reads, with the latter estimated using center-mapping singleton reads and all reads.

The first step of TEMP2 is mapping reads to the reference genome using the bwa mem algorithm (16) with the following command: bwa mem -T 20 -Y. Two types of read-pairs are then extracted from the mapping results: (i) discordantly mapped read-pairs for which one read is uniquely mapped to the reference genome while the other read is unmappable or mapped to multiple locations in the genome (Supplementary Figure S1 read-pairs #1–#8, #13–#36, and #41–#56) (ii) split read-pairs that are properly mapped to just one location of the genome but the 5′-end of one read is soft-clipped (Supplementary Figure S1 read-pairs #9–#12 and #37–#40). The unmappable, multiply mapped reads, and split reads are then aligned to transposon consensus sequences using bwa mem, and the read-pairs that can be mapped to transposons are considered as reads that can support transposon insertions.

The second step of TEMP2 is clustering and classifying. Two transposon-supporting reads are placed into the same cluster if they satisfy either of the following two conditions: (i) they map to the same side of a transposon insertion and the distance between their mapped locations in the genome is smaller than the 95% quantile of the fragment length of the sequencing library, or (ii) they map to the opposite sides of a transposon insertion and their distance is smaller than twice the 95% quantile of the fragment length of the sequencing library. We then determine the breakpoints of an insertion using the soft-clipping site of split reads. If no split reads are available, we set the average coordinate of the 3′-ends of the supporting reads (Figure ​ (Figure1A) 1A ) as the breakpoint. All insertions supported by read clusters are classified into three types according to their genomic location and read count: 1p1 (one-plus-one) insertions are supported by read-pairs on both sides of the insertion 2p (two-plus) insertions are supported by two or more read-pairs but these reads all come from one side of the insertion and de novo insertions are supported by only one (i.e. singleton) read-pair. TEMP2 considers 1p1 and 2p insertions as germline insertions that are passed on to the next generation and uses singleton read-pairs to estimate the level of de novo insertions, which include the insertions into somatic genomes or the insertions into the germline genomes that do not lead to offspring. TEMP2 also allows users to set an insertion-frequency threshold for classifying whether an insertion is de novo, which is necessary when the sequencing library is constructed from a small number of cells because in such cases de novo insertions may be supported by multiple reads due to PCR amplification of the small amount of starting DNA.

The third step of TEMP2 is filtering. Three types of filtering are applied to remove false-positive insertions. First, TEMP2 discards insertions by a transposon into a location in the genome that is annotated to contain a copy of the same transposons, because the discordant reads that support such insertions are likely due to sequence alignment errors. Furthermore, we place these insertion positions on a blacklist to filter out other insertions detected at the same genomic positions, which often come from transposons in the same family, again, suggesting alignment errors. Second, TEMP2 estimates the sequencing depth in the genomic region around each candidate insertion and compares it with the average sequencing depth across the whole genome. The number of mapped genome-sequencing reads that fall in a genomic window follows a bimodal distribution with one mode around the average coverage and the other mode much higher than five times the average coverage (Supplementary Figure S2A). Specifically, in our Illumina sequencing data, 0.226% genomic windows had 5× or more reads than the overall genome coverage (27.1×). Thus, we filtered out the insertions located in genomic regions with 5× or higher sequencing depths. Third, TEMP2 merges the insertions at exactly the same genomic position—the vast majority of these insertions are from related transposon subfamilies𠅊nd assigns all supporting reads to the insertion with the most supporting reads. We conducted these three filtering steps immediately after calling potential transposon insertions to minimize the number of insertions and insertion-containing genomic regions that we need to examine, reducing TEMP2’s runtime.

After identifying germline transposon insertions, TEMP2 also estimates the frequency for each transposon insertion. Properly mapped unsplit read-pairs crossing more than 20 bp of an insertion breakpoint are defined as reference read-pairs. The frequency of each transposon insertion is estimated using the equation below:

TEMP2 estimates the overall level of de novo transposon insertions for each transposon family in the whole genome using transposon-supporting singleton reads however, TEMP2 does not make predictions on transposon insertions at individual loci. To detect de novo insertions, TEMP2 must guard against the chimera reads introduced during library construction, which are often singletons. Chimera reads should map to all locations in a transposon consensus sequence uniformly, while singleton reads that support transposon insertions should be enriched in the two ends of the transposon consensus sequence, as far into the interior of the consensus sequence as the fragment length of the sequencing library would allow. Thus, we can use singleton reads that map to the center region (the consensus sequence minus the two ends) to estimate the number of chimeric reads. TEMP2 determines the fragment lengths for all the read-pairs that map entirely to a unique location in the reference genome and then defines the end of a transposon as the 95th percentile fragment length minus 25 nts. The number of de novo insertions of a transposon family can be inferred by the difference between the number of end-mapping singleton reads and the number of center-mapping singleton reads thus, the overall level of de novo insertions of a transposon family is:

TEMP2 outputs a confidence score (ranging from 0�%) for each transposon family that equals one minus our estimated overall rate of chimera reads for this transposon family. Supplementary Figure S2B uses two example transposons to illustrate how TEMP2 estimates de novo insertion frequencies. Using our Illumina sequencing data, TEMP2 estimates that roo does not have a higher than the background level of de novo insertions because its total number of singleton end–reads does not exceed the expected number of singleton end–reads, while Tirant is estimated to have 43 de novo insertions.

In a typical application when a sufficiently large number of cells (thousands or more) is used in the starting material to prepare the sequencing library, TEMP2 only considers singleton insertions as potential de novo insertions to estimate the genome-wide de novo insertion rate. In the rare cases when a limited number of genomes (hundreds or less) is used in the starting material, TEMP2 will not just consider singleton insertions, but will instead ask the user to provide the number of genomes in the starting material and then automatically set the insertion frequency threshold to be two times the theoretical frequency of de novo insertions for distinguishing potential de novo insertions from germline insertions.

To account for the cases of truncated de novo insertions such as 5′ truncated L1 elements (15,17,18), TEMP2 can also classify singleton reads that map to the two ends of fragmented transposons as insertion-supporting reads (using the ‘-T’ option), if there are enough reads (by default three or more reads at each end) to support these fragmented transposons elsewhere in the genome. Such fragmented transposons are used together with full-length transposons in the same family for computing end-mapping reads and center-mapping reads in the above equation for computing the overall rate of de novo insertions.

Simulated data

To benchmark the performance of TEMP2 and other transposon-detection methods, a set of Illumina sequencing data was simulated (see Supplementary Figure S3 for a summary). We simulated genomes with 400 new germline transposon insertions at different frequencies (0.25, 0.5, 0.75ਊnd 1) and insertion lengths as follows. We first constructed 10 000 reference genomes (dm6) and then inserted 90 full-length transposons (randomly picked) and 10 partial-length transposons (6 I-element, 2 Doc, 2 F-element) into the same coordinates of 2500, 5000 or 7500 of the 10 000 reference genomes one by one. We also simulated 10 000 genomes with 20 somatic transposon insertions each. We inserted eight full-length 297, four full-length copia, three full-length Tirant, two partial-length Doc, one full-length 17.6, one full-length F-elementਊnd one full-length blood, hence 20 transposons in total, into different coordinates of the 10 000 simulated genomes one by one. Low mappability regions were excluded when inserting transposons.

Illumina read-pairs were then simulated using the ART algorithm (version 2.5.1) with parameters -ss HS25 -p -l 100 (read length) -m 450 (fragment size) -s 10 -na (19). For each of the 10 000 simulated genomes, we simulated Illumina read-pairs at 0.0001×, 0.0002×, 0.0003×, 0.0004×, 0.0005×, 0.001×, 0.002×, 0.003×, 0.004×ਊnd 0.005× genome coverage by adding parameter -f. In total, Illumina read-pairs at the sequencing depth ofਁ�× genome coverage for 10 000 simulated genomes were generated for each genome set. Not that by 1�× genome coverage, we mean that the total number of nucleotides that mapped to the reference genome was at 1�× the genome length. Two additional datasets with different percentages of chimera read-pairs (0.05% and 0.5%) were generated by combining two random reads into one read-pair.

PacBio and Illumina whole-genome sequencing of Drosophila

For PacBio sequencing, the female virgin flies (ISO-1 strain, � individuals for each of two samples) were collected and starved for 1.5 h and flash-frozen in liquid nitrogen. Genomic DNA was extracted and purified with standard procedures. The DNA library preparation for PacBio sequencing was performed by following the PacBio protocol called ‘procedure & checklist of 20 kb template preparation using the BluePippin size-selection system’. Briefly, the DNA was sheared by a Covaris g-TUBE device and purified using AMPure PB beads. The fragmented DNA was subject to DNA damage repairing and ligated with adapters. Then purified ligation products were size-selected using the BluePippin Size Selection system. After annealing and binding of SMRTbell templates and preparation for MagBead loading, the two libraries were run on the PacBio RS II and Sequel system in NextOmics (Wuhan, China), respectively. The sequencing results for each sample contained two SMART cells.

For Illumina short-read sequencing, the whole bodies of 3𠄵-day-old female virgin flies (ISO-1 strain, � individuals) were collected and used for DNA extraction. DNA quality was assessed by OD260/OD280 with Nanodrop and agarose gel electrophoresis. The library for Illumina sequencing was prepared as follows: (i) fragmentation with Covaris ultrasonicator, (ii) end-repair and phosphorylation of the 5′ ends, (iii) A-tailing of the 3′ ends, (iv) ligation of adapters, (v) 12 cycles of PCR to enrich for the ligated product. Sequencing was done with the Illumina HiSeq-2500 sequencer (run type: paired-end read length: 125 nt) in Novogene (Tianjin, China).

Build a benchmark of transposon insertions using PacBio sequencing data

PacBio sequencing data were transformed to the FASTA format and then aligned to the dm6 genome using the Minimap2 algorithm (version 2.16) with parameters -x map-pb –MD (20). The mapping result was then provided to the Sniffles algorithm for structural variation detection with parameters -l 300 -s 1 (21). Only insertions longer than 300-bp were retained for further analysis because the shortest transposon in D. melanogaster is Stalker3, which is 372-nt in full length. The sequences of insertions were extracted and aligned to transposon consensus sequences using Minimap2 again to define new transposon insertions. A new transposon insertion is considered valid if both of the following conditions are satisfied: (i) the aligned length is longer than half of the insertion (ii) the alignment starts within 500-nt of the 5′-end of the insertion and ends within 500-nt of the 3′-end of the insertion. Transposon insertions within 50 bp were merged, and insertions with more than one supporting read were retained and considered as germline transposon insertions. Breakpoints of the insertions were set to the insertion sites that were supported by the most reads. The 5′-end and 3′-end of each inserted transposon were also annotated. To estimate insertion frequencies, genome-mapping PacBio reads around each breakpoint were tallied. Reads that cross a breakpoint for at least 50 bps were defined as reference reads, and reads split within 50 bp of the breakpoint were defined as supporting reads. Some PacBio reads were long enough to split in both the 5′-end and the 3′-end of an insertion, and these reads were counted as two supporting reads. The insertion frequencies were then estimated using the same equation as TEMP2:

We then manually inspected each of the 405 transposon insertions detected using the PacBio data. Among these 405 insertions, 73 were located in an annotated copy of the same transposon in the reference genome. We visualized the PacBio raw reads supporting each insertion using the IGV browser (v2.7.2) to examine detailed alignments (22). Furthermore, we manually aligned each inserted sequence back to the transposon consensus sequence. For 11 high-frequency insertions supported by many PacBio reads, the insertion sites made by the supporting PacBio reads were typically at exactly the same location of the reference genome or within a few base-pairs of each other, indicating that these are true insertions. For the remaining 62 insertions, a portion of the supporting PacBio read could not align to the reference genome due to high sequencing errors in the portion. However, when we manually aligned the portion back to the transposon consensus sequence, more than half of the portion could be aligned. Furthermore, their supporting PacBio reads point to positions in the reference genome that were far from one another (hundreds to thousands of base pairs away), suggesting alignment errors. We deemed these insertions false positives. We further examined whether the 332 PacBio-detected insertions that were not in a copy of the same transposon could be supported by any Illumina reads. We first aligned Illumina reads to the reference genome via bwa mem using default parameters and then identified discordantly mapped read-pairs from the ± 500 bp region flanking each of the 332 insertions. We aligned these discordant read-pairs to transposon consensus sequences via bwa mem using default parameters. If there was at least one discordant read-pair that could align to the inserted transposon, we deemed the insertion supported by Illumina reads.

Algorithm comparison

The main differences between the algorithms assessed by us are listed in Supplementary Table S4. Algorithms were benchmarked on three sets of short-read whole-genome sequencing data: simulated D. melanogaster data, experimental D. melanogaster data we produced, and human data in the NA12878 lymphoblastoid cell line downloaded from the 1000 Genomes Project.

For simulated and D. melanogaster data, default parameters for each algorithm were used. To achieve a fair comparison of the algorithms, the same cutoff of transposon-supporting reads were used for each of the algorithms (five reads). Sum of squared residue (SSR) was defined as the sum of errors of estimated de novo insertion rate across all transposons including the transposons with 0 simulated insertions:

The SSRs were 0.3 for TEMP2 and 24.75 for TEMP (Figure ​ (Figure2F). 2F ). When we considered only those seven transposons with non-zero simulated insertions, SSRs were 0.3 for TEMP2 and 17.69 for TEMP. The transposon library of D. melanogaster was downloaded from Flybase (23). Transposon insertions in the reference genome (dm6) were annotated using RepeatMasker with parameters -s -no_is -norna -nolow -e ncbi -cutoff 255 -div 40 -frag 20000 (24).

The performance of TEMP2 on simulated datasets. Simulated Illumina read-pairs at different sequencing depth (1�× genome coverage) were used for comparing the performance of TEMP2, TEMP, ERVcaller, MELT, RetroSeq and RelocaTE2 (in red, blue, green, yellow, purple and gray respectively). Panels A𠄽, germline insertions. Panels E-F, somatic insertions. Except for panel E, for which three levels of chimera read-pairs were tested, the datasets with 0.05% chimera read-pairs were used for all other panels. (A) Performance of TEMP2 and other transposon-detection methods in detecting transposon insertions. Three panels of line plots depict the sensitivity, precision, and F1 score of detecting germline transposon insertions, respectively, as a function of sequencing depth. (B) Accuracies of TEMP2, TEMP and RetroSeq in estimating transposon-insertion frequencies. Line plots show the average error of estimated frequencies of germline transposon insertions as a function of sequencing depth. (C) Accuracies of TEMP2 and other transposon-detection methods in identifying the breakpoints in the reference genome. Line plots show the average distance between detected and simulated breakpoints of new germline transposon insertions. (D) Accuracies of TEMP2 and two other transposon-insertion methods in predicting the ends of inserted transposons. Line plots show the average distance between detected and simulated transposon ends of new germline insertions. (E) Accuracies of TEMP2 and TEMP in estimating somatic transposon insertion numbers. Line plots show the sum of squared residuals (SSR) of estimated somatic insertion numbers for all transposon subfamilies. Simulated data with 0%, 0.05%, and 0.5% chimera were tested and the results are displayed as solid, dashed and dot-dashed lines respectively. This panel and panel F are benchmarked using simulated de novo insertions from six full-length transposons and one fragmented transposon (Doc). (F) Accuracies of TEMP2 and TEMP in estimating somatic transposon insertion numbers the sequencing depth was set to 20×. Scatterplots compare simulated and estimated insertion numbers. Each dot denotes a transposon subfamily, and the 8 transposon subfamilies with simulated somatic insertions are in black while the other transposon subfamilies are in gray.

We downloaded the .cram or .bam file of NA12878 low-depth and high-depth data from the 1000 Genomes Project. Although TEMP2 can directly work with these files, we wanted to ensure that the same parameters were used for genome mapping, so we extracted raw reads from these files using samtools (25) and then aligned the reads to hg38 using bwa mem with parameters ‘-T 20 -Y’ (16). Default parameters for ERVcaller, MELT, RetroSeq were used to analyze the NA12878 data. We allowed 10% sequence divergence for TEMP2 and TEMP when aligning reads to transposon consensus sequences, the same for MELT. To achieve a fair comparison of the algorithms, the same cutoff of transposon-supporting reads were used for each of the algorithms (3 for low-depth data and 10 for high-depth data). The transposon library, which contains Alu, SVA, and LINE1 consensus sequences, was downloaded from the MELT package (10). The reference insertion annotation of Alu, SVA, and LINE1 was also downloaded from the MELT package.

We thank members of the Myers, Moffat, Boone, and Andrews laboratory for fruitful discussions. This research was funded by grants from the National Science Foundation (MCB 1818293), the National Institutes of Health (R01HG005084, R01HG005853), the Canadian Institutes for Health Research (MOP-142375), Ontario Research Fund, Genome Canada (Bioinformatics and Computational Biology program), and the Canada Research Chairs Program. M.B. was supported by a DFG Fellowship (Bi 2086/1-1).

Study conception: MB, MR, and CLM Software and analysis: MR and MB Result interpretation: MR, MB, MC, HNW, KRB, CB, JM, and CLM Experiments: AHYT, KC, and MA Manuscript drafting: MC, MA, HNW, KRB, BJA, CB, JM. MB, MR, and CLM Funding: BJA, CB, JM, and CLM.


Next-generation sequencing is revolutionizing biological and clinical research. Long hampered by the difficulty and expense of obtaining genomic data, life scientists now face the opposite problem: faster, cheaper technologies are beginning to generate massive amounts of new sequencing data that are overwhelming our technological capacity to conduct genomic analyses ( Mardis, 2010 ). Computational processing will soon become the bottleneck in genome sequencing research, and as a result, computational biologists are actively developing new tools to more efficiently and accurately process human genomes and call variants, e.g. SAMTools ( Li et al. , 2009 ), GATK ( DePristo et al. , 2011 ), Platypus ( ), BreakDancer ( Chen et al. , 2009 ), Pindel ( Ye et al. , 2009 ) and Dindel ( Albers et al. , 2011 ).

Unfortunately, single-nucleotide polymorphism (SNP) callers disagree as much as 20% of the time ( Lyon et al. , 2012 ), and there is even less consensus in the outputs of structural variant algorithms ( Alkan et al. , 2011 ). Moreover, reproducibility, interpretability and ease of setup and use of existing software are pressing issues currently hindering clinical adoption ( Nekrutenko and Taylor, 2012 ). Indeed, reliable benchmarks are required to measure accuracy, computational performance and software robustness, and thereby improve them.

In an ideal world, benchmarking data to evaluate variant calling algorithms would consist of several fully sequenced, perfectly known human genomes. However, ideal validation data do not exist in practice. Technical limitations, such as the difficulty in accurately sequencing low-complexity regions, along with budget constraints, such as the cost to generate high-coverage Sanger reads, limit the quality and scope of validation data. Nonetheless, significant resources have already been devoted to generate subsets of benchmarking data that are substantial enough to drive algorithmic innovation. Alas, the existing data are not curated, thus making it extremely difficult to access, interpret and ultimately use for benchmarking purposes.

Owing to the lack of curated ground truth data, current benchmarking efforts with sequenced human genomes are lacking. The majority of benchmarking today relies on either simulated data or a limited set of validation data associated with real-world datasets. Simulated data are valuable but do not tell the full story, as variant calling is often substantially easier using synthetic reads generated via simple generative models. Sampled data, as mentioned earlier, are not well curated, resulting in benchmarking efforts, such as the Genome in a Bottle Consortium ( Zook and Salit, 2011 ) and the Comparison and Analytic Testing resource (GCAT) ( ), that rely on a single dataset with a limited quantity of validation data.

Rigorously evaluating predictions against a validation dataset presents several additional challenges. Consensus-based evaluation approaches, used in various benchmarking efforts ( The 1000 Genomes Project Consortium, 2010 DePristo et al. , 2011 Kedes and Campany, 2011 ), may be misleading. Indeed, different methods may in fact make similar errors, a fact that remains hidden without ground truth data. In cases where ‘noisy’ ground truth data are used, e.g. calls based on Sanger sequencing with some known error rate or using SNP chips with known error rates, accuracy metrics should account for the effect of this noise on predictive accuracy. Additionally, given the inherent ambiguity in the Variant Calling Format (VCF) format used to represent variants, evaluation can be quite sensitive to the (potentially inconsistent) representations of predicted and ground truth variants. Moreover, owing to the growing need to efficiently process raw sequencing data, computational performance is an increasingly important yet to date largely overlooked factor in benchmarking. There currently exist no benchmarking methodologies that—in a consistent and principled fashion—account for noise in validation data, ambiguity in variant representation or computational efficiency of variant calling methods.

Without any standard datasets and evaluation methodologies, research groups inevitably perform ad hoc benchmarking studies, working with different datasets and accuracy metrics, and performing studies on a variety of computational infrastructures. Competition-based exercises ( Earl et al. , 2011 Kedes and Campany, 2011 ) are a popular route for benchmarking that aim to address some of these inconsistencies, but they are ephemeral by design and often suffer from the same data and evaluation pitfalls described earlier.

In short, the lack of consistency in datasets, computational frameworks and evaluation metrics across the field prevents simple comparisons across methodologies, and in this work, we make a first attempt at addressing these issues. We propose SM a SH, a standard methodology for benchmarking variant calling algorithms based on a suite of S ynthetic, M ouse a nd S ampled H uman data. SM a SH leverages a rich set of validation resources, in part bootstrapped from the patchwork of existing data. We provide free and open access to SM a SH, which consists of:

A set of five full genomes with associated deep coverage short-read datasets (real and synthetic)

Three contaminated variants of these datasets that mimic real-world use cases (M.DePristo, 2013, personal communication) and test the robustness of variant callers in terms of accuracy and required computational resources

Ground truth validation data for each genome along with detailed error profiles

Accuracy metrics that account for the uncertainty in validation data

Methodology to resolve the ambiguity in variant representations, resulting in stable measurements of accuracy and

Performance metrics to measure computational efficiency (and implicitly measure software robustness) that leverage the Amazon Web Services (AWS) cloud computing environment.

SM a SH is designed to facilitate progress in algorithm development by making it easier for researchers to evaluate their systems against each other.

Author information


CRUK Cambridge Institute, University of Cambridge, Cambridge, UK

Maurizio Callari, Stephen-John Sammut, Leticia De Mattos-Arruda, Alejandra Bruna, Oscar M. Rueda, Suet-Feung Chin & Carlos Caldas

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

Corresponding authors

Watch the video: The Dos and Donts of Benchmarking Databases (August 2022).