Information

What good is the MinION?

What good is the MinION?



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

This year, Oxford Nanopore MinION has been shipped to some researchers for testing.

The advantage of a table-top sequencer for diagnostics and personalized medicine is obvious. Similarly, research "in the field" and forensic science could be aided greatly by this device. My question is about the applications in the research lab.

What does MinION do that a well-equipped lab with access to Sanger and next-gen sequencing cannot already do?

How well does the MinION work? How powerful is it?

What important scientific discoveries are being made using the MinION?

While the website of Oxford Nanopore has a list of publications, the most recent one is from 2013. This is before the early access program began - I am particularly interested in developments since February 2014.


Oxford Nanopore unveils portable genome sequencer – MinION

This diagram shows a protein nanopore set in an electrically resistant membrane bilayer. An ionic current is passed through the nanopore by setting a voltage across this membrane. Credit: Oxford Nanopore Technologies

(Phys.org) —U.K. based Oxford Nanopore Technologies has made good on a promise made two years ago to produce an inexpensive genome sequencer that is based on nanopore technology. David Jaffe, with the Broad Institute reported to an audience at the Advances in Genome Biology and Technology meeting in Florida last week that he has been asked by Oxford Nanopore to try out the new the device, called the MinION, and has found it to be promising.

Oxford Nanopore Technologies created a stir in the biological sciences world in 2012 when representatives for the company announced that its research team had successfully created a sequencing device based on nanopore technology and that an inexpensive prototype would be delivered to researchers outside the company soon thereafter. The company apparently ran into difficulty ramping up its pore technology and had to find some alternative materials—thus the two year delay.

Nanopore sequencing is where single strands of DNA are pulled through a pore—as each base pair passes through the pore its conductivity is measured—information that can be used for identification by a computer. The advantage of this approach is that at least in principle, any strand length can be sequenced.

The MinION isn't for sale yet—at this time the company is making prototypes available to a select few notables in the genome sequencing field—each will have to pony up $1000 as a down-payment for the honor. The device plugs into a port on a computer, which does the processing. Along with the device, which has been compared to the size of a pack of gum, Oxford Nanopore Technologies will send along DNA samples that researchers can use to learn how to use the new device—after that, they can sequence anything they want. In testing strands of two types of bacterial DNA sent to him, Jaffe reports having mixed results —long stretches of data were obtained correctly in some cases, but in others there were some errors. In order to assemble the entire genome of a type of bacteria, for example, he had to also use a sequencing device made by another company. He noted that he is optimistic about the device, however, as reps from Oxford Nanopore have assured him that error rates can be reduced using certain techniques even as improvements to the device are being worked out back at the lab.

If the error rates using the MinION can be reduced to a practical level, the device could mark a turning point in gemone sequencing as it would be a device that could be carried and used for fieldwork. And though the cost of such a device has not been announced, Oxford Nanopore has repeatedly used terms such as inexpensive and affordable to describe it, indicating it will cost much less than other sequencing devices currently on the marker, once it's ready for sale.


Nanopore MinION sequencing

The Oxford Nanopore MinION DNA sequencing system uses a novel 'nanopore technology for sequencing long streches of DNA. Briefly, large nanopores are inserted into a systhetic, electrically resistant membrane. A voltage is set across the membrane establishing an ionic current across the pores. DNA molecules tagged with tether proteins are guided to nanopores and the electric field drives individual DNA strands through biological nanopores at a rate controlled by a processive enzyme bound to the DNA at the pore orificeby. Each base ia identified by measuring the changes in current across the nanopores. Read length reflects the length of the DNA fragments submitted for sequencing.
More information about this technology can be found at https://nanoporetech.com.

MinION long reads are being used to close gaps and join contigs obtained from assemblies of Illumina reads of small genomes or genomic loci and the MCIC Computational Biology Laboratory has the necessary bioinformatics tools set up.

Library preparation and sequencing

Library preparation entails a DNA fragmentation step, ligation of the adaptors (with or without indices), optional PCR amplification and final attachment of the tether/motor protein to the 5' end of the DNA molecules. Input DNA can be as low as 20ng, however, for a PCR free library preparation protocol at least 1-2 ug of DNA are needed. Preparation of intact, not nicked high molecular weight genomics DNA is crucial for obtaining long and good quality reads.

Sequencing takes approximately 48 hours. During the run data are collected locally and uploaded to the Nanopre cloud computing where they are analysed using the Metrichor software. Metrichore software also provides real time feedback on the quality of the run. The lenght of the reads depends on the fragmentation size of the DNA and on its quality (no nicks). The system is capable of producing sequence reads of more than 200Kb and the thoughput for a flowcell is currently

100-200 Mb. Several different templates can be sequenced sequencially on a flow cell if a lower throughput is needed for each individual template.

If you would like to start a MinION sequencing project please contact MCIC staff via e-mail or by calling 330-263-3828.


2 Answers 2

Short answer: yes, but you need to get permission (and modified software) from ONT before doing that.

. but that doesn't tell the whole story. This question has the potential to be very confusing, and that's through no fault of the questioner. The issue is that for the MinION, sequencing (or more specifically, generating the raw data in the form of an electrical signal trace) is distinct and separable from base calling. Many other sequencers also have distinct raw data and base-calling phases, but they're not democratised to the degree they are on the MinION.

The "sequencing" part of MinION sequencing is carried out by ONT software, namely MinKNOW. As explained to me during PoreCampAU 2017, when the MinION is initially plugged into a computer it is missing the firmware necessary to carry out the sequencing. The most recent version of this firmware is usually downloaded at the start of a sequencing run by sending a request to ONT servers. In the usual case, you can't do sequencing without being able to access those servers, and you can't do sequencing without ONT knowing about it. However, ONT acknowledge that there are people out there who won't have Internet access when sequencing (e.g. sequencing Ebola in Africa, or metagenomic sequencing in the middle of the ocean), and an email to [email protected]> with reasons is likely to result in a quick software fix to the local sequencing problem.

Once the raw signals are acquired, the "base-calling" part of MinION sequencing can be done anywhere. The ONT-maintained basecaller is Albacore, and this will get the first model updates whenever the sequencing technology is changed (which happens a lot). Albacore is a local basecaller which can be obtained from ONT by browsing through their community pages (available to anyone who has a MinION) ONT switched to only allowing people to do basecalling locally in about April 2017, after establishing that using AWS servers was just too expensive. Albacore is open source and free-as-in-beer, but has a restrictive licensing agreement which limits the distribution (and modification) of the program. However, Albacore is not the only available basecaller. ONT provide a FOSS basecaller called nanonet. It's a little bit behind Albacore on technology, but ONT have said that all useful Albacore changes will eventually propagate through to nanonet. There is another non-ONT basecaller that I'm aware of which uses a neural network for basecalling: deepnano. Other basecallers exist, each varying distances away technology-wise, and I expect that more will appear in the future as the technology stabilises and more change-resistant computer scientists get in on the act.

Edit: ONT has just pulled back the curtain on their basecalling software all the repositories that I've looked at so far (except for the Cliveome) have been released under the Mozilla Public License (free and open source, with some conditions and limitations). Included in that software repository is Scrappie, which is their testing / bleeding-edge version of Albacore.


Materials and Methods

Barcode Testing With Known Species

We tested DNA barcoding on four different nematode species, Anisakis simplex, Panagrellus redivivus, Turbatrix aceti, and Caenorhabditis elegans. These species represent a subset of parasitic and free-living species with diverse lifestyles.

Anisakis simplex was dissected from fresh mackerel and stored in 70% ethanol, and one individual nematode was selected for DNA extraction. A. simplex is a marine parasite that uses crustaceans as intermediate hosts to infect teleosts and squids (Anderson, 2000). Although humans are accidental hosts for Anisakis spp., there has been a dramatic increase over the last decades in the reported prevalence of anisakiasis, a serious zoonotic disease (Chai et al., 2005).

Panagrellus redivivus was harvested from a fresh culture growing on oatmeal medium and used for DNA extraction. P. redivivus is a free-living nematode that has been used as a model system to study organ development, signal transduction, and toxicology and recently had its full genome and transcriptome sequenced (Srinivasan et al., 2013). The species is amongst others suggested as a comparative model for Strongyloides, as parasitic taxa are typically difficult to culture and analyze independently of their hosts (Blaxter et al., 1998).

Turbatrix aceti was harvested from a fresh culture in an apple cider vinegar medium and used for DNA extraction. The nematodes were washed in distilled water three times before DNA extraction, to mitigate an inhibiting effect of the vinegar medium on the subsequent Polymerase Chain Reaction (PCR). T. aceti is a free-living nematode that is mostly researched in relation to aging phenotypes, that are shared with other free-living nematodes such as Caenorhabditis elegans (Reiss and Rothstein, 1975). It is also used as live food in the larval stages of many fish species (Brüggemann, 2012). It lacks proper genetic studies, making it an interesting representative for the majority of nematode species that are mostly studied morphologically.

Caenorhabditis elegans strain N2 was grown on nematode growth medium (NGM) plates with E. coli OP50 for several days using standard procedures (Brenner, 1974) and subsequently harvested for DNA extraction.

DNA Extraction, PCR, and Sequencing

We extracted the DNA using the GeneJET Genomic DNA Purification Kit (ThermoFisher Scientific Ltd., Paisley, United Kingdom) according to manufacturer’s instructions for mammalian tissue and rodent tail genomic DNA purification (protocol A), except that samples were lysed overnight (step 3) to ensure complete cuticle break down. DNA purity was measured on a NanoDrop 2000 spectrophotometer (software: NanoDrop2000, version 1.4.2 ThermoFisher Scientific Ltd., Paisley, United Kingdom).

We amplified an internal fragment of the 18S SSU rRNA gene from our DNA samples, using the primers and thermocycler protocol optimized by Floyd et al. (2005). This fragment is � bp in length and widely used for nematode species identification. According to ONT’s instruction we adapted the primers from Floyd et al. (2005) to include an adapter tail at the 5′ end (“MinION tail,” in lowercase), which is compatible with the MinION workflows. This resulted in the following forward primer: Nem_18S_F_MinION: 5′ tttctgttggtgctgatattgcCGCGAA TRGCTCATTACAACAGC 3′ and reverse primer: Nem_18S_ R_MinION: 5′ acttgcctgtcgctctatcttcGGGCGGTATCTGATCGC C 3′. A different primer pair, SSU18A and SSU26R (Floyd et al., 2002), was initially tested with the MinION tails, but resulted in no PCR amplification for these samples. Each 25-μl PCR mix contained 2 μl purified DNA extract, 0.5 μl each forward and reverse primers (10 μM Sigma-Aldrich/Merck Ltd., Poole, United Kingdom), 9.5 μl nuclease free water (NFW ThermoFisher Scientific Ltd., Paisley, United Kingdom), and 2X GoTaq Hot Start Colorless Master Mix (Promega, Southampton, United Kingdom). PCR was performed on a Bio-Rad T100 Thermal Cycler (Bio-Rad Laboratories Ltd., Watford, United Kingdom). The PCR protocol remained the same as Floyd et al. (2005): initial denaturation for 5 min at 94ଌ followed by 35 cycles of denaturation for 30 s at 94ଌ, annealing for 30 s at 54ଌ and extension for 1 min at 72ଌ, all followed by a final extension for 10 min at 72ଌ and cooling to 12ଌ.

Successful amplification was confirmed using a 2% agarose gel (Agarose I, Molecular Biology Grade Thermo Fisher Scientific Ltd., Paisley, United Kingdom) made with 1x TBE buffer (Thermo Fisher Scientific Ltd., Paisley, United Kingdom), using 1 μl of NovelJuice nucleic acid stain (Sigma-Aldrich/Merck Ltd., Poole, United Kingdom) loaded with each sample and the size ladder. PCR products were purified using the GeneJET PCR Purification Kit (Thermo Fisher Scientific Ltd., Paisley, United Kingdom) following manufacturer’s instruction and eluted in 50 μl of Elution Buffer. DNA purity was measured on a NanoDrop 2000 spectrophotometer (software: NanoDrop2000, version 1.4.2 Thermo Fisher Scientific Ltd., Paisley, United Kingdom) and DNA concentration on a Qubit 1.0 (Thermo Fisher Scientific Ltd., Paisley, United Kingdom), using the Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific Ltd., Paisley, United Kingdom). Both Nanodrop and Qubit measurements were measured twice per sample to confirm accuracy of the measurement.

We prepared the MinION library according to the 1D PCR barcoding amplicons/cDNA (SQK-LSK109) protocol from ONT (version PBAC12_9067_v109_revH_23MAY2018). This protocol incorporates a second PCR to attach ONT barcodes to our first-round PCR products as means of indexing, allowing multiple samples to be run on one flow cell and subsequent demultiplexing in the bioinformatics stage. Briefly, the PCR Barcoding Kit (EXP-PBC001 ONT Ltd., Oxford, United Kingdom) was used to prepare a 100-μl PCR mix containing 2 μl barcode (10 μM ONT Ltd., Oxford, United Kingdom), 48 μl first-round PCR product, and 50 μl 2X LongAmp Taq Master Mix [New England BioLabs (NEB) Inc., Hitchin, United Kingdom].

We tried to prepare the first-round PCR products in equimolar concentrations for the barcoding PCR, but due to large variations in DNA concentrations between the samples we diluted the first-round PCR product of A. simplex and P. redivivus to between 100 and 150 fmol and used all the first-round PCR product for T. aceti. A. simplex received barcode number 05, P. redivivus barcode 06 and T. aceti barcode 07. PCR was performed on a Bio-Rad T100 Thermal Cycler (Bio-Rad Laboratories Ltd., Watford, United Kingdom). The PCR protocol for an amplicon length of 𢏁,000 bp (including primers) was as follows: initial denaturation 3 min @ 95ଌ denaturation 15 s at 95ଌ, annealing 15 s at 62ଌ, extension 50 s at 65ଌ (all 15 cycles) final extension 50 s at 65ଌ hold at 4ଌ. The PCR products were cleaned up with 1X Agencourt AMPure XP beads (Beckman Coulter Inc., Indianapolis, IN, United States). Finally, 1 μl per purified second-round PCR product was quantified on the Qubit 1.0 (Thermo Fisher Scientific Ltd., Paisley, United Kingdom) using the Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific Ltd., Paisley, United Kingdom).

The concentration of A. simplex and P. redivivus DNA was too high for Qubit quantification, so we prepared and quantified a 1/5 dilution in NFW (Thermo Fisher Scientific Ltd., Paisley, United Kingdom) that was taken forward. The second-round PCR products were pooled in roughly equimolar concentrations in 47 μl NFW (Thermo Fisher Scientific Ltd., Paisley, United Kingdom).

Library preparation continued using the reagents from the Ligation Sequencing Kit (SQK-LSK109 ONT Ltd., Oxford, United Kingdom), according to manufacturer’s instructions. Briefly, we prepared 325 ng pooled barcoded library in 47 μl NFW (ThermoFisher Scientific Ltd., Paisley, United Kingdom). Amplified product was end-repaired using NEBNext Ultra II End-Repair/dA-tailing Module (NEB Inc., Hitchin, United Kingdom) for 5 min at 20ଌ and 5 min at 65ଌ, after which it was cleaned up with 1X Agencourt AMPure XP beads (Beckman Coulter Inc., Indianapolis, IN, United States). Adapter ligation was performed using NEB Blunt/TA Ligation Master Mix (NEB Inc., Hitchin, United Kingdom) and reagents provided in the SQK-LSK109 kit. Ligation took place for 10 min at room temperature. DNA was eluted in 15 μl Elution Buffer after being purified with 0.4X AMPure XP beads and washed with the Short Fragment Buffer provided in the SQK-LSK109 kit. 1 μl of prepared library was quantified on the Qubit 1.0 (Thermo Fisher Scientific Ltd., Paisley, United Kingdom) using the Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific Ltd., Paisley, United Kingdom) and gave a measure of 6.36 ng/μl, which equates to a molarity of 102.9 fmol.

The protocol from ONT recommends loading 5-50 fmol of amplicon product onto the flow cell, so we diluted 5.44 μl of prepared library in 6.56 μl Elution Buffer to load 40 fmol of library onto the flow cell. The flow cell was primed for loading by flushing the flow cell with 1 ml priming mix (30 μl of Flush Tether in one tube of Flush Buffer), taking care to avoid the introduction of air bubbles. The library was prepared for loading by mixing 37.5 μl Sequencing Buffer, 25.5 μl Loading Beads and 12 μl diluted DNA library, after which the sample was added to a flow cell, type R9.5.1, through the SpotON sample port. Total library preparation time was estimated to be 𢏃 h.

We performed the sequencing run using MinKNOW (version 3.4.5 ONT Ltd., Oxford, United Kingdom) on the MinIT (a small powerful computing unit that eliminates the need for a dedicated laptop ONT Ltd., Oxford, United Kingdom), indicating the flow cell type and experimental kit used. As a metric of flow cell quality the MinKNOW software assesses flow cell active pore count, in the multiplexer (MUX) scan before each run. Higher active pore counts represent a high flow cell quality, with a maximum of 2,048 and a guaranteed level of 800. Our flow cell had 1,097 pores available for sequencing. The flow cell generated 116,620 reads in 10 min of sequencing, after which the run was stopped. The flow cell was subsequently washed using the Wash Kit (EXP-WSH002 ONT Ltd., Oxford, United Kingdom) with 150 μl Solution A, followed by 500 μl of Storage Solution, and stored in the fridge for re-use.

Portable DNA Extraction, PCR and Sequencing

In preparation for field work we tested whether the developed MinION procedure could also be performed on a fully portable system. We prepared the model organism C. elegans for MinION sequencing using a portable molecular lab, the Bentolab Pro 3 (Nature Biotechnology, 2016 Bento Bioworks Ltd., London, United Kingdom) and a multi tool (CMFTLi 10.8V Li-Ion Cordless Multifunction Tool, Clarke International Ltd., Epping, United Kingdom) as a low-cost handheld vortex. Most of the procedures are similar to above, but working on the Bentolab required some essential adaptations to protocols.

We extracted the DNA using the GeneJET Genomic DNA Purification Kit (ThermoFisher Scientific Ltd., Paisley, United Kingdom) according to manufacturer’s instructions for mammalian tissue and rodent tail genomic DNA purification (protocol A). Adaptations to the procedure to make this protocol work on the Bentolab were as follows: In step 3, the sample was divided over two 0.2 ml PCR tubes and briefly spun down using the Bentolab’s centrifuge (Bento Bioworks Ltd., London, United Kingdom). Subsequently, the two PCR tubes were incubated for 18 h at 56ଌ, using the Bentolab’s thermocycler (Bento Bioworks Ltd., London, United Kingdom) as a heating block. The thermocycler protocol performed 18 cycles of 1 h at 56ଌ. In step 4, the lysate was then transferred to a 1.5 ml centrifuge, 20 μl RNase A was added and vortexed on the multi tool 4 . Vortexing on a multi tool can be achieved by attaching the “straight saw blade” to the multi tool. This blade provides enough space for up to four centrifuge tubes at the same time. As a safety measure the sharp end of the saw blade was covered with duck tape. Then the centrifuge tube was added to the blade with duck tape, ensuring a tight fit. The multi tool was turned on at the highest speed (21,000 strokes/minute), creating a similar effect as a lab vortex. The vortexing in step 5 and 6 was also performed using the multi tool. In step 7, the 2 ml collection tube of the GeneJet purification column was replaced by a 1.5 ml centrifuge tube with the cap cut off. The Bentolab’s centrifuge can only handle 1.5 ml tubes use of 2 ml collection/centrifuge tubes will lead to small plastic particles that can lead to reduced efficiency of the centrifuge lock system. Because of the reduced volume of the collection tube, the lysate was added to the prepared column at a maximum of 350 μl at a time, after which the sample was centrifuged for 1 min at 6,000 × g and the flowthrough discarded (with a total of three repeats necessary to complete step 7). In step 8, 250 μl Wash Buffer I was added at a time and centrifuged for 1 min at 8,000 × g and the flowthrough discarded (with a total of two repeats necessary to complete step 8). In step 9, 250 μl Wash Buffer II was added at a time and centrifuged for 4 min at 8,000 × g and the flowthrough discarded (with a total of two repeats necessary to complete step 9, and increased centrifuge time to compensate for the max 8,000 × g force of the Bentolab’s centrifuge). An additional dry spin of 1 min at 8,000 × g was performed, after which the collection tube was discarded and replaced by a sterile 1.5 ml centrifuge tube. In step 10, 50 μl of Elution Buffer was added to the purification column.

PCR was prepared as described above, but this time performed using the Bentolab’s thermocycler (Bento Bioworks Ltd., London, United Kingdom). Also, aluminum foil was used as a sterile work environment as an alternative for a PCR hood. Aluminum foil was taped to the bench space using masking tape. Bleach (1:10 ratio dilution in water) was sprayed on the surface, letting it sit for 3 min, and wiping the surface with clean paper tissue. This process was repeated twice to decontaminate, after which 70% ethanol was used to remove any residual bleach. The Nem_18S_F/R_MinION primers did not work for C. elegans, so the primers from Floyd et al. (2002) were used with MinION tails (in lowercase). The forward primer: SSU18A_MinION: 5′ tttctgttggtgctgatattgcAAAGATTAAGCCATGCATG 3′ and reverse primer: SSU26R_MinION: 5′ acttgcctgtcgctctatcttcCAT TCTTGGCAAATGCTTTCG 3′. Each 25-μl PCR mix contained 2 μl purified DNA extract, 0.5 μl each forward and reverse primers (10 μM Sigma-Aldrich/Merck Ltd., Poole, United Kingdom), 9.5 μl nuclease free water (NFW Thermo Fisher Scientific Ltd., Paisley, United Kingdom), and 2X GoTaq Hot Start Colorless Master Mix (Promega, Southampton, United Kingdom). PCR was performed on the Bentolab (Bento Bioworks Ltd., London, United Kingdom). The PCR protocol was adapted from Floyd et al. (2002): initial denaturation 5 min at 94ଌ denaturation 1 min at 94ଌ, annealing 1.5 min at 60ଌ, extension 2 min at 72ଌ (all 35 cycles) final extension 10 min at 72ଌ hold at 12ଌ.

Successful amplification was confirmed using a 2% agarose gel (Agarose I, Molecular Biology Grade Thermo Fisher Scientific Ltd., Paisley, United Kingdom) made with 1x TBE buffer (Thermo Fisher Scientific Ltd., Paisley, United Kingdom). For the Bentolab’s small gel chamber (Bento Bioworks Ltd., London, United Kingdom) we used 27.5 ml of 1X TBE buffer with 0.5 g agarose. The need for a scale was eliminated by using an Eppendorf tube marked with the needed volume corresponding to 0.5 g agarose. Agarose was melted into the TBE buffer using a traditional coffee pot, which has a typical conical shape, on the hob. We have also found this method to work on a camping stove. The gel was then poured into the chamber and left to set for � min. The comb and shutters were removed and we added 45 ml 1x TBE buffer for the gel electrophoresis run, 60 min at 60V. We again used 1 μl NovelJuice (Sigma-Aldrich/Merck Ltd., Poole, United Kingdom) for the size ladder and per sample for DNA staining, as this DNA stain is safer to work with than traditional ethidium bromide and works both with UV transilluminators and with the blue LED transilluminator of the Bentolab (Bento Bioworks Ltd., London, United Kingdom).

The PCR product was cleaned up using GeneJET PCR Purification Kit (Thermo Fisher Scientific Ltd., Paisley, United Kingdom) following manufacturer’s instruction for DNA purification using centrifuge (protocol A). Adaptations to the procedure to make this protocol work on the Bentolab were as follows: In step 3, the 2 ml collection tube of the GeneJet purification column was replaced by a 1.5 ml centrifuge tube with the cap cut off (see adaptations to DNA purification protocol for explanation). The solution of step 1 was added to the purification column, centrifuged for 1 min at 8,000 × g and the flowthrough discarded. In step 4, 350 μl Wash Buffer was added at a time and centrifuged for 1 min at 8,000 × g and the flowthrough discarded (with a total of two repeats necessary to complete step 4). In step 5, a dry spin of 1.5 min at 8,000 × g was performed. In step 6, the collection tube was discarded and replaced by a clean 1.5 ml centrifuge tube. 50 μl of Elution Buffer was added to the purification column and centrifuged for 1 min at 8,000 × g.

We prepared the MinION library according to the 1D PCR barcoding amplicons/cDNA (SQK-LSK109) protocol from ONT (version PBAC12_9067_v109_revH_23MAY2018). As mentioned above, this protocol incorporates a second PCR to attach ONT barcodes to our first-round PCR products as means of indexing. This not only allows multiple samples to be run on one flow cell, but also allows for demultiplexing in the bioinformatics stage when a flow cell is reused. Washing a flow cell after a run might leave some remnant DNA from previous runs. Therefore, the ONT barcodes help to identify the current sample in the bioinformatics stage. Briefly, the PCR Barcoding Kit (EXP-PBC001 ONT Ltd., Oxford, United Kingdom) was used to prepare a 100-μl PCR mix containing 2 μl barcode (10 μM ONT Ltd., Oxford, United Kingdom), 2 μl first-round PCR product, 46 μl NFW (ThermoFisher Scientific Ltd., Paisley, United Kingdom) and 50 μl 2X LongAmp Taq Master Mix [New England BioLabs (NEB) Inc., Hitchin, United Kingdom]. C. elegans received barcode number 10. PCR was performed on the Bentolab thermocycler (Bento Bioworks Ltd., London, United Kingdom). The barcoding PCR protocol was slightly adjusted to accommodate the Bentolab’s inability for setting cycles of 15 s and minimum thermocycler temperature of 10ଌ: initial denaturation 3 min @ 95ଌ denaturation 20 s at 95ଌ, annealing 20 s at 62ଌ, extension 60 s at 65ଌ (all 12 cycles) final extension 50 s at 65ଌ hold at 10ଌ. The PCR products were cleaned up with 1X Agencourt AMPure XP beads (Beckman Coulter Inc., Indianapolis, IN, United States) on a 3D-printed magnetic BOMB microtube rack 5 (Oberacker et al., 2019).

Library preparation continued using the reagents from the Ligation Sequencing Kit (SQK-LSK109 ONT Ltd., Oxford, United Kingdom), according to manufacturer’s instructions. Since we wouldn’t have an accurate way of quantifying DNA in the field we based the used volume of DNA on a previous MinION run (Knot, unpublished data), to prepare 33 μl barcoded library in 47 μl NFW (Thermo Fisher Scientific Ltd., Paisley, United Kingdom). Amplified product was end-repaired using NEBNext Ultra II End-Repair/dA-tailing Module (NEB Inc., Hitchin, United Kingdom) for 5 min at 20ଌ and 5 min at 65ଌ on the Bentolab thermocycler (Bento Bioworks Ltd., London, United Kingdom). The end-repaired library was cleaned up with 1X Agencourt AMPure XP beads (Beckman Coulter Inc., Indianapolis, IN, United States) on a 3D-printed magnetic BOMB microtube rack (Oberacker et al., 2019). Adapter ligation was performed using NEB Blunt/TA Ligation Master Mix (NEB Inc., Hitchin, United Kingdom) and reagents provided in the SQK-LSK109 kit. Ligation took place for 10 min at room temperature. DNA was eluted in 15 μl Elution Buffer after being purified with 0.4X AMPure XP beads and washed with the Short Fragment Buffer provided in the SQK-LSK109 kit.

The flow cell was primed for loading by flushing the flow cell with 1 ml priming mix (30 μl of Flush Tether in one tube of Flush Buffer), taking care to avoid the introduction of air bubbles. The library was prepared for loading by mixing 37.5 μl Sequencing Buffer, 25.5 μl Loading Beads and 12 μl DNA library, after which the sample was added to a flow cell, type R9.5.1, through the SpotON sample port. Total library preparation time was estimated to be 𢏄.5 h.

We performed the sequencing run using MinKNOW (version 3.4.5 ONT Ltd., Oxford, United Kingdom) on the MinIT (a small powerful computing unit that eliminates the need for a dedicated laptop ONT Ltd., Oxford, United Kingdom), indicating the flow cell type and experimental kit used. To test whether old flow cells can still be useful for sequencing small barcoded amplicon libraries, we used a flow cell that was used twice before, once in a 24 h run and once in a 2.5 h run. When reusing a flow cell the starting voltage has to be adjusted and we adjusted this to -225 V, equivalent to ONT’s recommendation after � h previous run time. As mentioned above, higher active pore counts represent a high flow cell quality, with a maximum of 2,048 and a guaranteed level of 800 for new flow cells. The MUX scan indicated our flow cell had 43 pores available for sequencing. The flow cell generated 2,632 reads in 14 min of sequencing, after which the run was stopped.

Sanger Sequencing

Each of the samples used for the MinION sequencing was also sent for Sanger sequencing (GATC/Eurofins Genomics). Both forward and reverse strands were sequenced using the amplification primers as sequencing primers. Sanger sequence electropherograms were visually inspected and edited using 4Peaks version 1.8 (Nucleobytes B.V., Aalsmeer, the Netherlands). Edited forward strand and reverse complemented reverse strand sequences were aligned using Seaview version 4.7 (Gouy et al., 2010). Nucleotide mismatches were checked in the original electropherogram and resolved. A consensus sequence was derived for each sample and primer sequences trimmed from each end of it. The resulting sequences were 885 bp (A. simplex), 887 bp (P. redivivus), 832 bp (T. aceti), and 844 bp (C. elegans) long.

Bioinformatic Analyses

The raw fast5 MinION reads were basecalled and demultiplexed using Guppy version 3.2.4 + d9ed22f (ONT Ltd., Oxford, United Kingdom) to produce fastq files for each sample. Reads were classified as pass/fail based on a minimum quality score of 7. The fastq files were merged into one per sample and explored using Nanoplot (version 1.28.0 6 ), creating plots displaying log transformed read length (“–loglength”). Barcode and primer trimming was performed using Porechop (version 0.2.4 7 ). A second round of demultiplexing requiring barcodes at both ends of the reads (“–require_two_barcodes”) was performed using Porechop. Subsequently, the MinION reads were processed using the default settings of the ONTrack pipeline (version 1.4.2 8 Maestri et al., 2019). Briefly, Seqtk seq 9 was used to create fasta files complementary to the fastq files. Reads were clustered using VSEARCH (Rognes et al., 2016), after which the reads in the most abundant cluster were retained. Then 200 randomly sampled reads were used to produce a draft consensus sequence using Seqtk sample and aligned using MAFFT (Katoh et al., 2002). EMBOSS cons 10 was then used to retrieve a draft consensus sequence starting from the MAFFT alignment. Another 200 randomly sampled reads using Seqtk sample, different from the first iteration, were mapped to the draft consensus sequence using Minimap2 (Li, 2018) to polish the obtained consensus sequence. Samtools was used to filter and sort the alignment file and compress it to the bam format (Li et al., 2009). Nanopolish index and nanopolish variants – consensus modules from Nanopolish 11 were used to obtain a polished consensus sequence. The ONTrack pipeline was run with three iterations, the standard value of the pipeline. This resulted in three polished consensus sequences which were aligned with MAFFT to select the consensus sequence that was produced in the majority of times. All scripts of the pipeline were run within a virtual machine (as part of the ONTrack pipeline), emulating an Ubuntu v18.04.2 LTS operating system, on a Mac laptop without using any internet connection. All the code used for the bioinformatic analyses and additional files necessary to replicate the analyses can be found on https://github.com/ieknot/MinION-DNA-barcoding-of-nematodes. MinION fastq and Sanger fasta accession numbers are reported in the results.

To assess sequence accuracy, MinION raw reads and consensus reads were aligned to the corresponding Sanger-derived reference sequence using BLASTn (Altschul et al., 1990), with no sequence complexity masking (“-dust no-soft_masking false”). The consensus sequences were aligned to the corresponding Sanger sequence using the MUSCLE algorithm (Edgar, 2004) in Seaview version 4.7 (Gouy et al., 2010).


I’ve always been interested in plants and have been diving deeper and deeper into the biological aspect of plants, especially floral morphology and physiology. One day I stumbled upon the concept of genetic engineering with plants and began collecting the resources and knowledge of how to do this, mainly on the internet and google forums.

As my interest grew, so too did my home lab. I collected lab equipment off eBay in various states of disrepair and taught myself a bit of electrical engineering. I fixed what seemed like a dead unit bought for $40 and brought it back to working order with a new value exceeding $2000.

After several years, I built up a fully functional genetic engineering laboratory in the 3rd bedroom of my mom’s apartment capable of doing everything from general molecular biology to genetic engineering and recently I added Next Generation Sequencing to my lab’s wheelhouse of skills.

Imagine a world where school children sequence one of the 400,000 plant species on this planet as part of their studies and get direct scientific attribution at an early age.


Results

ONT MARC data

The MARC phase 1 experiments were performed by five laboratories that sequenced the same E. coli strain, in duplicate, by using the R7.3 flow cells with two different sequencing kits: the SQKMAP005 (Phase 1a) and the SQKMAP005.1 (Phase 2b). Each sequence produced by the MinION was classified as pass and fail on the base of base quality and converted to fastQ files by using poreTools [ 16] (see Methods section).

Because the main goal of this article is to evaluate the capability of ONT data to be used in resequencing analyses, in this section, we briefly report the principal characteristics of MARC experiments in terms of sequencing throughput, read length and quality distribution. A deep and comprehensive analysis of the characteristics of data generated by the MARC can be found in [ 15].

The total throughput of each experiment is variable between and within different laboratories, ranging from a minimum of 28 Mb to a maximum of 385 Mb and with a total number of sequenced reads that goes from around 6000 to 45 000 ( Table 1).

MARC experiments statistics

Exp name . Phase . Pass reads . Fail reads . Pass base . Fail base . ARS pass . ARS fail . Base prop pass:fail . Read prop pass:fail .
Lab1_run1 1a 32 548 14 806 228.3 86.8 6825.5 5626 0.72:0.28 0.69:0.31
Lab1_run2 1a 17 805 11 303 120.9 65.7 6658 5729 0.65:0.35 0.61:0.39
Lab2_run1 1a 8289 4790 59.3 30.3 7060 6138 0.66:0.34 0.63:0.37
Lab2_run2 1a 2901 1708 21.1 10.1 7200 5681.5 0.68:0.32 0.63:0.37
Lab3_run1 1a 18 765 7951 121.5 40.7 6367 4744 0.75:0.25 0.7:0.3
Lab3_run2 1a 19 169 7538 156.8 48.4 8007 6152 0.76:0.24 0.72:0.28
Lab4_run1 1a 13 836 10 858 69.4 44.2 3931 3479 0.61:0.39 0.56:0.44
Lab4_run2 1a 19 024 12 341 98.3 52.8 4352 3563 0.65:0.35 0.61:0.39
Lab5_run1 1a 23 566 6069 153.7 33.2 6242 4780 0.82:0.18 0.8:0.2
Lab5_run2 1a 17 673 26 351 48.2 39.4 1528 439 0.55:0.45 0.4:0.6
Lab1_run1 1b 12 258 17 511 69.4 78.1 5664 4464 0.47:0.53 0.41:0.59
Lab1_run2 1b 14 235 10 162 72.4 24.7 4738 667 0.75:0.25 0.58:0.42
Lab2_run1 1b 5165 5960 28.9 28.6 5438 4517.5 0.5:0.5 0.46:0.54
Lab2_run2 1b 28 054 30 044 206.5 178.2 7261.5 5944 0.54:0.46 0.48:0.52
Lab3_run1 1b 30 364 11 757 225.1 70.4 7235 5819 0.76:0.24 0.72:0.28
Lab3_run2 1b 14 800 6569 94.1 34.1 6285 5164 0.73:0.27 0.69:0.31
Lab4_run1 1b 1493 4673 8.4 20.0 5612 4042 0.3:0.7 0.24:0.76
Lab4_run2 1b 11 484 5856 65.5 27.3 5381 4371 0.71:0.29 0.66:0.34
Lab5_run1 1b 12 844 7876 83.8 43.0 6454 5257.5 0.66:0.34 0.62:0.38
Lab5_run2 1b 11 126 5894 72.8 31.7 6382.5 5113 0.7:0.3 0.65:0.35
Exp name . Phase . Pass reads . Fail reads . Pass base . Fail base . ARS pass . ARS fail . Base prop pass:fail . Read prop pass:fail .
Lab1_run1 1a 32 548 14 806 228.3 86.8 6825.5 5626 0.72:0.28 0.69:0.31
Lab1_run2 1a 17 805 11 303 120.9 65.7 6658 5729 0.65:0.35 0.61:0.39
Lab2_run1 1a 8289 4790 59.3 30.3 7060 6138 0.66:0.34 0.63:0.37
Lab2_run2 1a 2901 1708 21.1 10.1 7200 5681.5 0.68:0.32 0.63:0.37
Lab3_run1 1a 18 765 7951 121.5 40.7 6367 4744 0.75:0.25 0.7:0.3
Lab3_run2 1a 19 169 7538 156.8 48.4 8007 6152 0.76:0.24 0.72:0.28
Lab4_run1 1a 13 836 10 858 69.4 44.2 3931 3479 0.61:0.39 0.56:0.44
Lab4_run2 1a 19 024 12 341 98.3 52.8 4352 3563 0.65:0.35 0.61:0.39
Lab5_run1 1a 23 566 6069 153.7 33.2 6242 4780 0.82:0.18 0.8:0.2
Lab5_run2 1a 17 673 26 351 48.2 39.4 1528 439 0.55:0.45 0.4:0.6
Lab1_run1 1b 12 258 17 511 69.4 78.1 5664 4464 0.47:0.53 0.41:0.59
Lab1_run2 1b 14 235 10 162 72.4 24.7 4738 667 0.75:0.25 0.58:0.42
Lab2_run1 1b 5165 5960 28.9 28.6 5438 4517.5 0.5:0.5 0.46:0.54
Lab2_run2 1b 28 054 30 044 206.5 178.2 7261.5 5944 0.54:0.46 0.48:0.52
Lab3_run1 1b 30 364 11 757 225.1 70.4 7235 5819 0.76:0.24 0.72:0.28
Lab3_run2 1b 14 800 6569 94.1 34.1 6285 5164 0.73:0.27 0.69:0.31
Lab4_run1 1b 1493 4673 8.4 20.0 5612 4042 0.3:0.7 0.24:0.76
Lab4_run2 1b 11 484 5856 65.5 27.3 5381 4371 0.71:0.29 0.66:0.34
Lab5_run1 1b 12 844 7876 83.8 43.0 6454 5257.5 0.66:0.34 0.62:0.38
Lab5_run2 1b 11 126 5894 72.8 31.7 6382.5 5113 0.7:0.3 0.65:0.35

Columns report the main characteristics of each experiment generated by the five laboratories of the MARC. For each experiment we reported the phase (Phase), the number of reads (Pass reads and Fail reads), the throughput in Mb (Pass base and Fail base), the average read length (ARS Pass and ARS Fail) and the proportion of reads and throughput between pass and fail reads (Base Prop and Read Prop). All the statistics were calculated from MARC fastQ files.

MARC experiments statistics

Exp name . Phase . Pass reads . Fail reads . Pass base . Fail base . ARS pass . ARS fail . Base prop pass:fail . Read prop pass:fail .
Lab1_run1 1a 32 548 14 806 228.3 86.8 6825.5 5626 0.72:0.28 0.69:0.31
Lab1_run2 1a 17 805 11 303 120.9 65.7 6658 5729 0.65:0.35 0.61:0.39
Lab2_run1 1a 8289 4790 59.3 30.3 7060 6138 0.66:0.34 0.63:0.37
Lab2_run2 1a 2901 1708 21.1 10.1 7200 5681.5 0.68:0.32 0.63:0.37
Lab3_run1 1a 18 765 7951 121.5 40.7 6367 4744 0.75:0.25 0.7:0.3
Lab3_run2 1a 19 169 7538 156.8 48.4 8007 6152 0.76:0.24 0.72:0.28
Lab4_run1 1a 13 836 10 858 69.4 44.2 3931 3479 0.61:0.39 0.56:0.44
Lab4_run2 1a 19 024 12 341 98.3 52.8 4352 3563 0.65:0.35 0.61:0.39
Lab5_run1 1a 23 566 6069 153.7 33.2 6242 4780 0.82:0.18 0.8:0.2
Lab5_run2 1a 17 673 26 351 48.2 39.4 1528 439 0.55:0.45 0.4:0.6
Lab1_run1 1b 12 258 17 511 69.4 78.1 5664 4464 0.47:0.53 0.41:0.59
Lab1_run2 1b 14 235 10 162 72.4 24.7 4738 667 0.75:0.25 0.58:0.42
Lab2_run1 1b 5165 5960 28.9 28.6 5438 4517.5 0.5:0.5 0.46:0.54
Lab2_run2 1b 28 054 30 044 206.5 178.2 7261.5 5944 0.54:0.46 0.48:0.52
Lab3_run1 1b 30 364 11 757 225.1 70.4 7235 5819 0.76:0.24 0.72:0.28
Lab3_run2 1b 14 800 6569 94.1 34.1 6285 5164 0.73:0.27 0.69:0.31
Lab4_run1 1b 1493 4673 8.4 20.0 5612 4042 0.3:0.7 0.24:0.76
Lab4_run2 1b 11 484 5856 65.5 27.3 5381 4371 0.71:0.29 0.66:0.34
Lab5_run1 1b 12 844 7876 83.8 43.0 6454 5257.5 0.66:0.34 0.62:0.38
Lab5_run2 1b 11 126 5894 72.8 31.7 6382.5 5113 0.7:0.3 0.65:0.35
Exp name . Phase . Pass reads . Fail reads . Pass base . Fail base . ARS pass . ARS fail . Base prop pass:fail . Read prop pass:fail .
Lab1_run1 1a 32 548 14 806 228.3 86.8 6825.5 5626 0.72:0.28 0.69:0.31
Lab1_run2 1a 17 805 11 303 120.9 65.7 6658 5729 0.65:0.35 0.61:0.39
Lab2_run1 1a 8289 4790 59.3 30.3 7060 6138 0.66:0.34 0.63:0.37
Lab2_run2 1a 2901 1708 21.1 10.1 7200 5681.5 0.68:0.32 0.63:0.37
Lab3_run1 1a 18 765 7951 121.5 40.7 6367 4744 0.75:0.25 0.7:0.3
Lab3_run2 1a 19 169 7538 156.8 48.4 8007 6152 0.76:0.24 0.72:0.28
Lab4_run1 1a 13 836 10 858 69.4 44.2 3931 3479 0.61:0.39 0.56:0.44
Lab4_run2 1a 19 024 12 341 98.3 52.8 4352 3563 0.65:0.35 0.61:0.39
Lab5_run1 1a 23 566 6069 153.7 33.2 6242 4780 0.82:0.18 0.8:0.2
Lab5_run2 1a 17 673 26 351 48.2 39.4 1528 439 0.55:0.45 0.4:0.6
Lab1_run1 1b 12 258 17 511 69.4 78.1 5664 4464 0.47:0.53 0.41:0.59
Lab1_run2 1b 14 235 10 162 72.4 24.7 4738 667 0.75:0.25 0.58:0.42
Lab2_run1 1b 5165 5960 28.9 28.6 5438 4517.5 0.5:0.5 0.46:0.54
Lab2_run2 1b 28 054 30 044 206.5 178.2 7261.5 5944 0.54:0.46 0.48:0.52
Lab3_run1 1b 30 364 11 757 225.1 70.4 7235 5819 0.76:0.24 0.72:0.28
Lab3_run2 1b 14 800 6569 94.1 34.1 6285 5164 0.73:0.27 0.69:0.31
Lab4_run1 1b 1493 4673 8.4 20.0 5612 4042 0.3:0.7 0.24:0.76
Lab4_run2 1b 11 484 5856 65.5 27.3 5381 4371 0.71:0.29 0.66:0.34
Lab5_run1 1b 12 844 7876 83.8 43.0 6454 5257.5 0.66:0.34 0.62:0.38
Lab5_run2 1b 11 126 5894 72.8 31.7 6382.5 5113 0.7:0.3 0.65:0.35

Columns report the main characteristics of each experiment generated by the five laboratories of the MARC. For each experiment we reported the phase (Phase), the number of reads (Pass reads and Fail reads), the throughput in Mb (Pass base and Fail base), the average read length (ARS Pass and ARS Fail) and the proportion of reads and throughput between pass and fail reads (Base Prop and Read Prop). All the statistics were calculated from MARC fastQ files.

The average sequence length ranges between 5 and 7 kb for the great majority of the MARC experiments, and single reads range from hundreds bp to tens kb ( Table 1 and Figure 1) for all the 20 experiments. On an average, pass sequences are longer (4–8 kb) than fail sequences (4–6 kb) and the amount of pass reads represent more the 60% of the total sequencing throughput in almost all experiments.

Size and quality distribution of pass and fail ONT sequences. Panel (A) shows read length distribution for pass and fail sequences of the 20 experiments performed by the MARC. Panel (B) reports the GC content percentage for pass and fail reads compared with randomly selected regions of the E. coli genome. Panels c, d and e show average read quality (C, the first 20 barplots are related to pass reads while the second 20 to fail reads), base quality distribution (D) and base quality as a function of sequence position (E) for pass and fail reads, respectively. A colour version of this figure is available at BIB online: https://academic.oup.com/bib.

Size and quality distribution of pass and fail ONT sequences. Panel (A) shows read length distribution for pass and fail sequences of the 20 experiments performed by the MARC. Panel (B) reports the GC content percentage for pass and fail reads compared with randomly selected regions of the E. coli genome. Panels c, d and e show average read quality (C, the first 20 barplots are related to pass reads while the second 20 to fail reads), base quality distribution (D) and base quality as a function of sequence position (E) for pass and fail reads, respectively. A colour version of this figure is available at BIB online: https://academic.oup.com/bib.

The read GC content distribution is close to E. coli reference genome for both pass and fail reads ( Figure 1), and although the average quality of pass reads is clearly larger than fail reads, base quality does not depend on read position (except for around the first 100 bases, Figure 1E), demonstrating that the DNA strand translocation through the nanopore is not affected by position biases. This result is of fundamental importance because it suggests that nanopore-sequencing approach can generate high-quality sequences with no theoretical limits on length, except those introduced during sample preparation.

The results summarized in Table 1 and Figure 1 show that there are no significant differences between the experiments generated in Phases 1a and 1b in terms of sequencing throughput and median read lengths, in accordance with the data reported in [ 15].

Aligners and error rate estimation

The alignment of TGS sequences can be particularly challenging for the large number of long reads that they generate (from kb to tens of kb) and for the high error rates that are primarily InDels rather than substitutions [ 17]. The principal computational problem is how to align long (many kilobase) reads with moderate divergence from the genome (up to 20% divergence, concentrated in InDels) with the same speed and sensitivity of SGS alignment methods.

At present, few methods have been tested or developed to properly map long reads generated by TGS platforms. Chaisson et al. [ 18] proposed a novel method (Basic Local Alignment with Successive Refinement, BLASR) that combines the data structures used in short read mapping with alignment methods used in whole genome alignment (see Methods section for more details). Heng Li extended the BWA-MEM algorithm [ 19] by combining relaxed scoring of Smith–Waterman with heuristics filtering to support PacBio and ONT reads. Several papers proposed to map nanopore and PacBio data by using the approach adopted by LAST [ 20]. LAST follows a three steps approach in which first finds initial matches between reads and genome, then extend them with a gapless X-drop algorithm and finally extend them using a gapped X-drop algorithm [ 21]. Recently, Jain et al. [ 17] proposed a novel approach, marginAlign [ 17], properly devised for ONT data that realigns reads against a reference genome by combining a HMM with the alignments generated by LAST and BWA (burrows wheeler aligner). Henceforth, we will refer to marginAlign with LAST as HMML and to marginAlign with BWA as HMMB.

To understand the capability of different alignment approaches to properly map ONT sequences against a reference genome, we applied the five aforementioned long reads alignment methods (BWA, BLASR, LAST, HMML and HMMB, see Methods section for more details) to the pass and fail sequences of the 20 MARC experiments.

BWA, HMML and HMMB produced soft-clip alignments that represent 1.5–5% of mapped pass reads (1.5% for BWA and HMMB and 5% for HMML) and 10% of mapped fail reads (see Supplementary Table S1 for more details). Moreover, BWA was the only aligner to produce split mapping: 1% of pass reads and 8% of fail reads were splitted ( Supplementary Table S1 ). Around 99% of pass reads and 80% of fail reads ( Figure 2A) were aligned against the E. coli reference genome and mapping performance strongly depends on sequence size ( Figure 2B and C and Supplementary Figure S1 ) as the longer the reads and the higher the fraction of sequences mapped by each method. At present, the best way to evaluate the likelihood that an alignment is correct is mapping quality (MQ). This score is generally estimated by considering various factors, such as the number of base mismatches and the sizes of inserted or deleted regions in the alignment [ 22]. We analyzed the MQ values generated by BLASR and BWA (mappers being evaluated that generate MQ) and we found that around 99% of mapped pass reads and 90–95% of mapped fail reads (see Supplementary Table S1 ) have MQ ≥ 20 ⁠ . For this reason, all the subsequent analyses for BWA and BLASR will be performed using reads with MQ ≥ 20 ⁠ . Although all the five alignment strategies gave similar results, the LAST algorithm obtained the worst global performance and resulted as the most influenced by sequence length for both pass and fail reads. All the mapping strategies tested in this work were not able to align 10% of pass reads shorter than 1 kb and 40% of fail reads shorter than 3 kb, suggesting that short reads with lower base qualities contain more sequencing errors than short reads with higher base qualities.

Mapping algorithms comparison and sequencing error rate estimation. Panels (A) show the proportion on mapped and unmapped reads for all the five aligners (colors represent average read qualities). Panels (B) and (C) report aligners performance as a function of read length for pass (B) and fail (C) sequences. Sequencing error rate was estimated as a function of sequence position (D–I) and base quality (J–O). Error rate was estimated for substituted (D, G, J, M), inserted (E, H, K, N) and deleted bases (F, I, L, O) for pass (D–F, J–L) and fail (G–I, M–O) sequences. To simplify subplots grouping, panels (D–O) contain additional labels that describe variants and read type: Sub- (substitutions), Ins- (insertions), Del- (deletions), -Pass (pass reads) and -Fail (fail reads). A colour version of this figure is available at BIB online: https://academic.oup.com/bib.

Mapping algorithms comparison and sequencing error rate estimation. Panels (A) show the proportion on mapped and unmapped reads for all the five aligners (colors represent average read qualities). Panels (B) and (C) report aligners performance as a function of read length for pass (B) and fail (C) sequences. Sequencing error rate was estimated as a function of sequence position (D–I) and base quality (J–O). Error rate was estimated for substituted (D, G, J, M), inserted (E, H, K, N) and deleted bases (F, I, L, O) for pass (D–F, J–L) and fail (G–I, M–O) sequences. To simplify subplots grouping, panels (D–O) contain additional labels that describe variants and read type: Sub- (substitutions), Ins- (insertions), Del- (deletions), -Pass (pass reads) and -Fail (fail reads). A colour version of this figure is available at BIB online: https://academic.oup.com/bib.

As a further step, aligned data were used to obtain a raw estimation of ONT error rate for the three main sources of local errors: mismatch, deletions and insertions. To this end, for each mapping algorithm, we calculated the number of bases that are substituted, inserted and deleted with respect to the reference genome as a function of sequence position and read quality.

Although the results of these analyses give a combination of sequencing and alignment errors, the use of five different mapping strategies allowed us to mitigate the alignment effect obtaining a good estimation of sequencing errors. To better evaluate the error rates estimated for ONT data, we compared these results with those obtained by the Illumina MiSeq and Pacific Bioscience platforms (see Methods).

Taken as a whole, panels d–o of Figure 2 show that the sequencing error rate slightly depends on read position, while it is highly influenced by read quality. The fraction of substituted, inserted and deleted bases ( Figure 2D–I) increases with sequence position until reaching a constant value at around 50–100 bp for all the five aligners, with the exception of LAST on single-base substitutions. On the contrary, the error rate for the three variant categories decreases as the average read base quality increase ( Figure 2J–O), suggesting that read quality and read errors are highly correlated.

To evaluate this correlation, we used the alignments data generated by BLASR, BWA and LAST to estimate the Phred-scaled mismatch rate as Q = − 10 log 10 P (where P is the fraction of mismatches for each aligned read) and we compared it with the predicted quality scores. The results of these analyses are reported in Supplementary Figure S2 and show that predicted quality score accurately reflects measured mismatch rate for both pass and fail reads (although for fail reads, the predicted higher quality scores are underestimated).

All these analyses also demonstrate that the five mapping methods returned different results for the three error categories. The two marginAlign methods obtained the smallest error rate for single base substitutions, while LAST and BWA showed the best performance for small InDels (see Table 2 for more details).

Aligner . Sub pass . Ins pass . Del pass . Total pass . Proportion pass . Sub fail . Ins fail . Del fail . Total fail . Proportion fail .
BWA 5.87 2.37 3.66 11.9 49:20:31 13.82 3.6 6.16 23.58 59:15:26
BLASR 3.64 3.65 5.14 12.43 29:29:42 7.44 6.12 9.25 22.81 33:27:40
LAST 8.01 2.43 3.71 14.15 57:17:26 39.11 5.19 5.11 49.41 79:11:10
HMMB 2.97 3.32 4.71 11 27:30:43 6.35 6.23 8.96 21.54 29:29:42
HMML 2.46 3.37 4.77 10.6 23:32:45 6.06 6.15 8.83 21.04 29:29:42
MiSeq 0.24 0 0 0.24 100:0:0 0.24 0 0 0.24 100:0:0
PacBio 1.15 6.73 3.29 11.17 10:60:30 1.15 6.73 3.29 11.17 10:60:30
Aligner . Sub pass . Ins pass . Del pass . Total pass . Proportion pass . Sub fail . Ins fail . Del fail . Total fail . Proportion fail .
BWA 5.87 2.37 3.66 11.9 49:20:31 13.82 3.6 6.16 23.58 59:15:26
BLASR 3.64 3.65 5.14 12.43 29:29:42 7.44 6.12 9.25 22.81 33:27:40
LAST 8.01 2.43 3.71 14.15 57:17:26 39.11 5.19 5.11 49.41 79:11:10
HMMB 2.97 3.32 4.71 11 27:30:43 6.35 6.23 8.96 21.54 29:29:42
HMML 2.46 3.37 4.77 10.6 23:32:45 6.06 6.15 8.83 21.04 29:29:42
MiSeq 0.24 0 0 0.24 100:0:0 0.24 0 0 0.24 100:0:0
PacBio 1.15 6.73 3.29 11.17 10:60:30 1.15 6.73 3.29 11.17 10:60:30

Columns report the most relevant information about error rate for substitutions (Sub), insertions (Ins) and deletions (Del) for pass and fail reads. ‘Total’ columns report the sum of substitution, insertion and deletion error rates. ‘Proportion’ columns report the relative percentage of each error class (Sub:Ins:Del).

Aligner . Sub pass . Ins pass . Del pass . Total pass . Proportion pass . Sub fail . Ins fail . Del fail . Total fail . Proportion fail .
BWA 5.87 2.37 3.66 11.9 49:20:31 13.82 3.6 6.16 23.58 59:15:26
BLASR 3.64 3.65 5.14 12.43 29:29:42 7.44 6.12 9.25 22.81 33:27:40
LAST 8.01 2.43 3.71 14.15 57:17:26 39.11 5.19 5.11 49.41 79:11:10
HMMB 2.97 3.32 4.71 11 27:30:43 6.35 6.23 8.96 21.54 29:29:42
HMML 2.46 3.37 4.77 10.6 23:32:45 6.06 6.15 8.83 21.04 29:29:42
MiSeq 0.24 0 0 0.24 100:0:0 0.24 0 0 0.24 100:0:0
PacBio 1.15 6.73 3.29 11.17 10:60:30 1.15 6.73 3.29 11.17 10:60:30
Aligner . Sub pass . Ins pass . Del pass . Total pass . Proportion pass . Sub fail . Ins fail . Del fail . Total fail . Proportion fail .
BWA 5.87 2.37 3.66 11.9 49:20:31 13.82 3.6 6.16 23.58 59:15:26
BLASR 3.64 3.65 5.14 12.43 29:29:42 7.44 6.12 9.25 22.81 33:27:40
LAST 8.01 2.43 3.71 14.15 57:17:26 39.11 5.19 5.11 49.41 79:11:10
HMMB 2.97 3.32 4.71 11 27:30:43 6.35 6.23 8.96 21.54 29:29:42
HMML 2.46 3.37 4.77 10.6 23:32:45 6.06 6.15 8.83 21.04 29:29:42
MiSeq 0.24 0 0 0.24 100:0:0 0.24 0 0 0.24 100:0:0
PacBio 1.15 6.73 3.29 11.17 10:60:30 1.15 6.73 3.29 11.17 10:60:30

Columns report the most relevant information about error rate for substitutions (Sub), insertions (Ins) and deletions (Del) for pass and fail reads. ‘Total’ columns report the sum of substitution, insertion and deletion error rates. ‘Proportion’ columns report the relative percentage of each error class (Sub:Ins:Del).

The total error rates (sum of the three error rates) for BWA and the two marginAlign approaches (the three best methods in terms of performance) is around 11% (see Table 2) and in accordance with the total percent error estimated in the first paper released by the MARC [ 15]. As expected, the average error rate for pass reads (11%) is much smaller than that obtained for fail sequences (around 21%, see Table 2). Interestingly, although PacBio sequences present low error rate for substitutions (around 1%) they generate a total error rate comparable with ONT data as a consequence of the high insertions errors, and this is in accordance with previously published paper [ 23, 24]. As expected, the total error rate estimated for the SGS MiSeq reads is almost two order of magnitude (0.24%, Table 2) smaller than the other TGS technologies (around 11% for both PacBio and ONT, Table 2).

Error rate distribution

In resequencing studies, once the reads have been properly mapped, genomic variants are discovered by searching for differences between the reference genome and the aligned reads. For each genomic position, substitutions and small InDels (hereafter ‘events’) are inferred by comparing the number of reads that do not contain the reference allele and the total number of reads aligned with that position: a variant is called when the number of reads containing the same alternative allele is significantly large with respect to the total number of reads (for haploid genomes, a variant can be roughly called when half reads contain the same alternative allele). In this framework, although the error rate estimated in the previous section is a good approximation of sequencing accuracy (the capability of a sequencing technology to correctly sequence a DNA fragment), it is not able to predict the number of false-positive events generated by a resequencing analysis because it depends on recurrent errors aligned at the same genomic position.

For this reason, for each position of the reference genome, we counted the number of reads that contain the same substituted, inserted or deleted bases. In this way, for each alignment, we estimated the ‘recurrent’ error distribution that gives the probability to find N reads containing the same error aligned at the same position of the reference genome. The study of these distributions allowed us to estimate the probability of detecting false-positive events and to understand the randomness of recurrent errors (the probability to find N errors in the same position by chance).

To study the stochastic nature of recurrent errors, by using the sequencing error rates estimated in previous section, we simulated synthetic reads with randomly distributed substituted, inserted or deleted bases and we calculated their recurrent error distribution. By using this recipe, we obtained the probability distribution of finding N recurrent errors by chance and we compared with that generated by each alignment by means of the Kolmogorov–Smirnov statistics.

On one hand, the Kolmogorov–Smirnov statistics D quantifies the distance between two empirical distribution function and the smaller is D the closer are the two distributions. In our analyses, a small D value indicates that the recurrent error distribution of real and randomly generated reads are close and consequently real errors are randomly distributed along each read independently by the genomic position in which have been mapped. On the other hands, large D values indicate that errors in real reads are not randomly distributed but fall in recurrent positions of the genome.

On one hand, all the D statistics estimated for MiSeq and PacBio alignments ( Figure 3) are close to zero, indicating that substitution, insertion and deletion errors are randomly generated during the sequencing process of these technologies. On the other hands, the D statistics obtained from ONT alignments suggest that the error distribution along the reads generated by nanopore sequencing process is not completely random ( Figure 3 and Supplementary Figures S3–S8 ). LAST-based alignments (LAST and HMML) obtained D statistics close to one for all the three error classes, while BWA, BLASR and HMMB gave D values larger than those obtained by MiSeq and PacBio, in particular for deleted bases.

Recurrent errors distribution analysis. Summary of the recurrent error distribution analyses for substituted (A–F), inserted (G–L) and deleted (M–R) bases. Panels (A, C, G, I, M and O) report the Kolmogorov–Smirnov statistic as a function of read quality, while panels (B, D, H, J, N and P) show the false-positive frequency as a function of average sequencing coverage. False-positive frequency is defined as the ratio between the total number of false positive events and the size of E. coli genome in bp. False-positive events are defined as genomic loci in which more than half of the aligned reads contain the same error. The barplots of panels (E, F, K, L, Q and R) report the base content of false positive events for substituted (E, F), inserted (K, L) and deleted (Q, R) bases. Each bar with suffix -R reports the distribution of nucleotides in which the false event occurs (for InDels the nucleotide before the event). Each bar with suffix -E contains the base content of the substituted/deleted/inserted bases. Panels (A, B, E, G, H, K, M, N and Q) report results for pass reads, while panels (C, D, F, I, J, L, O, P and R) for pass+fail reads. To simplify subplots grouping, all panels have title labels that describe variants and read type: Sub- (substitutions), Ins- (insertions), Del- (deletions), -Pass (pass reads) and -All (pass+fail reads). A colour version of this figure is available at BIB online: https://academic.oup.com/bib.

Recurrent errors distribution analysis. Summary of the recurrent error distribution analyses for substituted (A–F), inserted (G–L) and deleted (M–R) bases. Panels (A, C, G, I, M and O) report the Kolmogorov–Smirnov statistic as a function of read quality, while panels (B, D, H, J, N and P) show the false-positive frequency as a function of average sequencing coverage. False-positive frequency is defined as the ratio between the total number of false positive events and the size of E. coli genome in bp. False-positive events are defined as genomic loci in which more than half of the aligned reads contain the same error. The barplots of panels (E, F, K, L, Q and R) report the base content of false positive events for substituted (E, F), inserted (K, L) and deleted (Q, R) bases. Each bar with suffix -R reports the distribution of nucleotides in which the false event occurs (for InDels the nucleotide before the event). Each bar with suffix -E contains the base content of the substituted/deleted/inserted bases. Panels (A, B, E, G, H, K, M, N and Q) report results for pass reads, while panels (C, D, F, I, J, L, O, P and R) for pass+fail reads. To simplify subplots grouping, all panels have title labels that describe variants and read type: Sub- (substitutions), Ins- (insertions), Del- (deletions), -Pass (pass reads) and -All (pass+fail reads). A colour version of this figure is available at BIB online: https://academic.oup.com/bib.

To evaluate the effect of recurrent errors on producing false-positive events, we counted the total number of genomic positions in which more than half of mapped reads contain the same substituted, deleted or inserted bases. As expected, the frequency of false-positive events depends on sequencing coverage and read base quality. Figure 3 shows that increasing the coverage mitigates the effect of recurrent error biases and reduces the total number of false-positive events. In the same way, the removal of reads with low base quality increases false-positive frequency by reducing sequencing coverage ( Supplementary Figures S9 and S10 ). Surprisingly, although PacBio data show high sequencing error rate (on the same order of magnitude of ONT reads), they obtained the lowest false-positive rate for all the three variant classes, detecting less than one false substitution every 100 kb and around one InDel every 1 Mb. The reason of these results can be mainly ascribed to the nearly random nature of errors distribution along PacBio sequences (small D values). The performance of MiSeq sequencer are similar to those of PacBio and this is a direct consequence of the low sequencing error rate of SGS reads reported in previous section.

Concerning ONT data, although BLASR resulted the best aligner (in terms of false-positive frequency) for substitutions and insertions and LAST for deletions, the global performance obtained by this sequencing approach is poor for all the three variant classes. In the best experimental/computational setting (best aligner and coverages larger than 30×) ONT experiments produce around one false substitution and insertion every 10–50 kb and one false deletion every 1 kb, making a hard challenge the use of this data for small variants discovery. Moreover, the combination of pass and fail reads has little effect on reducing false-positive frequency ( Figure 3).

As a further step, to understand the experimental and computational nature of recurrent errors, we studied the nucleotides content and the size distribution (for inserted and deleted bases) of all the false-positive events generated by each alignment. Although the five aligners produced slightly different results, the bar plots of Figure 3 and Supplementary Figure S11 show that recurrent errors follow specific nucleotide patterns that can be ascribed to intrinsic biases of the nanopore sequencing process. On one hand, recurrent substitution errors mainly affect C and G and, independently of the nucleotide they affect, substituted bases are enriched of C and G. On the other hand, recurrent-deleted bases principally involve A and T and mainly occur after A and T nucleotides of the genome. Supplementary Figure S11 also show that realignment strategy of marginAlign, irrespective of the mapper chosen for the primary alignment, introduces a bias which results in the missing of one or more nucleotides in poly-X homopolymers. Remarkably, inserted bases do not suffer of any apparent bias being equally distributed among the four nucleotides.

Moreover, we found that although the great majority of InDel calls are 1-base events for all the alignments, the two TGS data contain a significant fraction of inserted (PacBio) and deleted (PacBio and ONT) bases larger than 1 bp ( Supplementary Figure S12 ).

Taken as a whole, these results suggest that the translocation of C (G) through the nanopore is preferentially miscalled with G (C), while the translocation A and T may result in the loss of one (or more) subsequent nucleotides by the sequencing/base-calling process.

Although it is difficult to completely explain the reasons of these errors, we speculate that both deletions and C–G miscalling can be mainly ascribed to algorithmic limits of the HMM at the base of the Metrichor base caller. Taken as a whole, the results reported in this section can be of fundamental importance for improving the performance of base-calling methods and for the development of novel algorithms for the identification of small variants by using ONT data.

Depth of coverage

At present, the most powerful method for the identification of CNVs in resequencing analyses is the depth of coverage (DOC) approach [ 25, 26].

The DOC approach is based on the simple idea that during the sequencing process, the reads are randomly and independently sequenced from any location of the genome. Under this assumption, the number of reads mapping into a window of the reference genome should be proportional to the number of times the region appears in the DNA sample and follow a Poisson distribution. Following this assumption, the copy number of any genomic region can be estimated by calculating the DOC of reads aligned to consecutive and non-overlapping windows of the genome. To understand the capability of ONT data to identify genomic regions involved in CNVs, we studied the statistical properties and biases of DOC distribution and we compared it with the other two sequencing technologies.

As a first step, we studied the relationship between DOC and classical genomic biases: local GC content and mappability (defined as the inverse of the number of times that a sequence originating from any position in the reference genome maps to the genome itself) calculated as in [ 27].

On one hand, the correlation between DOC and GC content has been previously reported in several papers for SGS data and is mainly owing to the amplification step of the sequencing process. On the other hand, mappability bias is owing to the fact that the genome contains many repetitive elements and aligning reads to these positions leads to ambiguous mapping. In Magi et al. [ 25], by analyzing Illumina, 454 and SoLID reads, we observed that DOC is maximum for values of GC content between 35% and 60%, while it decreases at both extremes. In the same paper, we also found that DOC distribution for highly mappable regions is closer to Poissonian than genomic regions with low mappability.

On one hand, the results summarized in Figure 4 clearly show that ONT reads are slightly affected by the two classical sequencing biases with the exception of LAST alignment that is highly influenced by mappability. On the other hand, PacBio and MiSeq coverages strongly depend on local GC content and this can be mainly ascribed to the PCR chemistry at the base of these two sequencing approaches.

DOC distributions and biases. The first column of panels (a–u) reports the histograms of DOC and the superimposed Poisson distribution (solid lines) for all the alignments. Second column shows the correlation between DOC and GC content, while third column the correlation between DOC and mappability. The rows of panels (a–u) show the results for different aligners/platforms: BWA (A–C), BLASR (D–F), LAST (G–I), HMML (J–L), HMMB (M–O), MiSeq (P–R) and PacBio (S–U). Panels (v1) and (w1) report the ID as a function of window size, (v2) and (w2) the error rate for duplications and (v3) and (w3) for deletions. Panels (v1), (v2) and (v3) refer to pass reads, while panels (w1), (w2) and (w3) refer to all (pass+fail) reads. A colour version of this figure is available at BIB online: https://academic.oup.com/bib.

DOC distributions and biases. The first column of panels (a–u) reports the histograms of DOC and the superimposed Poisson distribution (solid lines) for all the alignments. Second column shows the correlation between DOC and GC content, while third column the correlation between DOC and mappability. The rows of panels (a–u) show the results for different aligners/platforms: BWA (A–C), BLASR (D–F), LAST (G–I), HMML (J–L), HMMB (M–O), MiSeq (P–R) and PacBio (S–U). Panels (v1) and (w1) report the ID as a function of window size, (v2) and (w2) the error rate for duplications and (v3) and (w3) for deletions. Panels (v1), (v2) and (v3) refer to pass reads, while panels (w1), (w2) and (w3) refer to all (pass+fail) reads. A colour version of this figure is available at BIB online: https://academic.oup.com/bib.

As a further step, to understand the stochastic properties of coverage distributions, we calculated the index of dispersion (ID) for different window sizes (10 bp, 20 bp, 50 bp, 100 bp, 200 bp, 500 bp, 1 kb, 2 kb, 5 kb, 10 kb and 20 kb). The ID, defined as the ratio between variance and mean, is used to quantify whether a set of observations are clustered or dispersed. In particular, ID larger than one indicate overdispersed data that follow a negative binomial distribution, ID smaller than one refer to underdispersed data that follow a binomial distribution, while ID = 1 indicates data with Poisson distribution. In [ 25], we demonstrated that DOC distribution from SGS sequences exhibit an ID largely greater than one and that this over dispersion can be accounted to local GC content and mappability.

All the ONT DOC distributions, with the exception of LAST alignments, have an ID close to one ( Figure 4) that demonstrate the Poissonian nature of the nanopore-sequencing process as a direct consequence of the low influence of GC content and mappability on these data. The large ID obtained by LAST can be mainly ascribed to the mappability bias of this alignment method, while the overdispersion of PacBio distributions is principally owing to the GC content bias described above. Although MiSeq data are strongly affected by GC content, they show small ID values that are the consequence of the small variance of these data.

As a final step, to evaluate the false-positive rate of CNV events, we calculated the fraction of genomic windows in which the 1-copy normalized DOC is larger than 1.5 (for duplication) and smaller than 0.5 (for deletions). ONT data obtained the best results for both duplicated and deleted regions, while PacBio reads gave the highest error rate demonstrating a poor suitability for CNVs analysis. Concerning ONT alignments, the BWA mapping data obtains the smallest error rate outperforming the other four methods. Moreover, the results reported in panels v1-w3 of Figure 4 show that ID and error rate decrease at the increasing of the window size and this trend is highly correlated with the read size: MiSeq data start to decrease from window size larger than 100 bp, while TGS data from window sizes larger than 2 kb.

Taken as a whole, these analyses demonstrate that the nanopore-sequencing process is a uniform process in which reads are randomly and independently sequenced. Notably, the error rate produced by ‘all’ reads (combining pass and fail) is much smaller than the error rate obtained with pass reads: although fail reads contain a large fraction of substituted, inserted and deleted bases they produce an increase in coverage that decrease the variance of DOC distribution and consequently the number of false-positive windows.

Variants detection accuracy

To evaluate the detection rate of ONT data for substitutions, small InDels and CNVs, we aligned the MARC data (pass and combined pass and fail) and the other sequencing experiments against synthetic E. coli reference genomes.

Synthetic reference genomes were generated by substituting, inserting and removing bases from the E. coli reference genome (see Methods section for more details). By using this approach, we were able to simulate substitutions, small InDels from 1 to 50 bp and deletions from 200 bp to 5000 kb in size. Moreover, by using a sophisticated strategy based on removing segmental duplicated regions from the E. coli reference genome, we were able to simulate multiple copy duplications (see Methods). After read mapping against the synthetic reference genomes, the detection rate for substitutions and small InDels was roughly estimated by calculating the proportion of modified loci in which more than half of the aligned reads contain the original reference allele. Detection rate was studied as a function of the local DOC of modified loci and as a function of variant size for small InDels.

The results of these analyses are summarized in Figure 5 and show that, as expected, MiSeq outperforms TGS methods for both substitutions and small InDels detection accuracy. PacBio obtained good results for substitutions discovery but completely failed the detection of small Indels. ONT data reached a detection rate of 0.9 for the discovery of substitutions with the two marginAligner mappers, and although the accuracy for small insertions was poor (< 0.1), for small deletions BWA and marginAlign obtained detection rates in the order of ∼ 0.3 ⁠ , much larger than that obtained by PacBio. As expected, the larger the InDel size, the smaller the capability of all the alignment data to detect them. Remarkably (with the exception of MiSeq data), local DOC has little effect on detection rate, while the use of combined pass and fail reads reduces the sensitivity for both substitutions and small InDels identification with respect to using only pass sequences.

Detection rate for substitutions and small InDels. Summary of the detection rate estimated with synthetic reference genomes. Panels (A) and (B) report the detection rate for substitutions as a function of base coverage. Panels (C–F) and (G–J) report the detection rate as a function of coverage and size for InDels respectively. The analyses were performed for pass reads (A, C, D, G, H) and for combined pass and fail reads (B, E, F, I, J). To simplify subplots grouping, all panels have title labels that describe variants and read type: Sub- (substitutions), Ins- (insertions), Del- (deletions), -Pass (pass reads) and -All (pass+fail reads). A colour version of this figure is available at BIB online: https://academic.oup.com/bib.

Detection rate for substitutions and small InDels. Summary of the detection rate estimated with synthetic reference genomes. Panels (A) and (B) report the detection rate for substitutions as a function of base coverage. Panels (C–F) and (G–J) report the detection rate as a function of coverage and size for InDels respectively. The analyses were performed for pass reads (A, C, D, G, H) and for combined pass and fail reads (B, E, F, I, J). To simplify subplots grouping, all panels have title labels that describe variants and read type: Sub- (substitutions), Ins- (insertions), Del- (deletions), -Pass (pass reads) and -All (pass+fail reads). A colour version of this figure is available at BIB online: https://academic.oup.com/bib.

At present, few methods have been developed for calling variants with ONT data and these methods, that include Nanopolish [ 28] and marginCaller [ 17], are capable to search for substitutions only. The Nanopolish variant caller first selects candidate variants on the base of mismatches between aligned reads and the reference genome and then groups them into sets of close variants. Each cluster of variants is used to generate a set of candidate haplotypes from the possible combinations of SNVs and the haplotype that maximizes the probability of the event-level data is called as the sequence for the region. The marginCaller (marginAlign tool) computes posterior alignment match probabilities between the bases in the reads and the reference by using a realignment strategy based on HMM.

Unfortunately, nanopolish performance could not be tested owing to the lack of raw Fast5 files, but the analyses performed with marginCaller on synthetic variants data set ( Supplementary Figure S13 ) demonstrate that this tool is capable to reach a detection rate around 0.99. However, these analyses also show that the high-detection rate of marginCaller is obtained at the expenses of a significant number of false-positive substitutions in C and G demonstrating that the HMM algorithm at the base of this tool is not capable to mitigate the effect of recurrent error bias of ONT reads.

To evaluate the accuracy of different sequencing technologies to identify genomic regions involved in CNVs, we calculated the 1-copy normalized DOC for different window sizes. The absolute number of DNA copies of each simulated variant was estimated by calculating the median DOC of the windows within the region and a deletion is called if this value is smaller than 0.5, while a duplication is called if it is larger than 1.5. The results reported in Figure 6 and Supplementary Figure S16 clearly show that although all the sequencing technologies are capable to correctly identify deleted regions (0-copies), only MiSeq and ONT reads aligned with BWA are able to identify duplications with high accuracy and to estimate the exact number of their DNA copies even for highly duplicated regions. Moreover, ONT-BWA data obtained the best correlation between simulated and predicted copy number, outperforming the MiSeq data. Notably, Supplementary Figure S14 demonstrates that sequencing coverage has little effect on CNV detection rate.

Detection rate for CNVs and absolute number of DNA copies prediction. Panels (A–D) show the detection rate of simulated deletions (A, B) and duplications (C, D) as a function of window size. Panels (E and F) report the correlation between the simulated and predicted absolute number of DNA copies for all the aligners/platforms. Panels (A, C and D) show the results for pass reads, while panels (B, D and F) for combined pass and fail reads. A colour version of this figure is available at BIB online: https://academic.oup.com/bib.

Detection rate for CNVs and absolute number of DNA copies prediction. Panels (A–D) show the detection rate of simulated deletions (A, B) and duplications (C, D) as a function of window size. Panels (E and F) report the correlation between the simulated and predicted absolute number of DNA copies for all the aligners/platforms. Panels (A, C and D) show the results for pass reads, while panels (B, D and F) for combined pass and fail reads. A colour version of this figure is available at BIB online: https://academic.oup.com/bib.

These results, combined with those reported in previous section, demonstrate that ONT data can be readily used to identify CNVs with high accuracy.


Deborah Pardo on the mobile postdoc….

On the social compromises of early career research scientists…

Reaching an early-career position in science implies that you and I have already made some sacrifices in our personal life and social life in general. I wanted to write this blog because we don’t often talk about it. Not sure this will change things very much but it’s an attempt to make steps towards finding the best balance to be both happy in our lives and professionally efficient.

I come from a very nice place on the south-eastern coast of France, where there is a strong local culture and people are convinced they are living in the best place in the world. I am sure this applies to a lot of your home towns… or not. By wanting to be a researcher, by definition you have to move and spend some time abroad. By going abroad I don’t only mean travel, but actually living in a foreign country and sharing other people’s cultures. Moving…. such a great feeling your thirst for discovery, meeting interesting people, seeing the world, opening your mind and so much more. But on the other side it creates a GAP.

A gap first between cultures that is sometimes bigger than you would have expected. For my first post-doc I did not go very far and ended up in Cambridge in the UK. This is truly a great place and a very stimulating environment to work in… But I just sometimes don’t understand people. Everything is always “amazing” and “lovely” but please could you tell me what you really think? From my culture it appears to be inefficient, it would help everyone to move forward if we could make the difference between what is very good, average or bad. There is no shame in telling someone if he is in the wrong direction and why. But I guess English people as far as I understand just don’t want to offend… so I end up having to rely on such webpages: http://www.buzzfeed.com/lukelewis/what-british-people-say-versus-what-they-mean. This is at times funny but sometimes it becomes tiring. And this is just Europe, I can’t imagine the gap between cultures when you change continent.

The second gap that you encounter while getting expatriated for your job is with your childhood friends/family. Once one of my best friends told me just before I left for ERASMUS to Sweden: “I don’t understand how you can choose your job over your boyfriend, I could never do that, and it is so selfish for the poor him staying alone here”. Well, I guess some of you have been through this as well. It is very hard not to be understood and/or supported. To reach the point where we are now, we had to make choices. And this starts by a passion, a conviction, something stronger than just going to a random boring work to get money. I believe the need to discover and better understand the processes around us, is something not intrinsically felt by everyone. The ones not feeling this might never understand… but on the meantime eventually the gap grows between you and your old friend and family members. With the amount of work you get and the little amount of money you receive, the gap grows even more. Family members getting sick, grandparents getting old, newborns that you saw only once, a whole wedding organisation missed and great parties that you could not attend because it was too expensive to come back just for that. I guess we are all facing that and it is hard, but in the end you have to be really strong to take it and keep working very hard while sticking to who you really are. At some point in our career we might actually reach a moment where this is not such a big issue anymore…

Another gap is of course in your relationships It seems like distance relationships or no relationships at all are an early-career researchers’ speciality. Again, you either need to find someone really understanding or someone as passionate as you (although this might complicate things even more), but the forthcoming suite of short term contracts in different cities or countries might become a real issue. And I am not even talking about building a family. For both men and women in research it is hardly ever the right time for having a child as you need to maximise your working time so much in order to get recognised by your peers. I would argue though that it is even harder for women, first to feel ready, to have the right contract, an understanding boss, and not being too scared of dropping research for a while. Unless your partner and friends are in the same position, you end up being so much slower than everyone else (this also includes becoming a homeowner) that the gap might grow again, between you, your friends, your partner, your family harassing you, the time you still have biologically while you can still conceive and the idea of your life you had before…

The final gap I can think of is with yourself. Why am I doing this? Can’t I just give up everything and go back with my loved ones or to a random boring work to get money as everyone else? Why am I always thinking about work during evening and weekends and holidays? Who cares anyway about the tiny little specific questions I am investigating? Is it really the life I want/deserve? I find it hard sometimes to believe in what I am doing, and I believe I am not alone in this case. Also in research jobs in general, work has no end. Therefore the more you work the furthest you can go! But this just opens new questions and you need to work more to answer them…. in the end I think we just need to sit and think what the right balance is for us. Because we pretty much all know it already, we just love what we are doing so deeply that things are not going to change. But at least it was nice to try and be aware about them by writing this blog, hope you like it!


What are the drawbacks of Oxford Nanopore Sequencing?

So I came across MinION, a sequencer that enables real-time, long read sequencing at a cost of a few thousand dollars and am wondering what are the drawbacks of such a sequencer. I also saw they have some larger sequencers such as PromethION and I am curious how they perform compared to Illumia. Could these devices be used for RNA-seq in academics or are they just not as viable. It looks like all their sequencers depend on "nanopore" technology but I found out that the technology is relatively old with the idea originating from around 2010, so why did no one else move on this type of sequencing if it is way more straight forward?

I found out that the technology is relatively old with the idea originating from around 2010

Uh? That would be extremely young. I also don’t know where you’re getting this number from. In reality, the idea of using nanopores for sequencing originated in the 1990s. It just took a very long time to develop a suitable pore and refine the technique to the point of usability.

By the time the first nanopore sequencers became available (the first commercial availability was 2015), the market was already firmly in the hand of Solexa/Illumina

And in 2015, my lab was basically finding it a random base generator. Considering it's only been a few years, it's definitely progressed fast

Yes, they can definitely be used for cDNA sequencing (and, for that matter, true native RNA sequencing). Accuracy concerns are a moot point for cDNA gene/transcript counting once you get above 90% (and Nanopore is near 95% with their most recent base callers). All that really matters is whether or not a sequence can be reliably assigned to its originating transcript.

It's my opinion that Nanopore cDNA sequencing runs have comparable (or possibly better) sensitivity and specificity than Illumina, with a lower cost, faster turnaround time, and true isoform-level results. This is because there's a lower noise floor (i.e. at most 2-5 reads for Nanopore vs

100 for Illumina) which compensates for a lower read count (e.g. 1M reads for Nanopopore vs 40M reads for Illumina). In addition to that, the longer reads make it more likely that mapped reads will uniquely hit an isoform, and more likely that mapped reads cover the entirety of a transcript. I've got a graph showing different Illumina read lengths vs Nanopore here:

The amount of effort I have to put into convincing people to give them a try. Oxford Nanopore rely mainly on word of mouth to market their products, which means they rely a lot on the unskilled marketing ability of research scientists (such as me). Their staff will go to places to give talks, but generally only when they've been invited. I get money by analysing other people's data, so it's beneficial for me to put in that effort.

Heated discussions with critics who keep spreading misinformation like, "it'll never be used in a clinical setting", despite years-old published research contradicting those statements. I tend to assume that those statements are personal opinions / expectations, so try to avoid directly refuting those statements.

People tend to equate "cheap" with "easy". It's a very different process from other sequencing, even ignoring that labs tend to get sequencing done as a service rather than doing it themselves. Nanopore offer training, but it's very expensive, so most people try to muddle through themselves (hence the tendency for sequencers to collect dust).

Oxford Nanopore's official sample prep protocols are not properly versioned, and have lots of traps for naive users (especially users who aren't used to working within a sequencing service facility). ONT also haven't said explicitly that it's okay to copy protocols into a proper protocol versioning service (e.g. protocols.io), so there are various dances involved in what gets reported in research papers.

Software for working with nanopore reads is very new, tends to be command-line based, and hence is not very user-friendly.

MinION sequencers (the $1000 ones) can't be used to provide commercial sequencing services to other people [Nanopore's response is that other people should just buy their own MinION].

The cost for GridION and PromethION is so low that it sets up "too good to be true" suspicions for funding (the commercial value of flow cells included in the sequencer purchase is similar to the capital cost of the sequencer).

Technology updates. By the time a particular technology gets into a published paper, it's a common occurrence that Nanopore has moved on and created a better thing (and has "archived" / deleted their protocols for the published technology).


Accessibility + Speed = Breakthroughs

Since its launch in 2015, the MinION has put DNA sequencing in hands of the most curious scientists, some of whom work in the most hard to reach places in the world. Examples are frequently posted to Oxford Nanopore’s twitter feed @nanopore.

Another, posted just this week describes how MinIT and MinION have been used onboard a marine research vessel in Alaska to perform onboard scientific analyses of seawater.

The work focuses on DNA analysis of microbial life in the sea helps the researchers understand the marine ecosystem, biodiversity in the ocean and how climate change can affect microorganisms there.

Focused on performance — both speed and accuracy — Oxford Nanopore uses NVIDIA GPUs for data analysis along with all their DNA sequencing devices.

With MinIT on NVIDIA AGX, they’re approaching a 10x performance improvement over previous versions to help unlock real-time human and plant genomics. Its benchtop PromethION product is powered by NVIDIA Volta GPUs and can crank out a human genome for under $800.