Abstract-DNA sequencing based on nanopore sensors is now entering the marketplace. The ability to interface this technology to established CMOS microelectronics promises significant improvements in functionality and miniaturization. Among the key functions to benefit from this interface will be basecalling, the conversion of raw electronic molecular signatures to nucleotide sequence predictions. This paper presents the design and performance potential of custom CMOS basecallers embedded alongside nanopore sensors. A basecalliing architecture implemented in 32-nm technology is discussed with the ability to process the equivalent of 20 human genomes per day in real-time at a power density of 5 W/cm 2 assuming a 3-mer nanopore sensor.
I. INTRODUCTION
With the recent emergence of miniature molecular readers into biological laboratories throughout the world, the field of DNA sequencing is beginning another major phase of evolution. In particular, nanopore-based molecular sensor arrays housed in palm-sized packages have entered the market over the last two years with the ability to process DNA at rates in excess of 50 bp/s (basepairs-per-second) over 100s of channels for dozens of hours-in-a-row leading to data generation in the 100s of GB per "modest" experiment. These numbers are astonishing compared to the metrics of their proof-of-concept predecessors reported 20 years ago [1] . As the technology improves, in no small part due to its close interface with CMOS (complementary metaloxide-semiconductor) microelectronics, substantially higher performance can be expected.
At this time, the information processing load from such devices is handled by traditional desktop and cloud-based machines. However the extremely compact physical dimensions of the new sensor platform call out for a similarly scaled compute resource. In tandem, molecular sensing with embedded measurement processing offer extremely broad application opportunities for DNA sequencing in particular and the automated genomic laboratory in general [2] .
This paper focuses on a particular aspect of this processing, basecalling: the conversion of raw electronic signals produced by DNA-sensor interactions to predictions of the molecule's base-pair constituents (i.e. adenosie, cytosine, guanine, thymine). Although this is just one part of a sophisticated sequencing pipeline [3] this step needs to process large amounts of raw data makes it especially advantageous for processing with a dedicated compute engine. This advantage is amplified if we seek real-time basecalling functionality alongside the small form-factors already inherent
The authors are with the Department of Electrical Engineering and Computer Science, Lassonde School of Engineering, York University, 4700
Keele St, Toronto, Canada magiero@cse.yorku.ca Fig. 1 . a.) Representation of biological pore cross-section undergoing DNA translocation and its associated electrochemical current, I signal . b.) The character of the electronic signal from the nanopore before amplification. c.) Generic mixed-signal CMOS amplification, filtering, and event detection responsible for converting I signal to digital sequence of event values, V signal to be processed by the basecaller.
to micro/nano technologies. Such a combination of size and speed in DNA sequencing machines should be particularly critical in promoting the vision of ubiquitous genomics [4] .
In this paper we detail the potential of CMOS technology for real-time embedded DNA basecalling from nanopore sensors. In particular we invoke nanopore-related sensor, circuit, system, and algorithmic developments to explore and comment on the core physical resources (i.e. CMOS chip area, speed, and power consumption) that can be expected of such basecallers.
II. NANOPORE-BASED SEQUENCING
Nanopore sensors and their application to sequencing have been extensively studied and described in the literature over the last two decades [5] . Presently, nanopore-based DNA sequencers utilizing biological pores as sketched in Fig. 1a. ) have achieved the most impressive results "in the field" (e.g. real-time Ebola surveillance [6] ). These are modified protein complexes that occur in nature (e.g. as virulence factors) and form openings with diameters on the order of 1-2 nm through which DNA can be threaded. As indicated, in Fig. 1b. ) the minute ionic current fluctuations I signal (on the order of 10 ps from biological pores) induced by the translocation of DNA through the pore assume a piecewise constant (albeit corrupted by noise) profile versus time. Each 978-1-4577-0220-4/16/$31.00 ©2016 IEEE 5745 plateau is associated with an event indicative of a discrete structural feature of the molecule segment currently in the pore.
To process the small signals available from the sensor an amplification and signal processing chain is needed as shown in Fig. 1c. ). Typically these consist of an electrode capable of forming Ohmic contacts in an electrolyte, transimpedance amplifier (TIA), low-pass filter (LPF), analog-todigital (A/D) converter and finally an event detector (ED). The TIA-LPF-A/D chain amplifies, conditions, and digitizes the raw signal while the ED predicts the event levels contained therein in terms of voltage V signal . The output of this chain is suitable for processing by the basecalling engine.
In the majority of experimental nanopore studies these features have been accomplished with off-the-shelf technology, a set-up that encumbers the apparatus with significant parasitics and hence compromises the event rate (i.e. the maximum I signal frequency) than can be accurately processed. Instead, employing co-packaged nanopore-CMOS TIAs has been shown to boost the workable event rate by orders of magnitude [7] ; integrating the remaining functions noted above in silicon naturally follows [8] . As noted, the integrated function of interest in this paper is the basecaller (BC) that follows the ED.
III. BASECALLING Ideally, I
signal and hence V signal would assume only four possible levels in relation to the four unique bases (A, C, G, T) that constitute DNA. In this case, the BC following the ED could be realized as a threshold detector referenced to some model of the output, that is, a training-based model associating event levels with bases in a one-to-one fashion. Endowing the ED with optimum filtering properties (e.g. matched filter) and the BC with optimum detection properties (e.g. maximum-likelihood) would achieve the basecalling function sought. This picture is complicated by the fact that in practice multiple features contribute to the pore output, resulting in the observation of many more event levels in V signal than four. The most prominent aspect of this issue is that multiple bp's occupy the nanopore constriction at any given moment; therefore a new event in I signal generated by the entrance of a bp into the pore is dependent on the L 1 bp's that preceded it. For example, with three nt's substantively occupying the pore we may expect any one of 4 3 = 64 different levels associated with I signal and ultimately V signal . A BC for such a system replaces the threshold detector noted above with a sequence detection device. The integrated CMOS hardware requirements of such an implementation for sequence detection are explored in this paper.
A. Sequence Detection
BC sequence detector design is guided by the well-known hidden Markov model (HMM) formalism which, in this case, is applied to the properties of the nanopore sensor. The HMM expression of the system's evolution over time can be expressed in the form of a trellis diagram (see Fig. 2 ). The trellis enumerates the possible states of our system, 4 L in our case, and accounts for all the possible ways or transitions (arrows in Fig. 2 ) in which any one state could transition into another state over a discrete time step (i.e. from stage k to stage k + 1). In a basic BC, as assumed in this work, a total of 4 L+1 transitions (4 from/to each state) are accounted for between adjacent stages. As elaborated below, the transitions are numerically weighted with branch metrics a ij inversely proportional to the probability that a state i at time k becomes state j at time k + 1.
Over some discrete time-span K the BC observers K event levels from the nanopore. Over this set of observations the BC decides, in a statistically optimum fashion, on the sequence of states that the nanopore sensor effectively traversed due to DNA translocation (e.g. a traversal represented by the sequence of red arrows in Fig. 2) . A dynamic programming strategy is used to reduce this sequence-based decision making procedure from having to enumerate O(exp(N · L)) possible trajectories through the trellis to O(N exp L). We briefly describe this algorithm now.
B. Viterbi Detection
A BC sequence-based decision over the HMM can be made in a statistically optimum fashion -in the maximum likelihood sense -by employing the Viterbi algorithm (VA). In essence this algorithm seeks to identify the path through the trellis over N time steps whose path metric (i.e. the sum of its constituent branch metrics a ij ) is a minimum. The VA achieves this in an iterative fashion by: 1) calculating the a ij values as some distance
where V signal [k] is the measured event at time k and V signal (i, j) is the expected event for a transition form state i to j. 2) Using these values to update the set of (4, candidate) path metrics going into each possible state at k + 1 i from which the last step constituting j originated; the latter serving as a means of recording the states visited by the survivor paths retained at each k + 1 state.
After M iterations over this three-step procedure a sequence with terminus state out = arg min
) is selected and the preceding M 1 states identified by referring to the associated values stored as part of step 3). The entire process is repeated to extract the next M -state sequence.
C. BC Performance
A plot of the performance achievable for a VA-enabled BC is shown in Fig. 3 . The base-error-rate (BER), a measure of the fraction of bases incorrectly predicted by the BC, is used to quantify behaviour as a function of the signal-to-noise (SNR) ratio. Three examples, a 5-mer sensor (i.e. L = 4), 4-mer, and 3-mer sensor modelled on the device discussed in [9] are considered. In anticipation of embedded fixed-point CMOS implementations, performance examples at finite bit depths are shown alongside their ideal (i.e. un-quantized) counterparts. Further, to simplify ensuing hardware constructions, a simple l 1 -norm is used to calculate (1).
Raw basecalling (i.e. before application of further bioinformatics) of good quality can attain BERs around 10 2 -10 3 [10] and in the examples considered these are achieved at SNRs of roughly 19, 18, and 23 dB for the 5, 4, 3-mer sensors respectively. The exact results are of course heavily dependent on sensor specifics and these examples are meant to be illustrative. The degradation of BER with a decrease in bit depth however is to be expected and serves as a key constraint on physical design. The character of this degradation is shown in Fig. 4 which indicates BER improvements of 40-90% as the bit width is increased from 6 to 10 bits.
IV. ARCHITECTURAL AND PHYSICAL IMPLEMENTATION
We characterize two extremes of VA-based BC design to help define a design-space for nanopore based CMOS basecallers: a uni-processor (UP) serial iterative architecture Fig. 4 . Relative BER at 20-dB SNR (referenced to BER at bit depth of 6) vs. bit depth for 5, 4, and 3-mer sensors. geared to process one state per cycle and a node-parallel (NP) architecture geared to process one stage per cycle. These approaches trade size for speed; the UP minimizes area at the expense of speed while NP does the opposite.
A functional diagram of the NP is shown in Fig. 5 ; it is composed of three main blocks: the state, the stage, and the traceback. The state block shown is one of 4 L 1 identical components (the ith block is drawn in Fig. 5 ) working in parallel in the NP architecture to produce branch (via the BM G: branch metric generator) and path metrics. The input (V signal ), with bit-depth d, from the ED is fed into all state blocks which, in aggregate, calculate all 4 L candidate path metrics in one clock cycle and then whittle them down to 4 L 1 with the help of the minimum-argument function argmin i that effectively creates pointers (pt l ) to minimum path values in sets of 4 states. The UP is identical to the NP save the fact that it has only one simple state block that executes the 4 L states in series.
The state block's results are gathered and processed by the stage block which finds the global minimum path metric ( g ) (computed by argmin) as well as a pointer to it (pt g ). The state block also re-references all accrued path metric calculations to it (via the subtractor shown in Fig. 5 ) so as to prevent overflow.
The pointers generated in the state and stage blocks are used to populate (with pt l ) and drive (with pt g ) the traceback block's M -register FIFO. Specifically, the collection of 4 L 1 
An example layout of a VA BC in a 32-nm CMOS technology is shown in Fig. 6 . Chip areas vary from 120 2 µm 2 for a 64-state NP design to 510 2 µm 2 for a 1024-state chip (d = 6) a footprint ratio of 16⇥ in line with the boost in states processed. Chip power shows a similar characteristic, with an average value of 1.5 mW for the 64-state NP and 20 mW for its 1024-state counterpart. The UP area advantage progressively increases as the bit depth grows, dropping from 75% to 60% to 55% relative to NP at d = 6, 8, 10, respectively.
V. DISCUSSION Fig. 7 conveys a more complete picture of the scalability of the VA BC for real-time performance over input f in (effectively the rate of DNA translocation through the nanopore) and clock f clk frequencies. In particular it considers the potential of multiple 64-state BC (both NP and UP) cores arrayed over a 25-mm 2 application-specific integrated circuit (ASIC) die with the intention of processing 1000s of nanopore signal channels a scenario already realized in marketed nanopore sensors and certainly true of emerging core sequencing facilities.
The f in is swept from 10 2 Hz (speeds realized in stateof-the-art commercial sequencers with modified nanopore proteins) to 10 6 Hz (the unfettered rate of bp translocation through nanopores) [5] . Roughly, each NP core can multiplex well below the 100 W/cm 2 capability of contemporary air cooling technology [11] ).
At the other extreme, processing input at the f in = 10 6 Hz peak, the ASIC can achieve H = 7 ⇥ 10 9 bp/s (200k human genomes/day) a rate competitive with the abilities of core sequencing facilities. The chip can accomplish this at f clk = 100 MHz and with a power flux of roughly 40 W/cm 2 . Although its throughput performance would be come seriously compromised for state counts in excess of 64, a UP approach can come close to this level, achieving H = 3.7 ⇥ 10 9 bp/s for peak f in with 35 W/cm 2 .
