New varieties of computer architectures, capable of solving highly demanding computational problems, are enabled by the large manufacturing scale expected from self-assembling circuit fabrication (10 12 -10 19 devices). However, these fabrication processes are in their infancy and even at maturity are expected to incur heavy yield penalties compared to conventional silicon technologies. To retain the advantages of this manufacturing scale, new architectures must efficiently use large collections of very simple circuits. This paper describes two such architectures that are enabled by self-assembly and examines their performance.
Introduction
The emergence of new device technologies is widely speculated to supplant the diminishing returns from the continued scaling of conventional silicon technology as we approach the 'red brick' wall identified by the International Technology Roadmap for Semiconductors [1, 2] . There is a wide array of new devices and architectures that explore the application of nanoscale research to computer design [3] . Some of these emerging computer designs represent fundamental shifts in the methods used to compute, and some are clever applications of conventional logic design to novel nanoscale device technologies.
The self-assembling computer architectures described in this paper represent two ends of a computational spectrum. One end of this spectrum includes the design space used by conventional computer designs to execute stored programs at 'run-time' using intricate networks of electrical circuitry. The other end of this spectrum includes a design space used by DNA computing schemes to perform a computation at 'assembly-time' using the orchestration of DNA molecules and an intricate series of biochemical processing steps. The continuity of this spectrum embodies the trade-offs between run-time performance and assembly-time complexity.
The first architecture, the decoupled array multi-processor (DAMP), is similar to a single-instruction, multiple-data (SIMD) design and relies heavily on run-time computation and is therefore on the run-time side of the computational spectrum. The second architecture, an oracle, is similar to a content-addressable memory with the important distinction that this memory self-assembles itself and its contents during its fabrication and is on the assembly-time side of the computational spectrum.
Both architectures are enabled by a guided self-assembly process that uses DNA hybridization to precisely structure nanoscale circuitry. Like many other emerging technologies, this process is being developed as a replacement or supplement to the photolithographic patterning processes used with conventional silicon technologies. The designs presented here are simple enough to be realized by self-assembly yet powerful enough to exploit the vast manufacturing scale promised by this new class of fabrication.
way for methods that use DNA as a scaffolding or assembly agent [6] [7] [8] [9] [10] . This inherently bottom-up approach is supported by work developing compatible nanoelectronic components that are amenable to DNA-guided self-assembly [11, 12] .
On-going research into DNA-guided self-assembly will clarify the details of how to efficiently assemble nanoelectronic components. In the meantime, we base our discussion of computer architecture on recent work in self-assembly that has begun to discern important challenges. The basic premise is that hybridization between complementary DNA strands can form complex networks of electronic components in vast quantities. These networks can be designed to function as logic blocks and have been simulated to estimate their electrical performance [13, 14] .
However, even with a per assembly-event yield of 98%, large structures are likely to form only a fraction of the time. Therefore, it is important to rethink the way large systems are designed (and emphasize smaller, simpler components) if the great potential of self-assembly is to be realized.
Nanoscale computer design
A great deal has been studied in terms of how computer systems might migrate from traditional silicon technology to nanoscale and self-assembling technologies. Defect tolerance at the nanoscale is an architectural issue that deals with the change in fabrication properties and computing systems from conventional silicon technology. The Teramac computer was an early adopter of reconfigurable defect tolerance [15] and other work has demonstrated the advantages of reconfiguration over traditional redundancy [16, 17] . The application of nanoelectronics to parallel computing has emerged as a new area of focus between materials science, chemistry, physics, and computer design [18, 13, [19] [20] [21] [22] . Alternative approaches to von Neuman computing have re-emerged as certain aspects of nanoelectronics mature. Cellular architectures employ local interactions between elements to compute [22] [23] [24] [25] . This form of communication is expected to dominate at the nanoscale [26] . Reconfigurable array architectures are also being investigated to cope with the high defect rates expected of nanoelectronics [21, 27, 28] . The great promise of quantum computing has driven investigations into practical issues including architectures [29, 30] and communications [31] . Sections 2 and 3 describe two architectures that are enabled by developments in self-assembled nanoelectronics.
The decoupled array multi-processor
The early limitations of self-assembling technologies require designs to use small circuitry and little interaction or communication with the external (i.e., macroscopic) world. Single-bit, serial processing elements are well suited to such limitations. They require less circuitry than parallel multi-bit implementations and have simple interfacing requirements.
The particular active component and assembly process we consider here is theoretical and based on prior work in emerging device modelling [14, 32] and fabrication [33] of silicon rod surrounding gate (or ring-gated) field-effect transistors. The process, described in [34] , uses DNA functionalized silicon rod FETs and precise control over the type of rod (e.g., heavily doped for wires, N+, P+, or insulating) and the sequence of strands of DNA covalently bonded to each end. The DNA strands direct the ends of the rods to assemble based on their sequence as in mesoscale DNA self-assembly [35] [36] [37] . Post-process metallization after the DNA has assembled the silicon rods renders the strands conductive [38] .
System overview
The decoupled array multi-processor (DAMP) is similar to a single-instruction multiple-data (SIMD) machine (e.g., the CM-2) with two important differences: the DAMP has no inter-processor communication, and many more processors (∼10 12 ). The most significant of these differences is the lack of any communication hardware between processors because most efficient solutions to common parallel algorithms require inter-processor communication of some sort.
The processors in the DAMP have no way to communicate with each other (i.e., they are decoupled from each other) except through a shared control unit. This limited form of interprocessor communication limits the DAMP to embarrassingly parallel problems that involve little to no data sharing.
The magnitude of the number of processors in each machine type is also dramatically different from typical SIMD machines. With on the order of 10 12 processing elements the DAMP far exceeds typical processor counts. However, the complexity of any individual processor is greatly diminished with respect to the processors used in SIMD machines. The basic structure of the DAMP is illustrated in figure 1 . The node controller sends control signals to each processor node in parallel. Each processor node can detect a processor-generated signal from an individual 'ringer' circuit embedded in each processor.
This reduced output capacity (from the processor's perspective) is a conservative worst-case scenario to reduce the fabrication complexity of each processor. Figure 2 illustrates the basic execution model for a DAMP processor. Each processor has five 16-bit registers and conditionally executes the instruction stream depending on the value of its wait-status bit.
Execution model
In this bit-serial design, the least significant bit (LSB) is the first bit to participate in each operation: the bit at the bottom of a register in figure 2. Through separate shift controls, the accumulator can shift independently from the R0-R4 registers enabling relative data shifts. The operational unit is a fulladder that can provide either the carry out or sum signal to the accumulator input. Each register R0-R4 can receive either its own LSB or the accumulator output as input during a shift. The six status bits can be used to implement a wide range of conditional operations [13] .
The 16-bit accumulator, R0, and R1 have the ability to load a random constant that is probabilistically unique to each processor through the use of a random assembly event that selects either a one or zero for each bit in the register (e.g., by a conductor or insulator attaching randomly to a latch's load line). The random constant is used as an index or a seed for selecting a portion of a problem space. At run-time each processor can supplement these assembly-time input bits with a counter or value from the node controller in the LSB positions.
The DAMP processors (∼10 12 ) can probabilistically evaluate any 40-bit input space with only one run of a program. The program instructs each processor to manipulate their random constants to produce an answer to a problem. If the answer is satisfactory (e.g., below some threshold) the processor with the answer can alert the node controller (with its ringer circuit) and a binary search over that node's input space can begin. The search is complete when all bits of the random constant used by the target processor are determined.
The self-assembly fabrication technology used to build the DAMP demands that each processor be simple. This drives the use of bit-serial processors and a simple controller. As a consequence all instructions are software encoded without the need for microcode on each processor (e.g., like VLIW). A brief list of instructions that can be efficiently implemented by the DAMP, including the gate level implementation details, transistor level nanorod layout for the processing elements, and the behavioural simulation details, can be found in [13] .
System operation
The large number of processors in the DAMP (∼10 12 ) leads to a very high peak performance on 16-bit operands (e.g., local 16-bit adds) as illustrated in figure 3. However, peak performance is only a simple measure of usefulness. The real power of a computing machine comes from the problems it can solve.
The DAMP can be used to solve vast global optimization problems. Many science and engineering problems can be posed as global optimization problems that seek to find the largest or smallest value for an objective function over a domain. The challenge in solving these problems comes from the large number of variables and multiple local minima that deceive search algorithms. Consider the hypothetical objective function shown in figure 4 . This function has many local minima and the global minimum, indicated by the black arrow, has a very narrow aperture for the search algorithm to find. Stochastic global optimization is a method of sampling an objective function at random points in the problem space and comparing the results at each point. Since the time required to exhaustively search the problem space at a resolution sufficient to be useful is far too large, the best local minimum is chosen from local searches starting at a random set of starting locations. A new set of random points is selected that concentrates the search around the best-found solution. That is, the search continues but focuses on a few of the lastbest answers. Typical calculations for each sample include the objective function and numerical derivatives (gradients) at the point. If the objective function has a well-behaved and computable gradient, this can be used as a local indicator of how to choose the next best solution since the objective changes along the gradient. This gradient descent approach is very sensitive to numerical instability because of how derivatives amplify high frequency changes in the objective function. The gradient descent approach is also very susceptible to entrapment by local minima.
Parallel pattern search (PPS) has emerged as a popular technique used to optimize difficult objective functions [39] [40] [41] [42] . This method uses a search along each dimension of the problem space to find the global minimum. Each iteration of the search begins from the optimal point found in the last round of evaluations. This technique has provable convergence to the minimum as long as certain rules are followed for adjusting the step size along each dimension and for how objective values are compared.
The DAMP can be used to solve continuous variable minimization problems that are much larger in dimensionality than those solvable today.
The pseudo-code below demonstrates how the DAMP can be used with 32-bit fixedpoint variable optimization problems.
The vector x k is the best-known solution after each step. The problem space is spanned by a positive spanning set 5 D, where d i is a unit vector from D along the ith dimension of the problem space. The functions C x (x i ) and C y (y) are used to verify that the input and output vectors, respectively, satisfy the problem constraints.
The program is run at each processor node. Since there are 2 28 processors per node, the random number generated at each processor has only 28 bits of significance. This means that to cover a 32-bit random number space each processor must run the program 16 times with a new 4-bit low order value each time. The value, k, is simply incremented between loops. Each processor takes its k and uses it to compute a new input vector. The particular dimension that the processor searches (d i ) is specific to the processor node. The new input vector is checked against the input constraints and if they are satisfied the function (F) is evaluated. The output from the objective function is checked against the output constraints and if they are satisfied the processor participates in a minimization query, or MIN-QUERY. This query is conducted by each processor node and searches, bit by bit, for the smallest objective function value found by any of its processors.
In addition to the details of the MIN-QUERY algorithm and implementation details of each instruction, a practical application of this method to the thermal intercept problem and performance results can be found in [13] .
Oracles
Similar to the DAMP, oracles are assembled using electrically active components (silicon rods) and DNA (metallized during post-processing) to form large arrays of simple circuitry. 5 A positive spanning set is a set of vectors that can be combined using nonnegative scalars to form all possible vectors in a constrained space. However, unlike the DAMP, oracles are designed to solve large portions of a particular problem during their fabrication rather than relying on purely run-time computation. That is, the oracles are a hybrid between the general-purpose electrical designs described in section 2 and electrical circuitry inspired by DNA computing. This section describes the architecture of an oracle and two simple designs.
System overview
An oracle is designed to solve a problem space during its assembly. This architecture is enabled by DNA-guided selfassembly because DNA hybridization can enforce well-defined sets of pairing rules. So far, these rules have been used here to define the geometric structure of circuitry (e.g., in the DAMP) but as in DNA computing, the rules can also be used to compute a result. Abstractly, an oracle contains a large number of question and answer pairs. Questions are posed to the machine and it generates a response if the question is contained in any of the oracle's question/answer pairs. In this fashion, the oracle is like a large content-addressable memory (CAM) that has been preloaded with the answers to a certain problem. An oracle differs from a CAM by the method the question and answer pairs are entered into the machine. To fully cover an input space of k-bits, the CAM requires O(2 k ) steps to load the answers (each of which must be computed). Each address is a question represented by up to k-bits with its associated answer-just like a lookup table. In comparison, the oracle requires O(k) steps to assemble and no run-time loading steps. The answers are determined by the manner in which the oracle is assembled. The self-assembly of each question and answer pair provides the oracle with the answers (with a high probability but not a certainty that a given question and answer pair will exist within the oracle). If a particular question and answer pair did not form during the oracle's assembly then the oracle cannot solve that instance of the problem.
A simple oracle
A simple example of an oracle is the addition oracle (not useful in itself, but illustrative.) The addition oracle has a simply defined problem and a brief functional description, and performs all calculations at assembly time. multi-bit addition problems by chaining the carry-out from one full-adder to the carry-in of the next full-adder. In a similar way the addition oracle will chain carry-outs to carry-ins, but at assembly-time rather than at run-time.
Implementation
Each line in the truth table is converted to a 'tile' that represents a particular input and output combination as in [43] . The difference in this work is that each 'tile' is implemented by the self-assembly of nanoelectronic components and is electrically active. Many of the challenges in fabricating large networks from DNA will remain in fabricating the DNA scaffolded nanoelectronic circuitry considered here. However, progress is being made in this area [44] and discoveries in the fundamental properties of DNA self-assembly will apply to this theory directly.
The tile conversion method is similar to the way a carryselect adder speculatively pre-computes a carry pattern and at run-time selects the proper carry path, with the exception that the oracle pre-computes the entire carry path. The tiles that correspond to table 1 are shown in figure 5 . Each tile has on its left side a carry input and an output. In this case, the top portion of the tile is the carry-in and the bottom portion the carry-out. The carries are depicted in such a way that they fit together like the pieces of a jigsaw puzzle.
The iterative nature of the function (e.g., output from step i produces input for step i + 1) allows strings of these tiles to implement an instance of the function's evaluation. For example, figure 6 illustrates a simple 4-bit string made from the tiles in figure 5 . This particular example is an instance of the addition function for '3 + 5 = 8'. The shape of the carries on each tile dictates how the string is formed. Valid strings must match each carry-out with the corresponding carry-in. In this fashion, the tiles perform an assembly-time computation as they form valid strings. They assemble only into valid solutions for addition because the carries must match at each stage. A complete addition oracle is the collection of all possible N -bit strings of tiles. Each string represents one particular input and output combination, for example the '3 + 5 = 8' string shown in figure 6 . In this case, the string will respond with '8' to the question 'What is 3+5?'. For all other questions the string will be silent.
Again, the addition oracle is simply an illustration of an oracle rather than an exemplar of its usefulness. The circuit complexity of each string is determined by the circuitry needed to read the string and respond to queries. A possible circuit for the addition oracle is shown in figure 7 .
The A, B, and S signals illustrated in figure 7 carry the input query and response bits, respectively, for each tile. The OE i and IE i signals are the output and input enable signals, respectively, that coordinate the individual tiles in a string so that the string responds to a query if and only if all tiles in the string match the input query. The input enable signal is passed downward along the string and at the very last tile is reflected upward as the output enable signal. Each tile can interrupt the input enable signal depending on the value of the current query, or the latched A i and B i input signals. Input queries are serially shifted into all strings (i.e., the circuits that implement each string) simultaneously. When the A and B values match the particular inputs of a string, all the tiles latch their sum values into the S i latches. The only strings that respond to the query are those that have successfully reflected the input enable signal to their output enable line. The output enable signal can be used to trigger a ringer circuit that creates an oscillating signal that can be detected by an external receiver. This method is useful for problems that require only a single bit of output (e.g., NP-complete problems). Alternatively, the output enable signal can be used, as shown in figure 7, to load the sum bit into a D-latch that can be shifted downward along the string to a ringer at the bottom that responds to the shift-out from the string.
Generalization of the oracle
The addition oracle serves as a simple model for other oracle designs. Carry chains are simple input and output constraints that can be generalized to include more complex relationships. An oracle can solve a problem that can be expressed using the form illustrated in figure 8 . The functions F i and G i take the f i−1 and X i inputs and generate the f i and g i outputs, respectively.
Each input ( f i−1 and X i ) and output ( f i and g i ) are bit vectors. To aid in initializing the system, f −1 is assumed to be α, which is a constant defined at assembly-time.
Equations (1)- (3) describe the addition oracle using the form illustrated in figure 8 . These equations are derived from the truth table for addition, shown in table 1. The input vector X has two elements, A and B, that represent the input operands. Equation (1) is the carry-out bit, and equation (2) is the sum bit.
Hamiltonian path oracle
The Hamiltonian path (HAM-PATH) problem is NP-complete and represents what is considered to be an intractable problem. The problem consists of finding a path through a connected graph of nodes that visits each node exactly once. The HAM-PATH oracle computes all paths through a fully connected graph at assembly-time in a manner very similar to the way Adleman solved the HAM-PATH problem using DNA [45] . The difference between the oracle and Adleman's approach is that the oracle solves the problem for any instance of the problem (with fewer than a fixed number of nodes).
Adleman's solution encodes each edge in a graph as a DNA fragment that has two 'sticky' ends representing the starting and ending nodes of the edge. Each node in the graph is allocated a sequence of DNA, and any edge that starts at that node will use this sequence on one end. The other sticky end of the DNA fragment uses the complement of the DNA sequence assigned to the ending node. All of the fragments are mixed together and form strings of edges (in the form of DNA fragments) that represent feasible paths through the graph. Since Hamiltonian paths visit each node once, only strings with as many edges as there are nodes in the graph are feasible Hamiltonian paths. All other strings are discarded (using biochemical techniques). Cycles in the graph need special treatment [1] . The entire process takes on the order of weeks from start to finish.
The way the HAM-PATH oracle solves all instances of the problem is by solving the problem for a fully connected graph and then discarding solutions at run-time based upon a particular input graph. Paths from the fully connected graph that do not appear in the problem instance are deleted at runtime. This idea is illustrated in figure 9 .
Like the addition oracle, the HAM-PATH oracle uses random strings of tiles to perform an assembly-time computation. The addition oracle formed all N -bit sums at assembly time. Likewise, the HAM-PATH oracle forms all paths through the fully connected graph. At run-time the HAM-PATH oracle selects the edges that exist in the current problem instance. After selecting the edges in the problem instance one or more computing elements within the HAM-PATH oracle responds (electrically) to indicate that a Hamiltonian path exists through the graph if and only if it has a solution.
The design of the circuitry for each HAM-PATH tile is more complicated than the addition oracle tiles because they need to remove nodes from a set and then respond to selected graph edges. However, the generalization in section 3.4 still holds and a more detailed account of this design and its circuitry can be found in [13] . The circuit level simulation results of this architecture estimate a run-time cycle period of ∼10 µs per graph instance to compute the optimal path. This is approximately 40 000 times faster than the estimated earth simulator performance (200 ms).
Conclusion
The challenges that self-assembly present to system design demand alternative approaches to large-scale computing. The dramatic change in the fabrication scale (10 12 devices) and severe constraints on circuit size must be balanced to exploit the advantages of DNA self-assembly. The two architectures we have proposed are enabled by self-assembly because they rely on its unique properties (i.e., scale and rulebased assembly) and can solve more complex problems than possible today (e.g., the 15-node HAM-PATH) with promising direction toward more practical applications (e.g., large global optimization problems).
