Abstract-The use of graphical processors for distributed computation revolutionized the field of high performance scientific computing. As the Moore's Law era of computing draws to a close, the development of non-Von Neumann systems: neuromorphic processing units, and quantum annealers; again are redefining new territory for computational methods. While these technologies are still in their nascent stages, we discuss their potential to advance computing in two domains: machine learning, and solving constraint satisfaction problems. Each of these processors utilize fundamentally different theoretical models of computation. This raises questions about how to best use them in the design and implementation of applications. While many processors are being developed with a specific domain target, the ubiquity of spin-glass models and neural networks provides an avenue for multi-functional applications. This provides hints at the future infrastructure needed to integrate many nextgeneration processing units into conventional high-performance computing systems.
Abstract-The use of graphical processors for distributed computation revolutionized the field of high performance scientific computing. As the Moore's Law era of computing draws to a close, the development of non-Von Neumann systems: neuromorphic processing units, and quantum annealers; again are redefining new territory for computational methods. While these technologies are still in their nascent stages, we discuss their potential to advance computing in two domains: machine learning, and solving constraint satisfaction problems. Each of these processors utilize fundamentally different theoretical models of computation. This raises questions about how to best use them in the design and implementation of applications. While many processors are being developed with a specific domain target, the ubiquity of spin-glass models and neural networks provides an avenue for multi-functional applications. This provides hints at the future infrastructure needed to integrate many nextgeneration processing units into conventional high-performance computing systems.
I. INTRODUCTION: GPUS CHANGED THE HPC LANDSCAPE, WHAT'S ON THE POST-MOORE'S HORIZON?
In 2005, it was shown that graphical processing units (GPUs) could outperform central processing units (CPUs) for LU Decomposition problems [1] . GPUs can be networked into clusters for distributed computing and the use of matrixbased computations changed the landscape for scientific and high-performance computing [2] , [3] , [4] , [5] . Most notably, the incorporation of GPUs into neural network training has allowed for significant advances in the field of deep learning and artificial learning [6] , [7] .
With the proliferation of non-von Neumann architectures and devices, there is an expectation that these devices will lead to a similar disruption in scientific computing. However, these are young technologies and there are many questions still to be answered. The paramount question is: on what application will these devices show significant speedup, computational advantage or supremacy? Secondly, what is the domain that This manuscript has been authored by UT-Battelle, LLC, under Contract No. DE-AC0500OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a nonexclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for the United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan. stands most to benefit from these new technologies? Third, how will they perform in heterogeneous workflow?
These questions have not yet been fully answered. Additionally, they are further complicated by the fact that many of the devices we discuss in this survey are not fully mature technologies. Neuromorphic computing focuses on emulating the computational capabilities of a mammalian brain. Quantum computing focuses on exploiting the complex dynamics of interacting sub-atomic particles in order to execute computations. However, the future development of both platforms needs to address similar questions: is the goal of these devices to fully simulate the specific dynamical systems they are derived from, or should these devices be developed to extract better performance on existing algorithms? A third option exists: can the development of the devices be done in conjunction with the development of new classes of algorithms?
We focus on how near-term devices perform on existing algorithms. Neuromorphic systems, and their use of spiking neurons have been shown to execute deep learning tasks with very low power consumption. Quantum annealers use quantum tunneling to provide improvements over simulated or thermal annealing, and have been shown to solve optimization problems faster. While the systems we discuss have been designed to offer benefits to known algorithms, we argue that they are not accelerators optimized for a single class of algorithms.
We structure our survey in terms of the levels of computation defined by Marr [8] : implementational, computational and algorithmical. We use these levels to organize our survey by identifying computational models that can facilitate the execution of different kinds of algorithms across several nextgeneration processors. This is summarized in Table I. At the implementational level, we discuss three classes of next-generation processors: quantum processing units (QPUs), neural processing units (NPUs), and tensor processing units (TPUs). Two of these are non von-Neumann (NPUs, QPUs) and we include TPUs in our discussion as they are a natural continuation of the progression from vector-based computation (CPUs) to matrix-based computaiton (GPUs) to tensor-based computaiton (TPUs). We focus our discussion on the following near-term devices:
• TPU: Google's custom design called a Tensor Processing Unit TM [9] A common feature to many near-term and current generation hardware is a strong emphasis on sparse coding or sparse embedding. This is motivated by the design of neural networks that can be executed in an energy-efficient manner on neuromorphic platforms, the design of spiking neuron systems that are closer to the physiology of actual biological systems, and the design of spin glass embeddings that are less susceptible to loss of coherence in quantum annealers.
For each implementational device we summarize the algorithm class which they are expected to outperform current technologies. These devices are not developed to be all purpose computing systems, and instead are designed to implement a specific class of computaitonal model. The goal of this survey is to emphasize that: a hardware implementation may be designed to operate using a specific computational model, but that does not limit the hardware to only solve a specific algorithmic class.
II. COMPUTATIONAL MODELS

A. Ising spin glass: computing with interacting spins
Spin glasses store and retireve information using two-body correlated spin states. Solving optimization problems with spin glasses first requires posing the problem as an Ising (or Potts) model Hamiltonian. This Hamiltonian acts as the cost function to optimize, which is done through a search over all possible spin states with the goal of finding the spin configuration that minimizes the cost function. The lowest energy state is determined by the positive or negative correlations between neighboring spins. On a conventional processor this search can be executed through various simulated annealing methods or Markov Chain Monte Carlo methods. The fundamental challenge faced when solving these problems on conventional processors is the possibility of a system converging to a local (rather than global) minima of the cost function.
Graphically, spin glass models are undirected graphs with weighted edges and vertices. They can be constructed with binary or bipolar spin states (Ising) or with q spin states (Potts). The theoretical development of Ising-glass, and Pottsglass models has grown from the study of spin dynamics in magnetic and paramagnetic systems. These system can have unique dynamics due to the competition between local and global spin-spin interactions, often exhibiting phase transitions between ordered and disordered systems [16] . Spin glasses can be used to solve CSPs, and NP-hard optimization problems [17] , [18] . The phase transition between ordered and disordered systems can be used to identify "difficult" or "easy" phases of a computational problem [19] .
Spin glasses can also serve as the theoretical framework for label propagation [20] . The close similarity of spinglass models and label propagation [21] has led to wide use of interacting spin systems in solving community detection problems [22] , [23] , [24] , [19] . Spin-glass based models of community detection have shown an advantage over classical algorithms by being resolution-limit free [25] .
The Hopfield network is a class of recurrent neural network with completely symmetric connections and no hidden units [26] , [27] . We choose to include it in this section based on how it is trained. These networks are distinguished from other ANNs in that the weights associated with a model are determined through various "learning" rules, not through training over data. There are a number of learning rules for the synaptic weights, including Willshaw learning [28] , Storkey learning [29] or projection learning [30] .
B. Artificial Neural Networks: Computing with Nonlinear Neurons
Artificial Neural Networks (ANNs) use nonlinear neurons and weighted connections to uncover features in data, compress and retrieve data, and to generate data. Neural networks have been applied to a large variety of tasks that will be useful in a supercomputing setting. Data analytics is one of the key areas in which neural networks will play a role, both in the development of analytics tools (i.e., training for neural networks) and in the deployment of existing analytics tools (i.e., inference for neural networks).
ANNs encompass a wide variety of topologies. A general feed-forward neural network is a layered construction formed by stacking several perceptrons to build a multilevel perceptron (MLP). This is an edge-weighted, acyclic network with deterministic, nonlinear units connected with directed edges. In contrast to the Hopfield neural networks described above, MLPs and other ANNs can contain hidden units. These units are not controlled by the user, and their values are determined by the weights and biases of the network. ANNs are powerful computational models and increasing the number of layers they contain can result in a network that can be a universal approximator [31] , [32] . Currently, the most widely used discriminative model utilizes MLPs [33] or one of its derivatives, such as convolutional neural networks (CNNs) [34] . These networks are trained using stochastic gradient descent-based back-propagation.
Boltzmann machines are a neural network type with symmetric edges. These are ANNs composed of stochastic neural units used to define generative models. These networks are fully-connected, weighted graphs and training is used to find a minimum global energy [35] . The computational complexity that results from this complete connectivity makes them impossible to train for anything but trivial problems on conventional hardware [36] . Thus, a modification called a restrictive Boltzmann machine (RBM) is typically used where the neurons are partitioned into a bipartite graph, and the network is trained using contrastive divergence [37] . RBMs can then be stacked into a deep structure to form a network called a deep belief network (DBN) [38] which are used to build generative models.
DBNs have been utilized in the literature for unsupervised feature recognition in both audio [39] and image [40] , [41] data. The ability to do unsupervised feature recognition is a key capability in a variety of contexts for future supercomputing systems. For training purposes, access to a co-processor that can quickly train Boltzmann machines or restricted Boltzmann machines for unsupervised feature recognition would be extremely useful in making realistic use of those systems. This is an especially compelling use case when considering that another processor in a heterogeneous supercomputing node may be used for scientific simulations, which are generating data, and a co-processor that could perform training and inference with a Boltzmann machine on that data could operate in parallel, resulting in Boltzmann machines trained on data as it is generated, or Boltzmann machines performing feature recognition on data as it is generated.
Recurrent neural networks retain the general layered construction of a feed-forward network, but introduce connections within layers. These networks, such as long short term memory networks (LSTMs) [42] , are able to model temporal relations in data. Recurrent neural networks have seen much success in providing a new state-of-the-art for domains ranging from computer vision [43] to speech processing [44] .
C. Spiking Neural Networks: Event-Driven Computing
Spiking neural network (SNNs) are a third type of artificial neural network that incorporate time into the way that the network processes information [45] , most notably by computing through the transmission and reception of electrical pulses ("spikes"). Spin glass systems and SNNs share a common feature of parallelization, where a single unit is able to affect change in multiple neighboring units simultaneously. In SNN systems, the times at which inputs are received in the system matter in how the network processes that information. Neurons in SNNs may be deterministic or stochastic, giving a degree of flexibility in the way computation is done in these systems.
One of the use cases of SNNs is to emulate biological neurons to execute brain-like computing, as they are more closely aligned with how biological brains operate than traditional artificial neural networks. The need for more brainlike characteristics has driven device development in terms of supporting synaptic plasticity [46] and having neuron models that can emulate neuronal dynamics [47] . Spiking neural networks can be constructed and trained like ANNs by modifying training methods such as back-propagation to allow for spiking data, and this has been the dominant application for NPUs. However, they have also been trained by a variety of other mechanisms, including unsupervised, Hebbian style learning [48] and evolutionary optimization [49] .
III. ALGORITHMIC CLASSES
Each of the computational models have different characteristics that make them well-suited to different types of algorithmic tasks. Each of the next-generation processors identified above, has been developed to execute an algorithmic approach efficiently with a specific computational model. For example, NPUs have been designed primarily to efficiently deploy deep learning with SNNs [50] . However, the theoretical concepts underlying several models overlap significantly, which has lead to the execution of additional algorithmic tasks. For example, SNNs have recently been used on NPUs to execute CSPs [51] . Table I gives a brief summary of the differences of the various models, as well as some of the applications for which they are well-suited and the most appropriate post Moore's era technologies for those network types. We will discuss each of the next-generation processors in terms of: the computational model and algorithmic class, and also identify potentially new use-cases.
A. Constraint Satisfaction Problems
CSPs such as graph coloring and 3-SAT belong to the computational complexity class NP [52] , [53] , [54] . A wider range of problems, including Traveling Salesman, are classified as NP-hard. Both NP and NP-hard problems cannot be solved through reduction to a simpler problem, but instead require an exhaustive search over a large set of possible solutions (which may be exponentially large). Many algorithms have been developed to solve CSPs [55] , and it is hoped that many next-generation processors can offer significant computational advantages in solving these problems. Simulated annealing [56] , [57] can be used to solve optimization problems, utilizing the phase transitions in a spin-glass system to find a global minima.
Addressable memory models, and pattern recall problems search over a set of stored patterns [58] , [59] , [60] , [28] . An edge-weighted network can store a set of binary or bipolar patterns which is retrieved by supplying a portion of the stored pattern as key and searching for the matching stored pattern. Hopfield networks can be utilized as auto-associative memory model known as a content-addressable memory (CAM) model [26] . The edge weights of the network are determined using Hebbian learning (or another learning rule), fixing the units contained in the key, and the network updates the remaining units to complete the pattern. The topology of the network for a CAM model of order N is a fully connected clique K N , and the storage capacity with perfect recall is upper bounded by 0.15N [26] , [61] , [58] , [59] . A bidirectional memory model stores patterns of order N in the edge weights of a fully connected biclique K n/2,n/2 [60] . Multiple learning rules [28] , [29] , [30] and network topologies [62] , [63] , [64] , [65] , [66] have been developed with the goal of increasing the pattern storage capacity of associative memory models .
Many graph algorithms can be classified as subcategories of CSPs, for this survey we focus our discussion on community detection. Community detection is both a search and a classification problem. For a graph G(V, E) the goal of a community detection task is to identify subsets of the vertex set which can be considered "related." This is a problem encountered in many scientific fields where structured, relational data is generated. Many algorithms exist for solving this problem [67] , [68] , [69] . While methods based on modularity matrix analysis [70] have a well known resolution limit [71] , and hierarchical clustering [72] , [73] , [74] may infer spurious associations; label propagation [20] has a significant advantage in that it does not require knowledge about the number of communities to search for in a graph.
B. Deep Learning
The advantages of using ANNs to model relationships in data are well known [75] . ANNs are powerful models that can extract complex relationships from data, without relying on an underlying theoretical model of how the data is generated. Deep Learning is a diverse field of research and encompasses algorithms used to train these networks [7] . These learning methods can be supervised, semi-supervised, unsupervised. There are also a number of biologically inspired models such as evolutionary learning or reinforcement learning. Training large ANNs over large sets of data is a computationally intensive task which has been significantly accelerated through the use of GPUs.
As an universal approximator, an ANN can approximate the probability distribution P(X) that generated a set of training data (X). Training an ANN determines the weights and biases in the set θ = {w, b} . This is inherently an optimization problem, as the search focuses on identifying the parameter set that minimizes (or maximizes) a given cost function J(θ). Large networks are generally trained using gradient descent (or stochastic gradient descent), where the parameters are iteratively updated depending on the gradient (or an estimate of the gradient) of the cost function. Generative models (such as RBMs) are trained with gradient-based estimation methods such as Gibbs sampling or contrastive divergence.
C. Common feature: Sparse coding
The near-term devices we are discussing in this survey emphasizes a need for sparseness in network design or CSP construction. A common coding feature that has shown promising results on both NPUs and QPUs is sparse coding [76] with locally competitive algorithms (LCAs) [77] . This is a coding technique which is motivated by how the brain encodes information. Sparse coding with local competition has been implemented on Intel's Loihi chip and has been shown to outperform conventional processors for solving LASSO optimization [15] . Sparse coding has also been shown to improve computational time for solving optimization problems on a QPU [78] .
IV. TENSOR PROCESSING UNITS
Currently, the most computationally intensive and widely used types of neural networks in deep learning research are CNNs and LSTM networks. Like their predecessor, the MLP, the overwhelming majority of the computation used for inference lies within a weighted-sum computation, which can be mapped to multiply and accumulate operations. Thus, for a wide variety of deep learning approaches that are rapidly being deployed to production systems, there exists a shared computational unit at their core. With the surge of use of these networks, a custom ASIC implementation of this computation is now becoming an economically feasible proposition for many large data companies. As such, researchers at Google have developed a custom design called a Tensor Processing Unit TM (TPU) [9] . The current generation TPU focuses heavily on concerns that could previously be ignored when deep learning computation was largely done as research or batch process runs against data periodically. Stakeholders are now looking to utilize deep learning in near real time. They are looking to collect data (e.g., speak into their phone), have the deep learning system operate on that data (e.g., transcribe voice to text), and then be provided actionable information on that data (e.g., a list of relevant search results). Such a system demands low-latency, low-power consumption, and high-throughput. In order to provide a low-latency inference system, the TPU is designed with a deterministic execution model, rather than a model that incorporates optimizations such as cache or out-of-order execution, which can come at the expense of guaranteed latency. This design allows the TPU to meet latency requirements while achieving throughput improvements over of 15X-30X and a power efficiency improvement of 30X-70X over current CPUs and Kepler generation GPUs [9] . Moreover, the TPU designers have proposed some minor changes that would allow future designs to achieve 3X improvements in throughput while more than doubling the power efficiency.
This system represents the current state-of-the-art for deep learning hardware. However, it has been noted that many NP hard problems are tensor in nature [79] and recently, CSPs have been constructed for execution on a TPU [80] .
V. NEURAL PROCESSING UNITS
A. State of the art: neuromorphic processors
The current generation of commercially available neuromorphic hardware, are digital systems such as IBM's TrueNorth Neurosynaptic Processor [12] , or SpiNNaker [81] . Both systems can implement many known classes of neural networks used in machine learning, such as CNNs, RBMs and liquid state machines [50] , [82] , [83] , [84] , [81] . Programming of neuromorphic hardware requires the specification of many parameters required to implement systems of coupled spiking neurons. For the TrueNorth chip, the layout and available connections shows dense local connectivity (intra-core) and sparse global connectivity (inter-core). In the current hardware there are limited bits of precision allocated for specifying the various system parameters. We expect that other neuromorphic hardware systems (as they are made available) will be similarly highly optimized for a particular network model and topology and have their own sets of restrictions. As such, there are non-trivial issues associated with training neural networks for use on neuromorphic hardware platforms. In this section, we present two use cases that require little or no off-chip training for use on neuromorphic systems, overcoming the issue of costly off-chip training requirements that may offset any speed up gained or reduction in energy costs made by utilizing a neuromorphic system as a co-processor for other use cases.
Spiking neural networks can be constructed in the same manner as ANNs. However, executing deep learning with a SNN deployed on neuromorphic architecture requires modifying the network construction as well as the learning algorithms to incorporate spike-based signals. SNNs typically implement a learning rule called Spike Timing Dependent Plasticity (STDP), which adjusts synaptic weights in the network based on the firing activity in the network [48] . However, SNNs can also be trained using a variety of other mechanisms, including gradient descent-based methods [85] and evolutionary algorithms [86] . Spike-based implementations of back-propagation [87] , [88] , CNNs [50] , Gibbs sampling [83] , RBMs [82] , DBNs [89] and recurrent neural networks [90] have been developed for IBM's TrueNorth Neurosynaptic System and the SpiNNaker system.
An increasingly common use case of spiking neural networks is as the "liquid" in a liquid state machine. Liquid state machines are a special use case of a recurrent spiking neural network with a read-out layer that is similar to the final layer in a traditional neural network [91] . Liquid state machines have been successfully used to analyze spatio-temporal data such as speech [92] and video [93] . The "liquid" itself is an untrained recurrent spiking neural network; the read-out layer is typically trained using some form of gradient descent or back-propagation. A co-processor that implements a liquid state machine could perform inference on spatio-temporal data as the data is being generated by another processor in the environment. Unlike many of the other discussed models, liquid state machines are very well-suited to deal with temporal data and could be used in combination with other neural network models for data analysis. Because the recurrent spiking neural network is untrained, it can be deployed immediately onto any spiking neuromorphic system that can realize sufficiently complex networks.
Though both TrueNorth and SpiNNaker are fully digital systems, there are a large variety of spiking neuromorphic hardware implementations that utilize emerging circuit components such as memristors to implement spike-based systems. Some of these systems have already been utilized to implement liquid state machines [94] , [95] . Systems that include memristors and other emerging device types typically consume much less energy than their fully digital counterparts, which are already more energy-efficient than other, more traditional architectures. Since energy and power consumption will likely be limiting factors in post Moore's era supercomputing, novel neuromorphic systems that consume significantly less energy than most computing systems will be well-suited to a future supercomputing environment.
Application to CSPs
While NPUs can efficiently execute neural networks, the training of these networks is usually done on conventional processors or GPUs. There has been growing interest in applications for NPUs that avoid the need for pre-training of large neural networks. This has led to the mapping of CSPs and label propagation algorithms onto NPUs. The main challenge in adapting difficult computational problems onto NPUs is how to translate spiking data into relevant information.
CSPs have been mapped onto the SpiNNaker system and solved using systems of stochastic neurons to minimize an energy function [96] , [51] . Whereas QPUs utilize quantum effects to escape spurious minima, the spike-based annealing using in Refs. [96] , [51] traverse energy barriers using intermediary network states. Pattern recall tasks have been mapped to spiking neuron systems [97] , [98] , [45] , [99] , [100] and deployed on many different neuromorphic systems [101] , [102] , [103] .
The synchronicity of spiking neurons with excitatory and inhibitory connections has been exploited to implement label propagation algorithms using leaky integrate and fire neurons or Kuramoto oscillators [104] , [105] , [106] under global driving and inhibition. A separate spike-based approach to label propagation has been implemented by combing spinglass dynamics with locally driven leaky integrate and fire neurons [107] , [108] and deployed on IBM's TrueNorth [109] .
VI. QUANTUM PROCESSING UNITS
A. State of the art: quantum annealers
Spin glass models have found success in a number of algorithms, but the search over all spin states is not guaranteed to find the global energy minimum. Simulated annealing relies on thermal excitations to search over the energy landscape, but these may not be sufficient to allow the system to escape spurious minima. Similar to simulated annealing, quantum annealing finds the solution of an Ising system by searching over its energy landscape [110] , [111] , [112] and can be applied to many NP hard problems, and specifically quadratic unconstrained binary optimization problems [113] , [114] . Quantum annealing is expected to have significant computational advantages over simulated annealing by leveraging quantum effects such as tunneling through barriers and superposition to escape local minima [115] , [116] . However, the full capability of quantum annealing and identifying problems that could arise as systems are scaled up still remain an active area of debate [117] , [118] , [119] , [120] , [121] .
Currently the (commercially available) state of the art quantum hardware is the quantum annealer from D-Wave [10] , [11] . The interacting qubits are arranged in a fixed geometry and the hardware can be conceptualized as a graph with weighted vertices and edges (Chimera graph layout). Programming the annealer to solve an optimization problem requires first the construction of an Ising spin glass system, which defines the needed edge weights (couplings) and vertex weights (biases). The current D-Wave 2000Q processor contains 2000 qubits, but the largest problems that can be solved on the system is limited by the fixed hardware graph. Other annealing platforma have been developed, mainly for the simulation of quantum systems [122] of up to 51 qubits, and D-Wave has been used to demonstrate critical behavior of quantum systems with over 400 simulated qubits [123] . QPUs have mainly been used to solve SAT problems and CSPs [124] , [125] , [126] and other general optimization problems [127] , [128] .
For pattern recall tasks, the use of quantum annealing simplifies the search for a stored pattern by replacing the iterative update rule with a single system annealing. Classical associative memory models models can be mapped to quantum systems for pattern retrieval [129] . While quantum annealing is projected to significantly increase the capacity of an associative memory model [130] , finding sparser representation of the Hopfield network can also improve to associative memory models for quantum annealing. For bidirectional memory models on quantum annealers [131] the choice of learning rule affects the performance of the model, as well as allows for flexibility in the face of limited bits of precision. Using neural coding can generate sparser representations of Hopfield networks embedded onto a QPU [132] .
Deep learning and graph algorithms
The most computationally intensive part of the contrastive divergence algorithm for unrestricted Boltzmann machines is the annealing to reach the thermal equilibrium, followed by sampling at the equilibrium state to estimate the probabilities required for contrastive divergence. The D-Wave quantum annealing processor performs this type of calculation naturally. Due to the Chimera topology in the D-Wave, however, a fullyconnected Boltzmann machine utilizing all qubits is not realizable; to get around this issue, Potok et al. proposed a limited Boltzmann machine, which has greater connectivity than the restricted Boltzmann but still operates within the constraints of the D-Wave configuration [133] . Utilizing the D-Wave as a coprocessor allows training for the limited Boltzmann machine that would take months on a conventional von Neumann architecture to occur in minutes to hours. Because Boltzmann machines have more sophisticated representations due to their greater interconnections, their data analysis capabilities are theoretically greater as well. QPUs will allow researchers to apply Boltzmann machines to real tasks in a reasonable amount of time.
Many graph algorithms can be constructed as CSPs, and deployed on QPUs. These include the Traveling Salesman problem [134] , network flow problems [135] or graph coloring [136] . Additional studies have shown how quantum annealers can be used to implement graph partitioning and community detection [137] , [138] , spanning tree identification [139] , and graph isomorphism [140] .
B. Addendum: circuit model quantum computing
Quantum annealing is not the only platform for quantum computing. The circuit model of quantum computing, in which individual qubits are manipulated through the application of unitary gates, is predicted to outperform classical computers on a number of tasks such as: database searching [141] and prime number factorization [142] . This model of quantum computing will require a large number of qubits in order to execute faulttolerant computations and at the time of writing this survey, we are not aware of any devices capable of implementing these tasks. Recent studies have shown how ANNs can be implemented in the circuit model of quantum computing [143] and deployed on near-term devices.
VII. SCALABILITY AND INTEGRATION INTO HPC
Thermal effects present a significant obstacle to QPU scalability, recent results posit that they may dilute any possible quantum speedup [120] . Currently, problem instances that are too large to be directly embedded onto the existing hardware are divided into smaller sub-problems [144] . There has been the most focus in on how to incorporate them into existing HPC workflows [145] , [146] . Hybrid classicalquantum solvers have been put forth by D-Wave, in which computationally difficult parts of a computation are off-loaded to a QPU [135] .
NPUs can operate at room temperature and have been developed to be more scalable, but less emphasis has been placed on how to incorporate them into larger HPC workflows. Currently they are incorporated into heterogeneous workflows by training ANNs on CPUs and GPUs, prior to deployment on a NPU.
VIII. CONCLUSIONS
We have provided a review of several recent applications of neural networks on next-generation hardware platforms. These examples emphasize that there are both familiar and unfamiliar elements in the design and implementation of neural networks onto these new systems. A common feature in these designs is the relevance of graphical structures to indicate associations, patterns, and feedback. While the use of these concepts is natural for programming neuromorphic and/or quantum systems, the migration of existing applications are less familiar. However, there are problems that find a natural connection with these data structures, including graph-analytic methods and machine learning tasks. The translation of existing applications to neuromorphic and quantum processing and other future technologies may therefore be expected to proceed in stages with early adoption by select user communities. We expect that success in these areas will set the stage for adoption later by broader user groups. The transition between these stages will depend on the infrastructure that provides support for translating data structure from conventional representations to the structure necessary for next-generation hardware.
There have been growing concerns that emerging computing paradigms, like neuromorphic computing and quantum computing, will provide such disruptive capability as to undermine existing investments in software infrastructure [147] , [148] . However, as the applications presented here show, the disruption lies largely at the interfaces between the specialized methods allocated to these processors and the routines which may make use of them. In this regard, neuromorphic and quantum processing may best befit the accelerator model that has become increasingly popular with the successful use of GPUs. Designs for these systems emphasize the expectation of extreme heterogeneity in future HPC systems and the need for infrastructure to manage, schedule, and tune these systems. However, we must emphasize that these successful use of neuromorphic and quantum processing to date does not discourage this approach but rather makes clear that such efforts are worthwhile.
