Neuromorphic Nearest-Neighbor Search Using Intel's Pohoiki Springs by Frady, E. Paxon et al.
1Neuromorphic Nearest-Neighbor Search Using
Intel’s Pohoiki Springs
E. Paxon Frady, Garrick Orchard, David Florey, Nabil Imam, Ruokun Liu, Joyesh Mishra, Jonathan Tse,
Andreas Wild, Friedrich T. Sommer, Mike Davies
Intel Labs, Intel Corporation
Abstract—Neuromorphic computing applies insights from neuroscience to uncover innovations in computing technology. In the brain,
billions of interconnected neurons perform rapid computations at extremely low energy levels by leveraging properties that are foreign
to conventional computing systems, such as temporal spiking codes and finely parallelized processing units integrating both memory
and computation. Here, we showcase the Pohoiki Springs neuromorphic system, a mesh of 768 interconnected Loihi chips that
collectively implement 100 million spiking neurons in silicon. We demonstrate a scalable approximate k-nearest neighbor (k-NN)
algorithm for searching large databases that exploits neuromorphic principles. Compared to state-of-the-art conventional CPU-based
implementations, we achieve superior latency, index build time, and energy efficiency when evaluated on several standard datasets
containing over 1 million high-dimensional patterns. Further, the system supports adding new data points to the indexed database
online in O(1) time unlike all but brute force conventional k-NN implementations.
F
1 INTRODUCTION
The brain was an inspiration even for the pioneers of
computing, like John von Neumann [1]. It was for historical
and practical reasons, that the von Neumann architecture
of classical computers looks very different from brains. In
a traditional computer, memory (DRAM) and computing
(CPU) are physically separated, information is processed
according to a sequential program specification mediated by
a central clock, and information is represented in digital bi-
nary strings. By contrast, in brains, information is processed
fundamentally in parallel with memory and computation
tightly intertwined and distributed over circuits of synapses
and neurons. Although there are emergent rhythms in the
brain that coordinate computation as needed, there is no
central clock. Finally, communication and computation in
neural circuits involve both analog and digital operations.
Neurons integrate synaptic input in an analog manner,
which is advantageous for efficient temporal computation,
but their outputs are binary-valued spikes, which is advan-
tageous for communication. Here we will demonstrate how
these brain-inspired principles can be applied to perform
efficient k-nearest neighbor search on Intel’s Poihiki Springs
neuromorphic research platform.
2 LOIHI AND POHOIKI SPRINGS
The Loihi neuromorphic research chip is a 128-core, fully
digital and asynchronous design implementing an ad-
vanced spiking neural network feature set [2]. Its highly
specialized architecture is optimized for efficiently commu-
nicating and processing event-driven spike messages. Loihi
is fabricated with Intel’s standard 14nm CMOS process
technology. Each one of its 128 cores implements up to
1024 digital spiking neurons with features such as vari-
able weight precision, hierarchical and compressed net-
work routing tables, and microcode-programmed synaptic
Fig. 1. Pohoiki Springs 5 RU Chassis
learning rules. Additionally each Loihi chip includes three
embedded x86 processors responsible for interacting with
the neuromorphic cores on short timescales and converting
off-chip data between conventional encodings and spikes.
Loihi also includes inter-chip communication interfaces
allowing it to scale to thousands of chips in a two-
dimensional mesh. Loihi has been used in several mesh-
based systems to date, ranging from Kapoho Bay, a 2-chip
(1x2 mesh) USB form factor device, to Nahuku, a 32-chip (4x8
mesh) custom plug-in card, to Pohoiki Beach, an early version
of the Pohoiki chassis instantiating two Nahuku cards.
Pohoiki Springs, shown in Fig. 1, is the latest evolution
in Loihi systems. It expands on Pohoiki Beach to a capacity
of 24 Nahuku cards in a standard 19” five rack-unit chassis.
The fully configured Pohoiki Springs chassis contains the
following components:
• 24 Nahuku cards organized into three columns of 8
cards each, for a total of 768 Loihi chips in a 12x64
mesh.
ar
X
iv
:2
00
4.
12
69
1v
1 
 [c
s.N
E]
  2
7 A
pr
 20
20
2• Three Arria 10 FPGAs, one per column of Nahuku
cards, that each include both the Arria 10 SX FPGAs
and ARM processors that interface with the mesh
of Loihi chips. The FPGA fabric converts the ARM
AXI bus to the Loihi proprietary communications
protocol, and the processor implements the network-
ing stack. Each ARM CPU serves as the host for
its allocation of Nahuku cards, responsible for data
I/O and CPU-coded algorithmic interaction with its
mesh of Loihi chips. The hosts communicate with the
remote super host CPU over an integrated Ethernet
network.
• One x86-based system, a Core i5 CPU on an ATX
motherboard form factor located in the rear of the
Pohoiki Springs chassis. This x86 system, referred to
as the super host, is used for orchestration, config-
uration, and other command and control duties. It
can take also part in neuromorphic computation by
injecting data into and interpreting results from the
768 chip mesh via the Arria 10 ARM hosts.
• One embedded Ethernet switch that consolidates all
internal Ethernet traffic into a single interface at the
rear of the chassis.
Loihi implements a barrier synchronization mechanism
that allows its architecture to seamlessly scale up from one
core to Pohoiki Springs’ heterogeneous mesh of 98,304 neu-
romorphic cores and 2,304 embedded x86 cores. Whether
within or between chips, all cores exchange barrier messages
with their neighbors that signal the completion of each
algorithmic timestep. The asynchronous, blocking nature of
the barrier handshakes allow timesteps to run in a variable
amount of real time depending on the amount of compu-
tation and communication the mesh collectively requires
on each timestep. For pure computational workloads such
as nearest neighbor classification, this feature allows the
system to complete computations in the minimum time
possible, providing latency and power benefits.
3 NEAREST NEIGHBOR SEARCH
As a first demonstration of a highly scalable neuromorphic
algorithm on Pohoiki Springs, we apply the neuromorphic
properties introduced above to nearest-neighbor search, a
problem that appears in numerous applications, such as
pattern recognition, computer vision, recommendation sys-
tems, and data compression. Given a database of a large
number M of N -dimensional data points, the k-nearest
neighbor (k-NN) algorithm maps a specified search key to
the k closest matching entries.
The performance of different k-NN algorithms is mea-
sured by the time complexity of a search, as well as the
time and space complexity required to prepare and store
the data structures used to perform the search, referred to
as the search index.
There are several exact k-NN implementations, such as
those based on space partitioning, for example using k-d
trees or R trees [3, 4]. However, exact approaches suffer
from the curse of dimensionality [5] and for large high-
dimensional databases they are too computationally expen-
sive to use in practice on conventional hardware. In recent
years a variety of efficient approximate k-NN implementa-
tions have been developed and are in wide use today. These
employ diverse approaches such as dimension reduction,
locality sensitive hashing, and compressed sensing [6, 7, 5].
Recent efforts to fairly benchmark these methods have
shown that even these approximate methods must choose
between minimizing either query time or index preparation
time [8].
Here we focus on the case of nearest neighbor search on
the unit sphere where distance refers to angular distance
or cosine similarity between (normalized) vectors. For this
distance metric, exact nearest neighbor search can be per-
formed by a matrix vector product (MVP) between a large
data matrix of high dimensional pattern vectors and the
search key. Given data matrix D ∈ RN×M and search key
d˜ ∈ RN , the matrix vector multiplication approach to k-NN
yields a score vector of matches, the match vector:
m = D>d˜ (1)
The entry of the match vector with maximum amplitude
identifies the nearest neighbor. For the k-NN problem, the
set of k components with largest amplitudes represents the
solution.
Our simple approximate algorithm computes and
searches the matrix-vector product (1) on neuromorphic
hardware. By encoding the data using spike-timing patterns,
we can implement k-NN on Pohoiki Springs at large scale.
3.1 Data encoding
Coding conventional data in a manner that is amenable for
sparse, spike-based processing is a key aspect of neuromor-
phic algorithm design. We explain our approach using the
Tiny Images dataset as an example. Similar processing is
applied to GIST-960 [9] and GloVe [10] datasets, and can also
be applied to other datasets. For the Tiny Images dataset, the
data dimensionality is N = 32 × 32 × 3 = 3072 pixels per
image, and the number of data points M will scale up to
106. To start, the data is mean-centered and normalized.
Image and other data can be reduced in dimensionality
quite easily, often with minimal loss of information. In this
work, we use PCA and ICA to transform input data patterns
to lower dimensional representations. For large datasets, a
representative subset can be used to compute the transform
matrix DPCA = UPCAΣVPCA. A subset of 20, 000 training
data points from Tiny Images were used to compute the
principal components VPCA ∈ RNC×N , with the top NC =
500 kept for dimensionality reduction, down from 3072.
Following PCA reduction, the fast ICA algorithm [11] is
used to find the ICA mixing matrix MICA ∈ RNC×NC . The
mixing matrix is a unitary matrix that rotates the image into
a basis with sparse coefficients. The PCA-ICA combination
provides an encoding matrix,
C = MICAVPCA (2)
The matrix C is computed offline once and stored for later
online use to encode search keys d˜. Specifically, an image
is represented as the sparse coefficients of the ICA basis
vectors (Fig. 2)
e˜ = Cd˜ (3)
3Fig. 2. In the preprocessing phase, a subset of the data is used to
compute the encoding matrix C, which is used to produce a sparse
representation of the data in a reduced dimensionality.
The vector e˜ is a sparse representation of the image in a
reduced dimensionality, and k-NN can be performed in the
lower dimensional space. To do so, we encode the dataset in
this reduced space, with
E = CD (4)
Dot products in this reduced space will then remain very
close to the true dot product, with E>e˜ ≈ D>d˜. Without
dimensionality reduction, where NC = N , these dot prod-
ucts would be exact. By choosing NC < N , we (1) lower
the computational cost of the nearest neighbor search with
minimal accuracy loss and (2) obtain a sparse lower dimen-
sional encoding of the search key that may be efficiently
transferred to Pohoiki Springs as spikes.
3.2 k-NN with spiking neurons
Our neuromorphic algorithm computes k-NN classification
with a single layer of integrate-and-fire neurons, where
each neuron’s membrane voltage represents the match of
a particular data point with the search key. The synaptic
weights feeding into a neuron encode the stored data point,
and the spike timing of presynaptic spikes represent a search
key. By pruning small components from the sparse search
key representation e˜, we reduce the amount of information
that has to be communicated by spikes without significantly
degrading the accuracy.
To represent search keys with spike timing, we adopt
previous approaches of spike time latency codes, in which
earlier spikes represent larger magnitudes [12, 13]. To repre-
sent negative amplitudes, the number of inputs is double
the dimension of the vector e˜. Negative amplitudes are
turned into positive amplitudes and represented as dual
components in the second half of the input vector. Thus
large positive and negative amplitudes in the coefficients
will both result in early spikes. The inputs therefore have
antagonistic receptive fields, like ‘on-cells’ and ‘off-cells’
seen in neuroscience.
A search key e˜ ∈ RNC is represented by a spike pattern
s˜(t) ∈ R2NC within the input window T . In this demonstra-
tion the window length is T = 60 timesteps,
s˜i(t) =

δ (t− T (1− e˜i/e˜max)) if e˜i > θe
δ
(
t− T (1 + e˜(i−NC)/e˜max)
)
if − e˜(i−NC) > θe
0 otherwise
(5)
where e˜max = max |e˜i| and δ(t) denotes the Kronecker delta
function over discrete timesteps t ∈ [0, T ]. Note that the
Fig. 3. Computation of dot product with temporal coding.
neurons 1, . . . , NC encode positive e˜i, while the neurons
NC+1, . . . , 2·NC encode negative e˜i. The larger the absolute
value of e˜i, the earlier the corresponding spike. Pruning of
small components is implemented by the threshold variable
θe > 0. Components with absolute values |e˜|/e˜max smaller
than the threshold are dropped. This reduces spike traffic on
the hardware, but of course not all information of the input
vector is transmitted. The used setting θe = 0.1 typically
removes about one quarter to one third of spikes.
The synaptic weight matrix is a concatenation of the
preprocessed data matrix E, and a sign-inversed version of
it: W = [E,−E] ∈ R2NC×Ms .
Each input spike is broadcast to all pattern match neu-
rons where it is weighted by the synaptic strength and
integrated to each neuron’s synaptic current. The postsy-
naptic currents are again integrated in a standard integrate-
and-fire neuron. To perform these computations, the Loihi
chip is configured to implement neurons with the following
discrete time dynamics:
Ui(t+ 1) = Ui(t) +
∑
j
Wij s˜j(t) (6)
Vi(t+ 1) = Vi(t) + Ui(t) (7)
Ui and Vi represent the synaptic current and voltage in
neuron i. When the voltage crosses threshold θV , the neuron
emits a spike and Vi is reset to 0. A long refractory period
prevents pattern match neurons from spiking more than
once.
The spike encoding of the search key (5) together with
synaptic multiplication and neuronal integrate-and-fire dy-
namics lead to a temporal code of the output spikes that
reflect the order of the dot products between search key
and the data points, E>e˜. Note that the area under the
curve of a neuron’s synaptic current, (6), can be written
as Ai =
∑T
t=0
∑
jWij s˜j =
∑
j Eij e˜j/e˜max, which is pro-
portional to the dot product. The quantity Ai is computed
by the integration of the current in the voltage variable
(7) (Fig. 3, right). The temporal order of output spikes,
generated when the voltages exceed threshold θV , reflects
the approximate order of matches in the search. Thus, the
detection of the first k output spikes implements k-NN
classification. Neurons that are too weakly activated by the
search key to surpass the threshold θV do not spike at all.
They represent weak matches that are excluded from even
being ranked.
4The approach presents tradeoffs between computation
precision, energy consumption, and time, which can be
adjusted to a particular problem by parameter settings:
• Threshold θe governs the trade-off between sparsity
in input spike patterns and representing components
of the search key e˜ with small absolute values.
• Threshold θV governs the average integration time
and thereby the precision of the returned match list.
Thus, raising the threshold increases precision at the
cost of compute time.
• Length of input window T determines the discretiza-
tion error in the representation of the search key
components represented by spike times.
• Synaptic resolution determines the discretization er-
ror in the representation of the stored data points.
• Threshold θW for synaptic pruning.
The adjustments of certain parameters should be coor-
dinated for achieving best performance at minimal resource
use: Integration window and synaptic resolution determine
the resolutions for search key and data points, respectively.
It is reasonable to choose similar resolutions for search
key and data points. Similarly, input threshold and synap-
tic pruning threshold should be adjusted to similar cut-
off levels. The dot product is computed most precisely in
neurons that spike exactly at the end of the input window,
so the spike threshold should be tuned jointly with the input
window.
4 IMPLEMENTATION
To map nearest neighbor search to Pohoiki Springs, we take
a modular approach. Subsets of the data are stored on indi-
vidual Loihi chips, and the full database is distributed across
the 768 chip mesh. The module defines the architecture
for one chip, e.g. it sets neuron model parameters, sets the
weights, and instantiates the neurons used to broadcast the
input keys.
To execute a search, the input query vector is converted
into a temporal spike code, and a network of routing neu-
rons distributes the query spikes throughout the mesh. The
similarity comparison is computed by a layer of output
neurons that integrate the contribution of the input spikes.
The spike times of the output neurons are detected by spike
counters in the x86 processors embedded in each Loihi chip,
which send the results back to the hosts and super host over
message-passing channels. Finally, the super host merges
and filters the top k matches based on the timestamps
attached to the returned messages.
4.1 Single chip nearest neighbor search
The k-NN module for a single chip consists of 1, 000 spiking
inputs, and 2, 400 spiking output neurons, with almost full
connectivity. A subset of Ms = 2, 400 patterns will be stored
on each chip in the weights between input and output
neurons. The data to store is represented by the matrix Ds,
with dimension N ×Ms. This data is encoded by the matrix
Es = CDs, as in (4).
The input weight matrix Ws = [Es,−Es] is encoded
in the same manner as the input search coefficient vector,
and is correspondingly doubled in length to match the
Fig. 4. In the programming phase, image data is encoded and stored on
individual Loihi chips. The process is repeated for each chip.
negative components. The weights Ws are rescaled to the
range [−128, 127], rounded to integer values and stored as
synaptic weights on a chip. We tuned the system such that
the best matches would produce spikes around timestep 60.
The relationship between the timing of the output spikes
from a single chip query and the dot product is visualized
in Fig. 5.
Fig. 5. The normalized spike times are compared to the empirical match
value.
4.2 Distributing the search over many Loihi chips
The full pipeline of the execution phase is illustrated in Fig.
6. Given a query vector d˜, the super host CPU computes
e˜ = Cd˜ and s˜(t), which is then sent to the Pohoiki Springs
host CPUs for further distribution into the Loihi mesh as a
list of spike indices and spike times. Each host sends s˜(t)
to the first x86 processor embedded in its column of 256
Loihi chips (Fig. 6, K). The embedded processor K injects
the spikes into a network of routing neurons (Fig. 6, R) that
distribute the spikes to the routing neurons in neighboring
Loihi chips. The neighboring chips in turn further route
these spikes to their neighbors, propagating throughout the
column as a wavefront of query spikes, advancing one layer
of chips per timestep.
The routing neurons in each chip also project to a local
population of integrate-and-fire neurons implementing that
chip’s similarity calculations over its stored patterns. The
spikes activate each subset Ws of the matrix W in parallel
on different chips throughout the mesh, influencing the tem-
poral integration of all pattern match neurons appropriately
as illustrated in Fig. 3. (Fig. 6, MVP).
4.3 Detecting and aggregating match results
As the pattern match neurons integrate to threshold, sig-
nifying close matches, they send spikes to hardware spike
counters contained within their local embedded x86 proces-
sors (Fig. 6, M) for match aggregation. On each timestep,
5Fig. 6. Nearest neighbor search on Pohoiki springs.
the processors detect candidate pattern matches by nonzero
counter values and send these as messages to their neigh-
boring processors in the same propagating wavefront man-
ner as for the query distribution.
Each processor in the wavefront sequence asyn-
chronously aggregates all of the results it receives before
communicating the results onwards to the next processor.
Any time a processor has sent k results, it will stop sending
messages and will communicate to the processors before it
that they should also stop sending results. The final root
processor on the last Loihi chip sends the fully aggregated
result back to the host and super host as soon as it has k
results to send.
The output message from the last processor is an ordered
list of the first k matches (or more than k if there are ties).
The ordering directly reflects the order in which matches
were found and does not require the host and super host
CPUs to do any sorting. Each match also includes the
timestep on which it was found, which can be used to
identify and break ties for greater recall accuracy.
The super host is responsible for aggregating the final k
results by merging and filtering the three ordered lists of k
matches it receives from the Arria 10 ARM hosts, a negligi-
ble extra computation. Although our search implementation
seamlessly scales up to the entire 768-chip mesh, controlled
by a single host, the gain in extra I/O bandwidth from par-
titioning the Loihi mesh into three host columns outweighs
the cost of merging three sequences of k matches to one.
In fact, due to the highly asymmetric dimensions of each
column of Loihi chips (4×64), the barrier synchronization
time per column of 14.3µs is only marginally faster than the
barrier synchronization time across all 768 (12×64) chips,
16.2µs.
For latency-optimized searches with coarse temporal dis-
cretization of the input window, typically the final timestep
will include on the order of k tied entries. For best possible
recall accuracy, the super host can perform a final k′-NN
search over the final tied entries, where k′ < k is the number
needed to complete a full set of k nearest neighbors. Since
the number of ties to search is orders of magnitude smaller
than the size M of the full dataset, this extra postprocessing
step adds a small additional latency to the query, which is
included in the results that follow.
5 EXPERIMENTAL RESULTS
For benchmarking, we follow the procedures described in
Aumu¨ller et al. [8]. Additionally, we measure and estimate
power consumption in order to compare energy expendi-
tures between different implementations. The ground truth
is based on the normalized dot product (cosine distance) as
computed on a CPU. We validate the algorithm on the Tiny
Images dataset [14], as well as GIST-960 [9] and GloVe [10].
5.1 Performance evaluation
Our first experiment measures the recall performance of k-
NN, that is, how well the algorithm returns the same results
as the ground truth. In Fig. 7, left, we show the results of
searching Tiny Images datasets of varying sizes from 76,800
to one million. For the k = 1 case, we classify an input that is
randomly chosen from the dataset and pixel-wise corrupted
with Gaussian noise (as in Fig. 6). For the other cases, we
query the dataset with an input that was excluded from the
dataset. Recall is calculated as the fraction of the returned k
data points that are no further from the search key than any
data point in the ground truth top-k set (Fig. 7, left).
For three one-million pattern datasets, we also evaluated
the (1+)-approximate recall performance, which defines
an expanded window in which the top k = 100 nearest
neighbors may be found. Fig. 7, right, shows approximate
recall as a function of  for Tiny Images, GIST-960 and GloVe
datasets.
We characterize the system’s query latency over the
range of times that Pohoiki Springs responds with the first
6Fig. 7. Recall performance.
Fig. 8. Search timing and latency measurements on GIST-960.
and kth match. Since the neuromorphic algorithm identifies
solutions in a temporally ordered manner, the closest match
is always found before the last match, and this latency
spread increases with increasing dataset size (Fig. 8, right).
Depending on spike traffic, each barrier synchronized
timestep of the computation can have a different duration
(Fig. 8, left). In the absence of excessive spike activity, the
system typically sustains just over 13µs per timestep for the
1M-pattern dataset workload and 5.8µs per timestep when
processing 76,800-pattern datasets. However, slowdowns
are observed during other periods. The first slowdown is
noticeable near the end of the input window when many of
the smaller coefficients above threshold are communicated
as spikes, which have to be routed throughout the mesh.
Output spikes begin to arrive near timestep 80 indicating
nearest neighbors, slowing down the system. More time is
needed to collect output spikes for larger k. Interestingly,
the observed slowdowns are due to the load on Loihi’s
embedded x86 cores related to processing the incoming
and outgoing spikes, not as a result of congestion in the
neuromorphic mesh interconnect or cores.
5.2 Power and energy
The total power of Pohoiki Springs, including power sup-
plies, FPGAs, ARM hosts, ATX motherboard, and Ethernet
switch, is measured at the plug while running queries at
maximum throughput. Estimates of the different power
components for the Loihi chips are obtained by extrapolat-
ing measurements on an instrumented board containing 32
Loihi chips and running 76,800-pattern search queries.
Table 1 provides a breakdown of the Loihi mesh power
consumption for a variety of sustained query workloads.
Static power is due to leakage when all circuits are fully
powered. Almost all leakage can be attributed to the neuro-
morphic cores, which dominate chip area. The x86 power is
TABLE 1
Power breakdown per query.
k Size Static x86 Neuro
1 76,800 3.34 W 2.09 W 1.56 W
10 76,800 3.34 W 2.10 W 1.80 W
100 76,800 3.34 W 2.14 W 1.58 W
1 1M 53.4 W 32.2 W 16.2 W
10 1M 53.4 W 31.7 W 17.1 W
100 1M 53.4 W 31.2 W 10.8 W
Pohoiki wall power 1M 258 W
CPU1 TDP * 140 W
TABLE 2
Energy breakdown per query (mJ).
k Size Static Reset x86 Neuro Total CPU1
1 76,800 2.34 0.19 2.75 1.33 6.59
10 76,800 2.57 0.19 2.90 1.67 7.31
100 76,800 3.14 0.19 3.21 1.72 8.25
1 1M 102 3.06 61.6 31.0 198 8,978†
10 1M 119 3.06 70.4 38.0 230 9,380†
100 1M 187 3.06 109 37.9 338 9,648†
†Does not include system DRAM energy.
dynamic power consumed by the x86 processors, approxi-
mately 90% of which is idle power. Neuro power is dynamic
power attributed to the neuromorphic cores.
Table 2 further breaks down the energy consumption of a
single search query. A reset phase occurs after each query to
prepare the system for the next query. Total dynamic energy
therefore includes both the energy required to reset and
the energy required to query. The query dynamic energy
is further broken down into x86 and Neuro components
by isolating the embedded x86 processor workload and
measuring it separately.
For extrapolation to the 1M-pattern workload, static
power and x86 idle power are assumed to remain constant
per chip. Neuro and x86 dynamic energy per chip (in excess
of idle activity) are assumed to scale linearly with the
number of Loihi timesteps. Reset energy per chip is constant
for every query.
Table 2 also provides the approximate energy that a Core
i9-7920X CPU1 requires to perform the same E>e˜ matrix-
vector product k-NN search that Pohoiki Springs computes
with spiking neurons. The energy is estimated based on
the measured runtime of a NumPy float32 implementa-
tion (non-batched) multiplied by the CPU’s thermal design
power.
5.3 Dataset processing and programming
Before searches may be executed, a given dataset must be
processed and Pohoiki Springs must be configured. This
happens over a series of three steps: (1) a dataset prepro-
cessing step to compute the encoding matrix C, (2) an index
build step to compute the weights to be programmed into
Pohoiki Springs, and (3) a programming step that writes all
computed weights to the Loihi mesh.
Dataset preprocessing entails computing PCA and ICA
on a subset of the dataset. This step optimizes the data
encoding for the Pohoiki Springs algorithm and only needs
to be computed once per class of data. For data with a few
thousand dimensions such as images, typically a subset of
710 thousand or more is needed. Here, we use 20,000 samples
for computing the encoding matrix. It takes 68, 71, and 132
seconds to compute the C matrices for GloVe, GIST-960
and Tiny Images, respectively, using an Intel Core i9-7920X
CPU1.
The index build step involves transforming the given
dataset by the encoding matrix C, i.e. computing (4), and
writing the resulting weight submatrices Ws for each Loihi
chip to disk. This is implemented as a batched NumPy com-
putation for each chip’s subset of the dataset. For compari-
son to conventional k-NN implementations, we measure the
time required to generate a single chip’s weights and scale
the time to the size of the dataset.
In the final programming phase, the encoded dataset
weights are loaded and written to each chip in the mesh
along with all other routing tables and register values re-
quired to configure the k-NN application. This is a very slow
step due to the current unoptimized state of the Pohoiki
Springs I/O subsystem. The programming time for a 192-
chip column was measured to be 893 seconds, or about
4.6 seconds per Loihi chip. Incrementally adding additional
data points to the system requires on the order of 1ms to
encode and program.
5.4 Comparison to state-of-the-art
In Table 3, we compare the system’s performance results on
GIST-960 to state-of-the-art k-NN implementations, Annoy
[15], Inverted file with exact post-verification (IVF) [16], and
Hierarchical Navigable Small World Graph (HNSW) [17, 18].
Comparison results were taken from Aumu¨ller et al. [8].
Note that our algorithm computes angular distance (cosine
similarity) and uses angular distance as ground-truth, while
the conventional implementations operate on Euclidean dis-
tances. As a baseline reference point, we also compare per-
formance results to our PCA/ICA-compressed brute force
algorithm executed on an Intel i9-7920X CPU1.
The Pohoiki Springs query latency includes 220µs of pre-
processing on the CPU to compute the ICA-transformed key
e˜, Eq. (3), and 300µs of CPU postprocessing to exhaustively
break ties in the final timestep. These extra times contribute
to the search latency but do not affect throughput since
they can be computed concurrently with unrelated Pohoiki
Springs queries. Conversely, search throughput is degraded
by the reset time of 230µs on Pohoiki Springs that falls off
the latency critical path.
Our results show that neuromorphic k-NN classification
achieves comparable recall accuracy to the other algorithms,
reporting 77-97% of the true top k results, with 3-4x better
search latency and throughput than Annoy and IVF.
Our algorithm is also favorable in its simplicity, which
supports a fast index build time and the smallest memory
footprint (Table 3, Index size). Hence, while the highly
query-optimized HNSW algorithm outperforms Pohoiki
Springs in search speed by about 2x, it vastly underperforms
it in index build time. Further, because the Pohoiki Springs
implementation organizes its index as a simple distributed
array of data points, encoded by dense network weights,
inserting a new point online during execution is an O(1)
operation that requires negligible time (feasibly under 1ms).
6 DISCUSSION
Fundamentally, the computation of (1) and the subsequent
top-k search can be parallelized to a very fine level of granu-
larity. This property is difficult to exploit with conventional
architectures because communication and processor over-
head come to dominate at high levels of concurrency. The
Pohoiki Springs neuromorphic architecture supports sparse
spiking representations and low-overhead barrier synchro-
nization, and these features can be harnessed to provide
a finely parallelized implementation of k-NN classification
that is fast, scalable, and energy efficient.
6.1 Neuromorphic algorithm and data encoding
Here we propose a simple neuromorphic implementation
of k-NN classification using a layer of conventional spiking
neurons, each one receiving inputs through synapses that
represent a data point by the synaptic strength. The algo-
rithmic innovation lies in how the input to these neurons
is encoded and combined with the synaptic and neural
dynamics to produce the desired computation with minimal
spike traffic. We use a latency code in which larger ampli-
tudes come early and spikes representing small amplitudes
are suppressed. With this input encoding and leak-less inte-
gration, the resulting membrane voltages exactly represent
the matches (dot products) at the end of the input window.
However, the computation becomes approximate due to
discretization and the translation of the membrane voltages
into output spikes based on a fixed chosen threshold.
Data preprocessing consists of PCA and ICA for dimen-
sionality reduction and sparse spike encoding. The compu-
tation is relatively cheap since it only needs to be computed
once on a representative sample of the data. The procedure
is optional in cases where the data is already sparse with
manageable dimensionality.
The search implementation uses brute-force parallelism
to perform a simple dot product computation, compared
to the complex hashing and search strategies of conven-
tional state-of-the-art nearest neighbor search algorithms.
Such a brute-force neuromorphic implementation achieves
efficiency at scale on Pohoiki Springs by taking advantage of
the architecture’s fine granularity of distributed, co-located
memory and computing elements in combination with
rapid synchronization of temporally coded and integrated
spike timing patterns.
6.2 Nearest-neighbor search results
Our neuromorphic approximate k-NN implementation on
Pohoiki Springs uniquely optimizes both index build time
and search speed compared to state-of-the-art approximate
nearest neighbor search implementations with equal recall
accuracy. Although batched implementations on both CPUs
and GPUs can boost query throughputs to well beyond the
levels evaluated here, up to 1,000-50,000 queries per second
[16], the latencies of those implementations are 100x or more
worse.
Additionally, our neuromorphic implementation sup-
ports adding new patterns to the search dataset in O(1)
complexity, on the timescale of milliseconds. Convention-
ally, only brute force k-NN implementations can support
8TABLE 3
Performance comparison
 Recall
Query Latency
(ms)
Throughput
(s−1)
Index build
time (s)
Index size
(kB)
Supports
incremental
insertions
Annoy 0.0 0.76 13.7 73.2 638 5,176,296 No(CPU2) 0.01 0.97 13.7 73.2
IVF (FAISS) 0.0 0.77 9.64 104 1297 5,153,480 No(CPU2) 0.01 0.97 9.64 104
HNSW (nmslib) 0.0 0.78 1.41 710 13,253 9,249,460 No(CPU2) 0.01 0.98 1.41 710
PCA/ICA (CPU1) 0.0 1.0 72 13.9 30 1,953,125 Yes+ batched (x100) 0.0 1.0 2,254 44.4
Pohoiki 0.0 0.77 3.03 366 30 1,953,125 YesSprings 0.01 0.97 3.03 366
All Annoy, IVF, and HNSW numbers come from Aumu¨ller et al. [8], in particular GIST-960 dataset values from https://github.
com/erikbern/ann-benchmarks.
O(1) pattern insertion, which then comes at the cost of
O(M) search latency. For the million-pattern datasets eval-
uated, this difference represents over 20x slower search
speeds. The ability to add to the search database without
interrupting online operation may be highly desirable in
latency-sensitive settings where the database needs to in-
clude points derived from events happening in real time.
Such applications could include algorithmic trading of fi-
nancial assets, security monitoring, and anomaly detection
in general.
The neuromorphic approach described also allows for
simple adjustments to trade off performance in accuracy for
improvements in latency. These adjustments can be made
dynamically without requiring hours of index re-building
[18]. By stretching or compressing the encoding of the
input spike times (which is done on the CPU), as well as
adjusting the thresholds of the output neurons, one may
dynamically configure k-NN search with higher resolution
or lower latency, as desired. However, accuracy is limited
due to various sources of noise.
One source of noise in the implementation comes from
discretization error. The exact timing of input and output
spikes are locked to discrete timesteps. With an integration
window of 60 timesteps, the dynamic range of each input
dimension is approximately six bits. Similarly, output spikes
are discretized into time steps, giving finite resolution. In-
creasing time scale can improve this source of discretization
noise, at the cost of longer execution time.
The more significant source of error is the temporal
coding of output spikes, which is critical for efficiently
identifying the top k matches. In the computation, the de-
sired dot product is exactly proportional to a pattern match
neuron’s membrane voltage only at the end of the input
window. However, in order to search for near matches,
the parameters must be tuned to permit spiking at times
away from the exact end-of-window, thereby introducing
inaccuracies. In general, the thresholds should be tuned so
that the pattern match neurons spike near the end of the
integration window. Threshold re-tuning is easily executed
and can be rapidly broadcast to all cores in the system.
The main shortcoming of the demonstrated implementa-
tion is that it only supports a dot product (cosine) distance
metric. Many practical k-NN applications require Euclidean,
Hamming, or other distance metrics. This limits the appli-
cation space for the neuromorphic implementation in its
current form.
6.3 Conclusion
The approximate k-NN classification implementation devel-
oped here exploits some, though certainly not all, of the
fundamental neuromorphic properties of Pohoiki Springs.
First, it exploits fine-grain hardware parallelism with fully
integrated memory and computation. The computation of
the closest k matches is distributed over Pohoiki Springs’
100,000 cores in which the patterns themselves are stored.
Second, the algorithm uses the timing of events to en-
code information and to simplify computation. In this case,
the multiply-accumulations of a conventional matrix-vector
multiply operation are replaced by event-driven weight
lookups and integration over time. Finally, the implemen-
tation intentionally introduces and exploits computational
sparsity. The algorithm transforms the input data repre-
sentations to prefer zero components over nonzero ones,
which the hardware then exploits by implicitly skipping all
computation related to the zeros.
On the other hand, this example does not exploit
many other important neuromorphic properties provided
by Loihi. All weights and network parameters in the system
are precomputed offline and remain static once loaded
into the system. This leaves the plasticity features of the
hardware untouched. The computation is shallow and feed-
forward, with the neuromorphic domain only responsi-
ble for computing a single matrix-vector product. Search
latency and dynamic energy remain dominated by von
Neumann processing, which is not ideal. In general, we
expect greater gains as a greater proportion of the overall ap-
plication falls within the neuromorphic domain, especially
as recurrent feedback loops are introduced to accelerate
convergence and to support pattern superposition. Such en-
hancements, the focus of ongoing work, promise to greatly
boost the networks storage capacity and performance.
Some aspects of the results suffer from a lack of op-
timization at both hardware and software levels, a conse-
quence of the early prototype status of the Pohoiki Springs
system. Full utilization of the system resources would in-
crease pattern capacity by at least 6x. Programming times
9could be reduced by well over 10x with optimized software
and I/O infrastructure. Much of the algorithmic latency
and energy is dominated by the relatively trivial ancillary
computation mapped to Loihis embedded x86 processors,
which were only minimally customized for their role in
neuromorphic interfacing. Loihi itself is research silicon and
factors of performance improvement are feasible with de-
sign optimizations, especially relating to multi-chip scaling.
Nevertheless, the k-NN implementation prototyped here
as the first application to run on Pohoiki Springs compares
favorably to state-of-the-art solutions running on highly
mature and optimized conventional computing systems.
The nearest neighbor search problem is central in a large
variety of applications, and this is just one of a wide space
of algorithms supported by Loihi and Pohoiki Springs. This
suggests a promising future for neuromorphic systems as
the technology is further matured and advanced to produc-
tion standards.
7 METHODS
All CPU performance measurements referenced in this work
were obtained from two systems. The systems, as annotated
in the text, have the following properties:
• CPU1: Intel Core i9-7920X CPU (12 cores, Hyper
Threading enabled, 2.90GHz, 16.5 MB cache) with
128 GB RAM. OS: Ubuntu 16.04.6 LTS. Python ver-
sion 3.5.2, NumPy version 1.18.2. Energy measure-
ments were obtained using Intel SoC Watch version
2.7.0 over a duration of 120 seconds with continu-
ously repeating workloads.
• CPU2: Used to obtained measurements referenced
from [8]. As described in that work, evaluations
were performed in Docker containers on Amazon
EC2 c5.4xlarge instances equipped with Intel Xeon
Platinum 8124M CPU (16 cores, 3.00 GHz, 25 MB
cache) and 32GB of RAM
All software run on Pohoiki Springs used a development
version of Intel’s Nx SDK advanced from release 0.9.5.rc1.
With the exception of CPU2 measurements quoted from
[8], all performance results are based on testing as of March
2020 and may not reflect all publicly available security
updates. No product can be absolutely secure.
REFERENCES
[1] J. Von Neumann, The computer and the brain. Yale
University Press, reprint 2012, 1958.
[2] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao,
S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain et al.,
“Loihi: A neuromorphic manycore processor with on-
chip learning,” IEEE Micro, vol. 38, no. 1, pp. 82–99,
2018.
[3] K. He and J. Sun, “Computing nearest-neighbor fields
via propagation-assisted kd-trees,” in 2012 IEEE Confer-
ence on Computer Vision and Pattern Recognition. IEEE,
2012, pp. 111–118.
[4] K. L. Cheung and A. W.-C. Fu, “Enhanced nearest
neighbour search on the r-tree,” ACM SIGMOD Record,
vol. 27, no. 3, pp. 16–21, 1998.
[5] M. S. Charikar, “Similarity Estimation Techniques from
Rounding Algorithms,” STOC, pp. 380–388, 2002.
[6] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and
L. Schmidt, “Practical and Optimal LSH for Angular
Distance,” NIPS, vol. 28, pp. 1–9, 2015.
[7] S. Har-peled, P. Indyk, and R. Motwani, “Approximate
Nearest Neighbor : Towards Removing the Curse of
Dimensionality,” Theory of Computing, vol. 8, pp. 321–
350, 2012.
[8] M. Aumu¨ller, E. Bernhardsson, and A. Faithfull, “Ann-
benchmarks: A benchmarking tool for approximate
nearest neighbor algorithms,” in International Confer-
ence on Similarity Search and Applications. Springer,
2017, pp. 34–49.
[9] H. Je´gou, M. Douze, and C. Schmid, “Product
Quantization for Nearest Neighbor Search,” IEEE
Transactions on Pattern Analysis and Machine Intelligence,
vol. 33, no. 1, pp. 117–128, Jan. 2011. [Online].
Available: https://hal.inria.fr/inria-00514462
[10] J. Pennington, R. Socher, and C. D. Manning, “Glove:
Global vectors for word representation,” in Proceedings
of the 2014 conference on empirical methods in natural
language processing (EMNLP), 2014, pp. 1532–1543.
[11] A. Hyva¨rinen, “Fast and Robust Fixed-Point Algo-
rithms for Independent Component Analysis,” IEEE
Transactions on Neural Networks, vol. 10, no. 3, pp. 626–
634, 1999.
[12] J. J. Hopfield, “Pattern recognition computation using
action potential timing for stimulus representation,”
Nature, vol. 376, no. 6535, pp. 33–36, 1995.
[13] S. Thorpe, D. Fize, and C. Marlot, “Speed of processing
in the human visual system.” pp. 520–522, 1996.
[14] A. Torralba, R. Fergus, and W. T. Freeman, “80 million
tiny images: A large data set for nonparametric object
and scene recognition,” IEEE transactions on pattern
analysis and machine intelligence, vol. 30, no. 11, pp.
1958–1970, 2008.
[15] E. Bernhardsson, Annoy: Approximate Nearest Neighbors
in C++/Python, 2018, python package version 1.13.0.
[Online]. Available: https://pypi.org/project/annoy/
[16] J. Johnson, M. Douze, and H. Je´gou, “Billion-
scale similarity search with gpus,” CoRR, vol.
abs/1702.08734, 2017. [Online]. Available: http://arxiv.
org/abs/1702.08734
[17] L. Boytsov and B. Naidan, “Engineering efficient
and effective non-metric space library,” in Similarity
Search and Applications - 6th International Conference,
SISAP 2013, A Corun˜a, Spain, October 2-4, 2013,
Proceedings, ser. Lecture Notes in Computer Science,
N. R. Brisaboa, O. Pedreira, and P. Zezula, Eds., vol.
8199. Springer, 2013, pp. 280–293. [Online]. Available:
https://doi.org/10.1007/978-3-642-41062-8 28
[18] Y. A. Malkov and D. A. Yashunin, “Efficient
and robust approximate nearest neighbor search
using hierarchical navigable small world graphs,”
CoRR, vol. abs/1603.09320, 2016. [Online]. Available:
http://arxiv.org/abs/1603.09320
