Using the Automata Processor for Fast Pattern Recognition in High Energy
  Physics Experiments - A Proof of Concept by Wang, Michael H. L. S. et al.
Using the Automata Processor for Fast Pattern Recognition in High Energy Physics
Experiments - A Proof of Concept
Michael H. L. S. Wanga,∗, Gustavo Canceloa, Christopher Greena, Deyuan Guob, Ke Wangb, Ted Zmudaa
aFermi National Accelerator Laboratory, Batavia, IL 60510, USA
bUniversity of Virginia, Charlottesville, VA 22904, USA
Abstract
We explore the Micron Automata Processor (AP) as a suitable commodity technology that can address the growing computational
needs of pattern recognition in High Energy Physics (HEP) experiments. A toy detector model is developed for which an electron
track confirmation trigger based on the Micron AP serves as a test case. Although primarily meant for high speed text-based
searches, we demonstrate a proof of concept for the use of the Micron AP in a HEP trigger application.
Keywords: Pattern Recognition, Tracking, Trigger, Finite Automata
1. Introduction
Pattern recognition occupies a central role in the reconstruc-
tion and analysis chains of practically all High Energy Physics
(HEP) experiments. The ability to recognize a charged parti-
cle track from a pattern of detector “hits”, for example, is ex-
tremely important because the trajectory of a particle carries
crucial information on its properties and provides a powerful
signature to isolate it from unwanted background. With the
advent of electronic detectors and readout systems nearly 50
years ago [1], tasks, like the manual scanning of tracks, were
transformed into computational problems. Such pattern recog-
nition problems have grown more challenging with every new
generation of experiment due to the trend towards more com-
plex event topologies and higher particle densities. To cope
with this trend, offline reconstruction applications have so far
relied on the rough doubling of transistors in many-core archi-
tectures every two years (Moore’s Law), while online applica-
tions with real-time requirements have traditionally relied on
custom hardware solutions. Unfortunately, Moore’s Law be-
comes less dependable as we enter a period of diminishing per-
formance returns where power dissipation issues from leakage
currents dominate as we approach the atomic scale [2]. On the
other hand, custom hardware solutions often entail more techni-
cal risks and require more investments in manpower and capital.
In this paper, we take a different approach by exploring
emerging commercial technologies designed to deal with the
deluge of digital information in today’s data centered economies.
One such technology is the Micron Automata Processor (AP) [3]
which is specifically targeted at pattern matching applications
like those in the Internet search industry and bioinformatics [4–
6]. It is a direct hardware implementation of Non-deterministic
Finite Automata (NFA) and can simultaneously apply thousands
∗Corresponding author
Email address: mwang@fnal.gov (Michael H. L. S. Wang)
of rules to find patterns in data streams at a constant rate of 1
Gbps/chip. As a proof of concept to demonstrate its feasibility
for HEP pattern finding applications, we develop a simple toy
detector model typical of those found at modern hadron col-
liders and use the Micron AP to implement an electron track
confirmation trigger.
2. The Automata Processor
2.1. Hardware Architecture
Because the Micron AP is based on a new and radical, non-
von Neumann architecture, this section will provide a descrip-
tion of the hardware, focusing on aspects relevant to our evalu-
ation.
2.1.1. A Memory-based Design
The Micron AP is derived from conventional SDRAM tech-
nology and its hardware architecture (see Figure 1) is conve-
niently understood in terms of a two-dimensional memory ar-
ray. Each input byte on an 8-bit wide bus serves as a row ad-
dress which is presented to an 8-to-28 decoder that selects one
out of 256 possible rows. A column of 256 cells in this array,
together with additional logic for state information, comprises
the basic building block of the AP known as a State Transition
Element (STE). Any cell or combination of cells in an STE can
be programmed to recognize any subset of 28 possible values or
symbols. When the address decoder selects a cell in an enabled
column or STE, which is programmed to recognize its associ-
ated symbol, the stored value of “1” is output to indicate symbol
recognition. This causes the STE output to change state and can
be used to enable other downstream STEs. Multiple STEs, each
programmed to recognize specific symbols, can be chained to-
gether to recognize patterns or strings of symbols. Such pat-
terns need not be limited to ASCII strings and could just as
well represent hit addresses associated with a particle trajectory
Preprint submitted to Nuclear Instruments and Methods in Physics Research A June 30, 2016
c© 2016. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/
ar
X
iv
:1
60
2.
08
52
4v
2 
 [p
hy
sic
s.i
ns
-d
et]
  2
8 J
un
 20
16
in a tracking detector. The interconnections between the STEs
is provided by the routing matrix structure, represented by the
block at the bottom of Figure 1, which plays the role of the
column address and decode operations.
Mb
STE
(0)
(255)
Mb
(254)
Mb
(253)
Logic
Mb
(0)
Mb
(1)
Mb
(2)
L
o
g
ic
M
e
m
o
ry
 C
o
lu
m
n
 B
it
s
 (
 1
 x
 2
5
6
 )
S
T
E
 (
1)
L
o
g
ic
M
e
m
o
ry
 C
o
lu
m
n
 B
it
s
 (
 1
 x
 2
5
6
 )
S
T
E
 (
2)
L
o
g
ic
M
e
m
o
ry
 C
o
lu
m
n
 B
it
s
 (
 1
 x
 2
5
6
 )
S
T
E
 (
3)
L
o
g
ic
M
e
m
o
ry
 C
o
lu
m
n
 B
it
s
 (
 1
 x
 2
5
6
 )
S
T
E
 (
N
-3
)
L
o
g
ic
M
e
m
o
ry
 C
o
lu
m
n
 B
it
s
 (
 1
 x
 2
5
6
 )
S
T
E
 (
N
-2
)
L
o
g
ic
M
e
m
o
ry
 C
o
lu
m
n
 B
it
s
 (
 1
 x
 2
5
6
 )
S
T
E
 (
N
-1
)
L
o
g
ic
M
e
m
o
ry
 C
o
lu
m
n
 B
it
s
 (
 1
 x
 2
5
6
 )
S
T
E
 (
N
)
Automata Routing Matrix Structure
S
T
E
(0
)
In
p
u
ts
S
T
E
(0
)
O
u
tp
u
t
State Transition Clock
(common)
Row Enable (0)
(common)
Row Enable (1)
(common)
Row Enable (2)
(common)
Row Enable (253)
(common)
Row Enable (254)
(common)
Row Enable (255)
(common)
8 bits
Input
Symbol
8
 t
o
 2
5
6
“r
o
w
 a
d
d
re
ss
”
d
e
co
d
e
r
2
5
6
 r
o
w
 b
y
 4
9
,1
5
2
 c
o
lu
m
n
M
e
m
o
ry
A
rr
a
y
Figure 1: The 2D memory array architecture of the Micron Automata Processor
adapted from conventional SDRAM technology.
2.1.2. Reporting Pattern Matches
The Micron AP was specifically designed to perform high-
speed pattern recognition. To be useful in this application, there
must be an efficient way to tell whether pattern matches were
found and provide details on these matches. To this end, any
AP building block such as an STE, counter, or boolean can be
configured to generate a signal known as a report event when-
ever it recognizes an input symbol. This way, the last element in
a string of STEs programmed to recognize a sequence of sym-
bols in an expression, can generate such a signal to indicate a
matching pattern in the input data stream.
Output
region
Output
event
memory
Output
region
Output
event
memory
Output
region
Output
event
memory
Half Core
Output
region
Output
event
memory
Output
region
Output
event
memory
Output
region
Output
event
memory
Half Core
64 KB 64 KB
Automata Processor Core
Output
event
buffer
Reporting results
(DDR3)
1024 bits
64
bits
1
0
2
4
 v
e
c
to
rs
Flow input
byte offset
Reporting
event vector
Figure 2: Reporting event vector readout architecture.
The portion of the AP architecture relevant to match report-
ing is shown in Figure 2. The AP is divided into 2 half cores,
each of which has 3 output regions. Each region has a local stor-
age area known as the output event memory with room for 1024
output event vectors (or report vectors) that are up to 1088 bits
wide. Each vector has a 64-bit preamble used to indicate the off-
set from the start of the input data stream. The remaining 1024
bits are used to identify the element in the region that generated
the report event. In this respect, the AP is unique because it
provides both spatial and temporal information on the matches.
The maximum number of reporting events per symbol cycle is
therefore 2cores×3regions/core×1024events/region/cycle =
6144 events/cycle. The compilation process will fail if the total
number of reporting elements exceeds this number.
An output event vector is generated on every symbol cycle
for which there was a report event in that region. Since each
vector is associated with a symbol cycle, events occurring on
the same cycle require a single vector while events on differ-
ent cycles require separate vectors. When symbol processing
completes, the vectors are transferred from each region’s out-
put event memory to the output event buffer, where they can be
read by external hardware.
2.1.3. Additional Latencies
Ideally, the total processing time would depend only on the
number of input symbols. In reality, there are overheads tied to
the internal memory transfers described above which introduce
additional latencies. There is an overhead associated with trans-
ferring each vector from a region’s event memory to the output
buffer. There is also a start-up overhead associated with trans-
ferring the first vector to the output buffer. Just to determine
that a region is empty also incurs an overhead. The overhead
due to transferring each vector depends on its size and can be
reduced by configuring the AP to use smaller vectors.
2.2. Programming the Automata Processor
The Micron AP is programmed using tools provided in the
AP Software Development Kit (SDK). The process begins by
creating a human-readable representation of the automata which
can be done through a graphical tool known as the AP Work-
bench. This representation is transformed into machine-readable
form in the compilation step which involves optimization and
placing and routing the automata elements onto the AP fabric.
This produces a binary relocatable image that is loaded and ex-
ecuted on the AP hardware.
3. Proof of Principle: A Pixel-augmented Electron Confir-
mation Trigger
To investigate the feasibility of the Micron AP for HEP pat-
tern recognition applications, we consider an electron confir-
mation trigger application where isolated high pT electrons are
verified and confirmed in a hadron collider detector by match-
ing energy clusters in an electromagnetic (EM) calorimeter with
charged particle tracks in a pixel-based tracking detector. A
simplified block diagram depicting the trigger hardware archi-
tecture is shown in Figure 3. Incoming pixel detector hit data
are decomposed into the R−φ (bend plane) and R−Z (non-bend
2
Layer Radius Faces Modules ROCs Pixels Pixels Total pixels
(cm) (φ) (z)
1 2.99 12 96 1536 1920 3328 6 389 760
2 6.99 28 224 3584 4480 3328 14 909 440
3 10.98 44 352 5632 7040 3328 23 429 120
4 15.97 64 512 8192 10 240 3328 34 078 720
Length of the toy pixel detector 54.88 cm
Table 1: Toy pixel detector specifications. Each face is a ladder consisting of 8 sensor modules. Each sensor module is made up of a pixel sensor bonded to 2 rows
of 8 Read-Out-Chips (ROCs) each. Each ROC has 80 rows × 52 columns of pixels. Each pixel sensor measures 165 × 98µm.
AP AP AP AP AP AP
RØ view RZ view
Detector hits
Hits in RØ view Hits in RZ view
Report vectors
in RØ view
Report vectors
in RZ view
Trigger accept
Coincidence
logic and
trigger
generation
Pre-processor
Figure 3: Block diagram of Automata-Processor-based electron track confir-
mation trigger hardware.
plane) views and fed into two separate banks of AP chips that
perform the electron track confirmation in each view [7]. The
reporting event vectors in each view are read out of each bank
and fed into external logic that determines if there is at least one
pair of reporting events (one from each view) that are correlated
in time. If this condition is satisfied, a trigger accept signal is
generated.
3.1. Toy Detector
For our proof-of-principle studies, we developed a toy de-
tector model consisting of 4 concentric cylinders (layers) ap-
proximating the barrel portion of the CMS Phase-1 pixel de-
tector [8]. The geometry and specifications of this model are
described in Table 1. All pixels in the entire detector have di-
mensions measuring 165 × 98 µm with the longer axis oriented
along the beam axis (z). The pixels in each layer form a uniform
grid laid out over the entire cylindrical surface. This means that
the face of each pixel is really a cylindrical tube segment in-
stead of a flat rectangular area. Each of the 4 detector layers
are uniformly divided in azimuthal angle into faces or ladders
that are parallel to the z-axis and run the entire length of the
cylinder. Each ladder is divided in z into 8 equal sections or
modules. Each module is further subdivided into 2 rows (along
azimuthal direction) by 8 columns (along z) of Read-Out-Chips
(ROCs). Each ROC, in turn, consists of 80 rows by 52 columns
of pixels. Pixels are uniquely identified using a cylindrical co-
ordinate system (R, φ, z), consisting of two orthogonal views
which are projections onto the R− φ and R− z planes. Pixel ad-
dresses or coordinates are encoded using the 16-bit and 14-bit
data formats shown in Figure 4 for the R−φ and R−z views, re-
spectively. This format makes it convenient to specify any pixel
in terms of layer, face, module, ROC, and pixel row and column
on the ROC. The entire pixel detector is immersed in a uniform
solenoidal magnetic field oriented along the z-axis, resulting in
charged particle trajectories that are curved in the R − φ (bend)
view and straight in the R − z (non-bend) view.
1 05 24 39 68 7101112131415
Layer
(2 bits)
Face
(6 bits)
ROC (1 bit)
Row
(7 bits)
{
R- :
3 02 17 46 511 810 91213
Layer
(2 bits)
Module
(3 bits)
ROC
(3 bits)
Column
(6 bits)
R-Z:
Figure 4: Data encoding scheme for pixel addresses in the R−φ and R−z views.
For the purposes of defining the regions of interest (ROI)
used in detector data readout and defining the pattern banks
used by the automata processor (both discussed in more detail
below), we divide the pixel detector into logical sections in the
R − φ and R − z views (see Figure 5). In the R − φ view, the
detector is divided into 72 overlapping azimuthal sectors, each
measuring 25◦ in φ. In the R − z view, layer 1 is divided into
32 equal sections in z and each of layers 2, 3, & 4 into 16 equal
sections in z. Taken together, the logical divisions in both views
form cylindrical tube segments in each layer which we will re-
fer to as tiles. An example of these tiles being used to define
the ROIs for track searching is shown in Figure 6.
Surrounding the pixel detector is a larger concentric cylin-
der representing the barrel portion of the EM calorimeter with
a radius of 129 cm and eta coverage of ±1.479. We assume a
single crystal barrel granularity of 180-fold in φ (half that of the
CMS detector) and 2 × 85 fold in η.
3
L1
L2
L3
L4
0 1 2 31
0 1 2 15
a
b
R-Ø view R-z view
Figure 5: The toy detector is divided into 72 overlapping sectors in the R− φ view. Examples of two such neighboring sectors, the 12th in red and the 13th in green,
each measuring 25◦ are shown above. In the R − z view, each layer is divided into equal sections in z — 32 for layer 1 and 16 for all other layers. Shown in blue,
red, and green are examples of combinations of 4 sections (1 from each layer) forming roads of straight tracks originating from the luminous region and within the
fiducial volume of the detector.
(a) (b)
Figure 6: Two views of an event with 2 EM calorimeter clusters (one of which is represented by the yellow bar in (b)) above the threshold. The search for tracks
associated with the clusters is done using only hits in the region-of-interest (ROI) defined by the 4 curved white tiles. There are 2 ROIs in the figure, each pointing
at an EM cluster.
4
(a) (b)
Figure 7: A simulated event consisting of a single Z → ee interaction is shown in (a). The two yellow tracks are the electrons from the Z. Other electrons are drawn
in cyan. All other charged tracks are drawn in magenta. Photons are drawn as green dashed lines and all other neutral tracks are drawn as blue dashed lines. The
downward-pointing narrow opening angle pair drawn in cyan is from a photon conversion in the 4th layer. The image in (b) shows a simulated event with a Z → ee
interaction that is overlaid with 50 pileup interactions.
3.2. Simulated Events
Simulated events are generated with proton-proton colli-
sions at 14 TeV center-of-mass energies using Pythia 6.4 [9]
with CMS Tune Z2* [10] parameters. Each event consists of
a Z → ee signal interaction overlaid with pileup interactions
which are pure Monte-Carlo minimum bias interactions. The
number of pileup interactions overlaid on each signal interac-
tion is randomly chosen from a Poisson distribution. The posi-
tions of the primary interaction vertices along the z-axis are ran-
domly selected from a Gaussian distribution centered at z = 0
cm with σz = 5 cm. For simplicity, the transverse positions of
the primary vertices are always centered at x = y = 0 cm. Fig-
ure 7-a shows a 3D view of a simulated event consisting only of
a single Z → ee signal interaction. Figure 7-b shows an event
with a Z → ee interaction overlaid with 50 pileup interactions.
Starting from the production vertices, all particles are tracked
through the toy detector model with a uniform 4T axial mag-
netic field. All unstable particles are allowed to decay randomly
into a channel selected according to their branching fractions
and with exponential decay length distributions based on their
proper lifetime. Photons are converted into electron-positron
pairs with the appropriate small opening angle distribution at a
frequency determined by the pair production cross section (for
both nuclear and electronic fields) in the material of the detector
concentrated in the 4 pixel detector layers. We use a fictitious
sensor module material with an effective Z and density (ρ) de-
termined from the various components of an actual CMS pixel
sensor module. To keep things simple in this proof-of-concept
investigation, no other physics processes such as multiple scat-
tering and energy loss mechanisms are simulated. All charged
particles traversing a pixel detector layer generate a single pixel
hit for that layer with 100% efficiency. The address of the hit
is determined from the coordinates of the intersection of the
trajectory with the cylinder. In reality, a charged particle can
produce a cluster of neighboring hits in a layer, which could
lead to more fake tracks in the pattern recognition stage. This is
not included in our simplified simulation. Because our selection
criteria requires EM clusters to have pT ≥ 5 GeV, only electro-
magnetic particles (electrons and photons) above this threshold
are propagated beyond layer 4 of the pixel detector to the EM
calorimeter barrel. Upon reaching the calorimeter, the total en-
ergy of such an EM particle is deposited into the calorimeter
crystal it intersects to generate an EM cluster.
Four different 1000-event samples were generated having
Poisson means for the number of overlaid pileup interactions of
N = 50, 80, 110, & 140. All four samples used the same set of
Z → ee signal interactions.
4. Implementing the Track Finder on the Automata Proces-
sor
4.1. Basic Automata Network and Principle of Operation
The application of the automata processor as a fast pattern
search engine is based on the idea of maintaining a database or
dictionary of all possible patterns against which symbols in an
input data stream are simultaneously compared for matches [11,
12]. The patterns in this dictionary can be all the possible words
in a language or, in our case, all physically possible charged par-
ticle trajectories in a tracking detector. In our present case, each
possible trajectory is defined by the addresses of 4 pixel hits
representing the intersection of the trajectory with each layer
of the detector. On the automata processor, each pattern of 4
pixel addresses representing a trajectory in the dictionary is as-
sociated with the automata network having the topology shown
in Figure 8. As shown, this network consists of 9 columns of
2 STEs each. The pixel addresses are stored in the STE pairs
labeled Lx lo and Lx hi with x=1,2,3,4 denoting the detector
layer. Since each STE only has 8 bits of symbol recognition
5
Figure 8: An automata network programmed to generate a report on matching a sparse and ordered sequence of exactly 4 pixel hit addresses. The “R” on the bottom
right STE indicates that it is a reporting element.
capability, the least and most significant bytes of each address
are stored separately in Lx lo and Lx hi, respectively. The STE
pair to the left of each (Lx lo, Lx hi) pair acts as a latch that
re-enables the Lx lo STE on its right only on odd clock cycles.
This ensures that pair of bytes denoting a pixel address are con-
tiguous and can occur anywhere in the input data stream. The
very first pair of STEs on the left of the figure imposes addi-
tional constraints for each pattern. Assuming that the very first
symbol in an input data stream represents the energy of the EM
cluster for which we are trying to find a track match, the STE
labeled ET requires this energy to be within the possible range
for the stored track pattern. Below this, the STE labeled Calo
holds the address of the EM calorimeter crystal that trajectory
intersects.
To illustrate the operation of this automata network, con-
sider an input data stream from the detector, consisting of pixel
hit addresses read out sequentially by layer, starting from the
innermost. Assume that the very first symbol in the input data
stream is an 8-bit number representing the EM calorimeter clus-
ter energy and that the second symbol represents the address of
the EM calorimeter crystal associated with the cluster. If the
energy of the measured cluster falls within the range of the ET
STE, its output is enabled, thereby activating the Calo STE. If
the crystal address matches that in the Calo STE, this STE en-
ables its output and activates the L1 lo STE. After this point,
if a symbol anywhere in the data stream matches that in L1 lo,
the output of this STE is enabled, activating L1 hi. If the im-
mediately following symbol matches that in L1 hi, then L2 lo is
activated to wait for matching hits in the 2nd layer of the detec-
tor. On the other hand, if no match is found for L1 hi, we need
to keep reactivating L1 lo every other clock cycle until we find
a layer 1 match. This is the purpose of the OC 0x pair of STEs
which reactivates L1 lo on every odd clock cycle, re-initializing
the search for the byte-pair representing the layer 1 hit. These
steps are repeated for every detector layer until either 4 match-
ing pixel hits are found or the input data stream is exhausted.
4.2. Generating the Dictionary or Pattern Banks
Figure 9: Generating all possible track patterns with a single electron gun monte
carlo event generator for a given region of the detector.
We generate our pattern banks separately for the R−φ bend
view and for the R−z non-bend view [7]. For the R−φ view, the
patterns are generated for each ±12.5 degree φ-sector described
earlier in Section 3.1. For the R − z view, we start with the
same subdivisions along z in each detector layer described in
Section 3.1. We will refer to each division as a window. We
take all possible combinations of 4 windows (one from each
layer) forming roads containing straight lines originating from
the luminous region and ending in the EM calorimeter barrel.
Patterns are then generated for each road.
To generate the pattern banks, we use a single electron gun
monte carlo event generator to produce all the possible tracks
above pT = 5 GeV. Figure 9 shows the process of generating
these patterns for a given region of the detector. Electrons of
both charges are generated and propagated through the toy de-
tector model described in Section 3.1 generating pixel hits as
they traverse each layer. Just as with our simulated Z → ee
samples that are overlaid with pileup interactions, we do not
simulate energy loss or multiple scattering when generating the
pattern banks.
6
Layer 1 hit
Layer 2 hit
Layer 2 hit
Layer 3 hit
Layer 3 hit
Layer 4 hit
R
R
L1_lo L1_hi
L2_lo L2_hi L3_lo L3_hi
L2_lo L2_hi
L4_lo L4_hi
L3_lo L3_hi
Latch
Latch
Latch
Latch
Latch
* *
* *
* * * *
* *ET CALO
Input Data Stream
Figure 10: An automata network programmed to generate a report on matching a sequence of 4 pixel hit addresses allowing up to one missing hit. The “R”s on the
two rightmost STEs indicate that these are reporting elements.
To keep the total number of patterns manageable, we use a
resolution of 4 pixels for all detector layers in each view. There
is a total of 72 pattern banks in the R − φ view with each bank
having ∼1163 unique track patterns. Due to the rotational sym-
metry of the ideal toy detector about the z axis, each bank has
an identical set of patterns modulo a rotation in φ. One could,
in principle, use a single bank to represent the entire detector
in the R − φ view after the appropriate rotation is applied to the
hits of a given sector. We chose not to do this since perfect ro-
tational symmetry may not be present in a real detector. For the
R− z view, we consolidate all roads with a common layer 1 and
layer 4 window into a single bank resulting in a total of 244 pat-
tern banks. The average number of track patterns in each bank
is ∼4662. For the ideal toy detector, there is a mirror symme-
try about the x − y plane in this view. The set of track patterns
in each half are identical modulo a translation in z. Again, we
chose not to exploit this symmetry since it may not be present
in a real detector.
These pattern banks are programmed into the automata pro-
cessor using the C-API provided in the AP SDK. For each entry
in a generated pattern bank, an instance of a macro represent-
ing the basic automata network shown in Figure 8 is created by
substituting its parameters with the values associated with the
current pattern. The resulting automata network containing all
instances is then compiled into an object file that is loaded into
the automata processor.
We can compile 2,496 instances of the macro shown in Fig-
ure 8 onto the current version of the AP chip. Without taking
advantage of rotational or translational symmetries, this would
require about 34 chips for the R − φ view and 456 chips for the
R − Z view. With the current AP boards that have 4 ranks of 8
chips each, this translates to about 1 board for the R − φ view
and 14 boards for the R − Z view.
4.3. Available Features and Capabilities and Possible Improve-
ment
The simple automata network shown in Figure 8, requir-
ing hits in all 4 layers of the detector, suffices for the present
purpose of demonstrating a proof of principle. However, it is
possible to design an automata-based algorithm that can deal
with missing hits due to inefficiencies in a real detector. An
automata network representing a track pattern allowing up to
one missing hit in any layer is shown in Figure 10. In the
general case when more than 1 missing hit is allowed, the to-
tal number of STEs representing a pattern is given by Nste =
2(2Nl + 2NlNm − 3Nm − 1) where Nl and Nm are the number
of detector layers and number of allowed missing hits, respec-
tively. When there are no missing hits, the total number of STEs
is simply Nste = 4Nl + 2.
Another interesting feature of the AP is the STE’s ability
to recognize a range of values instead of a specific one. This
makes it possible to employ variable resolution patterns that
offer a way to reduce the total number of patterns in a bank [13].
Lastly, the number of STE’s (18) in Figure 8 needed to rep-
resent a specific track pattern is largely due to the limited al-
phabet size of 8 bits. If our alphabet size were 16 bits, we could
reduce the number of STEs by a factor of 3 to just 4 STEs repre-
senting the 4 hits plus two additional STEs for the calorimeter
energy and position. This would reduce the total number of
automata chips needed for our pattern banks.
5. Testing the AP-based Electron Confirmation Trigger with
the Simulated Samples
To satisfy the requirements of the electron track trigger ap-
plication described in Section 3, the AP must, first of all, be
able to reconstruct tracks for electron/photon discrimination.
Secondly, it must be able to accomplish this task within the
available latency of the trigger [14]. This section will focus
on its track finding ability. The next section will focus on its
processing times.
5.1. Testing Procedure
For each of the simulated events described in Section 3.2,
we check to see if there are EM calorimeter clusters. Selecting
only EM clusters with pT > 5 GeV, we then read out all the
7
Pileup EM Clusters Track Match Eff. Rejection Purity
Inter. Total e γ e γ (%) Factor (%)
50 1242 837 405 837 9 100 45 99
80 1395 839 556 839 17 100 33 98
110 1515 844 671 844 26 100 26 97
140 1648 844 804 844 56 100 14 94
Table 2: This table summarizes the ability of the Automata Processor algorithm to identify electrons and reject photons for each of the 4 simulated samples used
in our study. The total number of EM clusters and their breakdown into electrons and photons are shown in columns 2-4. Columns 5 and 6 show the number of
electron clusters and photon clusters, respectively, for which a matching track was found. The last three columns show the electron identification efficiency, photon
rejection factor, and purity as defined in the text.
pixel hits within the ROI associated with the cluster (see Fig-
ure 6). The extents of the ROI in the R − φ view are defined by
the boundaries of the φ sector (1 of 72 described in Section 3.1)
whose bisector is closest in azimuthal distance to the cluster.
Considering only hits within the ROI is a sensible and practical
approach in an actual implementation because it avoids having
to deal with the sheer amount of data involved in reading out
the entire pixel detector.
In this proof-of-principle study, we also assume single (dou-
ble) crystal calorimeter resolution in the R − Z (R − φ) view
and precise knowledge of the interaction vertex associated with
the Z → ee signal. For the purposes of this study, we assume
that the latter knowledge is provided by an independent sub-
detector system such as an outer tracker based on silicon strips.
In the R − z view, we first construct a straight line defined by
the calorimeter cluster coordinates (center of the single crystal)
and the primary interaction vertex. We then find the layer 1 and
layer 4 windows this line intersects. Together with this pair, all
layer 2 and layer 3 windows that form roads (see Section 4.2)
with the pair are used to define the ROI extents in the R−z view.
If such a ROI exists within the acceptance of the pixel detector,
we will refer to the EM cluster plus primary vertex pair as re-
constructable. All the pixel hits within the ROI defined this way
are read out sequentially by layer starting with the innermost.
The sequence of the hits within each layer in the data stream
does not matter to the automata track finding algorithm. This
data stream of pixel hits arranged by layer is appended to two
8-bit quantities, the first for the EM cluster energy and the sec-
ond for the EM calorimeter crystal coordinate. The 8-bit stream
is fed into the automata hardware containing the pattern banks
corresponding to the ROI. This is done separately for the R − φ
and R− z views. In the hardware, all the instances of the macro
represented by Figure 8 associated with the bank are simultane-
ously presented with the input data stream to generate reports
in case of matches. A trigger accept is generated if matches are
reported in both views on the same clock cycle.
5.2. Results for Electron Identification and Photon Rejection
All reconstructable EM clusters are required to have pT > 5
GeV and to originate from the beam axis. The total number of
EM clusters satisfying these criteria for each of the 1000-event
simulated samples overlaid with a different number of pileup
interactions are shown in the second column of Table 2. A
breakdown of these numbers into those originating from elec-
trons and those from photons is shown in the third and fourth
columns. The fifth column of the table shows the number of
electron clusters for which a matching track was found. The
same number for photons is shown in the sixth column.
For this study, we define the electron identification efficiency
as e = NeEMtrk/N
e
EM and the photon rejection factor as Rγ =
NγEM/N
γ
EMtrk. N
e
EM and N
γ
EM are the number of reconstructable
EM clusters originating from electrons and photons (columns
3 and 4 of Table 2), respectively, which satisfy the require-
ments described at the beginning of this section. NeEMtrk and
NγEMtrk are the corresponding numbers of EM clusters for which
there is a matching track in the pixel detector (columns 5 and
6 of Table 2). The results for e and Rγ are shown in the last
two columns of Table 2. For all samples, we see that the AP
algorithm correctly finds a matching track for every electron
EM cluster satisfying our requirements. The fraction of pho-
ton EM clusters satisfying our requirements that are misiden-
tified increases from 2% to 7% as the number of the pileup
interactions in the sample, and hence detector occupancy, in-
creases. The last column in Table 2 shows the purity Pe =
NeEMtrk/(N
e
EMtrk + N
γ
EMtrk), which we define as the fraction of
all EM clusters satisfying our track trigger requirements that
originated from an electron.
6. Processing Time
The amount of time it takes the AP to find matching tracks
in each view (which we will refer to as symbol processing time)
depends only on the number of hits in the ROI. Because we are
using 16-bit hits while the AP uses native 8-bit symbols, the to-
tal number of input symbols is twice the number of hits. Since
one symbol is processed per AP clock cycle, the symbol pro-
cessing time is simply equal to the number of input symbols.
This time does not represent the total processing time since the
match results in each view need to be read out of the AP chips
and undergo further processing to find coincident matches in
each view before making a trigger decision. This additional
time includes (a) the internal data-transfer time within the AP
to move the match results from the local event memories in the
six output regions to the output event buffers (see Section 2.1.2),
and (b) the external processing time in external logic to find co-
incident reports in both views. The internal transfer time de-
pends on the size of the report vectors and their number in each
8
Number of pileups
40 60 80 100 120 140
Sy
m
bo
l c
yc
le
s
150
200
250
300
350
400
450
Number of pileups
40 60 80 100 120 140
Sy
m
bo
l c
yc
le
s
150
200
250
300
350
400
450
 viewφR- R-z view
assuming 512-bit report vectors assuming 512-bit report vectors
Figure 11: Shown above are the symbol cycles to process the hits in each view and transfer the report events to the output event buffers on an AP chip. This is
plotted as a function of the simulated samples overlaid with N = 50, 80, 110, and 140 pileup interactions.
of the six regions. In addition, there are overheads associated
with determining a region to be empty and a startup overhead
for reading the first vector. What we refer to as the external
processing time includes the data-transfer time from the AP’s
event output buffer to the external logic across the DDR3 inter-
face. The symbol processing and internal data transfer will be
collectively referred to as core processing since they both occur
on the AP chip. The first subsection below will focus on the
core processing followed by a second subsection devoted to the
external processing.
6.1. Core Processing Time
To calculate the number of symbol processing cycles, we
simply multiply the number of hits by two. For both views,
we add an additional symbol cycle to process the 8-bit symbol
representing the coordinates of the calorimeter crystal. For the
R − φ view, we add one more symbol cycle to account for the
8-bit symbol representing the energy of the cluster.
As we explained earlier, two report events occurring on the
same symbol cycle are saved in either the same or different
event vectors depending on the output region they occurred in.
As we also pointed out, it takes a finite number of symbol cy-
cles to transfer each vector. Because of this, when calculat-
ing the internal data-transfer time, we randomly assign a report
event to one of the six output regions. This assumes there is
no correlation between the automata instances representing all
the patterns in our bank and that they are uniformly distributed
throughout the 6 regions. We use an event vector divisor of 2 to
reduce the vector width to 512 bits and assign the appropriate
number of symbol cycles to transfer a vector of this size. We
also take into account the overheads associated with “reading”
an empty region and transferring the first vector. The symbol
processing times plus internal data-transfer times for each view
are shown as a function of the simulated sample in Figure 11.
For a symbol cycle of 7.5 ns, these translate to 1.29, 1.55, 1.85,
and 2.15 µs, respectively, for the 50, 80, 110, and 140 pileup
samples in the R − φ view. The corresponding times for the
R − Z view are, respectively, 1.49, 1.94, 2.53, and 3.25 µs.
6.2. External Processing Time
After all detector hit data associated with an EM cluster are
processed by the AP banks for the R−φ and R−Z views, the last
step of our track finding algorithm requires checking for at least
one pair of reporting events (one from each view) in which the
R−Z event occurred one symbol cycle ahead of the R−φ event.
This 1-cycle difference is due to the fact that the automaton
representing a track in the R − Z view has one less STE than
that for the R − φ view. This requirement does not guarantee
a common source for both events, but those that do originate
from the same track will necessarily exhibit this correlation in
time, and this constraint can only help reduce fake rates.
To estimate the processing time associated with finding at
least one pair of correlated reporting events, we implemented
the Content Addressable Memory (CAM) reference design de-
scribed in Reference [16] on a 100 MHz Xilinx Virtex-6 LX240T
Field Programmable Gate Array (FPGA). Our studies are based
on simulations done using the Xilinx ISE Design Suite. We as-
sume that the report vectors can be read out of the Micron AP’s
8-bit wide bus at 1066 MHz. In our simulations, the 576-bit
data (64-bit header + 512-bit reduced-size vector) associated
with each vector is read out from the R − Z banks and the 64-
bit header containing the temporal information is extracted and
written into FPGA memory. Once the headers from all R − Z
9
Number of pileups
40 60 80 100 120 140
CP
U 
cy
cle
s
20
40
60
80
100
120
140
160
180
200
220
310×
CPU processing time
nomatch
all
matched
Single threaded
(a)
Number of pileups
40 60 80 100 120 140
CP
U 
cy
cle
s
10
20
30
40
50
60
70
80
310×
CPU processing time
nomatch
all
matched
Multithreaded
using OpenMP
(b)
Figure 12: The plots above show the number of cycles it takes an Intel Core i7 CPU to process the pixel track trigger algorithm described in Ref. [15]. Plots (a) &
(b) show the number of CPU cycles for the single threaded and multithreaded cases, respectively, as a function of the simulated data samples. Dashed-line plots with
upright triangular markers are obtained using only EM clusters for which no track match is found. Dashed-line plots with inverted triangular markers are obtained
using only EM clusters that have at least one matching track. The solid-line plots and solid circular markers are obtained using all EM clusters. The cycles in these
plots are measured using the Intel CPU’s time stamp counter.
view vectors associated with an EM cluster are stored in mem-
ory, we loop over the corresponding headers from the R−φ view
vectors and present them to the CAM to find the first match. As
soon as this is found, a trigger accept signal is generated and
the process is repeated for the next EM cluster.
For the 1000-event sample overlaid with 140 pileup interac-
tions, the average time to execute this match finding step on the
FPGA is 0.37 µs, which is insignificant compared to the core
processing time.
7. Comparison with other Processor Architectures
In order to put the AP in better perspective, we compare it
with other processor architectures. We implement the pixel-
based tracking trigger algorithm described in Ref. [15] on a
CPU and a GPU. For this algorithm, which is functionally equiv-
alent to that used on the AP, we assume the same detector layout
and parameters, apply the same energy threshold cuts, and look
at an identical set of hits contained within a ROI defined in the
same way.
The CPU and GPU results in Sections 7.1 and 7.2 are pre-
sented within the context of the electron confirmation trigger
chosen to demonstrate our proof of concept. The relevant quan-
tity in this case is the total amount of time it takes to process
a single event (processing latency) and generate a trigger deci-
sion. In order to avoid dropping events, this must be less than
the time available to temporarily store an event prior to a trig-
ger decision, which is dictated by the limited size of the event
buffers. For the CMS experiment, this available time is on the
order of 10 µs. On the other hand, processing latencies are not
as important for other applications like offline reconstruction or
online triggers with sufficiently large event buffers. Although
such applications may still demand high processing rates, they
can easily be satisfied by the addition of more parallel execu-
tion units. In Section 7.3 we discuss the CPU performance with
this latter case in mind by considering how many processing
units (cores) can be used in parallel to achieve the same pro-
cessing rate as the AP-based system. We do not do this for the
GPU since it is much more difficult to determine and control
the mapping of threads (and their organization into blocks) to
the 15 streaming multiprocessors of 192 cores each, in order to
define some processing unit that can be reliably scaled.
Similar approaches to track finding, based on matching pat-
terns stored in a bank, have been implemented using CAMs or
Associative Memories (AMs) on FPGAs and custom Applica-
tion Specific Integrated Circuits (ASICs) [11, 12]. Since com-
parisons with such solutions provide a more level playing field
than with CPUs and GPUs, we conclude this section by briefly
describing a recent implementation of a an CAM-based track
finder on an FPGA and compare its capabilities with those of
the AP.
7.1. CPU Comparison
We compile a single threaded C-version of the algorithm
described in Ref. [15] using the Intel C compiler (v16.0.1). We
then run it on a 3.3 GHz Intel Core i7 (5820K) processor using
the same four simulated data samples. Using the Intel CPU’s
time stamp counters, we measure the number of CPU cycles to
10
Number of threads
0 2 4 6 8 10 12 14
CP
U 
cy
cle
s
0
20
40
60
80
100
120
140
160
310×
CPU cycles vs threads
140 pileups
110 pileups
80 pileups
50 pileups
Figure 13: The plot above shows the CPU cycles presented in Figure 12 as a
function of the number of threads for each data sample.
execute the trigger for each EM cluster and plot the mean as a
function of the sample in Figure 12a with a solid line and solid
circular markers. For a 0.3 ns CPU clock cycle, these results
translate to 3.38, 7.60, 16.7, and 32.1 µs, respectively, for the
50, 80, 110, and 140 pileup samples.
The processing cycles are also shown separately for EM
clusters with at least one matching track (dashed line with up-
right triangular markers) and EM clusters with no matching
tracks (dashed line with inverted triangular markers). Clusters
with matching tracks require less processing time because the
algorithm quits forming all possible combinations of 4 hits as
soon as a matching track is found. The opposite is true for
clusters with no matching tracks since the algorithm ends up
attempting all possible combinations.
These results clearly exhibit a quadratic rise in processing
times which increases by at least an order of magnitude in go-
ing from 50 to 140 pileup interactions. In contrast, the results
in Figure 11 show that the AP processing times scale almost
linearly as the number of pileup interactions in the samples is
increased. The processing times increase only by less than a
factor of 2× in going from 50 to 140 pileup interactions.
Using OpenMP, we created a multithreaded version of the
code described above and compiled it with the same version
of the Intel C compiler. The results for this multithreaded im-
plementation on the Intel CPU described above, with 6 physical
cores (12 logical cores because of Hyper-Threading), are shown
in Figure 12b for the four simulated samples. This implementa-
tion runs the algorithm using 2 OpenMP threads on each of the
6 physical cores. The meanings of the markers and line types
used in the plot are identical to those in Figure 12a. The average
processing times are 5.23, 7.85, 11.7, and 17.5 µs, respectively,
for the 50, 80, 110, and 140 pileup samples. The result for the
140 pileup sample is ∼ 2× faster than the single threaded CPU
result and ∼ 5× slower than the AP result.
The plots in Figure 13 show the processing cycles as a func-
tion of the number of OpenMP threads for each simulated sam-
ple. These results show that using more threads becomes more
advantageous as the number of hits (which increases with num-
ber of pileup interactions in the sample) that can be processed in
parallel increases. However, as the plot for the 140 pileup sam-
ple indicates, using more cores only helps up to a certain point.
Processing performance flattens out beyond 3 physical cores
(6 threads) and begins to worsen when we exceed 2 threads
per physical core (beyond 12 threads). Increasing the number
of cores beyond a certain point has no effect in reducing the
single-event processing time.
7.2. GPU Comparison
Number of pileups
40 60 80 100 120 140
CP
U 
cy
cle
s
20
40
60
80
100
120
140
160
180
200
220
310×
GPU processing time
nomatch
all
matched
Figure 14: The plot above shows the number of cycles to process the pixel
track trigger algorithm described in Ref. [15] as a function of the simulated
data samples for the nVidia Tesla GPU described in the text. Data-transfer
times between host and GPU are not included in these results. The cycles are
measured using the host Intel CPU’s time stamp counter. The three different
types of markers used have the same meaning as in Figure 12
We also implemented the algorithm described in Ref. [15]
on an nVidia Tesla K40c GPU (745 MHz) using nVidia’s CUDA
programming environment. In this case, the loops over 4-layer
hit combinations were unrolled using parallel thread blocks where
each thread block dealt with one hit combination from Layers
1 and 4. Multiple threads in each block then dealt with Layer
2 and 3 hit combinations in parallel. The number of processing
cycles as a function of sample for the GPU are shown in Fig-
ure 14. The processing cycles are measured using the host Intel
CPU’s time stamp counters. The legend used in the graph is
identical to that for the CPU with results shown for all clusters
and separately for the two classes of clusters described above.
The average GPU processing times are 34.8, 38, 44.8, and 53.7
µs, respectively, for the 50, 80, 110, and 140 pileup samples.
11
The GPU results do not show improvement over the CPU
results, mainly because it is more complicated for the GPU to
break out of a loop upon the first successful track match, or for
it to skip to the next iteration of a loop through continue state-
ments. This makes the execution time dependent on the slowest
thread. The parallel capabilities of the GPU (with 2880 cores)
are also not fully exploited by our test case where we have sig-
nificantly reduced the combinatorics by considering only hits
within the ROI. Furthermore, one must also take into account
additional latencies associated with data transfers between the
CPU and GPU, which contribute an additional > 91K CPU cy-
cles (> 27 µs) to the total cycles (time). One place where the
GPU does better, however, is on events in the tail of the ex-
ecution time distributions of the CPU. Such events have little
influence on the spread of the GPU distributions because the
GPU excels at dealing with problems that have more massive
parallelism.
The algorithm described in Ref. [15] used in the CPU/GPU
comparisons above is functionally but not algorithmically equiv-
alent to our automata-based track finder. Using regular expres-
sions to represent hit patterns, we also implemented an algorith-
mically equivalent NFA-based solution on the same GPU used
above. We measure ∼ 10 seconds to execute the algorithm for
each cluster which is about 6 orders of magnitude longer than
on the AP.
7.3. Note on CPU Processing Rate
For applications in which the processing latency is irrele-
vant and only the processing rate matters, it is interesting to see
how many CPU cores it takes to process events in parallel in
order to match the rate of the AP-based system. Using the re-
sults for the sample overlaid with 140 pileup interactions and
assuming that the R − φ and R − Z views are done in parallel,
the average single-event processing time (including the external
processing time) for the AP-based system is 3.62 µs, which is
equivalent to an event rate of approximately 276 kHz. Since the
corresponding time on a single core of the Intel i7 CPU is 32.1
µs, it would require about 9 CPU cores to match the processing
rate of an AP-based system consisting of 490 AP chips.
7.4. Associative Memories on an FPGA
Unlike ASICs, FPGAs are programmable, off-the-shelf de-
vices and should offer a fair comparison with the AP. Refer-
ence [17] describes the development of a Pattern Recognition
Associative Memory (PRAM) on modern FPGAs as part of
Fermilab’s tracking trigger R&D program for the LHC experi-
ments. The PRAM, which is based on CAMs, is implemented
using a mid-range Xilinx Kintex UltraScale KU040. Up to
4,000 patterns, for a detector consisting of 6 layers with 15-bit
hit addresses, were stored, consuming 78% of the FPGA re-
sources [18]. The device was successfully operated at 250MHz
with a fixed output latency of 7 clock cycles. It is also impor-
tant to note that, in this design, hits from all 6 detector layers
are presented simultaneously to the PRAM on 6 parallel 15-bit
input buses, in contrast to the single 8-bit bus of the AP. The
FPGA has an advantage over the AP in terms of maximum pat-
tern capacity per chip and pattern finding speed. The advantage
of the AP over the FPGA is in the ease with which it can pro-
grammed using the Micron AP SDK.
8. Conclusion
We have demonstrated a proof-of-concept use of the Micron
Automata Processor in an electron track confirmation trigger
for HEP. Even the current, first version of this technology shows
some promise for HEP trigger applications requiring low pro-
cessing latencies. In the AP’s current form, CAM-based FPGA
and ASIC implementations still surpass it in terms of pattern
storage capacity and processing performance. It is clear that
for specialized applications requiring the highest performance,
such as the most demanding aspects of the lowest level trigger
in a high-luminosity LHC environment, custom ASIC-based
solutions may be the most sensible if not the only approach.
However, the availability of an off-the-shelf, dedicated pattern
matching engine that is easy to program and suitable for HEP
applications, provides a new alternative for situations (e.g. less
demanding triggers and offline reconstruction) in which custom
or even FPGA-based solutions would not have been considered
previously.
Compared with other commodity off-the-shelf solutions like
CPUs and GPUs, the AP requires a factor of over two orders of
magnitude fewer hardware cycles to perform our sample track
finding application. With a clock cycle of 7.5 ns on the AP
versus 0.3 ns on the CPU, this lower hardware cycle count
translates into a processing latency of 3.64 µs on the AP that is
∼ 5× lower than that of the multithreaded CPU implementation.
Lower processing latencies are crucial for the online trigger ap-
plication considered in this paper. On the other hand, if we
disregard processing latency and pay attention only to process-
ing rate, it requires only 9 CPU cores to achieve the same pro-
cessing rate as an AP-based system consisting of 490 AP chips.
Such requirements can be satisfied by a commodity server with
dual 6-core CPUs.
When comparing the results, one must keep in mind that
the Intel CPU used in this study is the 5th generation of a very
mature technology. Pre-production evaluation versions of the
AP are only becoming available now at the time of this writing.
In our evaluation, we were also conservative in our choice of
configurable parameters such as the size of the report vectors.
Choosing a smaller size can further reduce the times associated
with the internal transfer and readout of these vectors. Future
versions of the AP may also incorporate larger alphabet sizes.
Doubling the current symbol recognition capability from 8-bits
to 16-bits, for example, will cut the time to process 16-bit hit
addresses in our track finding algorithm in half. The possibil-
ity of such improvements, coupled with the results presented
in this paper, suggest that this may be a promising technology
worthy of more detailed consideration in real-world HEP pat-
tern recognition applications. Lastly, this signifies the first use
of this interesting technology to the recognition of visual pat-
terns. It opens up a whole new realm which may even include
image processing applications in fields like astronomy.
12
9. Acknowledgements
This paper is dedicated to the memory of Simon Kwan who
worked tirelessly on the CMS pixel detector and upgrade project
and who inspired our choice of the pixel-augmented electron
trigger as a proof-of-concept application. We are grateful to
David Christian, Aurore Savoy-Navarro, Chang-Seong Moon,
Tiehui Ted Liu, Jin-Yuan Wu, Zijun Xu, and Ken Treptow for
valuable discussions. We thank the supportive staffs at the Uni-
versity of Virginia’s Center for Automata Processing and Mi-
cron Technology for their technical assistance. Fermilab is op-
erated by Fermi Research Alliance, LLC under Contract No.
De-AC02-07CH11359 with the United States Department of
Energy.
References
References
[1] G. Charpak, R. Bouclier, T. Bressani, J. Favier, Cˇ. Zupancˇicˇ, The
use of multiwire proportional counters to select and localize charged
particles, Nuclear Instruments and Methods 62 (3) (1968) 262–268.
doi:10.1016/0029-554X(68)90371-6.
[2] N. S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. S.
Hu, M. J. Irwin, M. Kandemir, V. Narayanan, Leakage current:
Moore’s law meets static power, Computer 36 (12) (2003) 68–75.
doi:10.1109/MC.2003.1250885.
[3] P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, H. Noyes, An ef-
ficient and scalable semiconductor architecture for parallel automata pro-
cessing, IEEE Transactions on Parallel and Distributed Systems 25 (12)
(2014) 3088.
[4] K. Wang, Y. Qi, J. J. Fox, M. R. Stan, K. Skadron, Association rule mining
with the micron automata processor, in: Proceedings of the 2015 IEEE
International Parallel and Distributed Processing Symposium, IPDPS ’15,
2015, pp. 689–699. doi:10.1109/IPDPS.2015.101.
[5] K. Zhou, J. J. Fox, K. Wang, Y. Qi, D. E. Brown, K. Skadron, Brill tag-
ging on the micron automata processor, in: Proceedings of the 2015 IEEE
International Conference on Semantic Computing, ICSC ’15, 2015, pp.
236–239. doi:10.1109/ICOSC.2015.7050812.
[6] I. Roy, S. Aluru, Finding motifs in biological sequences using the mi-
cron automata processor, in: Proceedings of the 2014 IEEE 28th In-
ternational Parallel and Distributed Processing Symposium, IPDPS ’14,
IEEE Computer Society, Washington, DC, USA, 2014, pp. 415–424.
doi:10.1109/IPDPS.2014.51.
[7] Instead of this decomposition into 2 separate views, it would be possible,
in principle, to have a single bank consisting of 3 dimensional patterns.
Unfortunately, this is not practical because the number of patterns that
needs to be stored makes it prohibitive in terms of resource usage on the
automata processor.
[8] A. Dominguez, et al., CMS Technical Design Report for the Pixel Detec-
tor Upgrade, CMS-TDR-011 (2012).
[9] T. Sjo¨strand, S. Mrenna, P. Skands, Pythia 6.4 physics and manual, Jour-
nal of High Energy Physics 05 (2006) 026.
URL http://stacks.iop.org/1126-6708/2006/i=05/a=026
[10] The CMS collaboration, Study of the underlying event at forward rapidity
in pp collisions at
√
s = 0.9, 2.76, and 7 TeV, Journal of High Energy
Physics 4 (72) (2013) 1–35. doi:10.1007/JHEP04(2013)072.
[11] M. Dell’Orso, L. Ristori, Vlsi structures for track finding, Nuclear Instru-
ments and Methods in Physics Research Section A: Accelerators, Spec-
trometers, Detectors and Associated Equipment 278 (2) (1989) 436 – 440.
doi:http://dx.doi.org/10.1016/0168-9002(89)90862-0.
[12] A. Annovi, A. Bardi, M. Bitossi, S. Chiozzi, C. Damiani, M. Dell’Orso,
P. Giannetti, P. Giovacchini, G. Marchiori, I. Pedron, M. Piendibene,
L. Sartori, F. Schifano, F. Spinella, S. Torre, R. Tripiccione, A vlsi
processor for fast track finding based on content addressable memo-
ries, IEEE Transactions on Nuclear Science 53 (4) (2006) 2428–2433.
doi:10.1109/TNS.2006.876052.
[13] A. Annovi, et al., A new Variable Resolution Associative Memory for
High Energy Physics, in: Proceedings, 2nd International Conference
on Advancements in Nuclear Instrumentation, Measurement Methods
and their Applications (ANIMMA 2011), CERN, CERN, Geneva, 2011.
doi:10.1109/ANIMMA.2011.6172856.
[14] Available trigger latency at Level-1 is on the order of ∼ 10 µs for the era
of the High-Luminosity LHC.
[15] C.-S. Moon, A. Savoy-Navarro, Level-1 pixel based tracking trigger al-
gorithm for lhc upgrade, Journal of Instrumentation 10 (2015) C10001.
URL http://stacks.iop.org/1748-0221/10/i=10/a=C10001
[16] K. Locke, Parameterizable content-addressable memory, Xilinx, Inc.,
XAPP1151 v1.0 (March 2011).
[17] J. Olsen, J. Hoff, T. Liu, J. Wu, Z. Xu, A new way to implement high
performance pattern recognition associative memory in modern fpgas,
poster presented at Topical Workshop on Electronics for Particle Physics
(TWEPP), September 28–October 2, Instituto Superior Te´cnico, Lisbon,
Portugal (2015).
[18] CAM-based ASIC implementations have achieved twice this density with
8K patterns per chip and the goal is to achieve 128K patterns per chip in
the next generation of the chip [19, 20].
[19] A. Andreani, A. Annovi, R. Beccherle, M. Beretta, M. Citterio, F. Cresci-
oli, A. Colombo, P. Giannetti, V. Liberali, J. Shojaii, A. Stabile,
Next generation associative memory devices for the ftk tracking pro-
cessor of the atlas experiment, in: Nuclear Science Symposium and
Medical Imaging Conference (NSS/MIC), 2013 IEEE, 2013, pp. 1–6.
doi:10.1109/NSSMIC.2013.6829550.
[20] M. Shochet, L. Tompkins, V. Cavaliere, P. Giannetti, A. Annovi, G. Volpi,
Fast TracKer (FTK) Technical Design Report, Tech. Rep. CERN-LHCC-
2013-007. ATLAS-TDR-021, CERN, Geneva, aTLAS Fast Tracker Tech-
nical Design Report (Jun 2013).
URL https://cds.cern.ch/record/1552953
13
