ORIGAMI: A Heterogeneous Split Architecture for In-Memory Acceleration
  of Learning by Falahati, Hajar et al.
ORIGAMI: A Heterogeneous Split Architecture for
In-Memory Acceleration of Learning
Hajar Falahati§‡ Pejman Lotfi-Kamran‡ Mohammad Sadrosadati[ Hamid Sarbazi-Azad[‡
§Iran University of Science and Technology ‡Institute for Research in Fundamental Sciences (IPM) [Sharif University of Technology
§hfalahati@iust.ac.ir ‡{hfalahati, plotfi, azad}@ipm.ir [sadrosadati@ce.sharif.edu [azad@sharif.edu
ABSTRACT
One of the major challenges in processing machine learning
(ML) algorithms is the memory bandwidth bottleneck. In-
memory acceleration has the potential to address this problem.
However, a solution based on in-memory acceleration needs
to address two challenges. First, in-memory accelerators
should be general enough to support a large set of different
ML algorithms. Second, the solution should be efficient
enough to utilize the bandwidth while meeting the limited
power and area budgets of the logic layer of a 3D-stacked
memory. We observe that previous work fails to simultane-
ously address both challenges.
In this work, we propose ORIGAMI that includes a hetero-
geneous set of in-memory accelerators to support compute
demands of different ML algorithms, and also uses an off-
the-shelf compute platform (e.g., FPGA, GPU, TPU, etc.) in
coalescence with the in-memory accelerators to utilize the
bandwidth without violating the strict area and power budgets.
ORIGAMI offers a pattern-matching technique to identify the
similar patterns of computation across a set of ML algorithms
and extracts a compute engine for each pattern. These com-
pute engines constitute the heterogeneous accelerators that
are integrated on the logic layer of a 3D-stacked memory.
The combination of these compute engines can execute any
type of ML algorithms. To utilize the available bandwidth
without violating area and power budgets of the logic layer,
ORIGAMI comes with a computation-splitting compiler that
divides an ML algorithm between the in-memory accelerators
and an out-of-the-memory platform in a balanced way and
with minimum inter-communications.
The combination of pattern matching and split execution of-
fers a new design point for the acceleration of ML algorithms.
The evaluation results across 12 popular ML algorithms show
that ORIGAMI outperforms the state-of-the-art accelerator
with 3D-stacked memory in terms of performance and energy-
delay product (EDP) by 1.5× and 29× (up to 1.6× and 31×),
respectively. Furthermore, the results are within a 1% margin
of an ideal system that has unlimited compute resources on
the logic layer of a 3D-stacked memory.
1. INTRODUCTION
Machine learning (ML) is set out to revolutionize the way
that the individuals and the society interact with and utilize
the machines. These advances; however, are predicated on
delivering high-performance platforms for training models
during the training phase. The trained model, then, is used
to evaluate unseen data, a.k.a., the inference phase. Training
ML models is significantly compute intensive and at the
same time, puts a lot of pressure on the memory [1–18].
Given these characteristics, in-memory acceleration [6–20]
is a natural fit for accelerating ML algorithms.
By advent of 3D-stacked memories [21–25], in-memory
acceleration [6,8,9,11,12,14–20] becomes a feasible solution.
Various pieces of inspiring work have devised in-memory
accelerators for ML algorithms but mostly focused on the
inference phase [6, 8, 9, 12, 14–18, 26] or training phase of
special kinds of ML algorithms [11] that can be done by
exploiting compute units of the inference phase, i.e., multiply-
accumulator (MAC) units.
An ideal in-memory accelerator for training ML algorithms
should be (1) general to support different kinds of ML
algorithms, as there are variations in the compute patterns
of different ML algorithms, and (2) efficient to capture
the available bandwidth of 3D-stacked memories [21–23],
while meeting the limited power and area budgets of these
memories. We observe that previous work limits the potential
capability of in-memory accelerators as none of them pro-
vides all the necessary features (see § 2 for more detail). As
an example, integrating general-purpose units [27] inside the
3D-stacked memory only captures up to 16% of the available
bandwidth (see § 2 for more detail).
We set out to explore an in-memory acceleration with
heterogeneous compute units to support a wide range of
ML algorithms. Investigating a wide range of popular ML
algorithms, we observe that ML algorithms exploit common
compute patterns, in which, each pattern can be executed on a
specialized compute unit with low area and power overheads.
The combination of these compute units can execute any
type of ML algorithms. Constrained by the limited area and
power budgets of a 3D-stacked memory, even these highly
optimized compute units capture only 47% of the memory
bandwidth (see § 2 for more details). Although the captured
bandwidth is much larger than that of the general-purpose
units, it is still lower than the total available bandwidth. We
conclude that in-memory accelerators, alone, cannot utilize
the whole available bandwidth even if we use light-weight
compute units due to the limited area and power budgets of
the 3D-stacked memory.
To capture all the available bandwidth, we aim to enable
a split execution between the light-weight heterogeneous
in-memory engines and an out-of-the-memory compute plat-
form. We observe that existing 3D interfaces can transfer
about 63% of the internal bandwidth to an out-of-memory
compute platform [21, 22]. Inside the memory, we integrate
1
ar
X
iv
:1
81
2.
11
47
3v
2 
 [c
s.L
G]
  9
 Ja
n 2
01
9
as many compute units as the area and power budgets allow to
capture the 3D-stacked memory bandwidth. We drive an out-
of-the-memory compute platform by the unused portion of
the bandwidth. To fully utilize the two platforms, we observe
that ML algorithms are composed of many parallel regions,
which facilitate execution of ML algorithms over the two
platforms, in-memory accelerators and an out-of-the-memory
platform, with minimal inter-communications.
ORIGAMI1 is a hardware-software solution that combines
compute patterns with a heterogeneous set of in-memory ac-
celerators, and splits the execution over the in-memory accel-
erators and an out-of-the-memory platform. ORIGAMI trans-
lates the common compute patterns of different ML algo-
rithms into heterogeneous compute engines that should be
integrated on the logic layer of 3D-stacked memories. More-
over, ORIGAMI efficiently distributes parts of the computa-
tion of an ML algorithm to an out-of-the-memory compute
platform to capture all of the available memory bandwidth
provided by a 3D-stacked DRAM.
This paper makes the following contributions:
• We extract common compute patterns and parallelism types
of a set of different ML algorithms.
• We propose ORIGAMI that benefits from a set of heteroge-
neous in-memory accelerators derived from the identified
compute patterns, and splits the computation over the in-
memory accelerators and an out-of-the-memory compute
platform using the identified parallelism types.
• We show that ORIGAMI outperforms the state-of-the-art
solution in terms of performance and energy-delay product
(EDP) by 1.5× and 29× (up to 1.6× and 31×), respectively.
Moreover, ORIGAMI is within a 1% margin of an ideal
system, which has unlimited compute resources on the
logic layer of a 3D-stacked memory.
2. MOTIVATION
There are two phases in processing ML algorithms: (1) a
training phase that optimizes the model parameters over a
training dataset, and (2) an inference phase where the trained
model is deployed to process new unseen data. While both
phases are computationally intensive, the training phase de-
mands more compute resources due to two reasons. First,
the training phase is a superset of the inference phase. The
inference phase just includes multiply-accumulator (MAC)
operations, but the training phase, which optimizes different
objective functions, includes more operations like non-linear
operations. Second, to achieve high accuracy for the trained
model, ML algorithms require copious amounts of processing
power to iterate over vast amounts of training data [1–5, 10].
Intrinsic parallelism of ML algorithms has inspired both
academia and industry to explore accelerating platforms such
as FPGAs [27–33], GPUs [34–37], and ASICs [1,2,5,32,38–
49]. However, the high memory footprints of ML algorithms
limit the potential performance benefits of accelerations.
2.1 Memory Bandwidth Bottleneck
To keep compute resources busy, accelerators need to
transfer huge amounts of data, which makes memory sub-
system a serious bottleneck in terms of bandwidth and en-
ergy [1–18,26,27,29,37]. A large body of work has explored
1The small number of basic origami folds can be combined in a
variety of ways to make intricate designs.
in-memory processing, built upon DRAM [45], SRAM [50],
non-volatile memories [7, 10, 13], and 3D-stacked memo-
ries [6, 8, 9, 11, 12, 14–18, 26] for performance improvements
and energy savings.
Many pieces of prior work proposed in-memory accel-
erators that are built upon 3D-stacked memories, as these
memories are commercialized (e.g., HMC [21, 22]) and in-
memory processing within them is feasible. 3D-stacked
memories stack multiple DRAM dies on top of each other
inside a package. These dies are vertically connected via
thousands of low-capacitance through-silicon vias (TSVs) to
a logic die in which the memory controllers are located. 3D
memories use high-speed signaling circuits from the logic
die to the active die (e.g., CPU, GPU, FPGA, etc.) out of the
memory. Putting all together, 3D-stacked memories provide
massive bandwidth with low access energy (3 to 5 times
smaller) as compared to the conventional DRAMs.
2.2 Challenges of In-memory Acceleration
While in-memory accelerators have the potential to address
the memory bandwidth bottleneck, they have two main chal-
lenges that should be addressed. First, there are many types of
ML algorithms and an in-memory accelerator should be able
to effectively accelerate them. Second, there is a significant
constraint on the area and power usage of the logic die in
3D-stacked memories [7, 8, 10, 11, 13, 15, 24, 26, 51].
To evaluate in-memory accelerators, we define two param-
eters: (1) Generality: how much the architecture is flexible
to support different kinds of ML algorithms; (2) Efficiency:
how much the architecture can utilize the available bandwidth
subject to area and power constraints.
To achieve generality in accelerating a wide range of
ML algorithms with different objective functions, an in-
memory accelerator may use general-purpose execution units
to execute different operations including various non-linear
operations (e.g., Sigmoid) in the training phase. General-
purpose execution units can provide generality. However,
they are expensive in terms of area and power. On the other
hand, to achieve efficiency, we need to integrate as many
general-purpose execution units as needed to capture the
whole available bandwidth.
Prior 3D-stacked based in-memory accelerators either
support the inference phase [6, 8, 9, 12, 14–18, 26] or the
training phase of special kinds of ML algorithms such as
Convolutional Neural Network (CNN) [11]. As there is no
previous in-memory accelerator that supports the training
phase of different types of ML algorithms, we implement
general-purpose units similar to those in prior work [27, 32].
Considering the available power and area budgets of 3D-
stacked memories, these general-purpose units only capture
80 GB/s (16%) of the available bandwidth (out of 512 GB/s,
more details in § 6.2). The captured bandwidth is much lower
than the total bandwidth of 3D-stacked memories, which
shows that general-purpose in-memory accelerators fail to
utilize the available bandwidth.
2.3 Holistic In-memory Approach
To alleviate the bandwidth bottleneck and accelerate the
training phase of a wide range of ML algorithms, we propose
a holistic in-memory approach, called ORIGAMI, which sat-
isfies both ML requirements and limitations of 3D-stacked
memories. While we focus on accelerating the training phase
2
of ML algorithms, the proposed idea can also be applied to
the inference phase, as the training phase is a superset of the
inference phase.
ORIGAMI benefits from low-overhead compute engines
as in-memory accelerators on the 3D-stacked memory to
capture as much 3D-stacked memory bandwidth as possible
(240 GB/s of 512 GB/s). ORIGAMI uses the rest of the
bandwidth (272 GB/s of 512 GB/s) to drive an out-of-the-
memory compute platform, which can be ASIC, GPU, FPGA,
TPU, or any other types of compute platform.
We build ORIGAMI upon two key ideas:
Pattern-Aware Execution. ORIGAMI exploits pattern-
aware execution to accelerate different ML algorithms by
light-overhead heterogeneous compute engines on the logic
die of a 3D-stacked memory. First, ORIGAMI identifies
these compute patterns. Second, ORIGAMI implements each
compute pattern by a specific hardware unit, called compute
engine, with low area and power overheads. The combina-
tion of these heterogeneous compute engines can accelerate
different types of ML algorithms.
Split Execution. 3D-stacked memories provide the active die
with external bandwidth of up to 320 GB/s. This observation
motivates us to integrate as many compute engines as possible
on the logic die and feed the unused portion of the available
bandwidth to the active die. The combination of accelerators
on the logic die and the compute platform on the active die
utilizes the whole 3D-stacked DRAM bandwidth.
To partition an ML algorithm efficiently between the in-
memory accelerators and the out-of-the-memory platform, a
partitioning algorithm should have three features. (1) Concur-
rency: Partitioned parts should be able to run simultaneously.
(2) Minimum inter-communications: Partitioned parts should
have no or minimum inter-platform communications. (3)
Load Balancing: Partitioned parts should be proportional to
the compute capabilities of the two platforms. Compute capa-
bilities are limited by two factors. First, compute throughput
that depends on the speed of hardware. Second, memory
throughput that depends on the available memory bandwidth.
To have all of these features, we observe that there are various
parallelism types inside ML algorithms. ORIGAMI extracts
three parallelism types from ML algorithms, that guaran-
tee concurrency and minimum inter-communications. Then,
ORIGAMI exploits an assignment algorithm to partition these
concurrent parts based on the compute capabilities of the two
platforms, that guarantees load-balancing.
We compare some prior work [11, 19, 20, 27, 33, 48, 49]
in Table 1. In-memory and Training columns show if the
method uses an in-memory accelerator and targets the training
phase, respectively. The other five columns show whether the
method offers generality, split execution, concurrency, load-
balancing, and minimum inter-communications, respectively.
As summarized in Table 1, pieces of prior work that use in-
memory accelerators [11, 19, 20] do not support execution of
the training phase of different kinds of ML algorithms. While
TABLA [27] is a general method to accelerate the training
phase of ML algorithms, it suffers from memory bandwidth
problem. Other pieces of work [20, 33, 48, 49] benefit from
split execution but do not support acceleration of the training
phase of different kinds of ML algorithms. Scalpel [48],
Proger PIM [20], and Resource partitioning [33]) execute
only special parts of algorithms over multiple resources
and run the rest on just one compute resource. Such tech-
niques fail to provide concurrency and load balancing. So-
lutions such as partitioning ML algorithms based on the
types of layers, e.g., Scaledeep [49], which assign memory-
intensive parts to the in-memory and compute-intensive parts
to the out-of-the-memory platform, neglect the minimum-
intercommunications and do not always distribute the com-
putation in a load-balanced manner. As shown in the table,
none of prior work has all the required features.
Table 1: Characteristics of previous ML accelerators in sup-
porting: in-memory acceleration (In-memory), training phase
(Training), generality (Generality), split execution (Split), con-
currency (Conc), load-balancing (L-B), and minimum inter-
communications (Min I-C).
Approach In-memory
TABLA [27]
Scalpel [46]
Conc L-B Min I-CTraining
Resource Partitioning [31]
Scaledeep [47]
No
No
No
No
Yes
Yes
No
Yes
No
No
No
No
No
No
No
No
No
Yes
No
No
Neurocube [11] Yes Yes No No No
Generality
Yes
No
No
No
No
Split
No
Yes
Yes
Yes
No
CMP-PIM [19] Yes No No No NoNo No
Proger PIM [20] Yes Yes Yes Yes YesNo Yes
3. PATTERN-AWARE EXECUTION
To accelerate different ML algorithms, one solution is
to use general-purpose execution units. However, as we
discussed in § 2, although general-purpose execution units
(e.g., [27]) can accelerate a wide range of ML algorithms,
they utilize only a small fraction of the available memory
bandwidth provided by a 3D-stacked memory due to their
large area overhead. To address this problem, this work
proposes to integrate light-weight accelerators in such a way
that the power and area of the accelerators are much smaller
that those of general-purpose accelerators, and at the same
time, provide enough generality to accelerate different kinds
of ML algorithms. Our key idea is to identify common
compute patterns of different ML algorithms and map them
to light-weight hardware accelerators.
3.1 Compute Patterns
We thoroughly examine the compute graphs of different
ML algorithms and break the graphs down to several compute
patterns. We observe that some of these compute patterns are
common across different ML algorithms. Each ML algorithm
has an objective function, (f), and a set of weights, (w), a.k.a,
models,2 that map the elements of an input vector, (X), to the
output, (Y ), as shown in Equation 1.
∃Wmin∑
i
f (W,Xi,Yi) (1)
The objective function is a cost function that measures the
quantity of the distance between the predicted output and the
actual output for the corresponding input dataset. Solving an
optimization problem over the training data, ML algorithms
minimize the objective function gradually.
Stochastic Gradient Descent (SGD) is a widely-used algo-
rithm to gradually minimize the objective functions [52–58].
Due to the popularity of SGD, in this paper, we consider SGD
as the optimization algorithm, however, our work is general
to consider other optimization algorithms as well.
2We use weight and model interchangeably in this paper.
3
Equation 2 shows how SGD solves the optimization prob-
lem defined in Equation 1.
W (t+1) =W (t)−µ× ∂ (∑i f (W
(t),Xi,Yi)
∂ (W (t))
(2)
SGD updates W (t), by computing W (t+1), in the reverse di-
rection of the gradient function, ∂ ( f ), which speeds minimiz-
ing the objective function. ML algorithms use the updated
weights with other m input vectors during the next iterations.
Parameter µ is the learning rate of the ML algorithm.
Considering the objective functions and parameters of
different ML algorithms, we extract four types of compute
patterns in the compute graphs of different ML algorithms.
The combination of these compute patterns optimizes the
objective function. Three types of these compute patterns are
common among different ML algorithms, and one of them is
algorithm dependent.
• Common Compute Patterns.
1. REDUCTION Compute Pattern. The first compute pattern
of different ML algorithms calculates the dot product of
the input vector, (X), and the weight vector, (W). We refer
to this dot product, ∑i Xi ∗Wji; i ∈ [0,k)and j ∈ [0,n), as
REDUCTION. This compute pattern is needed to compute
the predicted output.
2. COMPARATOR Compute Pattern. The predicted output
of an ML algorithm is compared against a threshold or the
known output, (Y), usually using a subtractor operation. We
refer to this compute pattern as COMPARATOR. Using this
compute pattern, the output of the objective function, delta,
is calculated.
3. OPTIMIZATION Compute Pattern. Using an optimiza-
tion method, an ML algorithm updates the models to min-
imize the output of the objective function, delta. This
compute pattern, W (t+1) = W (t) − µ × delta, is referred
to as OPTIMIZATION.
• Algorithm-Dependent Compute Pattern. In addition to
the aforementioned compute patterns, some ML algorithms
need to perform extra operations in their objective functions
to calculate the predicted output and the delta value. These
extra operations differ from one ML algorithm to another,
and include basic operations (e.g.,−,+,∗,<,and >) and
non-linear operations (e.g., Sigmoid, Gaussian, Sigmoid
Symmetric, and Log). We refer to this compute pattern as
algorithm-dependent (a.k.a, SPECIAL).
Table 2 shows the compute patterns of the ML algorithms
that we considered. For example, the compute patterns of
LogReg, as shown in Figure 1, include REDUCTION ( 1 ),
SIGMOID ( 2 ), COMPARATOR ( 3 ), and OPTIMIZATION ( 4 ).
This table shows that different ML algorithms have common
compute patterns, REDUCTION, COMPARATOR, and OPTI-
MIZATION. Note that LonReg has the same compute patterns
as 2D-Reg. The reason is that LogReg and 2D-Reg exploit the
same objective function to optimize one-dimensional and
two-dimensional models, respectively.
3.2 Light-Weight Compute Engines
Compute patterns in the compute graphs of ML algorithms
can easily be mapped to a set of heterogeneous compute
engines such as REDUCTION UNIT, COMPARATOR UNIT,
OPTIMIZATION UNIT, and NON-LINEARITY UNIT, where
the compute engines are customized to execute one particular
✕ ✕
+
✕ ✕
+
+
Sigmoid
-
 
✕
✕
-
X0 w0 X1 w1 X0 w0 X1 w1
1
✕
✕
-
✕
✕
-
✕
✕
-
µ µ µ µ
X0 X1 X2
X3
w3w2w1w0
w3w2w1w0
2
3
4
Figure 1: Compute patterns of Logistic Regression (LogReg)
algorithm.
compute pattern efficiently. Each compute engine (accelera-
tor), as shown in Figure 2, is named after its corresponding
compute pattern.
Reduction Compute Engine (REDUCTION UNIT). To per-
form REDUCTION compute pattern, we need an array of
multipliers, along with an arrangement of adders to accumu-
late the products into one final output. REDUCTION UNIT
compute engine, k/RU , is an array of k multipliers, each
evaluating the product of xi and wi, followed by as many
adders as required to aggregate the sum of all the products
into one final output, and save the output in the sum register.
Figure 2 1 shows a REDUCTION UNIT of size 8, ’8/RU’, in
which eight multiplications are performed, and the products
are accumulated using several levels of adders.
Comparator Compute Engine (COMPARATOR UNIT).
COMPARATOR UNIT, labeled as 2 in Figure 2, performs
a subtraction between the predicted output, PO, and the
expected output, EO, and generates a difference to be used
for updating the models.
Optimization Compute Engine (OPTIMIZATION UNIT).
OPTIMIZATION UNIT, labeled as 3 in Figure 2, is a serial
chain of two multiplier units and a subtractor unit, which up-
dates an element of the model array. To update n weights, we
need to integrate n instances of the OPTIMIZATION UNITs.
Non-Linearity Compute Engine (NON-LINEARITY UNIT).
The last compute pattern, SPECIAL, is algorithm-specific and
mostly performs non-linear operations. Linear operations of
SPECIAL are performed by other compute engines. Regard-
ing the non-linear operations, one realization of this accel-
erator is a general-purpose ALU that has a special unit for
non-linear operations. However, this realization is expensive
in terms of area and power.
To address this problem, we borrow the idea of imple-
menting non-linear functions using lookup tables from prior
work [59–62] but instead of lookup tables, we use the die-
stacked DRAM to hold the outputs of non-linear functions.
This is mainly because the area and power budgets of the
4
Table 2: Compute patterns for several ML algorithms.
Benchmarks
2D-Reg RU COMP
Reco RU MUL
SIG
Bprop RU SIG SIG
SVM RU MUL
LogReg RU COMP SIG
LinReg RU OPT
MUL
SIG MUL MUL MUL MULRU RU RU
COMP
COMP
COMP
COMP
COMP COMP
OPT
OPT
OPT
OPT
OPT
Computational Blocks
Sum Register
✕
X0 W0
✕
X1 W1
+
✕
X6 W6
✕
X7 W7
+
+ReductionEngine
<> YiComparatorEngine
 Register
2
EO PO
✕
✕
-
μ
Xi
WiOptimization
 Engine
 Register
3
 
1
Figure 2: Reduction, Comparator, and Optimization compute
engines.
Reduction 
Block
Optimization
Block
Comparator
Block
Reduction 
Block
Optimization
Block
W’1 W’2
X1[k] W1[m][k] X2[k] W2[n][k]
Model_level Parallelism
Non-Linearity 
Block
1
Non-Linearity 
Block
Comparator
Block
Reduction 
Block
Optimization
Block
Comparator
Block
Reduction 
Block
Optimization
Block
W’1[i][k] W’1[j][k]
X1[k] W1[i][k] X1[k] W1[j][k]
Partial_level Parallelism
Non-Linearity 
Block
2
Non-Linearity 
Block
Comparator
Block
Reduction 
Block
Optimization
Block
Comparator
Block
Reduction 
Block
Optimization
Block
W’1[i][z] W’1[j][z’]
X1[z] W1[i][z] X1[z’] W1[i][z’]
Block_level Parallelism
Non-Linearity 
Block
3
 
+
Figure 3: Three different types of parallelism, namely
bloc_level, partial_level, and model_level.
logic die are limited and we prefer to dedicate the available
area and power to the other compute engines.
4. SPLIT EXECUTION
As the area and power budgets of the logic layer of 3D-
stacked memories are limited, in-memory accelerators cannot
utilize the whole bandwidth offered by 3D-stacked DRAM
even if we use light-weight compute engines. One way to
address this limitation is to use out-of-the-memory resources
on the active die in parallel with the in-memory accelerators.
The out-of-the-memory resource can be ASIC, GPU, FPGA,
TPU, or any other types of compute platforms. To this end,
we need to split the execution on two platforms. Moreover,
split execution offers performance improvement if both plat-
forms are fully utilized and inter-platform communications
are minimal.
Analyzing the compute graphs of ML algorithms, we set
out a platform-aware partitioning mechanism. We observe
that there are three different types of parallelism inside ML
algorithms. Based on these parallelism types and the spec-
ifications of the two platforms (heterogeneous accelerators
on the logic die and the out-of-the-memory compute plat-
form), we partition the compute graph over the platforms to
maximize resource utilization with minimal inter-platform
communications.
4.1 Parallelism Types
We observe that there are up to three types of parallelism,
as shown in Figure 3, in the compute graph of an ML algo-
rithm.
1. Model_level parallelism. Some ML algorithms optimize
more than one weight array and each weight array works
on completely distinct data arrays. For example, Recom-
mender Systems (Reco) algorithm optimizes two indepen-
dent weights: movie_feature and users_feature. Thus, its
compute graph consists of independent compute subgraphs,
labeled 1 in Figure 3.
2. Partial_level parallelism. The weight arrays of some
ML algorithms such as Recommender Systems (Reco), 2D-
Regression (2D-Reg), and Back-propagation (BProp) have
more than one dimension, Wji; i∈ [0,k) and j ∈ [0,n) where
n>1. In such cases, the compute graph consists of up to n
independent compute subgraphs. These compute subgraphs
work on all elements of an input array and a portion of
the elements of the weight array as labeled 2 in Figure 3.
With this type of parallelism, we can create two subgraphs
that work on two distinct portions of the weight array,
model[a][i] and model[b][i] where a ∈ [0,n1), b ∈ [n1,n),
and i ∈ [0,k).
3. Block_level parallelism. There is an internal parallelism
in the REDUCTION and OPTIMIZATION compute patterns
as they operate on the weight and input arrays, Wi and
Xi where i ∈ [0,k). As shown by the compute subgraphs
labeled 3 in Figure 3, these compute patterns can be broken
into two parts, which operate on distinct elements on input
and weight arrays. While the two parts in the OPTIMIZA-
TION compute pattern are independent of each other, the
two parts of the REDUCTION compute pattern generate
partial sums that need to be aggregated (i.e., inter-platform
communications). To benefit from this source of parallelism,
we partition the input and weight arrays into two parts.
The REDUCTION compute pattern can be executed on both
partitions in parallel on the two platforms. When the two
partitions are executed, the two results are aggregated. To
minimize the inter-platform communications, the OPTI-
MIZATION computation will be executed on the same two
partitions at each platform.
As ML algorithms always include REDUCTION and OPTI-
MIZATION compute patterns, our analysis reveals that there is
at least one source of parallelism in an ML algorithm (while
some algorithms benefit from two or even all three sources
of parallelism).
4.2 Platform-Aware Partitioning
Using the three types of parallelism, we partition an ML
algorithm using Algorithm 1. The partitioning algorithm re-
ceives the compute graph, G, and the specifications of the two
platforms (in-memory and out-of-the-memory, labeled as
MEM and External in the algorithm, respectively), and stati-
cally partitions the compute graph into two subgraphs to be
5
assigned to the two platforms. The partitioning algorithm
attempts to maximize the resource utilization using two key
ideas: (1) minimizing inter-platform communications, and
(2) splitting the execution over the two platforms in a load-
balanced manner. As a result, the algorithm makes sure that
the two platforms finish their execution at about the same
time. The static nature of the partitioning algorithm relaxes
the hardware control unit from the overhead of runtime load
balancing. The platform-aware partitioning algorithm fol-
lows two steps:
• Minimizing inter-platform communications. First, we
extract all the available types of parallelism in an ML
algorithm. Second, out of the available types of paral-
lelism, we pick the best one to partition the compute graph
into two subgraphs (line 5). The model_level parallelism
has the highest priority as it includes two independent
subgraphs working on different weights and inputs. The
next priority belongs to the partial_level parallelism in
which the compute graph is divided into two independent
subgraphs working on distinct parts of the weight array.
The partial_level parallelism has lower priority than the
model_level parallelism. Unlike model_level parallelism,
in partial_level parallelism, the two partitions operate on
the same input. The block_level parallelism has the lowest
priority because it requires inter-platform communications,
while the other two types of parallelism (i.e., model_level
and partial_level) have no inter-platform communications.
• Providing load-balanced partitioning. To offer a load-
balanced partitioning between the two compute platforms,
we set the size of each partition (subgraph) based on the
throughput of the corresponding platform (lines 7-26). To
this end, we calculate the throughput of a platform using
Equation 3:
T hroughput = min(Memory BW,Compute BW ) (3)
For memory bandwidth, we assign as much bandwidth as
needed to the heterogeneous compute engines on the logic
die. The rest of the bandwidth is given to the out-of-the-
memory platform. To calculate the compute bandwidth, we
use Equation 4:
Compute BW =∑
i
×∑
j
Ii j×data size× f requency (4)
where Ii j indicates the number of inputs of the ith resource,
data size indicates the size of each input in byte, and fre-
quency is the operating frequency of the platform (lines
7-9). Assume that a platform has a 1 GHz frequency,
320 GB/s memoryBW, and 1000 32-bit multipliers. Its
compute bandwidth is 8 TB/s (1000 × 2 × 4 × 1) and its
throughput is 320 GB/s.
To offer a load-balanced partitioning, we partition the com-
pute graph of an ML algorithm based on the throughput
ratio of the two platforms, rateT hroughput.
1. Block_level parallelism. We divide k weights into
two parts with proportion to rateT hroughput, (k =
num1+num2; num1 = rateThroughput × num2), (lines
12-15).
2. Partial_level parallelism. We divide the second di-
mension of the weight array, which its size is n, into
two parts with the size of n1 and n2 in proportion to
rateT hroughput, ( n = n1 + n2; n1 = rateT hroughput
× n2) (lines 16-19).
3. Model_level parallelism. We sort the independent mod-
els based on their size. Starting from the largest, we assign
models, one by one, to the platform with larger throughput
till partitioning ratio becomes rateT hroughput. As with
the model_level parallelism, we only have coarse-grained
partitioning ability (i.e., assign the whole model to a
platform), the ratio of partitioning might not exactly be-
come rateT hroughput. Consequently, after model_level
partitioning is done, we apply other types of available
parallelism (e.g., block_level or partial_level) to change
the ratio to rateT hroughput.
Algorithm 1: Platform-Aware Partitioning
1 Algorithm Platform-Aware Partitioning()
input : G: Compute Graph
External-Spec: Out-of-the-memory Specifications
MEM-Spec: In-memory Specifications
output : External-Partitions: subgraphs assigned to the
out-of-the-memory platform
MEM-Partitions: subgraphs assigned to in-memory
compute engines
2 External-Partitions← empty()
3 MEM-Partitions← empty()
4 queue← empty()
5 pMode = parallelismAnalayzer(G)
6 queue.push(G)
7 External-Spec.throughput = min ( External-Spec.memoryBW,
External-Spec.computeBW)
8 MEM-Spec.throughput = min (MEM-Spec.memoryBW,
MEM-Spec.computeBW)
9 rateThroughput = External-Spec.throughput /
MEM-Spec.throughput
10 while (!queue.empty()) do
11 G = queue.pop()
12 if (pMode == Block_level) then
13 num1← num-model *
(rateThroughput/(1+rateThroughput))
14 num2← num-model - num1
15 G1, G2 += partition(G, num1, num2)
16 end
17 else if (pMode == Partition_level) then
18 dim1← N_dim * (rateThroughput/(1+rateThroughput))
19 dim2← N_dim - dim1
20 G1, G2 += partition(G, dim1, dim2)
21 end
22 else if (pMode == Model_level) then
23 g1, g2← partition(G, rateThroughput)
24 G1 += g1
25 rateThroughput = update(g1, g2, rateThroughput)
26 pMode = parallelismAnalayzer(g2)
27 queue.push(g2)
28 end
29 end
30 External-Partitions.insert(G1)
31 MEM-Partitions.insert(G2)
5. ORIGAMI
We propose a heterogeneous split architecture for in-
memory acceleration of ML algorithms, called ORIGAMI.
ORIGAMI is a holistic approach that benefits from pattern
and split executions to accelerate different ML algorithms
over a set of heterogeneous compute engines on the logic
die and a compute platform on the active die. ORIGAMI is a
hardware-software solution that spreads at different abstrac-
tion levels including programming layer (§ 5.1), compiler
layer (§ 5.2), architecture layer (§ 5.3), and hardware layer
(§ 5.4). Figure 4 shows the main components of ORIGAMI.
6
Graph
Extractor
High-Level 
Specification
Computation 
Graph
Pattern
Extractor
Reduction
Non-Linear
Comparator
Optiization
Partition 
Analyzer
FPGA
Specifications
3D-Stacked 
Specifications
1 2 3
FPGA Computation
Sub-Graphs
3D-Stacked Computation
Sub-Graphs
Code Generator
4
Model_level
Partial_level
Block_level
Compute 
Platform
3D-Stacked 
Memory
Compute Platform 
Instructions
3D-Stacked 
Instructions
Figure 4: ORIGAMI Workflow.
5.1 Programming Layer
The programming layer includes a programming interface
and a graph extractor unit to translate the high-level spec-
ification of an ML algorithm to its corresponding compute
graph, as shown in Figure 4, labeled 1 .
Programming Interface. Programming interface re-
ceives a high-level specification which includes learning
parameters, data declaration, and mathematical declaration
of an ML algorithm. The learning parameters include the
learning rate and the number of features. The data declaration
specifies various types of data such as the training input
vectors (a.k.a., input or model_input), the real outputs (a.k.a.,
model_output), and the weights (a.k.a., model_parameters
or model). The mathematical declaration specifies how the
objective function of the ML algorithm is computed, using
mathematical operations, to update the weights. The mathe-
matical operations can be expressed in three categories: (1)
basic operations, such as −,+,∗,<,>, (2) group operations,
such as ∑, ‖‖, Π, and (3) non-linear operations, such as
Sigmoid, Gaussian, Sigmoid Symmetric, and Log.
Graph Extractor. By receiving the high-level specifica-
tion of an ML algorithm, the graph extractor unit extracts the
corresponding compute graph.
5.2 Compiler Layer
At the compiler layer, ORIGAMI performs five operations:
(1) extracts compute patterns, (2) detects parallelism types,
(3) partitions the compute graph into two load-balanced parts
with minimum inter-part communications, (4) assigns each
part to a compute platform, and (5) schedules the execution
of the algorithm over the two platforms. Managing these
goals at the compiler layer alleviates the runtime overhead
and facilitates management of the simultaneous execution,
which simplifies the control mechanism in the 3D-stacked
memory. Compiler layer consists of two key components, as
shown in Figure 4:
2 Pattern extractor passes three steps: (1) it creates
three pattern subgraphs for the three common compute pat-
terns, (2) it runs a pattern-matching algorithm, adopted from
graph algorithms [63, 64], to find all instances of these com-
pute patterns in the compute graph. The remaining parts of
the compute graph are instances of the SPECIAL compute
pattern, and, (3) it clusters all nodes in each instance as a
coarse-grained node in the pattern compute graph.
3 Partition analyzer detects the parallelism types in
the pattern compute graph of an ML algorithm and partitions
the graph into two parts to be executed on the heterogeneous
compute engines on the logic die and an out-of-the-memory
compute platform (i.e., FPGA in this paper3). Partition
3ORIGAMI is general and can benefit from different kinds of out-
of-the-memory compute platforms such as FPGA, GPU, and etc.
Without loss of generality, in this work, we assume that the compute
platform on the active die is an FPGA. We use FPGA as an example
analyzer follows Algorithm 1 to split the pattern compute
graph into two load-balanced partitions with minimum inter-
platform communications.
5.3 Architecture Layer
The code generator uses our proposed instruction set archi-
tecture (ISA) to prepare executable code and static scheduling
for the heterogeneous compute engines.
5.3.1 ISA
The proposed ISA is a RISC instruction set that consists
of two flags, two types of registers, computation and syn-
chronization, and three types of instructions, communication,
computation, and synchronization. The input and output of
each compute engine is hardwired to a dedicated computation
register. In addition to computation registers, there are two
registers for synchronizing the execution of the compute
engines and the out-of-the-memory platform.
Communication Instructions transfer data from memory
locations to registers and vice versa (mov %src, %des).
Computation Instructions use the compute engines to
perform computation in the 3D-stacked memory. There are
three computation instructions:
1. reduce %Num: enables the Num th REDUCTION compute
engine to operate on its input registers and store the results
in the output register.
2. comparator %Num: enables the Num th COMPARATOR to
operate on its input registers and store the results in the
output register.
3. optimization %Num: enables the Num th OPTIMIZA-
TION compute engine to operate on its input registers and
store the results in the output register.
Synchronization Instructions. Synchronization instruc-
tions handle the required interactions between the two com-
pute platforms.
ORIGAMI utilizes three parallelism types to split the ex-
ecution of the compute graph over the two platforms. With
model_level and partial_level parallelism types, there is no
inter-communications between the two compute platforms,
hence, there is no need for synchronization. However, with
block_level parallelism type, as shown in Figure 3, there is
a need for inter-communications between the two platforms.
In block_level, the heterogeneous compute engines on the
logic die execute a part of the reduction and optimization
compute pattern, while the out-of-the-memory compute plat-
form executes the rest. With the reduction compute pattern,
the partial sum of the two platforms need to be aggregated.
The optimization compute engine cannot start the execution
until the two partial sums are aggregated. For this purpose,
one platform (called master) is in charge of aggregating the
partial sums and generating the final result, while the other
platform (called slave) should transfer its partial sum to the
master. When master is done with its partial sum, it needs
and leave examination of other platforms for the future work.
7
to wait to receive the partial sum of the slave. Receiving the
partial sum, the master aggregates the partial sums and sends
the final result to the slave, which is waiting for the result to
start execution of the optimization compute pattern.
To this end, the proposed ISA uses two flags, M_ready
and S_ready, two synchronization registers, M_delta and
S_psum, and three instructions, check, set, and wait. With-
out loss of generality, we assume that the out-of-the-memory
compute platform is the master.
1. set %f: sets the value of flag %f.
After preparing the partial sum, the in-memory controller
should write the partial sum to S_psum and set S_ready by
the set instruction. The master should check the S_ready
flag and reads the partial sum from S_psum when the flag
indicates it is ready.
2. wait %f: waits for the flag %f to set.
The in-memory controller should wait for the M_ready to
set and then read the value of M_delta register. The master
computes the delta, which is needed for the optimization
compute pattern, writes it to M_delta register, and sets the
M_ready flag.
3. clr %f: resets the value of flag %f.
5.3.2 Static Scheduling
Code generator unit receives the part of the compute graph
that needs to be executed on the heterogeneous compute
engines on the logic die, and transforms it into a sequence of
instructions to be executed by the in-memory controller, as
we explain in §5.4.
5.4 Hardware Layer
ORIGAMI adds a set of heterogeneous compute engines
and an in-memory controller to the logic die of a 3D-stacked
memory. In-memory controller is a light-weight unit that
executes the instructions of §5.3.1.
6. EVALUATION
6.1 Experimental Setup
Benchmarks and Datasets. Table 3 summarizes the set of
benchmarks that are used for evaluation of ORIGAMI, and
their descriptions including model topology, number of fea-
tures, and number of input vectors. To evaluate the sensitivity
of ORIGAMI’s performance improvement to the size of the
model used by a given ML algorithm, we use two distinct
models (shown as M1 and M2 in Table 3) for each evaluated
benchmark. The benchmarks include the state-of-the-art ML
algorithms. The Back-propagation (BProp) algorithm trains
models to detect handwritten digits [65,66] and speech [67].
The Linear Regression (LinReg) algorithm is widely used
in finance and image processing to predict prices [68] and
texture of images [69]. The Logistic Regression (LogReg)
algorithm trains models to detect tumors [70], and cancer [71].
The Support Vector Machine (SVM) algorithm is used in com-
puter vision and medical diagnosis domains to detect human
faces [72] and cancer [73]. The Recommender Systems (Reco)
algorithm is widely used in processing movie datasets such as
Movielens datasets [74,75] and the Netflix Prize datasets [76].
The 2D-Regression (2D-Reg) algorithm trains models to detect
different kinds of tumors [77] and cancers [73].
FPGA Platform. We evaluate ORIGAMI in the context of a
3D-stacked memory on top of an active die that includes
a Virtex UltraScale+ (DS923) VU13P FPGA. Table 4
reports the key FPGA parameters. We synthesize the hard-
ware in the FPGA platform with Vivado Design Suite
v2017.2 to extract the FPGA design parameters.
ASIC Implementation. We use Synopsys Design Compiler
(L-2016.03-SP5) and TSMC45-nm standard cell library at
313 MHz frequency, the frequency of HMC stacked mem-
ory [11, 21, 22, 78], to synthesize the accelerators and obtain
the area, delay, and energy numbers. We use CACTI-P [79]
to measure the area and power of the registers and on-chip
SRAMs.
Memory Model. The 3D-stacked memory is modeled af-
ter an HMC stacked memory [11, 21, 22, 78]. Each vault
delivers up to 16 GB/s bandwidth to the logic die and 10
GB/s bandwidth to the active die [21, 22]. The available area
to accelerators in each vault is 1.5mm2 [6, 21]. We extract
the 3D-stacked memory model parameters from the data
sheet [21]. Table 4 reports the parameters of the memory
model used in our evaluations.
Cycle-Level Simulation. Using the ASIC and FPGA syn-
thesis numbers and the configurations of the memory models,
we develop a cycle-level architectural simulator to measure
the performance and energy consumption of ORIGAMI. The
ORIGAMI simulator includes the timing of the memory ac-
cesses and faithfully models the parameters of the ASIC
and FPGA implementations. Table 5 lists the major micro-
architectural parameters of ORIGAMI.
Comparison Metrics. We evaluate benefits of ORIGAMI with
six ML algorithms, listed in Table 3, in terms of performance
and energy-delay product (EDP).
Comparison Points. We compare six different platforms,
namely (1) FPGA, (2) PIM-GU, (3) PIM-CE, (4) ORIGAMI (our
approach), (5) ORIGAMI-IIC, and (6) PIM-GU-Unlimited.
ORIGAMI represents our approach, in which, we use both
the FPGA and the compute engines on the logic die of the 3D-
stacked memory. ORIGAMI assigns as much of the internal
bandwidth as possible to the compute engines and delivers
the rest of the bandwidth to the FPGA (See Table 4). Ta-
ble 5 also lists the available resources on the logic die that
ORIGAMI uses for computation.
ORIGAMI-IIC evaluates an ORIGAMI which exploits an
ideal inter-platform communications with no delay and band-
width usage. The ORIGAMI-IIC exploits the same configura-
tion as ORIGAMI. We use this comparison point to evaluate
the effect of inter-platform communications’ delay and band-
width usage on ORIGAMI’s effectiveness.
The state-of-the-art FPGA-based accelerator to train dif-
ferent ML algorithms is TABLA [27]. It has been shown
that it outperforms GPU and CPU implementations [27]. We
implement ALUs of TABLA in an FPGA connected to a 3D-
stacked memory. We refer to this design as FPGA.
In-memory accelerators focus on the inference phase [6,
8, 9, 12, 14–18, 26] or the training phase of a restricted set of
ML algorithms such as CNNs [11]. As there is no previous
in-memory accelerator that supports the training phase of
different types of ML algorithms, we compare ORIGAMI with
a design that uses the general-purpose ALUs similar to
those in prior work [27, 32] on the logic die of the 3D-
8
Table 3: Benchmarks.
Benchmark Abbreviation Description
Back 
Propagation
Bprop (M1)
Bprop (M2)
Linear 
Regression
LinReg (M1)
LinReg (M2)
Logistic 
Regression
LogReg (M1)
LogReg (M2)
Support Vector 
Machine
SVM (M1)
SVM (M2)
Recommender 
Systems
Reco (M1)
Reco (M2)
2D-Regression
2D-Reg (M1)
2D-Reg (M2)
Human face detection
Cancer diagnosis based on the gene expression
Handwritten digit pattern recognition
Hierarchical acoustic modeling for speech recognition
Stock price prediction
Image texture recognition
Tumor classification using gene expression
Prostate cancer diagnosis based on the gene experssion
MovieLens recommender system
Netflix recommender system
Brain tumor classificaton
Financial forecasting
# of Features
5000
13338
80,000
1,638,400
20,000
1,206,600
174,000
249,515
903,0,30
2,776,508
96,000
289,584
Model Topology
1000x1000x200
351x1000x40
80,000
1,638,400
20,000
1,206,600
174,000
249,515
9,030,300
27,765,080
96,000
289,584
# of Input 
Vectors
20,000,000
35,819,788
1,305,030
1,161,915
9,698,600
3,344,380
16,959,800
7,295,540
244,040,960
1,004,982,870
18,621,312
8,026512
Table 4: Major parameters of FPGA and 3D-stacked Mem-
ory [11].
FP
GA
3D
-S
ta
ck
ed
 M
em
or
y Peak Frequency
Model
Total Number of LUTs
Total Number of Flip-Flops
BRAM Size
Total Number of DSPs
UltraRAM Size
Technology Node
UltraScale+ VU13P
250 MHz
1,728 K
3,456 K
12,288
94.5 Mb
360 Mb
16 nm
Internal Access Latency
Model
Number of Vaults
Internal Transfer Energy
Total Bandwidth 
External Access Latency
Area per vault (mm2)
External Transfer Energy
Capacity
Number of Banks/Vault
Number of Links/Package
Logic Die Frequncy
HMC v 2.1
27.5 ns [22]
32
8 GB
16
4
27.5 ns [22]
3.7 pJ/bit [22]
1.5 
512 GB/s
10 pJ/bit [22]
313 MHz
Table 5: Parameters of ORIGAMI.
# of Multipliers per Reduction Unit 8
# of Adders per Reduction Unit 7
# of Reduction Units per Logic Die 8
# of Optimization Units 64
# of Multipliers per Optimization Unit 2OR
IG
AM
I
Total Area (mm2)
Maximum Captured Bandwidth (GB/s)
41.3
240
stacked memory. We refer to this design as PIM-GU. To
understand the limitations of previous in-memory accelerator,
we compare the results to an ideal but impractical platform,
PIM-GU-Unlimited, with enough general-purpose ALUs on
the logic die of the 3D-stacked memory to fully utilize the
available bandwidth. PIM-CE evaluates an ORIGAMI which
only benefits from the compute engines on the logic die.
We use this comparison point to show the importance of
split execution. Moreover, this comparison point shows the
effectiveness of heterogeneous compute engines as compared
to general-purpose ALUs.
Table 6 shows the specifications of ORIGAMI accelerators
and the ALU of prior work [27, 32]. The table shows that
we can only include 32 ALUs on the logic die, as the area of
one ALU is 1.2 mm2 and the available area in a single vault
is 1.5mm2. Consequently, PIM-GU exploits 80 GB/s of the
available bandwidth.
Table 6: Area, power, and latency of the compute engines of
ORIGAMI and a general-purpose ALU in a 45-nm technology
node.
Comparator Engine
Reduction Engine
Area (mm2) Power (uW) Latency (ns)
0.02
0.20
0.4
13 15
3
Compute unit
Optimization Engine 0.45 4 12
General Purpose ALU 1.1 32 30
6.2 Experimental Results
Performance Analysis. To evaluate the effect of ORIGAMI on
accelerating the training phase of ML algorithms, we measure
the execution time across the evaluated benchmarks. Figure 5
shows the speedup (higher is better) of different platforms
with respect to FPGA. We make four key observations.
First, ORIGAMI outperforms FPGA in terms of execution
time, by 1.55× on average (up to 1.6×). ORIGAMI exploits
all the available bandwidth, while FPGA only captures the
external bandwidth. Second, The speedup of ORIGAMI is
within ≈1% of PIM-GU-Unlimited. The reason for this level
of speedup is the effectiveness of ORIGAMI in capturing all
the available memory bandwidth. ORIGAMI maximally uti-
lizes the memory bandwidth by judiciously distributing the
computations between the FPGA and the accelerators on the
logic die of the 3D-stacked memory.
Third, ORIGAMI offers the same speedup as ORIGAMI-IIC,
which shows how well our partitioning algorithm minimizes
the inter-platform communications. Our evaluations show
that the bandwidth overhead in ORIGAMI is less than 0.001%
and ORIGAMI effectively hides the delay overhead. Fourth,
PIM-CE outperforms PIM-GU by 2.9×, which shows the ef-
fectiveness of the heterogeneous compute engines as com-
pared to general-purpose units. Moreover, ORIGAMI com-
bines heterogeneous compute engines with split execution
to capture the whole available bandwidth, and hence, outper-
forms PIM-CE by 2.1×.
Energy Analysis. We measure energy-delay product (EDP)
of all benchmarks on all compute platforms. Figure 6 shows
the normalized EDP reduction (higher is better) of the com-
pute platforms, normalized to FPGA.
We make four key observations. First, ORIGAMI outper-
forms FPGA by 29×, on average (up to 31×). It is due to two
reasons: (1) in ORIGAMI, a portion of data communications
is local, as it executes a portion of computations inside the
3D-stacked memory, and (2) ORIGAMI exploits light-weight
compute engines in both in-memory and out-of-the-memory
platforms, which consume less energy than the general-
purpose ALUs of FPGA. Second, EDP of ORIGAMI is 86.1×
and 2.1× lower than PIM-GU’s and PIM-GU-Unlimited’s, re-
spectively. Although PIM-GU and PIM-GU-Unlimited per-
form all the communications and computations inside the
3D-stacked memory, their general-purpose ALUs consume
more energy than compute engines of ORIGAMI. Third,
PIM-CE outperforms both PIM-GU and PIM-GU-Unlimited
9
0.25
0.75
1.55
1.60
0.0X
0.2X
0.4X
0.6X
0.8X
1.0X
1.2X
1.4X
1.6X
1.8X
Bp
ro…
Bp
ro…
Lin
Re
…
Lin
Re
…
Lo
gR
…
Lo
gR
…
SV
M…
SV
M…
Re
co
…
Re
co
…
2D
-…
2D
-…
Ge
o…
PIM-GU PIM-CE ORIGAMI ORIGAMI-IIC PIM-GU-Unlimited
Sp
ee
du
p
Bprop  Bprop  LinReg  LinReg  Log eg Log eg   SVM              SVM             Reco Reco          2D-Reg 2D-Reg                            Geomean
M1       M2       M1      M2         M1       M2                 M1      M2    M1   M2      M1       M2
Figure 5: Speedup of the competing compute platforms over FPGA.
0.34
22.70
29.05 13.81
0.0X
5.0X
10.0X
15.0X
20.0X
25.0X
30.0X
35.0X
Bp
ro…
Bp
ro…
Lin
Re
…
Lin
Re
…
Lo
gR
…
Lo
gR
…
SV
M…
SV
M…
Re
co
…
Re
co
…
2D
-…
2D
-…
Ge
o…
PIM-GU PIM-CE ORIGAMI ORIGAMI-IIC PIM-GU-Unlimited
No
rm
al
ize
d 
ED
P 
Re
du
ct
io
n
Bprop  Bprop  LinReg  LinReg  Log eg Log eg SV              SV             Reco Reco 2D-Reg          2D-Reg                         Geomean
M1       M2       M1      M2         M1       M2                 M1      M2    M1   M2      M1       M2
Figure 6: EDP reduction of the competing platforms, normalized to FPGA.
by 67.3× and 1.7×, on average, respectively. This is because
PIM-CE exploits heterogeneous compute engines whose en-
ergy usage is significantly lower than general-purpose ALUs
in PIM-GU and PIM-GU-Unlimited. Fourth, ORIGAMI and
ORIGAMI-IIC offer very close EDP due to low inter-platform
communications offered by split execution of ORIGAMI.
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
Pattern-aware Split
Speedup Normalized EDP Reduction
Figure 7: Breakdown of speedup and EDP reduction between
pattern-aware execution and split execution of ORIGAMI.
Sources of Benefit. ORIGAMI exploits two execution tech-
niques, pattern-aware execution and split execution, to ef-
fectively run different kinds of ML algorithms. To shed
light on the importance of these two techniques, we compare
ORIGAMI against PIM-GU. As compared to PIM-GU, ORIGAMI
offers two advantages: (1) heterogeneous compute engines
instead of general-purpose ALUs, and (2) split execution
over the compute engines and the out-of-the-memory plat-
form. Figure 7 shows the contribution of each technique to
the speedup and EDP reduction of ORIGAMI as compared
to PIM-GU. The figure shows that heterogeneous compute
engines are responsible for 48% of the speedup and 78% of
the EDP reduction. Likewise, split execution is responsible
for 52% of the speedup and 22% of the EDP reduction. These
results clearly show that both techniques are necessary for
the success of ORIGAMI.
Sensitivity to Parallelism Types. To better illustrate the
source of benefit in ORIGAMI, Figures 8 and 9 show the effect
of different parallelism types (i.e., block_level, partial_level,
and model_level) on the speedup and EDP across the eval-
uated benchmarks. The first bar, block_level, shows the
results when only block_level parallelism is enabled. The
second bar, partial_level, shows the speedup and EDP
when both block_level and partial_level parallelism types are
enabled. Finally, the last bar, model_level, illustrates the
results when all three parallelism types (the default mode in
ORIGAMI) are enabled. We make three observations.
First, by enabling block_level parallelism, ORIGAMI out-
performs FPGA in terms of speedup and EDP, by 1.52× and
25.24×, on average, respectively. It is due to the fact that all
benchmarks benefit from the block_level parallelism, thus
by leveraging block_level parallelism, ORIGAMI splits the
execution over both compute engines in 3D-stacked memory
and the out-of-the-memory FPGA platform. Out of twelve
benchmarks, six benchmarks (LinReg (M1), LinReg (M2), Lo-
gReg (M1), LogReg (M2), SVM (M1), and SVM (M2) have only
block_level parallelism. Thus, these benchmarks see no
speedup or improvement in EDP by enabling partial_level
and model_level parallelism types.
Second, by enabling both block_level and partial_level,
on average, ORIGAMI achieves 1.56× and 28.10× improve-
ment in execution time and EDP over FPGA . BProp, Reco,
and 2D-Reg achieve higher speedup and lower EDP over
block_level. As an example, partial_level improves
the speedup and EDP of BProp by ≈10% and 26.75%, respec-
tively, over block_level.
Third, by enabling all three parallelism types, ORIGAMI
improves speedup and EDP of Reco by 1.6× and 31.0×, on
average, respectively, over FPGA . Other benchmarks do not
have model_level parallelism and achieve no improvements
by enabling model_level. Reco has two independent models
and achieves the highest speedup and the lowest EDP by
10
1.52
1.54 1.55
0.0X
0.2X
0.4X
0.6X
0.8X
1.0X
1.2X
1.4X
1.6X
1.8X
Bp
ro…
Bp
ro…
Lin
Re
…
Lin
Re
…
Lo
gR
…
Lo
gR
…
SV
M…
SV
M…
Re
co
…
Re
co
…
2D
-…
2D
-…
Ge
o…
Block Level Partial Level Model Level
Sp
ee
du
p
Bp op  Bp p  LinReg  Lin eg  Log eg Log eg   SVM              SVM             Reco Reco           2D-Reg         2D-Reg                            Geomean
M1       M2       M1      M2         M1       M2                 M1      M2    M1   M2      M1       M2
Figure 8: Speedup sensitivity to the three parallelism types.
25.24
28.07 29.05
0.0X
5.0X
10.0X
15.0X
20.0X
25.0X
30.0X
35.0X
Bp
ro…
Bp
ro… Lin
R…
Lin
R…
Lo
gR
…
Lo
gR
…
SV
M…
SV
M…
Re
co
…
Re
co
…
2D
-…
2D
-…
Ge
o…
Block Level Partial Level Model Level
No
rm
al
ize
d 
ED
P 
Re
du
ct
io
n
Bpr p  Bpr p  Lin eg  LinReg  Log eg Log eg   SVM              SVM             Reco Reco           2D-Reg         2D-Reg                            Geomean
M1       M2       M1      M2         M1       M2                 M1      M2    M1   M2      M1       M2
Figure 9: EDP sensitivity to the three parallelism types.
model_level. model_level improves the speedup and EDP
reduction of Reco as compared to other parallelism types
by 6% and 1.50×, on average, respectively. Although BProp
includes three models, they are not independent, thus, BProp
does not have model_level parallelism. These results as-
serts the importance of exploring various types of parallelism
to fully benefit from ORIGAMI.
7. RELATED WORK
Our proposal, ORIGAMI, is fundamentally different from
prior work in the following directions: (1) ORIGAMI extracts
compute patterns of ML algorithms and translates them into
heterogeneous compute engines on the logic die of a 3D-
stacked memory, (2) ORIGAMI splits execution of ML algo-
rithms over the heterogeneous compute engines and an out-
of-the-memory compute platform to utilize all the available
bandwidth, and (3) ORIGAMI exploits an optimization algo-
rithm to split the computation of ML algorithms between two
platforms in a load-balanced manner and with minimum inter-
platform communications to maximize resource utilization.
There has been a wealth of architectures for in-memory
accelerators that integrate logic and memory onto a single
die to enable higher memory bandwidth and lower access
energy [1, 2, 5–7, 9–14, 16–20, 27–31, 34–47, 80]. Most of
these in-memory architectures accelerate the inference phase
of ML algorithms, some of in-memory accelerators, such as
Neurocube [11] and Proger PIM [20], accelerate both the
training and inference phases, only target CNNs and do not
work for other ML algorithms.
Prior work exploits ASIC [1, 2, 5, 38–47], GPU [34–37],
FPGA [27–31], and multi-computing-node [47,80] platforms
to accelerate ML algorithms. While effective, these tech-
niques do not benefit from in-memory processing. Some
prior work used split execution to accelerate ML algorithms.
Shen, et al. [33] partitioned FPGA resources to process dif-
ferent subsets of convolutional layers of CNNs. Scalpel [48]
customizes DNN pruning over SIMD-aware weight prun-
ing and node pruning. Park, et al. [32] distribute only the
optimization part of the training phase of different ML al-
gorithms over FPGA and n ASIC units. Consequently, their
technique does not offer load balancing. Scaledeep [49]
uses heterogeneous processing tiles that are customized for
compute-intensive and memory-intensive parts of training
DNNs. Proger PIM [20] uses CPU, fixed-function PIM
and a programmable PIM unit for training different CNN
models. These techniques are fundamentally different from
our proposed technique since none of them is in-memory
processing and they do not offer all the necessary features
needed for an efficient in-memory processing.
8. CONCLUSION
During the training phase, ML algorithms process large
amounts of data, iteratively, which consumes significant band-
width and energy. Although in-memory accelerators provide
high memory bandwidth and consume less energy, they suffer
from lack of generality or efficiency. We propose ORIGAMI,
a holistic approach that exploits heterogeneous compute en-
gines on the logic die to efficiently cover a wide range of ML
algorithms and splits the execution of ML algorithms over the
in-memory compute engines and an out-of-memory compute
platform to use all the available bandwidth. The evaluation
results show that ORIGAMI outperforms the best-performing
prior work, in terms of performance and energy-delay product
(EDP), by up to 1.6× and 31×, respectively. ORIGAMI also
improves average performance and energy efficiency by 1.5×
and 21×, respectively.
11
9. REFERENCES
[1] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for
energy-efficient dataflow for convolutional neural networks,” in Intl.
Symp. on Computer Architecture (ISCA), June. 2016.
[2] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
and O. Temam, “Shidiannao: Shifting vision processing closer to the
sensor,” in Intl. Symp. on Computer Architecture (ISCA), June. 2015.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural
information processing systems, 2012.
[4] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” 2014.
[5] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network
computing,” in Intl. Symp. on Computer Architecture (ISCA), 2016.
[6] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris:
Scalable and efficient neural network acceleration with 3d memory,” in
Intl. Conf. on Architectural Support for Programming (ASPLOS), Apr.
2017.
[7] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined
reram-based accelerator for deep learning,” in Intl. Symp. on High
Performance Computer Architecture (HPCA), Feb. 2017.
[8] L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, and H. Kim, “GraphPIM:
Enabling Instruction-Level PIM Offloading in Graph Computing
Frameworks,” in Intl. Symp. on High Performance Computer
Architecture (HPCA), May. 2017.
[9] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O‘Connor,
N. Vijaykumar, O. Mutlu, and S. W. Keckler, “Transparent Offloading
and Mapping (TOM): Enabling Programmer-Transparent Near-Data
Processing in GPU Systems,” in Intl. Symp. on Computer Architecture
(ISCA), June. 2016.
[10] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P.
Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A
convolutional neural network accelerator with in-situ analog
arithmetic in crossbars,” June. 2016.
[11] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay,
“Neurocube: A Programmable Digital Neuromorphic Architecture
with High-Density 3D Memory,” in Intl. Symp. on Computer
Architecture (ISCA), June. 2016.
[12] H. Asghari-Moghaddam, Y. Hoon Son, J. Ho Ahn, and N. Sung Kim,
“Chameleon: Versatile and Practical Near-DRAM Acceleration
Architecture for Large Memory Systems,” in Intl. Symp. on
Microarchitecture (MICRO), Oct. 2016.
[13] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie,
“Prime: a novel processing-in-memory architecture for neural network
computation in reram-based main memory,” in Intl. Symp. on
Computer Architecture (ISCA), June. 2016.
[14] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, “NDA:
Near-DRAM Acceleration Architecture Leveraging Commodity
DRAM Devices and Standard Memory Modules,” in Intl. Symp. on
High Performance Computer Architecture (HPCA), Feb 2015.
[15] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A Scalable
Processing-in-Memory Accelerator for Parallel Graph Processing,” in
Intl. Symp. on Computer Architecture (ISCA), June. 2015.
[16] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim,
“DRAMA: An Architecture for Accelerated Processing Near Memory,”
CAL, vol. 14, no. 1, 2015.
[17] Q. Zhu, T. Graf, H. Sumbul, L. Pileggi, and F. Franchetti,
“Accelerating Sparse Matrix-Matrix Multiplication with 3D-Stacked
Logic-in-Memory Hardware,” in HPEC, Intl. Conf. on High
Performance Extreme Computing Conference (HPEC) 2013.
[18] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and
Y. LeCun, “Neuflow: A runtime reconfigurable dataflow processor for
vision,” in CVPRW, june. 2011.
[19] S. Angizi, Z. He, A. S. Rakin, and D. Fan, “Cmp-pim: an
energy-efficient comparator-based processing-in-memory neural
network accelerator,” in Intl. symp. on Design Automation Conference
(DAC), June. 2018.
[20] J. Liu, H. Zhao, M. A. Ogleari, D. Li, and J. Zhao,
“Processing-in-memory for energy-efficient neural network training: A
heterogeneous approach,” Intl. Symp. on Microarchitecture (MICRO),
Oct. 2018.
[21] Hybrid Memory Cube Consortium, Hybrid Memory Cube
Specification 2.1, 6 2014. Rev. 10.0.
[22] “Hybrid memory cube.”
[23] P. Rosenfeld, Performance Exploration of the Hybrid Memory Cube.
PhD thesis, 2014.
[24] T. Zhang, K. Wang, Y. Feng, Y. Chen, Q. Li, B. Shao, J. Xie, X. Song,
L. Duan, Y. Xie, et al., “A 3d soc design for h. 264 application with
on-chip dram stacking,” in 3DIC, Nov. 2010.
[25] M. Ghosh and H.-H. S. Lee, “Smart refresh: An enhanced memory
controller design for reducing energy in conventional and 3D
die-stacked DRAMs,” in Intl. Symp. on Microarchitecture (MICRO),
Dec. 2007.
[26] L. Zhou, S. Pan, J. Wang, and A. V. Vasilakos, “Machine learning on
big data: Opportunities and challenges,” 2017.
[27] D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdanbakhsh, J. K.
Kim, and H. Esmaeilzadeh, “Tabla: A unified template-based
framework for accelerating statistical machine learning,” in Intl. Symp.
on High Performance Computer Architecture (HPCA), Mar. 2016.
[28] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based Accelerator Design for Deep Convolutional Neural
Networks,” in Intl. Symp. on Field-Programmable Gate Arrays
(FPGA), Feb. 2015.
[29] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao,
A. Misra, and H. Esmaeilzadeh, “From high-level deep neural models
to fpgas,” in Intl. Symp. on Microarchitecture (MICRO), Oct. 2016.
[30] A. R. Putnam, D. Bennett, E. Dellinger, J. Mason, and P. Sundararajan,
“Chimps: A high-level compilation flow for hybrid cpu-fpga
architectures,” in Intl. Conf. on Field Programmable Logic and
Applications (FPGA), Sep. 2008.
[31] C. Farabet, Y. LeCun, K. Kavukcuoglu, E. Culurciello, B. Martini,
P. Akselrod, and S. Talay, “Large-scale fpga-based convolutional
networks,” Scaling up Machine Learning: Parallel and Distributed
Approaches, 2011.
[32] J. Park, H. Sharma, D. Mahajan, J. K. Kim, P. Olds, and
H. Esmaeilzadeh, “Scale-out acceleration for machine learning,” in
Intl. Symp. on Microarchitecture (MICRO), Oct. 2017.
[33] Y. Shen, M. Ferdman, and P. Milder, “Maximizing cnn accelerator
efficiency through resource partitioning,” in Intl. Symp. on Computer
Architecture (ISCA), June. 2017.
[34] S. G. Elango, Convolutional Neural Network Acceleration on GPU by
Exploiting Data Reuse. PhD thesis, San Jose State University, 2017.
[35] K.-S. Oh and K. Jung, “GPU implementation of neural networks,”
Pattern Recognition, vol. 37, no. 6, pp. 1311 – 1314, 2004.
[36] A. Guzhva, S. Dolenko, and I. Persiantsev, “Multifold acceleration of
neural network computations using GPU,” in Intl. Conf. on Artificial
(ICANN), Sep 2009.
[37] K. Li, J. Chen, W. Chen, and J. Zhu, “Saberlda: Sparsity-aware
Learning of Topic Models on GPUs,” in Intl. Conf. on Architectural
Support for Programming (ASPLOS), Apr. 2017.
[38] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, J. Kyung Kim,
V. Chandra, and H. Esmaeilzadeh, “Bit fusion: Bit-level dynamically
composable architecture for accelerating deep neural networks,” in
Intl. Symp. on Computer Architecture (ISCA), Jun. 2018.
[39] V. Aklaghi, A. Yazdanbakhsh, K. Samadi, H. Esmaeilzadeh, and
R. K. Gupte, “Snapea: Predictive early activation for reducing
computation in deep convolutional neural networks,” in Intl. Symp. on
Computer Architecture (ISCA), June. 2018.
[40] A. Yazdanbakhsh, K. Samadi, H. Esmaeilzadeh, and N. S. Kim,
“GANAX: A Unified SIMD-MIMD Acceleration for Generative
Adversarial Network,” in Intl. Symp. on Computer Architecture (ISCA),
June. 2018.
[41] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and
A. Moshovos, “Stripes: Bit-serial Deep Neural Network Computing,”
in Intl. Symp. on Microarchitecture (MICRO), Oct. 2016.
[42] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An
Accelerator for Compressed-sparse Convolutional Neural Networks,”
in Intl. Symp. on Computer Architecture (ISCA), June. 2017.
[43] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J.
Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural
12
Network,” in Intl. Symp. on Computer Architecture (ISCA), June. 2016.
[44] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and
Y. Chen, “Cambricon-X: An Accelerator for Sparse Neural Networks,”
in Intl. Symp. on Microarchitecture (MICRO), Oct 2016.
[45] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
Z. Xu, N. Sun, et al., “Dadiannao: A machine-learning supercomputer,”
in Intl. Symp. on Microarchitecture (MICRO), Dec. 2014.
[46] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng,
X. Zhou, and Y. Chen, “Pudiannao: A polyvalent machine learning
accelerator,” in Intl. Conf. on Architectural Support for Programming
(ASPLOS), June. 2015.
[47] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., “In-datacenter
performance analysis of a tensor processing unit,” in Intl. Symp. on
Computer Architecture (ISCA), June. 2017.
[48] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke,
“Scalpel: Customizing dnn pruning to the underlying hardware
parallelism,” in Intl. Symp. on Computer Architecture (ISCA), June.
2017.
[49] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha,
A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, et al.,
“Scaledeep: A scalable compute architecture for learning and
evaluating deep networks,” in Intl. Symp. on Computer Architecture
(ISCA), June. 2017.
[50] J. Zhang, Z. Wang, and N. Verma, “A machine-learning classifier
implemented in a standard 6t sram array,” in Intl. Symp. on VLSI
Circuits (VLSI-Circuits), June. 2016.
[51] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instructions: A
low-overhead, locality-aware processing-in-memory architecture,” in
Intl. Symp. on Computer Architecture (ISCA), June. 2015.
[52] C. De Sa, M. Feldman, C. Ré, and K. Olukotun, “Understanding and
optimizing asynchronous low-precision stochastic gradient descent,”
in Intl. Symp. on Computer Architecture (ISCA), June. 2017.
[53] L. Bottou, “Stochastic gradient learning in neural networks,”
Neuro-Nımes, 1991.
[54] L. Bottou, “Stochastic gradient descent tricks,” in Neural networks:
Tricks of the trade, 2012.
[55] S. Zhang, C. Zhang, Z. You, R. Zheng, and B. Xu, “Asynchronous
stochastic gradient descent for dnn training,” in Intl. Conf. on
Acoustics, Speech and Signal Processing (ICASSP), May. 2013.
[56] R. Ormándi, I. H. us1, and M. Jelasity, “Asynchronous peer-to-peer
data mining with stochastic gradient descent,” in Euro-Par, 2011.
[57] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient
descent and its application to data-parallel distributed training of
speech dnns,” in Interspeech, 2014.
[58] M. Li, T. Zhang, Y. Chen, and A. J. Smola, “Efficient mini-batch
training for stochastic optimization,” in KDD, 2014.
[59] J. Kaufmann, “Signal conditioner with symbol addressed lookup table
producing values which compensate linear and non-linear distortion
using transversal filter,” 1998. US Patent 5,778,029.
[60] K. Engel, M. Kraus, and T. Ertl, “High-quality pre-integrated volume
rendering using hardware-accelerated pixel shading,” in Prcd. Conf. on
Graphics hardware (HWWS), 2001.
[61] T. A. Keahey and E. L. Robertson, “Techniques for non-linear
magnification transformations,” in Intl. Symp. on Information
Visualization (ISIV), Oct. 1996.
[62] J. Mielikainen et al., “Lossless compression of hyperspectral images
using lookup tables,” Signal Process, 2006.
[63] C. Schulz, “Graph partitioning and graph clustering in theory and
practice,” in Institute for Theoretical Informatics Karlsruhe Institute of
Technology (KIT), 2016.
[64] G. W. Flake, R. E. Tarjan, and K. Tsioutsiouliklis, “Graph clustering
and minimum cut trees,” Internet Mathematics, 2004.
[65] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.
[66] “A variant of mnist dataset with 8 millions records.”
[67] J. P. Pinto, Multilayer Perceptron Based Hierarchical Acoustic
Modeling for Automatic Speech Recognition. PhD thesis, EPFL, 2010.
[68] B. Zhou, “High-frequency data and volatility in foreign-exchange
rates,” Journal of Business & Economic Statistics, vol. 14, no. 1, 2008.
[69] S. Dhanya and R. V. Kumari, “Comparison of various texture
classification methods using multiresolution analysis and linear
regression modelling,” Springerplus, vol. 5, no. 54, 2016.
[70] M. Segal, K. Dahlquist, and B. Conklin, “Regression approaches for
microarray data analysis,” Journal of Computational Biology, vol. 10,
no. 6, 2003.
[71] D. Singh, P. Febbo, K. Ross, D. Jackson, J. Manola, C. Ladd,
P. Tamayo, A. Renshaw, A. A. D, J. Richie, E. Lander, M. Loda,
P. Kantoff, T. Golub, and W. Sellers, “Gene expression correlates of
clinical prostate cancer behavior,” Cancer Cell, vol. 1, no. 2, 2002.
[72] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggiott, and
V. Vapnik, “Feature selection for svms,” in NIPS, 2000.
[73] “Integrated cancer repository for cancer research.”
[74] I. Cantador, P. Brusilovsky, and T. Kuflik, “Movielens dataset,” in
HetRec, 2011.
[75] Grouplens, “Movielens dataset,” 2017.
[76] “Netflix prize data set.”
[77] “Integrated cancer repository for cancer research.”
[78] J. Jeddeloh and B. Keeth, “Hybrid Memory Cube New DRAM
Architecture Increases Density and Performance,” in Intl. Symp. on
VLSI Technology (VLSIT), June. 2012.
[79] S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi,
“CACTI-P: Architecture-level Modeling for SRAM-based Structures
with Advanced Leakage Reduction Techniques,” in Intl. Conf. on
Computer-Aided Design (ICCAD), Nov. 2011.
[80] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray,
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals,
P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,
“TensorFlow: Large-scale machine learning on heterogeneous
distributed systems,” arXiv:1603.04467 [cs], 2016.
13
