TDO-CIM: Transparent Detection and Offloading for Computation In-memory by Vadivel, Kanishkan et al.
TDO-CIM: Transparent Detection and Offloading
for Computation In-memory
Kanishkan Vadivel1/Lorenzo Chelini1,2, Ali BanaGozar1, Gagandeep Singh1,2, Stefano Corda1,2, Roel Jordans1,
Henk Corporaal1
1Eindhoven University of Technology, 2 IBM Research Zrich
{k.vadivel, l.chelini}@tue.nl
Abstract—Computation in-memory is a promising non-von
Neumann approach aiming at completely diminishing the data
transfer to and from the memory subsystem. Although a lot
of architectures have been proposed, compiler support for such
architectures is still lagging behind. In this paper, we close this
gap by proposing an end-to-end compilation flow for in-memory
computing based on the LLVM compiler infrastructure. Starting
from sequential code, our approach automatically detects, opti-
mizes, and offloads kernels suitable for in-memory acceleration.
We demonstrate our compiler tool-flow on the PolyBench/C
benchmark suite and evaluate the benefits of our proposed in-
memory architecture simulated in Gem5 by comparing it with a
state-of-the-art von Neumann architecture.
Index Terms—LLVM, compute in memory, memristor, pattern
matching, Polly, Loop Tactics
I. INTRODUCTION
As we are moving toward exascale computing, the memory
wall [1] is becoming one of the toughest challenges for the
traditional von Neumann architecture. Not only does the cost
of moving data dwarf the cost of a floating-point operation but
also the memory bandwidth is not able to meet the demand
of today’s applications [2]. Consequently, new architectures
with a radical departure from the traditional von Neumann
architecture start to arise. Computing in-memory (CIM) is
one of them. CIM aims at processing information and storing
computation data on the same physical unit using emerging
devices referred to as memristive devices. Memristive devices
such as phase change memory devices (PCM) can store data
within their conductance state, which can be changed by
altering the amorphous/crystalline phase within the device [3].
Computation, on the other hand, is carried out through various
physical mechanisms such as Ohm’s and Kirchhoff’s laws.
Memristor devices are organized in a computational memory
unit—which we refer to as CIM tile. As storing and pro-
cessing happen in the same physical device, CIM completely
diminishes the overhead of data movement between the CPU
and the main memory, enabling data-intensive task in an
efficient manner [4]. A huge body of research has been done
on the architecture side [5]–[7]. However, before CIM can
be established as the de-facto solution for HPC and IoT
applications a leap forward need to be done on the compilation
toolchain and software stack, which is the purpose of this
paper. Our contributions are:
• An end-to-end compilation flow for CIM devices, which
allows to automatically and transparently invoke CIM
Top electrode
Phase change
material
Bottom electrode
Heater
Insulator
Programmable region
T meltReset pulse
Set pulse
Read pulse
T crys
Time
Te
m
pe
ra
tu
re
T room
a) b)
Fig. 1. Cross section of a PCM device (a) and its programming pulses (b).
acceleration, without any user intervention. Therefore,
enabling legacy code to exploit in-memory acceleration.
• A lightweight run-time library for data allocation, transfer
and execution of computational tasks on the CIM device.
• We evaluate the benefits of CIM computation in terms
of energy and performance by comparing it with a
current state-of-the-art von Neumann architecture using
the PolyBench/C benchmark suite.
II. COMPUTE IN MEMORY ARCHITECTURE
In this section, we first briefly discuss the physic behind
the PCM device (Section II-A). Afterward, we show how
such devices can be interconnected together in a crossbar-
like structure (Section II-B) to create the basic block of our
accelerator (Section II-C).
A. Memristor Basics
PCM is a type of non-volatile memory that stores informa-
tion by changing the cell resistance, switching between amor-
phous and crystalline states. The transition between the two
states happens as a consequence of the application of external
voltages that exceed the threshold voltage of a device. Figure 1
(a) shows a cross-section of a PCM device. The phase change
material is sandwiched between two electrodes, and current
is applied through the heater in order to change the material
state. A short, but intense, pulse—known as reset pulse—is
used to bring the material in the amorphous phase (high-
resistance). Contrarily, to switch back to low resistance the
set pulse—a lower and longer pulse—is applied. To read the
ar
X
iv
:2
00
7.
00
06
0v
1 
 [c
s.A
R]
  3
0 J
un
 20
20
Host 
(ARM) CIM-Acc
Main Memory
Column Buffers
PCM
Crossbar
Digital Logic
Output Buffers
R
ow
 B
uf
fe
rs
C
IM
 A
cc
el
er
at
or
 
Context
Registers
DMA
S&H S&H S&H
ADC
G 0,0 G 0,1 G 0,2
G 1,0 G 1,1 G 1,2
G 2,0 G 2,1 G 2,2
a) b) c)
M
ic
ro
 e
ng
in
e
v0
v1
v2
I = v.G
CIM Tile
V
V
V
Accumulate
Compute
Store ResultFill Buffer
Host
C
IM
 A
cc
el
er
at
or
CIM
Tile
DMA
Digital
Logic
Trigger
Data Ready
(optional)
Result Ready
Prepare data in shared
memory and write CIM
configuration registers
Time
d)
Fig. 2. Overview of the emulated system (a). A more detailed view of our developed CIM accelerator (b). A memristor-based crossbar (c). Timeline of a
kernel execution on our CIM accelerator (d).
device, an even lower pulse (read pulse) is used (Figure 1 (b)).
Thanks to the excellent scaling capabilities of PCM devices—
which allows increasing main memory capacity in a cost-
effective and power-efficient way—it is expected that PCM
will play a significant role in future memory architectures [8].
But before this can happen, one main challenge needs to be
addressed: endurance. PCM devices can stand 106 - 108 writes
before they wear out making the lifetime of a PCM-based
system last for a few years [9]. Despite a lot of effort on
architecture support for wear-leveling and smart algorithms
for data re-placement, no prior work tries to address this
endurance problem at compile time. TDO-CIM addresses this
obstacle by revisiting two common compiler transformations:
tiling and fusion (Section III-B).
B. CIM Tile
The electrical conductance/resistance of the PCM depends
on the material phase of the device. A single PCM can achieve
several resistance levels that can be exploited for in-memory
computations [10]. Each resistance level can be used to encode
a particular binary value. For instance, a PCM device with 2M
levels can support a maximum of M -bit computation at full-
resolution. To support higher resolutions with a low precision
device, multiple columns in a crossbar can be used [11].
Each column in crossbar computes partial results, and the
final result is computed by a weighted sum using traditional
CMOS technology. Figure 2 (c) shows how PCM devices can
be organized in a crossbar-like structure to execute matrix-
vector multiplication. A matrix can be stored in the crossbar
as the conductance state of the PCM devices (Gx,y values).
Afterward, the input vector is fed as a set of voltages to the
crossbar, which multiplies by the conductance values. The
resulting current sensed at the columns is the analog dot-
product result [12]. The output currents are converted back
into digital signals by analog to digital converters (ADC).
To further improve the energy efficiency, ADCs are shared
amongst multiple columns which are reused using sample and
holds (S&H) [13]. In addition to the previously mentioned
analog components, a digital interface is required to hook the
CIM tile (Figure 2 (b)) with traditional CMOS logic. The
digital interface is composed of row/column buffers, output
buffers, and a digital logic block. The row/column buffers act
as a data and mask registers for the crossbar [14]. During
write operation, the column buffers contain the data that has
to be written on the crossbar, and the row buffers supply
a row-enable signal. Similarly, during a compute operation,
the column buffers supply column-enable signal and the row
buffers latch the inputs. The computed result will be stored in
the output buffers. The digital logic block implements scalar
compute functionality (i.e., reduction functions) to perform
post-processing on the crossbar result.
C. Accelerator Organization
A CIM tile, a micro-engine, and a DMA unit for load and
store operations make a standalone accelerator. The core is the
CIM tile which computes a standard matrix-vector multiplica-
tion (GEMV) of complexity O(N2) in O(1) constant time.
The matrix-matrix computation (GEMM) can be implemented
as a series of matrix-vector operations, and therefore the
accelerator supports both GEMV and GEMM.
The accelerator uses a shared global memory interface for
data sharing and exposes a set of context registers to the
system via a memory-mapped IO interface. Context registers
are used for control and offloading, and are read or written by
the host. The micro-engine translates the high level-parameters
stored in the context registers into a series of circuit-level
operations such as loading the data from shared memory to
row/column buffers, configuring the mask values, triggering
the computation on CIM tile, and writing back the results
from the output buffers to the shared memory. Additionally,
it manages the control flow involved in decomposing GEMM
to a series of GEMVs and supports double buffering for all
the registers in the accelerator to hide the data latency of the
memory accesses. Figure 2 (d) shows a timeline of the events
that happen after the host trigger the CIM accelerator.
D. Hardware Model
Figure 2 (a) shows our emulated system with a host, main
memory, and a CIM accelerator connected through a system
bus. We implement the CIM accelerator as a cycle-accurate
User App User App User App...
CIM Runtime Library
User Space
Kernel Space
CIM Driver 
Mem mgt. Read Write
CIM hardware
Context Registers
Status Command DMA
Fig. 3. Overview of the CIM’s software stack.
model integrated into the Gem5 simulator [15]. The accelerator
is based on a port-mapped IO (PMIO) and a DMA interface.
The PMIO interface exposes context registers to the system,
and the DMA provides a memory interface to the accelerator.
The host mimics a dual-core Arm-A7 processor based on [16].
For the experiments in the paper, we run the simulator in full-
system mode to capture the effects of the operating system,
device drivers, and hardware interactions.
E. Software Stack
A software stack (Figure 3) allows applications running
in the user space to interact directly with the hardware.
The software stack is divided into kernel-space and user-
space. At the lowest level of the stack, the kernel-space
CIM driver reads and writes to the context registers of the
accelerator through a ioctl system call. Besides, the driver
translates the virtual address used by the host processor to
a physical address as the accelerator can work only with
physical addresses. On the other hand, the user-space CIM
API is responsible for encoding CIM runtime library calls
into context register parameters. Furthermore, with the help of
the CIM driver, it implements the support for allocating and
releasing the physically-contiguous pages in shared memory
via the contiguous memory allocator (CMA) APIs exposed
by the Linux kernel [17]. The use of CMA offers two main
benefits compared to the traditional malloc-based approach:
1) the size of the shared memory region is not limited by
the page boundary; 2) there is no need for explicit memory
management in the driver routines, which diminishes overhead
in the host.
To enforce memory coherence in the shared memory region,
the kernel driver triggers a cache flush on the host side before
invoking the accelerator. The accelerator, on his part, uses only
un-cachable requests for memory access which automatically
enforces memory coherence. Once the accelerator completes
its execution, it updates the status in a specific context register.
The host can either wait on spinlock or continue with other
tasks and check the status of such register periodically.
III. OVERVIEW OF THE CIM COMPILER
The high-level design of our compilation flow is shown in
Figure 4. It follows a classical compiler design with a front-
end, a mid-level optimizer and targets specific back-ends. We
extend this flow by introducing 1) Loop Tactics [18], [19]—a
state-of-the-art declarative optimizer—in the mid-level. Loop
Tactics enables automatic detection and offload of specific
computational patterns; 2) A lightweight runtime library that
provides optimized performance and memory usage for the
CIM device. The library has been designed to be used directly
by the application programmer, or an optimizer (i.e., Loop
Tactics). It exposes a host-callable C API, similar to what
cuBLAS or MKL offers for Nvidia GPU and Intel CPU,
respectively.
A. A Bird’s-eye View of TDO-CIM
The entry point in our compilation flow is an application
written in a high-level language (i.e., C++). To handle a
variety of languages front-ends lower the high-level language
to an intermediate representation (IR) on which all the sub-
sequent optimizations are spelled. For our work we use the
LLVM compiler infrastructure, hence adopting its intermediate
representation LLVM-IR. Given an application, we can use
any of the LLVM-based front-ends (i.e., Clang) to lower
the high-level language to LLVM-IR. At LLVM-IR level we
rely on the polyhedral optimizer Polly [20] to detect, extract
and model compute kernels. Internally Polly represents the
schedule of each detected kernel as a tree, which we refer to as
schedule tree. Schedule tree [21] is the way of representing the
execution strategy of each kernel by mapping each dynamic
statement instance with its execution order. This mapping is
implicitly defined by the node parent-child relation within the
tree. Loop optimizations and device mapping are expressed
as tree modifications and carried out by Loop Tactics, which
works as additional passes within Polly. Loop Tactics’ passes
consume schedule trees and output a CIM-optimized schedule.
The modified tree is then passed back to Polly which lowers it
back to an imperative AST and then further down to LLVM-
IR. In the back-end LLVM-IR is lowered to final executable. It
is at this stage of the compilation pipeline where we link our
executable with the CIM runtime library. Listing 1 shows at the
top a GEMM kernel in C++ code, while at the bottom the mid-
level optimizer output as pseudo-C++. For our example, we
assume single-precision operands. The GEMM kernel has been
detected and swapped by Loop Tactics with a function call
to the CIM runtime library (polly_cimBlasSGemm). Blas
parameters (i.e., values of alpha or leading dimensions) are
automatically collected or computed by Loop Tactics. In addi-
tion, Loop Tactics inserts an initialization call to configure the
CIM hardware (polly_cimInit) as well as all the function
calls to orchestrate the data movement to and from the device
(i.e., polly_cimMalloc and polly_cimDevToHost).
1 for (int i = 0; i < M; ++i)
2 for (int j = 0; j < N; ++j) {
3 C[i][j] = beta * C[i][j];
4 for (int k = 0; k < K; ++k)
5 C[i][j] += alpha * A[i][k] * B[k][j];
6 }
1 /* ... */
2 // initialize CIM device 0
3 polly_cimInit(0);
Loop Tactics passes
LLVM-IR Tree Matcher 
CIM
Optimized
LLVM-IRTree Builder
C++/C
Clang
Host backend
Assembler
Linker 
CIM Runtime
Library
Executable 
Front-end  Mid-optimizer (Polly)
Back-end
Fig. 4. LLVM-based compilation flow developed for the CIM accelerator.
4 // allocate data on CIM device
5 polly_cimMalloc((void**)&cim_C, M*N*4);
6 polly_cimMalloc((void**)&cim_A, M*K*4);
7 polly_cimMalloc((void**)&cim_B, K*N*4);
8 // execute GEMM kernel on CIM device
9 polly_cimBlasSGemm(transA, transB, M, N, K,
10 &alpha, cim_A, lda, cim_B, ldb,
11 &beta, cim_C, ldc);
12 // copy C back to host
13 polly_cimDevToHost(cim_C, host_C, M*N*4);
Listing 1. High-level code for a generalize matrix multiplication (GEMM)
kernel (top). Loop Tactics generated code to offload GEMM kernel to the
CIM accelerator (bottom).
B. TDO-CIM specific optimizations
We revisit loop fusion and tiling in the light of this new
CIM computing paradigm trying to minimize write operations
to crossbar to enhance endurance.
Revisited Loop Fusion: Loop fusion is a performance-
oriented transformation that combines two loop nests in a new
single-loop nest. In our case, we focus on a specific case of
loop fusion: kernel fusion. Consider two consecutive kernels
X and Y, with Y following X directly. We fuse X and Y if both
kernels have the same access patterns (i.e., both are GEMM
kernels) and are independent. Two kernels are independent if
Y doesn’t read from or write to any output of X, and Y does
not write to any input of X. An example is shown in Listing 2.
1 for (int i = 0; i < M; ++i)
2 for (int j = 0; j < N; ++j) {
3 for (int k = 0; k < K; ++k)
4 s1: C[i][j] += @\ul{A[i][k]}@ * B[k][j];
5 }
6 for (int i = 0; i < M; ++i)
7 for (int j = 0; j < N; ++j) {
8 for (int k = 0; k < K; ++k)
9 s2: D[i][j] += @\ul{A[i][k]}@ * E[k][j];
10 }
Listing 2. Independent kernels with shared input (A matrix). TDO-CIM
exploits shared inputs to increase endurance by avoiding multiple writes on
the memristor crossbar.
By fusing two kernels, we get the following advantages:
1) we reduce the number of calls to the runtime library by
using batched operations. The GEMMs in Listing 2 will be
replaced by a single polly_cimBlasGemmBatched in-
stead of two calls to polly_cimBlasSGemm. The interface
for the batched operation is similar to the one provided for
polly_cimBlasSGemm with the only exception of having
10 20 30 40
0
8
16
24
32
40
48
PCM cell endurance (number of writes in million)
Sy
st
em
lif
et
im
e
(y
ea
rs
)
Naive mapping
“Smart” mapping
Fig. 5. Impact of TDO-CIM fusion transformation for the code in Listing 2.
arrays of pointers instead of single pointers. 2) We increase
endurance by exploiting possible shared inputs. The A matrix
(Listing 2) is shared and remains constant; we exploit this
by writing only A in the crossbar and streaming B and E
from the input buffers. This allows writing only one matrix
on the crossbar in contrast with a naive mapping where B
end E would have been written, and A streamed from the
input buffers. Figure 5 shows the expected lifetime for the
PCM crossbar comparing the naive mapping and the “smart”
mapping applied by TDO-CIM. The x-axis shows the PCM
cell endurance in an interval between 10 million to 40 million
writes which is in the expected lifetime interval of a PCM
device (106 to 108). The expected lifetime is computed by
applying the following equation [9]:
SystemLifeT ime =
CellEndurance ∗ S
B
(1)
where S is the crossbar size, 512 KB in our case, while B
is the write traffic in GB/s for the kernel in Listing 2. B is
obtained by diving the total number of writes by the kernel
execution time. We assume squared matrices of 4096 byte-
elements and the writes to be localized uniformly across the
entire crossbar. As can be seen from Figure 5 the “smart”
mapping allows to improve endurance by a factor of 2.
Revisited Tiling Transformation: Let us now focus on
our revisited tiling optimization. Tiling is a well-known trans-
formation to improve locality by reducing the reuse distance
of memory accesses to the same location. Consider statement
s1 in Listing 2 and assume that matrix A doesn’t fit in the
CIM crossbar. We use tiling to split A into multiple tiles
such that the working set of a single tile fits in the CIM
TABLE I
CIM AND HOST SYSTEM CONFIGURATION.
CIM Parameter Value
PCM Crossbar
Technology(256x256 @8-bit) IBM PCM 2x(256x256 @4-bit)
Compute and Write Latency/8-bit 1µs and 2.5µs
Compute Energy/8-bit 200fJ (2x 100fJ/4-bit PCM)
Write Energy/8-bit 200pJ (2x 100pJ/4-bit PCM)
Energy for Mixed signal circuit 3.9nJ (@1.2GHz)
Input/Output buffer Energy (1.5KB) 5.4pJ/byte-access
Digital Logic 40pJ/GEMV for weighted sum
and 2.11pJ/extra ALU operation
Energy for DMA and microEngine <0.78nJ
Host CPU Spec
2xArm-A7 @1.2GHz 2GB LDDDR3 @933MHz
L1-I/D-32KB, L2-2MB 128pJ/inst1(including cache)
crossbar. We then apply loop interchange on the tile loops
jj and kk such that we can reuse tiles of A in consecutive
executions of the point loops, hence once more increasing
endurance. The outcome of our tiling transformation is shown
in Listing 3, where the point loops will be replaced by a call
to polly_cimBlasSGemm.
1 // tile loops
2 for (int ii = 0; ii < SIZE; ii += TILE)
3 for (int kk = 0; kk < SIZE; kk += TILE)
4 for (int jj = 0; jj < SIZE; jj += TILE)
5 // point loops
6 for (int i = ii; i < ii + TILE; i++)
7 for (int j = jj; j < jj + TILE; j++)
8 for (int k = kk; k < kk + TILE; k++)
9 C[i][j] += A[i][k] * B[k][j];
Listing 3. Loop tiling and interchange to fit the operand on the CIM crossbar
and reduce the number of writes by reusing A tiles in consecutive execution
of the point loops.
IV. DEMONSTRATION AND EVALUATION
In this section, we quantify the benefits of CIM computation
for a set of linear-algebra kernels from the Polybench/C
benchmark suite compiled with TDO-CIM.
a) Experimental Setup: We use the system shown in
Figure 2 (a). We select an energy efficient dual-core Arm-
A7 with a shared L2 cache. The simulator is a cycle-accurate
model that imitates the functionality of the memristor com-
putations and surrounding digital blocks [14]. The memristor
crossbar is an 8-bit 256x256 PCM crossbar based on IBM’s
4-bit PCM [4]. To mimic an 8-bit cell with a 4-bit cell, two
adjacent columns are used, one for 4 MSBs and the other for
4 LSBs. The final result is computed by a weighted sum of
MSB and LSB columns in the digital logic block. The energy
and latency model for the crossbar and mixed-signal circuitry
is from [4] and [13] respectively. The energy model for the
rest of the digital blocks is based on a synthesis report of
commercial 40nm finFET technology. Table I summarises our
system configuration and energy model.
b) Performance Evaluation: We use the compilation
string shown in footnote2 for the host and the host+CIM,
1Based on Ara: Energy-Efficient RISC-V, Matheus et al. 2019.
2clang -O3 -march-native
clang -O3 -march-native -enable-loop-tactics
respectively. Dynamic instruction count and run-time are pro-
filed in Gem5 by inserting ROI markers. For energy estimates,
we use the numbers shown in Table I. We do not include
DRAM energy numbers in the estimates as the host and CIM-
accelerator generate the same amount of traffic by accessing
the same data. Figure 6 (left) shows the energy numbers
obtained for the reference platform (Arm-A7), and for the
Arm-A7+CIM where the kernel execution is performed on
the in-memory accelerator. For the host, the energy numbers
include the energy spent on computation and in the memory
hierarchy. For the CIM, the energy numbers incorporate the
energy spent on the driver (host side) and in the accelerator.
GEMM-like kernels: 2mm, 3mm, gemm, and conv were able to
achieve good energy improvements over the reference system.
This is not the case for GEMV-like kernels (bicg, mvt,
gesummv) due to their low compute intensity. From the CIM
perspective, the compute intensity for a given kernel can be
formulated as Number−of−MAC−operationsNumber−of−CIM−writes which is very low
for GEMV-like kernels as can be seen in Figure 6 (left). With
such low compute intensity the energy is dominated by the
overhead in host for offloading computations to accelerator
and the number of writes which are costly for the CIM device
200pJ/byte (see Table I). Figure 6 (right) shows the energy-
delay-product (EDP). It follows the same trend as the energy
plot. We gain for GEMM-like kernels (up to 612x) while we
lose for GEMV-like.
V. RELATED WORK
Code offloading: Several works address the issue,
TOM [22] being perhaps the very first of them. TOM proposes
an offloading decision based on a simple cost function. The
idea is to statically identify the code section with the highest
potential in bandwidth saving. Similarly, Pattanik et. al. pro-
pose an affinity prediction model based on memory-related
metrics to decide where a given kernel should be executed
(i.e., main CPU or in-memory accelerator) [23]. Previously
mentioned works target GPU as an in-memory accelerator. On
the other hand, in our case, we are targeting a memristor cross-
bar which means that only specific kernels must be offloaded
as the accelerator is capable of executing only GEMM and
GEMVs-like kernels. Nair et. al. propose a code offloading
based on OpenMP 4.0 user annotation [24]. Contrary, our
approach is completely transparent to the application and does
not require any user intervention to exploit CIM acceleration.
CAIRO relies on an LLC cache profiler and analytical models
to decide potential offloading candidates [25]. The LLC pro-
filer is not integrated into the compilation flow and requires to
characterize the behavior of each kernel offline. Other works
expose CIM acceleration via API [6], [7], which requires
significant changes in the application, reducing application
readiness, and hurdling widespread adoption.
Enhance PCM endurance: Software and hardware wear-
leveling techniques to distribute write operations uniformly
across the memory module have bee studied extensively.
Hardware techniques require additional storage tables to keep
track of heavily written blocks, that will be periodically get
2m
m
3m
m
ge
mm co
nv
ge
su
mm
v
bic
g
mv
t
Ge
om
ea
n
Se
lec
tiv
e
Ge
om
ea
n
0
1
10
En
er
gy
 in
 m
J
32
.6
x
3.
2x
Host (Arm-A7)
Host+CIM
0
1
2
3
4
M
AC
s p
er
 c
im
-w
rit
e
1e4
2m
m
3m
m
ge
mm co
nv
ge
su
mm
v
bic
g
mv
t
Av
er
ag
e
-10
0
10
100
1000
ED
P 
Im
pr
ov
em
en
t 612xHost+CIM
-1
0
1
10
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t
 (R
un
tim
e)
Fig. 6. Energy (left) and Energy-delay-product (right) improvement for CIM computation.
remapped to the lowest wear-out ones [9]. Software tech-
niques, on the other hand, rely on lazy write-back policy [9],
dynamic data management [26], data migration and recompu-
tation [27]. All previous approaches are orthogonal to TDO-
CIM, which tries to enhance endurance at compile time by
intelligently mapping array references to the CIM crossbar.
VI. CONCLUSION
We present an end-to-end compilation flow for in-memory
computing. Our approach automatically identifies, optimizes,
and offloads computing kernels to our in-memory acceler-
ator. We compile a set of linear-algebra kernels from the
Polybench/C benchmark suit and prove the benefits of in-
memory computation by comparing our in-memory architec-
ture simulated in Gem5 with a state-of-the-art von Neumann
architecture. The results show the benefits of in-memory
computing by achieving average energy reduction of 32.6x and
energy-delay-product improvement of 612x. We expect our
compiler and Gem5 emulator to boost researches in the field by
providing a transparent and automatic flow to compile entire
applications on the CIM architecture and perform domains-
space exploration by tweaking our simulator.
ACKNOWLEDGMENTS
This research is supported by EC Horizon 2020 Research
and Innovation Program through MNEMOSENE project under
Grant 780215 and the NeMeCo grant agreement, id. 676240.
REFERENCES
[1] W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications
of the obvious,” SIGARCH Comput. Archit. News, vol. 23, no. 1, pp.
20–24, Mar. 1995.
[2] Singh et al., “Near-memory computing: Past, present, and future,”
Microprocessors and Microsystems, vol. 71, p. 102868, 2019.
[3] M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma,
C. Bekas, A. Curioni, and E. Eleftheriou, “Mixed-precision in-memory
computing,” Nature Electronics, vol. 1, no. 4, p. 246, 2018.
[4] M. Le Gallo, A. Sebastian, G. Cherubini, H. Giefers, and E. Eleftheriou,
“Compressed sensing with approximate message passing using in-
memory computing,” IEEE Transactions on Electron Devices, vol. 65,
no. 10, pp. 4304–4312, Oct 2018.
[5] M. N. Bojnordi and E. Ipek, “Memristive boltzmann machine: A
hardware accelerator for combinatorial optimization and deep learning,”
in 2016 IEEE International Symposium on High Performance Computer
Architecture (HPCA), March 2016, pp. 1–13.
[6] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie,
“Prime: A novel processing-in-memory architecture for neural network
computation in reram-based main memory,” in 2016 ACM/IEEE 43rd
Annual International Symposium on Computer Architecture (ISCA), June
2016, pp. 27–39.
[7] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo: A
processing-in-memory architecture for bulk bitwise operations in emerg-
ing non-volatile memories,” in 2016 53nd ACM/EDAC/IEEE Design
Automation Conference (DAC), June 2016, pp. 1–6.
[8] S. Raoux, F. Xiong, M. Wuttig, and E. Pop, “Phase change materials
and phase change memory,” MRS bulletin, vol. 39, no. 8, pp. 703–710,
2014.
[9] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras,
and B. Abali, “Enhancing lifetime and security of pcm-based main
memory with start-gap wear leveling,” in Proceedings of the 42Nd
Annual IEEE/ACM International Symposium on Microarchitecture, ser.
MICRO 42. New York, NY, USA: ACM, 2009, pp. 14–23.
[10] N. K. Upadhyay, H. Jiang, Z. Wang, S. Asapu, Q. Xia, and
J. Joshua Yang, “Emerging memory devices for neuromorphic com-
puting,” Advanced Materials Technologies, vol. 4, no. 4, p. 1800589,
2019.
[11] Zidan et al., “Field-programmable crossbar array (fpca) for recon-
figurable computing,” IEEE Transactions on Multi-Scale Computing
Systems, vol. 4, no. 4, pp. 698–710, 2017.
[12] Q. Xia and J. J. Yang, “Memristive crossbar arrays for brain-inspired
computing,” Nature materials, vol. 18, no. 4, p. 309, 2019.
[13] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-
chan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional
neural network accelerator with in-situ analog arithmetic in crossbars,”
in 2016 ACM/IEEE 43rd Annual International Symposium on Computer
Architecture (ISCA), June 2016, pp. 14–26.
[14] A. BanaGozar, K. Vadivel, S. Stuijk, H. Corporaal, S. Wong, M. A.
Lebdeh, J. Yu, and S. Hamdioui, “Cim-sim: computation in memory
simuiator,” in Proceedings of the 22nd International Workshop on
Software and Compilers for Embedded Systems. ACM, 2019, pp. 1–4.
[15] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,
J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell,
M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,”
SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1–7, Aug. 2011.
[16] F. A. Endo, D. Courousse´, and H.-P. Charles, “Micro-architectural
simulation of embedded core heterogeneity with gem5 and mcpat,” in
Proceedings of the 2015 Workshop on Rapid Simulation and Perfor-
mance Evaluation: Methods and Tools, ser. RAPIDO ’15. New York,
NY, USA: ACM, 2015, pp. 7:1–7:6.
[17] M. Nazarewicz. (2012) A deep dive into cma. [Online]. Available:
https://lwn.net/Articles/486301/
[18] O. Zinenko, L. Chelini, and T. Grosser, “Declarative Transformations in
the Polyhedral Model,” Inria ; ENS Paris - Ecole Normale Supe´rieure de
Paris ; ETH Zurich ; TU Delft ; IBM Zu¨rich, Research Report RR-9243,
Dec. 2018. [Online]. Available: https://hal.inria.fr/hal-01965599
[19] L. Chelini et al., “Declarative loop tactics for domain-specific
optimization,” ACM TACO, Nov. 2019. [Online]. Available: http:
//doi.acm.org/10.1145/3372266
[20] T. Grosser, A. Groesslinger, and C. Lengauer, “Pollyperforming polyhe-
dral optimizations on a low-level intermediate representation,” Parallel
Processing Letters, vol. 22, no. 04, p. 1250010, 2012.
[21] S. Verdoolaege, S. Guelton, T. Grosser, and A. Cohen, “Schedule trees,”
in International Workshop on Polyhedral Compilation Techniques, Date:
2014/01/20-2014/01/20, Location: Vienna, Austria, 2014.
[22] K. Hsieh, E. Ebrahim, G. Kim, N. Chatterjee, M. O’Connor, N. Vijayku-
mar, O. Mutlu, and S. W. Keckler, “Transparent offloading and mapping
(tom): Enabling programmer-transparent near-data processing in gpu
systems,” in 2016 ACM/IEEE 43rd Annual International Symposium on
Computer Architecture (ISCA), June 2016, pp. 204–216.
[23] A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir,
O. Mutlu, and C. R. Das, “Scheduling techniques for gpu architectures
with processing-in-memory capabilities,” in 2016 International Confer-
ence on Parallel Architecture and Compilation Techniques (PACT), Sep.
2016, pp. 31–44.
[24] R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen,
C. . Cher, C. H. A. Costa, J. Doi, C. Evangelinos, B. M. Fleischer,
T. W. Fox, D. S. Gallo, L. Grinberg, J. A. Gunnels, A. C. Jacob,
P. Jacob, H. M. Jacobson, T. Karkhanis, C. Kim, J. H. Moreno, J. K.
O’Brien, M. Ohmacht, Y. Park, D. A. Prener, B. S. Rosenburg, K. D.
Ryu, O. Sallenave, M. J. Serrano, P. D. M. Siegl, K. Sugavanam, and
Z. Sura, “Active memory cube: A processing-in-memory architecture for
exascale systems,” IBM Journal of Research and Development, vol. 59,
no. 2/3, pp. 17:1–17:14, March 2015.
[25] R. Hadidi, L. Nai, H. Kim, and H. Kim, “Cairo: A compiler-assisted
technique for enabling instruction-level offloading of processing-in-
memory,” ACM Trans. Archit. Code Optim., vol. 14, no. 4, pp.
48:1–48:25, Dec. 2017. [Online]. Available: http://doi.acm.org/10.1145/
3155287
[26] J. Hu, C. J. Xue, Q. Zhuge, W. Tseng, and E. H. . Sha, “Towards energy
efficient hybrid on-chip scratch pad memory with non-volatile memory,”
in 2011 Design, Automation Test in Europe, March 2011, pp. 1–6.
[27] Hu et al., “Reducing write activities on non-volatile memories in
embedded cmps via data migration and recomputation,” in Proceedings
of the 47th Design Automation Conference, ser. DAC ’10. New York,
NY, USA: ACM, 2010, pp. 350–355.
