The advent of 3D memory stacking technology, which integrates a logic layer and stacked memories, is expected to be one of the most promising memory technologies to mitigate the memory wall problem by leveraging the concept of near-memory processing (NMP). With the ability to process data locally within the logic layer of stacked memory, a variety of emerging big data applications can achieve significant performance and energy-efficiency benefits. Various approaches to the NMP logic layer architecture have been studied to utilize the advantage of stacked memory. While significant acceleration of specific kernel operations has been derived from previous NMP studies, an NMP-based system using an NMP logic architecture capable of handling some specific kernel operations can suffer from performance and energy efficiency degradation caused by a significant communication overhead between the host processor and NMP stack.
INTRODUCTION
Recently, emerging applications like machine learning (ML), high-performance computing (HPC), natural language processing (NLP), and big data analytics algorithms have diffused explosively toward various domains. However, the tremendous cost of data movement through a deep memory hierarchy of state-of-the-art computer systems has become a major issue that threatens both 49:2 H. Lim and G. Park performance and energy efficiency. The advent of 3D memory stacking technology, which integrates a logic layer and stacked memory, is expected to be the most promising memory technology to mitigate memory wall problems by utilizing the concept of near-memory processing (NMP). As localized data processing within the logic layer of the stacked memory is available, various studies have been investigated for an NMP logic layer architecture capable of acceleration and offloading near the stacked memory.
In prior works, various studies have focused on accelerating the specific kernel operations involved in applications, and the NMP logic architecture based on dedicated accelerators or multiple programmable cores has been introduced as a solution (Thanh-Hoang et al. 2016; Akin et al. 2014; Zhu et al. 2013) . However, the big data applications containing various kernel operations are expected to be one of the primary target applications to be executed on a system with the concept of NMP Guz et al. 2015) . While a significant acceleration of the specific kernel operation has resulted from the previous studies, the data included in several unsupportable kernel operations within the NMP logic layer should be transferred to the host processor through a deep memory hierarchy. As a result, previous NMP logic layer architectures can suffer from a significant communication overhead between the host processor and the NMP stack. This overhead leads to degraded performance and energy efficiency of the NMP-based system. Therefore, the NMP logic layer architecture, which can reduce communication overhead by handling the various kernel operations contained in an application and accelerate these kernel operations, thus having a significant impact on performance, is of critical importance to fully exploit the benefits of an NMP-based system.
In this study, we propose a Triple Engine Processor (TEP), which is a heterogeneous nearmemory processor for processing various kernel operations included in applications. Our TEP consists of three different data processing engines: an energy-efficient programmable core for control-flow and memory reference operations, reconfigurable processors for data-parallel computing operations, and a dedicated hardware engine (or a set of dedicated hardware engines) for specific kernel operation that can result in much higher performance than processing in the programmable core. In the evaluation result, the proposed TEP achieves significant performance improvements, about 3.4 times and 1.9 times better on average, compared to the baseline 3D memory system and an NMP engine consisting of 16 in-order cores, respectively. In addition, our TEP provides 33% lower energy consumption and reduces 52% of data traffic compared to the baseline 3D memory system. In summary, the key contributions of our work are as follows:
• To design the proposed TEP, we first extract the NMP kernel from various applications that can benefit from performance and energy efficiency by allocating them to the NMP stack and provide a detailed analysis of their characteristics.
• We analyze the key operations contained in the extracted NMP kernels and classify them into three operation types according to the characteristics of the key operation. Based on the classified operation types, we explore several representative architectures models with different features required to handle them efficiently, and derive three architectures for the NMP logic layer.
• We propose TEP, a heterogeneous near-memory processor designed with three architecture models derived from the analysis of applications. We also provide the operating mechanism, communication interface, and programming model to enable efficient data processing in our TEP-based system. • Finally, we evaluate various NMP-based models introduced in previous studies. We demonstrate the effectiveness of a heterogeneous architecture model in the logic layer of the NMP-based system by evaluating the proposed TEP and previous NMP-based models in terms of performance, energy, and communication overhead.
The remainder of this article is organized as follows. Section 2 explains the background for this study. Section 3 introduces the target applications used in this study. Section 4 classifies the operation types and explores diverse architectures for NMP logic. Section 5 presents our proposed TEP. Section 6 explains our evaluation methodology. Section 7 demonstrates the simulation results. Section 8 presents prior works. Finally, Section 9 concludes this study.
BACKGROUND
In this section, we provide background on the baseline NMP architecture and Hybrid Memory Cube (HMC). Figure 1 illustrates an overview of our baseline NMP architecture. In general, an NMP-based system is composed with a high-performance host processor and an NMP stack. The host processor communicates with the NMP stack using high-speed I/O links, as shown in Figure 1 . The NMP stack is constructed with a logic layer and DRAM dies based on 3D integration technology like HMC (Jeddeloh et al. 2012 ). An NMP stack can be daisy-chained to other NMP stacks, or many NMP stacks can constitute processor/memory-centric networks to utilize the processor I/O bandwidth in a system. The stacked DRAM dies are divided into vertically aligned partitions, called vaults in HMC, and each vault has a memory controller (aka vault controller) in the logic layer, as shown in Figure 1 . The vault controller is responsible for ensuring DRAM-specific timing constraints, DRAM sequencing, error detection and correction, and so on.
Baseline NMP Architecture

Application Acceleration in NMP-Based System
In an NMP-based system, an NMP logic is responsible for the data acceleration/offloading by locating closely with the vertically stacked memory. Like SIMD/GPU engines, the NMP logic accelerates some specific parts of an application, such as specific operations, functions, code segments, or threads . The host processor, on the other hand, executes the remaining segments that are not allocated to the NMP stack and processes the output data produced by the NMP logic.
Therefore, the kernel allocation policy for the NMP engine can have a decisive impact on the performance and energy efficiency of NMP-based systems. In addition, the characteristics of the kernel to be allocated to the NMP engine can be utilized as one of the key factors for designing Loh et al. 2013) . For example, an NMP engine with limited functionality and careless kernel allocation for that engine can cause inefficient (or useless) data movement between the host processor and the NMP stack, which increases the communication overhead. Table 1 shows 11 different target applications used in this study. We select diverse applications from various application domains, such as ML, NLP, HPC, in-memory DB, computer vision, and automotive systems, which have been considered in previous NMP studies (Farmahini-Farahani et al. 2015; .
TARGET APPLICATIONS AND PROFILING 3.1 Target Applications
Profiling and Analysis Methodology
In this study, we use GPROF (Graham et al. 1982) and KCacheGrind (2013) profiling tools to extract the major functions with a large contribution on program execution time. To extract the major function, we use a high-performance multi-core system with a 6-core CPU (Intel i7-5930K @3.50GHz) and 32GB total system memory. A last-level cache (LLC) has 15MB and is shared across all cores. In this article, except for Memcached, 3D-LiDAR, and K-means applications, the major functions are extracted until the sum of the execution times of the functions included in the target application exceeds 95%, because these applications contain functions that perform file or network I/O event handling. The misses per kilo instruction (MPKI) of the last-level cache (LLC) of the target applications and memory intensity, meaning the memory traffic caused by certain functions, are measured in the simulation framework. (The detailed simulation configuration will be presented in Section 6.) 49:5 Table 2 summarizes the analysis results of the target applications. As shown in Table 2 , most applications have many functions, but only some of them contribute significantly to the total execution time (see the "% time" column in Table 2 ) and generate almost all of the memory traffic generated by the application (see the "memory intensity" column in Table 2 ). Thus, we can expect two significant benefits in terms of performance and energy efficiency by adopting the concept of NMP to the application. In this study, we focus on designing the architecture of an NMP engine by considering the extracted major functions as target NMP kernels to be executed in NMP logic and analyzing the characteristics of key operations included in NMP kernels.
Profiling Results
ARCHITECTURE DESIGN BASED ON CLASSIFICATION OF OPERATION TYPES
In this section, we analyze the key operations included in the extracted NMP kernels and classify them into three operation types according to the characteristics of the key operations. We also analyze the architectural features required to efficiently support the classified operation types and explore several representative architecture models. Finally, we derive three architectures for the NMP logic used in the proposed baseline architecture.
49:6
H. Lim and G. Park 
Three Different Types of Kernel Operations
(1) Data-parallel Computing Operation (DCO): The first type of operation is the DCO, which computes a large amount of data based on high data-level parallelism. Figure 2 (a) shows a code example of a DCO type composed of code segments that can be processed in parallel, such as computing blocks with large loop bodies and data-flow computations. SRR, and 3D-LiDAR) contain a small number of DCO types, but they are a large part of execution time. Thus, we can expect performance and energy efficiency benefits by accelerating the DCO type at the NMP logic layer. (2) Memory Manipulation Operation (MMO): The second type of operation is MMO, which refers to a large amount of data stored in the memory. The target applications contain a large number of memory reference operations such as data movement, initialization, copying, scatter/gather, and comparison. Since the MMO type causes large amounts of data traffic in memory hierarchy by referencing bulk-scaled data, the MMO type has been visited as one of the suitable operations to offload in the NMP logic layer from many previous studies (Loh et al. 2013; Akin et al. 2015) . Figure 2( Table 2 , most NMP kernels extracted from the target applications contain a large number of the CFO type, but the CFO type has a smaller impact on execution time than other types.
Architecture Design for Processing Three Operation Types
• Architecture model for the DCO type: The DCO type is calculated with limited operators such as addition, subtraction, multiplication, division, and square root, and the order of the operators that are executed for each NMP kernel is different. Therefore, it is necessary to have an architecture with both the programmability and the ability to process various types of operations in parallel to efficiently process the DCO type. Several architectural models such as fine-grain reconfigurable arrays (FPGAs), coarse-grain reconfigurable arrays (CGRAs), and SIMD/GPU-based accelerators can be considered to meet the requirements for the DCO type. While these models are structures that can handle DCO types by utilizing resources that allow for parallel processing, CGRA can be considered an appropriate architectural model for performing the DCO types that involve most parallel workloads. This decision is made based on the results that CGRAs provide higher performance and lower energy consumption for parallel workload processing than the SIMD/GPU engine and have a much smaller configuration data than FPGAs, as demonstrated in previous studies Huang et al. 2013; Govindaraju et al. 2012; Tripp et al. 2007 ). In addition, low energy is important, because the near-memory processor has more stringent power/thermal constraints. Thus, this article focused on CGRAs to support the DCO types that involve most parallel workloads.
• Architecture model for MMO type: To handle the MMO in the NMP logic layer, we can approach them in two ways, similar to previous studies. First, because most MMO types have fixed-function operations such as memory read-manipulation-write, a dedicated hardware accelerator (Akin et al. 2015; Seshadri et al. 2013) can be designed for the operations. Dedicated hardware accelerators have the advantages of providing much higher performance and energy efficiency than programmable architecture models. However, because it is designed for specific operations, it is very difficult to apply other operations (or 49:8 H. Lim and G. Park applications). On the other hand, we can consider a programmable architecture model such as the energy efficient in-order core used in previous NMP research . Programmable engines can overcome the disadvantages of dedicated hardware accelerators, but they have the disadvantage of relatively low performance. Given the tradeoffs in both architectures, we can use an energy-efficient in-order core or a dedicated hardware accelerator (or a set of dedicated hardware accelerators) to process the MMO type. If some specific tasks that have a significant impact on the overall application performance, then a hardware accelerator or set of accelerators will be designed to accelerate the tasks. The target kernel operation and the architecture of the accelerator will be discussed in Section 5.
• Architecture model for CFO type: As shown in Figures 2(b) and 2(c), the data consumed by CFO is shared with the DCO and MMO types and is often used to determine the next program flow. To handle the CFO type including control-flow operations, an architecture model with flexibility and programmability is required. Thus, we can consider a low-energy programmable core (e.g., in-order core) employed in a NMP logic layer like prior studies Previous studies have focused on the NMP engine architecture to accelerate the DCO or MMO type rather than the CFO, but the CFO type can be a major cause of increased communication overhead between the host processor and the NMP stack in terms of performance and energy efficiency. In a NMP-based system with a NMP logic layer that cannot handle the CFO, the data consumed by the CFO type should be transferred to the host processor through a deep memory hierarchy. For instance, CGRAs-based accelerators specialized for the DCO type Farmahini-Farahani et al. 2015) or dedicated hardware logics designed for the MMO type (Akin et al. 2015; Seshadri et al. 2013 ) should rely on hostprocessing to handle the data included in the CFO type. In other words, the data consumed by operations not supported in the NMP engine should be passed to the host processor and the NMP engine must wait until the host processing is completed and the data is stored in the main memory (i.e., until the data to be used is updated). As a result, it causes increments of global data transfer through printed circuit boards (PCBs) in proportion to the amount of the shared data and this can diminish the benefits obtained from an NMP-based system. According to Jeddeloh et al. (2012) ; Pugsley et al. (2014) , the global data transfer through a PCB, which is located between the host processor and the NMP stack, consumes more than 60 times the energy of a TSV transfer.
TRIPLE ENGINE PROCESSOR (TEP)
This section introduces TEP, a heterogeneous near-memory processor that can efficiently handle a variety of kernel operations. The proposed TEP consists of three computing engines determined through application analysis and architectural exploration. Finally, this section presents a programming model and operating mechanism for the proposed TEP.
Overall Architecture
As shown in Figure 3 (a), our TEP is integrated in a memory control interface of the NMP logic layer. Because the proposed TEP is integrated into the memory control interface, the TEP can easily access data distributed across multiple vaults. In addition, we can use the logics already implemented in the standard HMC's memory interface, such as refresh controllers, DRAM sequencers, read/write buffers, and bolt routers. Our TEP can issue memory requests into target vault controllers through a vault router and each vault controller schedules the memory requests. Figure 3(b) shows the overall architecture of the TEP proposed in this study. Our TEP consists of three main data processing units, each of which is designed with an energy-efficient in-order core for the CFO and MMO types, and CGRA-based accelerators for the DCO type, and a dedicated hardware accelerator to accelerate the memory compare operation. The architecture details of each components are as follows:
• Control Computing Unit (CCU): In this study, the CCU is designed as a single-issue inorder core like the ARM Cortex-A5 processor (ARM CortexA5 Processor CortexA5). The CCU manages the other two accelerators and is responsible for communication with the host processor, and it can merge/retrieve the output data produced from the two accelerators. In addition, the CCU is responsible for the processing of the CFO and MMO types. In this study, the memory compare operation of the MMO type is accelerated by designing a dedicated hardware accelerator (the architecture is introduced below). The CCU has a translation lookaside buffer (TLB) and an input/output memory management unit (IOMMU) to support the virtual addresses translation and a direct memory access (DMA) to access the required data from the stacked memory. The virtual memory-related issues will be discussed in Section 5.4. Our CCU has a 16KB single-level instruction cache with a 64byte line size and an 8KB stream buffer to retain a temporal data stream (Jouppi 1990) . In this study, we employ a stream buffer to utilize the sequential memory access patterns of the MMO type, such as memmove, memcpy, memset, and swap. A stream buffer consists of a simple FIFO queue that maintains the prefetched data streams, and the CCU releases the modified data for writing to the main memory through a vault router. In our architecture, the CCU issues the prefetching request with successive lines starting at the first memory request, and the stream buffer is flushed when a miss occurs. Finally, the CCU manages the register files to enable communication with the host processor and the other two accelerators. The communication model of our TEP will be explained in Section 5.3.
• Reconfigurable Processing Unit (RPU): The RPU is designed with CGRAs and accelerates the DCO type, including data-parallel computing blocks and data-flow operations. The RPU is constructed with four CGRAs to enable concurrent computations with partitioned data and the computation block can be divided into smaller slices. 1 In this study, we use DySER, which has significant benefits in performance and energy efficiency as the basic Table 3 . Simulation Configuration architecture of our RPU (Govindaraju et al. 2012; Farmahini-Farahani et al. 2015) . Some modifications of the architecture such as the number of functional units (FUs), placement of FUs, and configuration memory are performed to reflect the analysis results of the target applications. The RPU is composed of a grid of word-sized FUs that perform various arithmetic and logic operations, which are connected by a configurable interconnect fabric. The RPU also has some internal memory near each FU for temporary results. The FUs of our RPUs are unified dual-mode integer/floating point units, meaning that they can compute in ether integer mode or floating point mode (Govindaraju et al. 2012; Farmahini-Farahani et al. 2015) . Most FUs can operate addition, subtraction, and multiplication, while a few FUs can perform more complex operations, such as division and square root. In this study, to make efficient routing between FUs, we place FUs according to the arithmetic priority to be used in the computation and the routing distance from the RPU input interface; for example, the FU performing multiplication is placed closer to the input interface than the FU performing addition. Table 3 summarizes the details of the RPU with 64 FUs used in this article (See Section 6).
A configuration context contains the data to be mapped to the RPU for various dataflow graphs and is retained in the memory until it is required and it is cached in the configuration memory local to the CCU. Note that it relies on a compiler to create its configuration and insert instructions in the application to communicate with the CCU. The role of the compiler and programming interface for the proposed TEP will be explained in Section 5.2. The configuration memory consists of a fully associative history table that has a 64-entry and is maintained with the least recently used (LRU) policy.
• Dedicated Hardware Accelerator(s) (Memory Compare Unit (MCU) as an example):
Although most kernel operations included in the application can be handled by employing the CCU and RPU as engines of TEP, a dedicated accelerator (or a set of hardware accelerators) can be designed and used additionally for certain operations to achieve higher performance and energy efficiency than processing in a programmable core. In this work, we present the MCU as an example of dedicated hardware engines for accelerating memory compare operation (i.e., memcmp() in the C/C++ standard library suite) with pointerchasing in a linked-list data structure. According to Saunders (2011) , the memory compare operation suffers from very long delays on conventional processor architecture from comparing the memory regions byte-by-byte. In our target applications, we identified both a memory compare operation for pointer-chasing in linked-list structure and a hash function to calculate a hash index, both of which contribute a significant portion of the execution time in the Memcached application (see Table 2 ). Thus, the MCU accelerates the kernel operation of the Memcached application from among the target applications. Although this article presents the MCU for only memory compare operation, other additional dedicated accelerators can be also designed and used for other tasks requiring much higher performance or energy efficiency than processing in the programmable core. For instance, another dedicated hardware accelerator for memory swap operation, which accounts for approximately 62% of the total execution time of the ME application, can be designed and used to accelerate the operation as a component of the set of dedicated hardware accelerators in proposed TEP.
To achieve fast memory compare operation with the pointer-chasing, our MCU is designed to have a coarse-grained comparator, simple control logics, and a table buffer, as shown in Figure 3 . In our MCU, the comparators have two inputs: one is the KEY value of the table buffer fetched from memory and the other is a KEY value requested by the host processor and is stored in a register. A coarse-grained comparator consists of eight comparators, each of which can compare 4-bytes; thus, it can compare 32-bytes at once.
Programming Model for the TEP
First, this article assumes that a compiler capable of recognizing the proposed TEP is constructed. Programmers need to make efforts to add several primitives to the program code so that the host processor can recognize and assign tasks to the NMP kernel of the application and each processing engine. Note that the task allocation and implementation methodologies will be explained in Section 6.1. This section describes the programming model for recognizing the NMP kernels, the mapping method of the calculation block in four RPUs, and the application programming interface to trigger the MCU.
• DCO mapping model for CGRAs: Figure 4 (a) shows the partition model to map the computation sub-region into each RPU. A large DCO region can be partitioned into several smaller computation regions like MapReduce to allow concurrent CGRA computations with partitioned data (Farmahini-Farahani et al. 2015; Islam et al. 2014) . Some boundary data could be duplicated to enable independent RPU computations. The RPUs output data is processed and merged by the CCU after RPU computations are completed.
• MCU API: We define an API to trigger MCU as follows.
void * mcucmp(const void * bucket_addr, unsinged int offset_key, unsinged int offset_nkey, unsinged int offset_next, const void * target_addr, size_t length);
A defined mcucmp has six parameters, each of which is stored in the register of the MCU register file. First, the bucket_addr parameter is the starting address of the first item that will be searched 49:12 H. Lim and G. Park in the indexed hash bucket (see lines 5-9 in Figure 2(c) ). The offset_key, offset_nkey, and offset_next parameters are used to extract the field containing a KEY value, length of KEY, and next pointer field existing in a linked-list data structure. Finally, the target_addr and length parameters are an address of a KEY value and a length of the KEY that are requested for retrieval. The defined API has a very simple and intuitive form, and a programmer can accelerate the memory compare operation by just passing a few parameters without any modification to or re-organization of the data.
Communication Model for the TEP-based System
The CCU in our TEP is responsible for communication between the host processor and the two acceleration units (i.e., RPUs and MCU). First, the host processor invokes the proposed TEP by identifying a directive that indicates the NMP kernel inserted by the NMP kernel recognizer, and initializes a finish register (❶). The host processor periodically polls the finish register to check for the completion of kernel operations by the TEP or exceptions. CCU can process the allocated operations (e.g., CFO or MMO type) (❷). When the CCU reaches the RPU instruction inserted by the RPU instruction generator, the CCU initializes a status register and configures the RPUs to run the required configuration for the target kernel (❸). The RPU computes the instructions in a computation sub-region after the input data included in each memory sub-region are ready in each FU. The CCU periodically polls the status register to check for RPU processing completion or errors. The output data produced by RPUs can be processed and merged by the CCU according to the inserted instructions. After the RPU acceleration is completed, the status register is updated (❹). When the CCU reaches MCU's API (i.e., mcucmp()), the status register is initialized and the CCU writes the required parameters into the MCU register file (❺). After MCU acceleration is completed, the MCU updates the status register and writes the address of the retrieved item into a return register to be used for informing the host processor (❻). The CCU can process the remaining instructions and stores the output data into the main memory (❼) and notifies the host processor of the completion of the kernel processing by updating a finish register (❽). If the item address retrieved by the MCU is written to the return register, then the host processor may read the address from the return register and perform additional operations.
Other Issues for TEP Operations
• Cache coherency and memory consistency: In our architecture, the host processor and the TEP work serially and do not execute on the same data simultaneously. Therefore, when the TEP is enabled, the host processor waits for the TEP processing to complete. In the proposed TEP, the RPUs and MCU are executed serially as well. Note that the effect of performance and energy consumption by the sequential execution model of TEP is reflected in our simulation results in the Section 7. The TEP has no permission to access the on-chip cache memory of the host processor. Therefore, the data accessed by the TEP should not be retained in the processor's on-chip caches. To avoid inconsistencies between the cache contents of the host processor and main memory, the memory region used by the TEP (i.e., the data consumed in the NMP kernels) is defined as a non-cacheable region. The host processor accesses data in this region by using non-temporal instructions (e.g., MOVNTQ, MOVNTPS, and MASKMOVQ in x86 architecture) that bypass the cache hierarchy.
Even though the MCU performs read-only operations without any manipulation/modification of the data, the cache inconsistencies problem could occur from the SET command (e.g., add(KEY, VALUE)), which stores a new KEY-VALUE pair item. Since the new item is stored in the processor's on-chip cache memory, the MCU cannot access the inserted KEY data. To solve this problem, we bypassed the cache memory for the data used in SET commands by using non-temporal instructions.
• Virtual-to-Physical address translation: Because the virtual addresses are delivered to our TEP, it should access the stacked memory after translating from the provided virtual addresses to the physical addresses. To allow for address translation, we employ an IOMMU and TLB in the CCU. The TLB misses are served from the OS executed on the host processor, similar to an IOMMU in a conventional system. The TLB has 16 entries with 2MB page size to accelerate address translation. We use large pages for the entire system to minimize TLB misses in big data applications (Navarro et al. 2002 ).
• Area overhead estimation: For the estimation of the area overhead for the proposed architecture, we use McPAT and CACTI 6.5 tools that are assumed 40nm technology (Li et al. 2009; Muralimanohar et al. 2009 ). We use the logic area of an ARM Cotex-A5, which is the baseline architecture of the CCU, from the reported value in ARM CortexA5 Processor (CortexA5). The CCU has a total area of 1.25mm 2 (0.27mm 2 of logic, 0.13mm 2 of 16KB L1 instruction cache, 0.61mm 2 of stream buffer, and 0.24mm 2 of TLB). To estimate the area of the RPU, we use the area of 64 FUs and 81 switches from Farmahini-Farahani et al. (2015) . The area of RPU is 1.07mm 2 (four RPUs occupy a total area of 4.31mm 2 ). For the MCU, the area of a comparator is estimated from Kumar et al. (2014) and it is about 0.1mm 2 . In total, the area of the proposed TEP is approximately 5.66mm 2 , and it corresponds to only 2.5% of the area of an 8Gb DRAM die area (e.g., 226mm 2 (Shevgoor et al. 2013) ).
• Discussion of the thermal issues: Unlike a conventional host processing system, the NMP engine has more stringent power/thermal constraints. The proposed TEP consists of three computing engines; those are an in-order core, four CGRAs, and a dedicated hardware unit, and those do not consume much energy as like a high performance processor. We believe that the thermal effect of TEP on stacked DRAM devices will not be significant, because the proposed TEP has a much smaller number of processing engines than some earlier studies, which analyzed the thermal effects of NMP-based processing engines and do not have significant thermal issues (Pugsley et al. 2014; Farmahini-Farahani et al. 2015; Eckert et al. 2014) .
EXPERIMENTAL METHODOLOGY 6.1 TEP-based Simulation Methodology
To allocate tasks for the TEP, we implement an NMP kernel recognizer that recognizes the NMP kernel, an RPU instruction generator that generates special instructions for the RPU, and an MCU instruction parser that reads information from the defined APIs. In this study, we implement these into PINTool (Luk et al. 2005 ).
• NMP kernel recognizer: First, we implement an NMP kernel recognizer in PINTool that recognizes the NMP kernels to be allocated into the TEP and can mark the starting point of NMP kernels in the list of instructions that will be executed on system. For this study, the modification of applications is required to run them on a TEP-based system that conforms to a TEP-aware programming model including indication of the NMP kernels to be assigned to the TEP, tagging the RPU pragma, and the MCU's API.
• RPU instruction generator: We implement an RPU instruction generator into PINTool to create the RPU configuration and insert additional instructions to communicate with the CCU. Similar to Farmahini-Farahani et al. (2015) , the generator identifies a pragma directive indicating a DCO region and it partitions the DCO region into a memory sub-region for loads, stores, and address calculations, and a computation sub-region for all other instructions. In this phase, the RPU instruction generator analyzes the data dependency within the marked RPU region and creates the configuration context including computation and memory LD/ST slices. Finally, the generator creates the RPU configuration for the computation sub-region and inserts communication instructions into the memory sub-region.
• Other instruction parser:
We implement an MCU instruction parser into PINTool that can recognize the defined MCU's API. The parser extracts several pieces of information that are required to process memory compare operations with pointer-chasing based on the MCU's API. Finally, CFO and MMO instructions are also converted to the simulation segments executable in our simulation environment.
Finally, simulation execution contexts including the RPU configuration, MCU instruction, and CCU/MMO instruction are assigned to each processing engine in our TEP, and the host processor invokes the proposed TEP according to the TEP call instruction inserted into the host processor's executable code.
Simulation Environment
To evaluate our architecture, we use MacSIM simulator (MacSIM 2016), a cycle-accurate x86-64 system simulation tool whose frontend is PINTool. We also use DRAMSim2 for DRAM timing simulation, which has a modeled HMC-based NMP stack (Rosenfeld et al. 2011) . Table 3 summarizes the parameters used in this study.
• Host processors: Table 3(a) summarizes the architectural parameters of the host processor.
We model the Intel Xeon Processor E3-1275, a high-performance host processor with 4GHz and quad-cores, in which each core executes instructions out-of-order with a 256-entry RoB (Intel Xeon Processor 2015) . Each core has private L1 and L2 caches, and a last-level L3 cache is shared by all cores.
• TEP-based system: The TEP is the proposed architecture, and it is composed of a CCU, four RPUs, and an MCU. The architecture details for the proposed TEP configuration are summarized in Tables 3(b) , 3(c), and 3(d).
• HMC-based memory configuration: We model HMC-based timing parameters used in Rosenfeld (2014) and for the stacked memory. We assume the stacked memory has four links, 16 partitions, two banks per partition, and 32-TSVs per vault. For the simulation, we use 16 vaults organization, because it is reported to have the highest throughput with 256 total banks based on Rosenfeld (2014) . We use t RP = 17, t CC D = 6, t RC D = 17, t CL = 17, t W R = 19, t RAS = 34 cycles for DRAM timing parameters (see Table 3 (f)).
Architecture Models for the Evaluation
We compare the performance and energy consumption of the proposed TEP with other architectures. The list of evaluated NMP architecture models is as follows:
• Host-only system: This model is the baseline architecture that is comprised of a highperformance host processor and HMC-based memory stacks (Table 3 (a)) without any accelerator. In this model, the target application runs on a host processor.
• Host with RPUs+MCU: This model is an HMC-based system with RPUs and a MCU integrated with the host processor. In other words, this model assumes that the proposed RPUs and MCU are integrated as the on-chip accelerators into the host processor. In this model, the RPUs and MCU accelerate DCO and MMO (i.e., memory compare operation) on the proposed model, respectively, and the data to be consumed in the accelerators is delivered through a LLC of the host processor (Intel's Hybrid CPU-FPGA 2016).
• NMP with in-order cores: This model signifies an NMP layer with 16 single-issue inorder cores where each core has a single-level instruction and data cache. In this model, NMP kernels are executed on the NMP with an in-order cores processor, and L1 misses are sent to the DRAM array through direct TSV connections. In other words, we assume that each in-order core is integrated with a vault controller connected to TSVs similar to prior NMP studies . The details of the NMP with in-order cores are summarized in Table 3 (e).
• NMP with RPUs+MCU: This model represents an NMP logic layer that consists of RPUs and an MCU without the CCU. Thus, instead of the CCU, the host processor is responsible for CFO and MMO type processing. To allow communication between the host processor and NMP stack, we model a simple controller including several control logics such as register files, TLB, MMU, DMA, and configuration memory (see Tables 3(c) 
and 3(d)).
• TEP: This model is the proposed TEP architecture with a CCU, RPUs, and MCU (see Tables 3(b) , 3(c), and 3(d)).
Estimation of Energy Consumption
In this study, we basically approximate the energy consumption of the evaluated models by using an analytic model that has been used to investigate the energy consumption of the NMP-based system . Table 4 shows the major parameters for estimating the energy consumption used in this study. We assume each host core and in-order core, excluding on-chip caches, consumes 10W and 80mW of active power based on and Pugsley et al. (2014) . Because there are four cores in the host processor and 16 cores in the "NMP with in-order cores" model, these consume 40W and 1.28W in total, respectively. The power consumed by each memory component (e.g., on-chip caches, stream buffer, and TLB) in the host processor and TEP is calculated based on the dynamic power of each component and simulation statistics. Thus, the memory access characteristics are reflected in the estimation of overall energy consumption. In our TEP, the energy consumed by each FU and switch of the proposed RPU are based on Farmahini-Farahani et al. (2015) . We assume most energy consumption of the MCU operation is by the comparators and use 22.4pJ for the comparator as reported by Krashinsky (2011) . We use Table 4 . Energy Parameters for Estimating the Energy Consumption CACTI 6.5 to obtain the power consumption of the cache memory, RPU's configuration memory, TLB, and stream buffer. In the stacked DRAM-side, we use the energy consumed for each DRAM access based on Muralimanohar et al. (2009) . In a stacked DRAM-side, we use the energy consumed for each DRAM access is estimated based on Udipi et al. (2010) . PCB and TSV transfer energy values are estimated based on from Pugsley et al. (2014); Udipi et al. (2011); Woo et al. (2010) , respectively. Note that the static power of the major components associated with the proposed TEP is reflected in the energy consumption according to the equations defined in . The static power of the RPU (6.8mW) and the MCU (0.64mW) is assumed to be in proportion to the area of the CCU as investigated in Marowka et al. (2012) .
In this study, the dynamic power of the proposed TEP is calculated below. First, total dynamic energy consumption is presented as a summation of the energy consumed by the host processor, TEP, global data transfer, and NMP memory:
In Equation (1), the energy consumption of host processor and TEP can be expanded as follows:
In Equation (2), we calculate the energy consumption of the host processor and cache memory based on the equations defined in . The energy consumption of the proposed TEP including CCU, RPU, and MCU can be calculated as follows:
The energy consumed by the stacked memory including the data transfer energy by TSVs can be calculated by Equation (7): Finally, the global transfer energy is calculated by the energy consumed due to the off-chip data transfer through PCB-based bus:
EVALUATION RESULTS
Performance Results
Figure 5(a) shows the speedup of the NMP kernels extracted from the target applications. In the evaluation result, our TEP achieves about 5.3 times and 2 times speedup on average compared to the "Host-only system" and "NMP with in-order cores" models, respectively. Figure 5(b) shows the overall speedup results of the target applications. Our TEP achieves the highest performance improvement among the evaluated architecture models. On average, the proposed TEP delivers 3.4 times and 1.9 times better performance than the "Host-only system" and "NMP with in-order cores" model, respectively. The following four major observations can be made based on the performance results.
(1) First, we can identify that high and mid MPKI applications (e.g., Memcached, Liblinear, LDA, 3D-LiDAR, HPCCG, Stemmer, RBM, SRR, and ME) achieve better performance on every NMP-based model compared to the host-processing models (e.g., "Host-only system" and "Host with RPUs+MCU" model). Thus, we infer that a major source of high MPKI of the application resulted from the extracted NMP kernels and performance improvement can be achieved by processing the NMP kernels locally by the NMP-based engine. On the other hand, low MPKI applications (e.g., miniMD and k-means) do not experience significant performance improvement on conventional NMP models (e.g., the "NMP with in-order cores" and "NMP with RPUs+MCU" models) compared to the "Host with RPUs+MCU" model. Therefore, an LLC MPKI application can be utilized as one of the major metrics to determine whether the application should be allocated into the NMP stack. (2) We can identify a distinct performance effect in accordance with the contribution of the three operation types classified in this study (i.e., DCO, MMO, and CFO), based on a comparison of the performance of the "NMP with RPUs+MCU" and "NMP with in-order cores" models. In Figure 5 (a), applications that have a high contribution of the DCO type (e.g., LDA, 3D-LiDAR, HPCCG, Liblinear, RBM, miniMD, and k-means) achieve an outstanding speedup on the "NMP with RPUs+MCU" model compared to the "NMP with in-order cores" model. On the contrary, some applications with a high contribution of MMO or CFO type (e.g., Stemmer, SRR, and ME) experience higher performance on "NMP with in-order cores" than the "NMP with RPUs+MCU" model. This result means that "NMP with inorder cores" can handle the MMO and CFO types efficiently, whereas the "NMP with RPUs+MCU" model suffers from a significant overhead when processing them on the host processor, because it cannot handle those types. In the case of the Memcached application, a remarkable performance improvement can be achieved from the acceleration of MCU (about 53% improvement with the "NMP with RPUs+MCU" over the "NMP with CCU+ RPUs"), because it is designed for accelerating the memory compare operation, which is one of the major performance bottlenecks of the application. This result shows that the "NMP with in-order cores" and "NMP with RPUs+MCU" models can achieve significant performance improvements for applications with suitable operations for each structure. However, performance improvement cannot be obtained for all applications only with each structure as well. (3) Our proposed TEP provides an outstanding speedup for all NMP kernels. As shown in Figure 5 (a), our TEP achieves about 5.3 times and 2 times on average greater speedup for the NMP kernels than the "Host-only system" and "NMP with in-order cores" models, respectively. In particular, the proposed TEP provides better performance than the "NMP with RPUs+MCU" model in LDA and 3D-LiDAR applications with small contributions of MMO and CFO types in total execution time. As a result, Figure 5 shows that the necessity of a heterogeneous NMP engine consisting of the specialized accelerators and a flexible computing engine capable of handling various operations and accelerating certain kernels to effectively utilize the benefits of NMP-based systems. (4) Finally, we realized that the data shared between NMP and non-NMP kernels, which sends/receives parameters among the kernels (i.e., functions) within a program, can influence the overall performance of an application significantly (see Figure 5 (b)). To analyze the side-effect on the overall performance of applications, we calculate the expected speedup, when the kernel speedup that resulted from the proposed TEP ( Figure 5(a) ) is reflected in the total NMP kernel contributions, without any performance loss due to communication by the shared data between NMP and non-NMP kernels. For this study, we use Amdahl's law as shown in Equation (9) (Gustafson 1988) . In other words, a legend of Expected speedup by TEP in Figure 5 (b) means the maximum performance that can be obtained by the system equipped with the proposed TEP: 
In Figure 5 (b), HPCCG, Memcached, Stemmer, RBM, and miniMD applications, which share a large amount of the data between the NMP and non-NMP kernels, experience lower performance compared to the expected performance. On the contrary, LDA, 3D-LiDAR, Liblinear, SRR, ME, and K-means applications, which consume most of the data within the extracted NMP kernels, achieve performance improvements almost similar to the expected speedup by enjoying a relatively low overhead. Therefore, in the case of an application program with a very low execution time of the non-NMP kernel, completely offloading the application to the NMP logic layer can be considered as a method for minimizing the communication overhead due to data sharing between the NMP and the non-NMP kernels. Figure 6 (a) shows the overall data traffic from the stacked memory to the on-chip memory of the host processor on the evaluated systems. In Figure 6 (a), data traffic (left y-axis) is normalized to that of "Host-only system" model. The proposed TEP can reduce 52% of the overall data traffic on average compared to the "Host-only system" model. In particular, 3D-LiDAR, Liblinear, SRR, ME, and K-means applications, which consume the most data within the NMP kernels, experience a significant reduction of data traffic of about 52% on average compared to the "Host-only system" model. However, these applications suffer from higher data traffic on the "NMP with RPUs+MCU" model than the proposed architecture, because they should communicate frequently with the host processor to handle the data included in the MMO and CFO types within the NMP kernels. Thus, this result shows that the necessity of a programmable core for the NMP engine to reduce the communication overhead of NMP-based systems and improve the performance. Figure 6 (a), on the other hand, shows that the "NMP with in-order cores" model is also as effective as the proposed TEP in reducing the data traffic between the host processor and the NMP logic layer by processing all NMP kernels extracted from the target applications. Thus, in terms of the data traffic, Figure 6 (a) implies that the kernel coverage capability of the NMP architecture, which can handle the various tasks contained in the application, is important to reduce data traffic on NMP systems.
Data Traffic
Furthermore, we measure the data traffic caused by shared data between the NMP and non-NMP kernels on our TEP architecture. The shared data proportion of each application is presented as a line chart in right y-axis in Figure 6 (a). As shown in Figure 6 (a), we can observe that HPCCG, Stemmer, RBM, and miniMD applications, which showed a large performance gap from the expected performance ( Figure 5(b) ), share about 22% of the data between the NMP and non-NMP kernels. This result means that a large amount of data is consumed by the host processor for executing non-NMP kernels even though it has a very small contribution to the total execution time. In addition, Figure 6 (a) shows that applications, such as miniMD and K-means application, have low MPKI and experience high traffic overhead on the NMP models evaluated in this article. This result shows that the data sharing between the kernels of the applications has a larger impact on the traffic overhead on the applications with low MPKI, so these applications would not be suitable for the NMP model. Figure 6 (b) shows the energy consumption of each application on the evaluated models. In Figure 6 (b), each bar details the energy consumed by each architectural component. DRAM access refers to the energy consumed by DRAM devices, on-chip transfers include consumed energy for on-chip cache memories and PCB transfers, and execution and scheduling involves the consumption of the host processor and TEP engine. The proposed TEP provides a significant energy saving of about 33% compared to the "Host-only system" model on average. Our TEP provides significant energy saving on the on-chip data transfer by processing the kernel operations locally within the NMP logic layer consisting of an energy-efficient core and two accelerators.
Energy Consumption
Figure 6(b) shows that the proportion of the operation type classified in this study can have a significant effect on the energy consumption as shown in the performance results. First, applications with a dominant contribution of the DCO type can experience high-energy savings in models with RPUs (i.e., "Host with RPUs+MCU," "NMP with RPUs+MCU," and "TEP") that process the DCO type in parallel. On the contrary, applications with a dominant contribution of the MMO or CFO type experience high energy reduction including in the programmable core (i.e., "NMP with in-order cores" and "TEP"). In addition, we can confirm that an NMP logic with a restricted processing ability is unlikely to offer high-energy saving potential compared to the versatile NMP models based on the energy consumption of the "NMP with RPUs+MCU" model that cannot handle various operations. Finally, Figure 6 (b) shows that the applications sharing a large amount of data between the NMP and non-NMP kernels experience higher energy consumption on the on-chip data transfer.
RELATED WORK
In the 1990s, various studies for processing-in-memory (PiM) were performed as early forms of the NMP model. Prior works related to PiM tried to overcome the memory wall problem between the processor and memory by integrating the data processing logics into a single die (Mai et al. 2000; Draper et al. 2002) . Although they delivered several outstanding results, the PiM model has been difficult to employ in the industry due to the high cost to make a single chip and lack of killer applications. This section briefly summarizes the prior studies based on the classification of the architecture models that were considered as a primary architecture in an NMP logic layer.
NMP Engine with Fully Programmable Cores
An NMP engine constructed with fully-programmable cores has been investigated in various NMP-related studies. introduced a near-data processing accelerator consisting of a number of in-order cores to process a large-scale graph algorithm, and proposed a low-cost message passing mechanism with specialized prefetchers to support efficient communication between vault domains. designed an analytical model to estimate the energy consumed in an NMP system consisting of low power in-order cores. Pugsley et al. (2014) and Islam et al. (2014) employed a number of in-order cores as a near-data processing engine to offload the MapReduce workloads. Hong et al. (2016) proposed an in-memory accelerator constructed with a number of in-order cores to minimize the memory bottleneck caused by pointer chasing operations. In Yitbarek et al. (2016) , Intel Atom processors are employed in an NMP logic layer to accelerate the data-intensive kernels, and Scrbak et al. (2015) investigated the number of cores, cache size, and core frequency, in terms of performance and energy efficiency upon the MapReduce workload, to derive the design parameters for an NMP engine with in-order cores.
In previous studies, many NMP engines with fully programmable cores can perform a variety of tasks and are easily adaptable to a variety of applications. However, it is difficult to expect high performance and low power consumption with fully programmable cores over NMP engines designed for specific tasks.
NMP Engine with Reconfigurable Logics
Reconfigurable architecture has been studied in various domains like multimedia, signal processing, and pattern matching to exploit the spatial parallelism from a large loop work. This research has focused on two categories-CGRA that computes with word granularity (Goldstein et al. 1999; Govindaraju et al. 2012; Huang et al. 2013) , and FPGA that computes with bit granularity (Hauser et al. 1997; Hartenstein et al. 2001; Mishra et al. 2006) . The reconfigurable architecture showed outstanding benefits in terms of performance and energy efficiency compared to a fully-programmable architecture for many data-parallel applications.
Recently, reconfigurable architectures have been reconsidered in various pieces of NMP research as an efficient near-memory accelerator capable of providing benefits in terms of performance and energy efficiency. Farmahini-Farahani et al. (2015) implemented an acceleration engine with multiple CGRAs into the NMP logic layer to accelerate a large-scale loop body of big data applications. proposed a heterogeneous reconfigurable logic as near-data processing units to improve power and area efficiency. GPU-based NMP engines have been also studied to accelerate the parallel workloads involved in big data applications. proposed a GPGPU execution units with 3D-stacked DRAM for in-memory computing and Hsieh et al. (2016) proposed a compiler-based mechanism to automatically identify the code region to offload and map data to a GPU-based NMP engine.
The reconfigurable/GPU-based NMP engines have resulted in improved performance by accelerating the data-parallel workloads. However, they can experience a high communication overhead between the host processor and NMP logic layer.
NMP Engine with Dedicated HW Logics
In many NMP-related studies, an architecture with a fixed-function hardware logic has been investigated for the various workloads. Hong et al. (2016) introduced a dedicated hardware logic to accelerate a pointer-chasing operation on a linked-list data structure, and Lee et al. (2015) proposed hardware to minimize the overhead of atomic operations for machine-learning workloads. Seshadri et al. (2013) proposed a hardware logic that supports the copy and initialization operation of the bulk data on the stacked DRAM. Akin et al. (2014) and Akin et al. (2015) proposed a memory accelerator to handle the data reorganization operations, such as shuffle, pack/unpack, and swap with mathematical framework. Thanh-Hoang et al. (2016) introduced dedicated hardware to accelerate the data movement between multiple accelerators implemented in the stacked memory. The NMP engine, which is based on dedicated hardware logic, provides significantly improved performance for a target kernel. However, it can experience high communication overheads with the host processor due to its limited functionality, and is difficult to adapt to other applications. In summary, most prior studies used a single type of computing engine for the NMP engine. NMP engines, which are designed with one type of computing engine such as dedicated hardware or fully programmable logics, can experience high communication overheads and low performance for certain tasks. Unlike prior works, this study proposed a heterogeneous NMP architecture with three different types of computing engines, which can accelerate and execute three types of kernel operations extracted from various applications.
CONCLUSION
With the advent of 3D memory stacking technology, various approaches have been explored for near-memory processing that can accelerate and offload certain kernel operations included in a variety of big data application areas. While most previous works have delivered improved performance and energy efficiency for some specific kernel operations with a specific type of NMP engine, this study shows that an NMP engine with a single-type of processing engine can have some drawbacks as follows. NMP engines with dedicated hardware logic designed for a specific task can provide fast processing and low power consumption but can suffer a relatively high communication overhead with the host processor compared to NMP engines with programmable cores. On the other hand, while the NMP engines with energy-efficient in-order cores that have been adopted in previous studies can be applied to a variety of applications, it is difficult to expect high performance and low power consumption over dedicated hardware specialized for specific kernels.
In this study, we propose TEP, a near-memory processor with three types of processing engines to execute and accelerate various kernel operations included in emerging applications. To design the TEP, we categorize three different types of kernel operations-CFO, MMO, and DCO-which are extracted from various applications, and then analyze the architectural features to process these three categorized types efficiently. The proposed heterogeneous NMP architecture, TEP, has been designed to have three computing engines; a CCU designed with an energy-efficient in-order core for CFO and MMO types, an RPU designed with CGRAs for DCO type, and a (or a set of) dedicated hardware accelerator(s). A representative hardware accelerator engine, called a MCU for memory compare operations, is designed as an example of a set of dedicated hardware accelerators.
In our evaluation results, the proposed TEP provides about 3.4 times increased speed and about 52% reduction of data traffic (33% energy saving) on average compared to the non-NMP system. Through this study, we have presented the effectiveness of a heterogeneous NMP engine that can handle the diverse kernel operations included in applications by evaluating various NMP-based models. Furthermore, we have observed how the data shared between NMP and non-NMP kernels can interfere with the benefits of the NMP-based system. A heterogeneous NMP architecture with a new combination of different architectural models such as GPU, SIMD, and FPGA, and an NMP architecture that completely offloads applications to minimize communication overhead will be studied in our future work.
