Introduction
With the the limitation of the conventional computer systems, new heterogeneous multiprocessing systems are being widely developed. Reviewing the design of current computer architecture, we need to look backwards to the past, understand the upsides and downsides, and past workloads. We need to evaluate the future, as we will be creating new design. That would include upcoming technologies that can be implemented, as well review of the upcoming problems that the architecture will be resolving.
Looking at the conventional architectures, single core or multicore, we can see that one of the main limitations is the communication network, connecting the processors with the memory. This is limited shared resource creating a bottleneck, problem also known as von Neumann bottleneck. One way of creating cheaper architecture and avoiding the current limitations is by adding computational modules directly in the memory. The main upsides of this architecture are that: we will no longer use the shared memory bus, the computational modules will be using interconnected network inside the memory; the computational modules will not be facing the penalties caused from cache misses, since they will have direct access to all of the operational memory.
Theoretical Preliminaries
A computer is used for solving wide range of different problems. The process of solving a problem with electrons is done through layers of transformations. We need to describe the problem. The description of the problem is trough literature language, and this has a lot of ambiguities and different meanings.
The first step of transformation is describing the problem with algorithm. The algorithm goes step by step, through the solutions without any ambiguities or unknowns. Next step is to transfer the algorithm to a programming language. A programming language is a formal constructed language designed to communicate instructions to a computer. Runtime System exhibits the behavior of the constructs of a computer language. The instruction set architecture, is part of the computer architecture related to programming. It is bridging the gap between software and hardware. From this transformation we go to the microarchitecture, and it is not visible to the software. The microarchitecture is the way a specific set of instructions is implemented in the processor. Sometimes it is referred as computer organization, and it runs programs by performing the following steps: 1. Read instruction and the data related to the instruction, 2. Decode the instruction, 3. Execute the instructions over the corresponding data, 4. Write the result out. This is called instruction cycle and is repeated continuously until the power is turned off. Different instructions will require different number of cycles to complete.
System Description of PIM Architecture
The developed system model is designed with conventional single or multiple processors MP -Main Processor/s, with supplementary independent computing cores integrated in the operating memory -mPIMs. The main goal of the newly proposed design is to accomplish improvements, and allow us to make changes without them being visible to the ISA level. Existing applications will be able to run on the newly developed architecture, with no need of software changes. Placing processing cores in the memory will give us low latency, and quick communication between the cores inside the memory. Developing small homogeneous cores is economically and the cores themselves will have lower power consummation.
The main processor, as other conventional architectures, contains multiple level of cache. Although this will speed up the MP, the communication network -CN is critical with respect to the speed of the entire system. To reduce the frequent memory access a PIM nodes are used. Part of the computations is done on memory level, reducing the impact of MP over CN. Designing heterogeneous multicore, multiprocessors architecture with same ISA is challenging and multiple factors have to be taken into account. Bigger and faster cores have higher performance, more instructions per second. They are less energy effi- 
Stages of the Hardware Design
The development of PIM core is performed via Xilinx ISE WebPACK. ISE is an Integrated Synthesis Environment for synthesis and analysis of HDL designs, simulation and implementation.
This is iterative process of correcting and testing a model, until its final integration and completion. Xilinx system allows multiple changes during the entire cycle of development, without the need of physical recycling of the elements after alternations.
PIM Core Model
The designed PIM core contains the following man components:
DPM cient. Small cores have less performance, but are much more energy efficient and easy for design. When building the test bench for single PIM core, we need to take into consideration the current limitation of the technologies. The newly introduced hybrid memory cube, developed by the HMC Consortium will allow us to develop processing elements tightly connected with the memory layers. The consortium was initially founded by Micron Technology and Samsung Electronics, however later on it was joined by developer The external data flow is loaded in a DP Memory Module, designed out of three sections with dual ports. The ports are used to load addresses for the read and write of the operating memory cells. The fetching and decoding of the instructions is done from the AU. DPM Control module coordinates the data flow towards the different sections of the internal memory, allowing us access and control of two sections simultaneously. Access to and from the PIM module is done via I/O module, designed with two 32 bit channels with three buffers per channel.
The design is done via HDL -hardware description language. For memory control I/O ports, we have the following module: With the Xilinx ISE WEBPack package, we were able to complete design and approbation of complete PIM Core. We received initial numerical indicator showing the efficiency of usage and reference markers to compare against the analytical model of PIM.
Simulation Model and Results
To simulate the proposed structural design, and get the performance of a single core PIM, we are using the TPC, industry standards organization that defines performance benchmark. TPC stands for Transaction Processing Performance Council, and is mostly used for evaluating the performance of physical servers before releasing them in operations. TPC Benchmark, On-Line Transaction Processing (OLTP) workload is as follows: The result from the designed PIM core can be used for future development and comparison of analytical models of computer architecture with supplemental processing power in the memory. The models can be used for evaluation and benchmark of complete systems with multiple cores including the conventional processors.
