INTRODUCTION
With the myriad of new computer architectures appearing these days, it is not easy to cover all of them in a single comparative study. Instead, we adopted an approach to present a comparative study of only a few recent commercially available systems of the new architectural styles (as opposed to academic projects). For less recent contributions, and for contributions of academic research, interested readers are referred to past survey papers of the authors and others [1, 2, 3, 4] . Each one of the solutions presented here is a proven, commercially successful representative of a particular computing style. Examination of their comparative advantages and disadvantages serves to envision future trends for such systems. We are focused on a single node, noting that each one of the surveyed systems can be connected to more such nodes.
II.

PROBLEM STATEMENT
Several years ago, it seemed that performance improvement due to Moore's law, by simply increasing clock frequency and improving in instruction level parallelism, both using reduction in feature size, was reaching its limit [5] . With clock frequencies of today's systems, power consumption has appeared as a new serious limitation.
In order to deal with these limitations, designers resorted to several different approaches. The first one is to put more of simpler processing cores on the same chip. The other one is to use configurable logic as computing unit that will be configured to process a large amount of data in parallel, at one order of magnitude slower clock frequencies. These solutions often can achieve better performance at lower power consumption.
III.
EXISTING SOLUTIONS
In this section, we give a short overview of four existing solutions of the problem specified above, namely CBEA, SOl RASC, Maxeler MaxNodes, Nvidia GPU computing. In general, the reviewed solutions are well established commercial solutions and give good results under the conditions of interest for their operating environments and problems solving domains, but each one of them has some comparative advantages and disadvantages.
This section starts with a classification of existing solutions in the domain of contemporary computing architectures. It continues with an overview which, for each and every example, gives the following main points: (a) The short overview of the solution (b) Essential elements of the approach in terms of architecture and programming model, and finally a conclusive section where we critically compare all solutions at their present state.
A. Selection Criteria
Basically, two paradigms exist for parallel computation on multiprocessors: message passing interface (MP!), usually available where communication among processors happens in a network, and the shared memory or symmetrical multi processing (SMP) usually applied in bus-based systems where communication happens through a shared memory. Heterogeneous multi processing refers to the use of different processing cores to maximize performance. In most of the existing works, the purpose is to have an efficient mapping of a set of threads onto the processors. Two possibilities have been presented that address this purpose: simultaneous multithreading (SMT) [6, 7, 8] and chip multiprocessor (CMP) [5, 9, 10, 11, 12] . An SMT chip is based on a superscalar processor with a set of units for a parallel execution of instructions. Given a set of threads, the resources are dynamically allocated by extracting several instructions to be executed in parallel. Threads that need long memory access are preempted to avoid idle states of processors.
Instead of using only one superscalar core to execute instructions in parallel, CMP use many processor cores to compute threads in parallel. Each processor has a small first level local instruction and data cache. A second level cache is available for all the processors on the chip. The CMPs target applications consisting of a set of independent computations that can be easily mapped to threads. This is usually the case in database or web server where transactions do not rely on each other. In CMPs, threads having long memory access are preempted to allow others to use the processor [I] .
A combination of the chip multiprocessor paradigm and flexible hardware accelerators can be used to increase the computation speed of applications. In this case, FPGAs can be used as target devices in which a set of processors and a set of custom hardware accelerators are implemented and communicate together in a network. However, actual FPGA solutions for high-performance computing still use fast microprocessors, and the results are systems with a very high power consumption and power dissipation.
The adaptivity of the whole system can be reached by modifying the computation infrastructure. This can be done at compile-time using full reconfiguration, or at run-time by means of partial device reconfiguration.
B. Presentation of existing solutions
We shall now present details of each one of the selected solutions. From those, SGI RASC and Maxeler Dataflow Engines (MaxNodes) belong to a domain of symmetrical multi core processor accompanied by reconfigurable hardware accelerator. The CBEA is a representative of heterogeneous multicore processor. Finally, the NVidia Fermi GPU is an example of a heterogeneous system where a CPU is accompanied by a highly multithreaded single instruction/multiple data (SIMD) accelerator.
1.
Cell broadband Engine Architecture (CBEA) The Cell Broadband Engine Architecture (CBEA), jointly developed by Sony, Toshiba, and IBM, was conceived as next generation chip architecture for multimedia and compute intensive processing [13] . IBM offers two chips based on CBEA. One of them is at the core of Roadrunner, the world's first petaflop supercomputer. Fig. 1 shows a Ce11/B.E. processor's major components: The main processing element (the Power processor element, or PPE), the parallel processing accelerators (synergistic processor elements, or SPEs), the on-chip interconnect (a bidirectional data ring known as the element interconnect bus, or EIB), and the 110 interfaces (the memory interface controller, or MIC, and the Cell Broadband Engine interface, or BEl). The SPEs are made as accelerators that can significantly improve performance on applications with large portions of code comprising a lot of independent calculations. This is done by extracting portions of problem that with data can fit on one SPE and executing on it. Mapping data means that software has to transfer input data from the main memory to local memories of the SPEs, and transfer results from local memory of particular SPE to the main memory. For the purpose of these data transfers software uses DMA controllers and overlap data transfers with program execution, avoiding the performance degradation.
One of the main disadvantages of the Ce11/B.E. processor is complexity of programming model presented by it. Programmer has to write separate programs for PPE, and SPEs, and compile them with different compilers because different instruction sets of PPE and SPE. It is the responsibility of programmer to make decision how to break calculations onto SPEs in order to get a real speedup, and to make proper communication and synchronization between parts of the program. In case of non appropriate decision made, it is easy to get worse performance then the same program using only one processor. There are also some promising solutions, like compilers with support of OpenMP which releases programmer from a burden of dealing with complexity of the processor [14] . Reconfigurable Application Specific Computing (RASe) module. It is SGI's technology that is enabling users to develop application specific hardware using reconfigurable logic elements. The SGI RASC is tightly integrated with NUMALink, over which it has access to global shared memory. The SGI RASC is designed so it can be reconfigured in runtime, in order to meet current application requirements. Fig. 2 shows block diagram of the SGI RASC module. The main components of SGI RASC are Algorithm FPGAs intended for implementing processing hardware, Additional memory for storing current working set of data, Loader FPGA responsible for configuring Algorithm FPGA in runtime, and TIO Application Specific Integrated Circuit (ASIC) for interfacing RASC module to the NUMALink over which it communicates with processors and their memories.
A summary of the SGI approach to FPGA programming is, as follows: (a) Write an application in the C programming language for a system microprocessor, (b) Identify computation intense routine(s), (c) Generate a bitstream using Core Services and language of choice, (d) Replace routines with RASC abstraction layer (rasclib) calls that support both a C and Fortran90 interface, (e) Run your application and debug with GOB. Source of complexity in this solution is generating appropriate bitstream optimized for particular application, and interfacing it to the software. 
Maxeler Dataflow Engines
Maxeler Technologies' core competence is in delivering substantially increased performance for HPC applications through the rewriting of applications for Dataflow Engines. By mapping compute-intensive algorithms directly into parallel hardware, tightly coupled to a conventional CPU through a high-speed I/O bus, complete applications can be accelerated by orders of magnitude over conventional CPU implementations. By exploiting massive parallelism at the bit level, Maxeler MaxNode solutions deliver performance far in excess of CPUs at approximately a tenth of the clock frequency and power consumption. Fig. 3 sketches the architecture of a Maxeler hardware acceleration system which equips one or more Dataflow Engines attached to a set of memories and connected to a host CPU via pcr Express channels [16] [17] . MaxRing interconnects (not shown in Fig. 3 ) establish high bandwidth communication channels between the Dataflow Engines on the accelerator. Accelerating an application involves looking at the entire application, adapting program structure and partitioning the execution flow and data layout between CPUs and accelerators. The program of the Dataflow Engines comprises arithmetic data-paths for the computations (the kernels) and modules orchestrating the data I/O for these kernels (the manager).
The Maxeler's system is programmed using Java libraries supplied with the system. Using these libraries programmer can describe the structure of the system. This approach hides from the programmer the most of implementation and hardware specific details, like implementation of the interface to the software and other details with which programmer have to deal when using some Hardware Description Language (HDL). It is still required of the programmer to deal with structure of the hardware. That means the programmer has to identify part of algorithm described with sequential instructions with a lot of parallel calculations, and then convert it to the dataflow structure with the same function as sequential solution. Efficient dataflow structure, called streaming kernel, strongly emphasize the regularity of the data flow. Such streaming kernels lend themselves to deeply pipelined dataflow implementations which are the key to achieving high performance in custom hardware.
4.
NVidia Fermi Graphics Processing Unit (GPU)
A GPU is a symmetric multi-core processor that is exclusively accessed and controlled by the CPU, making the two a heterogeneous system. The GPU operates asynchronously from the CPU, enabling concurrent execution and memory transfer. The GPU is the most pervasive parallel processor to date. Today's GPUs greatly outpace CPUs in arithmetic throughput and memory bandwidth, making them the ideal processor to accelerate a variety of data parallel applications.
The Fermi based GPU (see Fig 4) features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The 512 CUDA cores are organized in 16 streaming multiprocessors (SM) of 32 cores each. Each CUDA core has a fully pipe lined integer arithmetic logic unit (ALU) and floating point unit (FPU). Also, each core has its own local memory. Each SM has shareable memory accessible by cores in that SM. The GPU has a global memory that is accessible by each core in the system. The GPU has six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread global scheduler distributes thread blocks to SM thread schedulers. Threads are further scheduled bay two Warp schedulers that exists inside each SM [18] . 
E]E] E]E]
ILDIST I IlDlST I
\EJEJ EJE]
ILDIST I
B
IlDlsr I
EJE] EJEJ
I EJEJ E]E]
B
IlDlST I
E]E] E]E]
B
E]E] E]E]
IntKccnn.ct N.twork Uniform Ca�.
Fig. 4. Th e Fermi arch itecture
From the programmer's point of view, there are two address spaces, one of the host system, and one of the GPU. Aware of that, programmer writes programs on some HLL (e.g. C), identifies some functions that have a lot of parallel calculations that are data independent, designate it for compilation and execution on the GPU, and uses appropriate API to deal with resources on the GPU. Each function intended for execution on the GPU, called kernel, is executed by one or more threads that are executing in parallel on the GPU cores. Although the programmer writes only one program on some HLL, he is responsible to make right partition of the problem in order to get speedup from system. If programmer is not careful (e.g. breaks problem so working set cannot fit into fast memory), it is easy to degrade performance.
C. Conclusion about existing solutions
A performance comparison is given in the TABLE I [2, 19, 20] . The important parameters for assessing performance potential for parallel algorithms, besides peak performance are also a bandwidth of shared memory access or interconnection network speed. Bear in mind that not all solutions are from the same technology generation, a year of issue can be used as an orientation to which generation a particular solution belongs.
The three architectural styles we have reviewed have their distinct benefits and drawbacks, which in tum affect how well they perform for different applications. The major challenge for application developers is to bridge gap between theoretical and experienced performance, that is, writing well balanced applications where there is no apparent single bottleneck. This, of course, varies from application to application, but also for a single application on different architectures.
The GPU is the best performing architecture when it comes to single precision floating point arithmetic, with an order of magnitude better performance compared to the others. When considering the price and performance per watt, the GPU also comes out favorably. However, the GPU performs best when the same operation is independently executed on a large set of data, disfavoring algorithms that require extensive synchronization and communication [21] .
The CBEA still performs well in comparison with state-of the art CPU, even though it is now five years old. Furthermore, it offers a very flexible architecture where each core can run a separate program. Synchronization and communication is fast, with over 200 GB/s bandwidth on the interconnect bus. This makes the CBEA extremely versatile, yet somewhat more difficult to program than the GPU. It is also a very well performing architecture when it comes to double precision. The FPGA performance is difficult to quantify in terms of floating point operations, as the number of operations that can be executed in parallel depends on the specific application implemented. All FPGAs are particularly well suited to algorithms where fixed point, integer, or bit operations are the key. For such tasks, the FPGA has an outstanding raw performance, and an especially good performance per watt ratio. If one needs floating point operations, FPGA implementations benefit from tuning the precision to the minimum required level, independent of the IEEE-754 standard. For Maxeler FPGA solutions in particular, peak performance potential of the chip is not a good guide to overall achieved performance, since the inherent flexibility of the architecture often enables good solutions to be devised for a wide range of applications where more rigid solutions (even with higher peak performance) struggle.
D. Programming considerations
It is widely believed that the major barrier for adoption of heterogeneous and reconfigurable computing technology is the lack of adequate programming systems that offer a level of abstraction higher than curr ently provided by available HDLs. Tools supporting high-level programming specifications would tremendously accelerate the development cycle of applications for reconfigurable systems and facilitate the migration of already developed algorithms to these systems, a key aspect for their widespread use and acceptance.
IV.
THE TREND, THE PROMISING SOLUTIONS, AND OPEN
QUESTIONS FOR FUTURE
We believe that future applications will require heterogeneous processing. However, one must not overlook legacy software that has existed for decades. Such software is economically unfeasible to redesign, and must use library calls to benefit from heterogeneous processing. This has the drawback captured in the Amdahl's law, where the serial part of the code quickly becomes the bottleneck. New applications and algorithms, however, can be designed for existing, and future, architectures.
In the future, we see it as likely that GPUs will provide new high speed interconnect links, similarly to current FPGAs. Using CUD A 4.0 technology, there is the possibility of a shared address space, where both the CPU and GPU can transparently access all memory. A programming language of GPU, being a limited subset of C at first, now shifts towards full C++.
The CBEA roadmap is uncertain, but we find it likely that similar designs will be used in the future. The major strengths of the CBEA are the versatility of the synergistic processing units, and the use of local store memory in conjunction with DMA. We believe, such features will become increasingly used as power constraints become more and more pressing.
A growth rate of 200% was observed in the capacity of the Xilinx FPGA in less than 10 years, while in the meantime, a 50% reduction in power consumption could be reached with the prices also enjoying similar improvements [1] . The obvious path of FPGAs is to simply continue increasing the clock frequency, and decrease the production cost. However, the inclusion of hard arithmetic units is an interesting trend. We believe that the trend of an increased number of special purpose on-chip hardware, such as more floating point support, will continue. This will rapidly broaden the spectrum of algorithms that are suitable for FPGAs. Traditionally, the challenges to using FPGAs effectively fall into two categories: ease of use and performance. Ease of use issues include the following: Methodology of generating the "program" or bitstream for the FPGA; Ability to debug an application running on both the microprocessor and FPGA; Interface between the application and the system or Application Programming Interface (API). Performance challenges include optImIzation of data movement (partitioning) between microprocessors and FPGAs, and partitioning between multiple CPU+FPGA pairs, driving the scalability of the system topology. One commercial solution that deals with all of these problems comprises the Maxeler dataflow programming model and tools.
V.
CONCLUSION
For certain data-intensive applications, more cores do not mean better performance, according to Sandia's simulation [22] : "After about 8 cores, there's no improvement, and further it even decreases". Therefore, it is difficult to envision efficient use of hundreds of traditional CPU cores. The use of hundreds of accelerator cores in conjunction with a handful of traditional CPU cores, on the other hand, appears to be a sustainable roadmap.
Currently, there is no one-size-fits-all approach for different computing domains, including general purpose computing. The three architectural styles we have described are currently addressing different needs. The GPU provides a mass-appeal via highly parallel and accessible accelerator technology, where communication and synchronization is avoided. The CBEA offers a highly versatile architecture, where each core can run a separate program with fast inter core communication. Finally, for applications relying on very large datasets and complex numerical computations, FPGA based solutions offer extreme performance, especially the Maxeler dataflow machine.
