Abstract-We show the design of specialized compute fabrics that maintain the efficiency of full custom hardware while providing enough flexibility to execute a whole class of coarsegrain linear algebra operations. The broad vision of this project is to develop integrated and specialized hardware/software solutions that are co-optimized and co-designed across all layers ranging from the basic hardware foundations all the way to the application through standard linear algebra packages.
I. INTRODUCTION AND BACKGROUND
Power consumption is becoming the limiting factor for continued semiconductor technology scaling. Future chips will use heterogeneous solution to cope with "dark" silicon by using custom IP cores, thereby keeping the chip within the power budget. Application-specific on-chip hardware accelerators can provide orders of magnitude improvements in both power and performance [1] . However, full custom design is expensive in many aspects. The question that we try to answer in this project is: can we design specialized compute fabrics that maintain the efficiency of full custom hardware while providing enough flexibility to execute a whole class of coarse-grain operations?
We aim to answer this question for the domain of linear algebra computations. The reason for the interest in implementing matrix-matrix multiplication and related kernels is that these operations are what deliver high-performance to many crucial applications [2] . In the proposed work, we will pursue this goal, asking and answering questions on how to achieve high-performance, low-power hardware implementations for this important class of operations. We expect the insights of this work to impact future computational advances, both in scientific high-performance computing as well as in the embedded, mobile or cyber-physical domains.
We propose a shift in the way transistors are employed. Efficiency as measured by the energy and area per operation has to be the new optimization goal. So, what if we start from a clean slate specifically for the domain of matrix computations? How can we best utilize the given blanket of transistors? What is the right level of flexibility versus optimality and how can we realize existing methods for efficient basic linear algebra operations directly in specialized hardware and software? We propose to study these questions by investigating the hardware and software foundations for novel classes of linear algebra compute fabrics.
The broad vision of this project is to develop integrated and specialized hardware/software solutions that are cooptimized and co-designed across all layers ranging from the basic hardware foundations all the way to the application programming support through standard linear algebra packages. In the process, we aim to study upper limits on performance/power ratios that can be achieved, and fundamentally investigate both limitations in current architectures and opportunities for targeted improvements in future architectures specially designed to efficiently support this crucial class of operations. Further, the study will focus on task scheduling and load balancing for more general purpose applications where the LAP will be part of a bigger heterogenous system and is employed to accelerate specific parts of a bigger application.
We have provided evidence that these opportunities indeed exist for the domain of dense linear algebra computations [3] , [4] . The conclusion is that a Linear Algebra Processor (LAP), implemented in current 45nm technology at 1.4GHz, can achieve more than 1200 GFLOPS in single precision and 600 GLOPS in double precision general matrix-matrix multiply (SGEMM/DGEMM) for less than 25 Watts. This represents orders of magnitude higher efficiencies compared to current conventional CPU and GPU architectures.
II. ARCHITECTURE AND PROGRAMMING MODEL
We target an ASIC implementation that will allow us to fully exploit state-of-the-art technologies instead of programmable hardware like FPGAs. Within this context, our goal is to develop a fixed architecture to avoid inefficiencies of general purpose processors and GPUs. We remove the overheads of program pipeline and rearrange processing elements in a more efficient way. Finally, in contrast with full custom specialized accelerators, our architecture is flexible enough to optimally execute different matrix operations, but with the same level of efficiency. As will be described later (Section III), the architecture is thereby carefully codesigned with algorithm choices to pick the right algorithm for the architecture and vice versa The Linear Algebra Core (LAC) architecture consists of a 2D array of n r × n r Processing Elements (PEs), with n r = 4 in the figure. Each PE has a single cycle MultiplyACcumulate (MAC) unit, and local Static Random-Access Memory (SRAM) memory. PEs are connected by simple, low-overhead horizontal and vertical broadcast buses. LAC control is distributed and each PE has a state machine that drives a predetermined, hardcoded sequence of communication, storage, and computation steps for each supported BLAS operation.
We used the row and column busses in a 2D arrangement of PEs as shown in Figure 1 . This arrangement naturally maps optimized matrix multiplication kernels using broadcast buses, and eliminates the need for communication through a register file by fully exploiting the communication network [3] . By distributing the control we also removed the overheads of control and data communication between processing elements. Further, we exploit the concept of separating the memory interface and communication interface by streaming the data through a particular channel to the core. We first examined the feasibility of our design methodology and mapped Cholesky factorization on the LAC by carefully selecting the right variation of algorithm. Figure 2 shows how a user application can use a LAP in the design. Libraries such as the libflame [5] and the LAPACK library supports built-in software layers that decompose big problems into smaller subproblems. Routines with higher level functionality (e.g, a LU factorization) are called from the host application; Then, internal routines recursively break larger problems into smaller subroutine calls to BLAS and Communication & Packing routines until the problems reach a certain size. These small-size problems (for example 128 × 128) are basic units of data with which atomic computations are performed. In a typical general purpose solution, these kernels are all implemented very efficiently in target machine assembly code [6] . One can view the LAP as an accelerator for these atomic kernel operations and the atomic size of kernels depend on the LAP kernel sizes. Instead of calling the assembly coded kernel on the host processor, necessary information including the data location address and type of kernel is passed to the LAP through the device driver. After finishing the operation, LAP puts the computed data back in the memory. The LAP can overlap the communication with computation by pipelining multiple operations. Here, scheduling, and load balancing plays a key role. A good partitioning of tasks between resources yields in significant performance gain while saving power.
III. ALGORITHMS
In this section we briefly indicate the capabilities of our proposed architecture and the applications that we have mapped on it. This section describes how small modifications to the architecture side in accordance with deep understanding on the algorithm side could yield in more flexibility while maintaining efficiency.
A. Level-3 BLAS
The details of algorithm/architecture implementation of the GEMM operation on the LAC are presented in [3] . There, detailed core power, area, utilization, and bandwidth trade-offs are presented. The LAC can perform GEMM very efficiently by carefully choosing the right algorithm to maintain locality. In [4] , we extended our analysis to a whole chip called Linear Algebra Processor (LAP) with multiple LACs and analyzed the on-chip and off-chip system memory hierarchy trade-offs for the mentioned metrics. In that study we also built a power model using state of the art reports and methodologies and presented power breakdown of our LAP, as compared to the CPU/GPUs that were available in the literature. We developed analytical formulae that predicts the utilization of GEMM operation in different layers of memory hierarchy for CPUs [7] , and GPUs [4] . The formulae were used to designate the sources of low utilization in the current GPUs' memory hierarchies. Figure 3 . 45nm GEMM efficiency of various cores [3] , [8] .
We further showed details for mapping the rest of level-3 BLAS operations in [8] . This work shows that with negligible modification to the core architecture (adding the reciprocal unit), a LAC can perform all of the level-3 BLAS operations with high utilization and negligible loss in efficiency.
B. Matrix Factorizations
Cholesky, LU (with partial pivoting), and QR factorization are fundamental algorithms for solving linear systems of equations. These operations contain a selection of level-3 BLAS operations and their own inner kernels. We redesigned the architecture of the floating-point unit in our PEs to make the core flexible enough for performing these factorizations. Our extensions to the design of the floatingpoint unit removes the inherent complexities imposed by limitations in the floating-point representation of real numbers [9] . The modified core is equipped with reciprocal and inverse-square-root functionality.
IV. RESULTS
In this section, we present selected results from our publications from the project. We start by core level exploration and then continue with system level results.
A. Core level results
First, we present power and area efficiency comparisons between LAC and a few other architecture cores in 45nm technology (see Figure 3) . A LAC performs an order of magnitude better in power efficiency for DGEMM when compared to a low power ARM processor or Fermi's SM cores. Figure 4 shows one example of the trade-off between the core's local store size, and the off-core bandwidth and its effect on the core utilization for representative level-3 BLAS operations. We observe that the rest of level-3 BLAS kernels maintain comparable utilization to GEMM as also shown in Figure 9 .
B. System level results
Here, we bring a few selected charts and a table from our analysis and the power model we have developed. We carefully designed the memory hierarchy around our cores and measured the power and area consumption in different scenarios for the GEMM operation and different problem sizes. Figure 5 . LAC efficiency for level-3 BLAS algorithms at 1.1 GHz [8] .
bandwidth and the size of on-chip memory. The performance could reach up to 600 GFLOPS while requiring very low off-chip bandwidth.
Having studied the system design space, we combined the analyses with a power model that we developed. This power model uses state of the art research tools and is used to derive power breakdown for different architectures. Figure 7 shows the power breakdown for both LAP and the Nvidia's Fermi GTX480 chips. The significance of this graph is that it shows the main sources of energy waste in more generalized architectures compared to an application specific design.
One main issue in the system design is the effect of memory hierarchy on the efficiency. While a core or an accelerator IP might be very power efficient, the memory hierarchy to optimize the data traffic to such core consumes power and reduces the overall efficiency. Figure 8 shows an analysis of core-and chip-level efficiencies for studied architectures and a LAP in which we vary the number of cores to match the throughput in existing architectures.
Finally, Figure 9 summarizes key metrics for various systems running GEMM as a representative matrix computation. For this table included estimates for our 15-core LAP design and all other architecture scaled to 45nm technology. We note that for a single-precision LAP at around 1.4 GHz clock frequency, the estimated performance/power ratio is an order of magnitude better than for GPUs. The double-precision LAP design yields around 30 times higher efficiency compared to CPUs. The power density is also significantly lower, as most of the LAP area is used for the local store. The performance/area ratio of our LAP is in all cases equal to or better than other processors.
V. CONCLUSIONS AND OUTLOOK This work presents detailed analysis of algorithmarchitectural design for linear algebra class of applications. We have developed a cycle accurate simulator that performs such operations. However, one main challenge is to examine such a design and its effectiveness in a heterogeneous system and perform task scheduling and load balancing. We plan to extend our cycle-accurate simulator and integrate it with other multi-core simulators to study detailed design tradeoffs both at the core and chip levels. This integration will allow cycle-accurate modeling of dynamic power consumption of different design choices. We also plan to research the possibility of mapping more complex linear-algebra or a different class of applications by relaxing the architectural design axis. Figure 9 . 45nm scaled performance and area of various systems running GEMM for single-precison (Top) and double-precision (Bottom) [3] .
