Abstract-The Parallella is a hybrid computing platform that came into existence as the result of a Kickstarter project by Adapteva. It is composed of the high performance, energyefficient, manycore architecture, Epiphany chip (used as coprocessor) and one Zynq-7000 series chip, which normally runs a regular Linux OS version, serves as the main processor, and implements "glue logic" in its internal FPGA to communicate with the many interfaces in the Parallella. In this paper an Epiphany-accelerated BLAS library for the Parallella platform was created (which could be suitable, also, for similar hybrid platforms that include the Epiphany chip as a coprocessor). For the actual instantiation of the BLAS, the BLIS framework was used. There have been previous implementations of MatrixMatrix multiplication, on this platform, that achieved very good performances inside the Epiphany chip (up to 85% of peak), but not so good ones for the complete Parallella platform (due to inter-chip data transfer bandwidth limitations). The main purpose of this work was to get closer to practical Linear Algebra aplications for the entire Parallella platform, with Scientific Computing and, in the long run, Big Data applications, in view.
Introduction
In recent times there has been interest in the use of hybrid platforms for scientific computation in large clusters. On the other hand RISC-based clusters, and ARM-based ones in particular, are also of interest, because of the low power consumption that they can achieve, and because new consumer products have made them ubiquitous, lowering their cost. The Parallella platform [1] has both: it's a hybrid platform based on an ARM CPU, and a manycore RISC device as a co-processor (the Epiphany) [2] . In this work the real and practical possibilities of the Parallella platform for Scientific Computing are explored. The Linpack benchmark was chosen to be run on a cluster of Parallella nodes, but it was found that there was no (Epiphany accelerated) BLAS implementation for the platform. Therefore, a BLAS library was "instantiated" with the BLIS framework [3] , after writing an Epiphany accelerated sgemm micro-kernel for it. The micro-kernel uses a "SUMMA-like" algorithm [4] , and improves the performance over current implementations (that use Cannon's [5] ). The achieved results, for the MatrixMatrix Multiplication performance, were the best for this platform that are presently known to the author [6] [7] [8] (if the host processing and off-chip data transfer is taken into account).
The Parallella board
The Parallella board [2] has one Zynq 7010 or 7020 chip acting as "the host processor", one Epiphany chip acting as a "co-processor", and a 1GB DRAM chip, of which 32MB are accessible to both the host and coprocessor (shared DRAM). It also contains many interfaces, like Ethernet, USB, a slot for an SD card, etc. The Zynq SoC [9] has a dual-core ARM Cortex-A9 CPU, with an FPGA embedded, and many onchip interfaces. The FPGA is used to implement the "elink" that is needed to communicate with the Epiphany chip. The Epiphany chip [10] consists of a 2D array of cores ("eCores") connected by a mesh Network-on-chip. Each core contains a RISC CPU, a DMA engine, 32 KB of local memory and a Network interface. The option chosen to program the Parallella architecture here was to use the eSDK [11] provided by Adapteva, which consists of a series C functions that allow the communication between a host and the Epiphany SoC, and between eCores within the Epiphany. It is important to note that the Epiphany kernels can be written in C (although to achieve the best performance some assembly code may be needed).
Software Architecture

BLIS
BLIS is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [3] . When invoked, it generates a new BLAS-like API that its creators made to improve the old BLAS library, but also generates the classic FORTRAN BLAS library. A microkernel was written to accelerate the "sgemm" function by offloading the main calculations to the Epiphany coprocessor. When a BLAS user calls the "sgemm" function, the BLIS code divides the input and output matrices conveniently and sends small predefined multiplications to be performed by the micro-kernel.
A Separate Linux process
To save the "intialize/finalize" times required in every call of the micro-kernel, the initialization and finalization code was placed in an entirely different process that runs as a "linux service". With that solution in mind, we will now have a Host-Coprocessor shared RAM, and also a Host-Host shared RAM. They will be called the HC-RAM and HH-RAM, respectively. Of course if it was possible to use the same space for both communications some time could be saved, but that was not yet implemented on this work.
We will call ir and or to the ratio of the input loading time (host) and the postprocessing time (host), to the total time, respectively. The "sgemm inner micro-kernel" follows a "SUMMA-like" scheme, and then it does some postprocessing. The input matrices (a1, b1) are divided in blocks of KSU B columns and rows respectively. The main loop iterates on those blocks, sending one (m × KSU B)-size block from input a1, and one (KSU B × n)-size block from input b1 on each iteration to an "Epiphany Task" that performs the outer product of each column of a ti with each row of b ti , and partially sums over those products. The result is the partial result matrix (for task i): c ti , that can then be summed by the (host) sgemm inner micro-kernel, or can be accumulated in the coprocessor local memory, depending on the implementation. After that, the micro-kernel multiplies the resulting matrix by α and adds β · c in , to produce the sgemm micro-kernel final result. Using a "command" variable, the host micro-kernel can tell the coprocessor to do the initialization steps only once, then accumulate the results of many KSUB-blocks, and in the last iteration send the final result back. Thus a lot of time is saved (most importantly, the time needed to "send the results back"). The dissadvantage of accumulating is that the results (of m × n size) must be stored fully in the local memory, and that limits the maximum possible size of m and n. m and/or n increases are needed to reduce the input time ratio (ir). So, a clear compromise exists between improving the or and the ir ratios.
Epiphany kernel
Due to the memory restrictions it is very important to organize the code and buffers in the local memory. In figure  1 it is shown the local memory map, for one core, in this implementation.
3.3.1. Epiphany Task. The outer layer of the Epiphany kernel will be called an "Epiphany Task". Again, the algorithm is "SUMMA-like" [4] . The input is divided between the cores in blocks of (m × KSUB CORES ) size for a ti and ( KSUB CORES × n) size for b ti (those block will be called a ti−cj and b ti−cj ). Each core will calculate the correspondig outer products and sum over KSUB CORES of them to obtain a partial result (c ti−cj ) that, in turn, will be summed with the partial results of the other cores (resulting c ti ). Usually, the solution would be to move partial input data within the cores, as moving results would be more costly, but on this case (due to some Epiphany special characteristics) the implementation moves the partial results instead. An inter-core pipeline (figure 2) was designed to move those intermediate results. Epiphany Column Iterations, the Epiphany Task is completed.
Epiphany K Iteration. Each "Epiphany Column
Iteration" is divided into CORES "Epiphany K Iterations". On each Epiphany K Iteration a partial result block of size m × NSUB is calculated by each core, and sent to the next core in the defined pipeline (figure 2) to be accumulated with other partial results. For sending and receiving the partial results, two buffers are defined and are interchanged on even and odd K iterations. Before and after every K Iteration a barrier is used to synchronize the cores.
subMatmul.
The function "subMatmul" could be thought as "the single-core version of the Epiphany K Iteration". It is just a single-core matrix-matrix multiplication function, that accepts inputs of size m × KSUB CORES for a, and KSUB CORES × n for b, and outputs the resultant m × NSUB product matrix. This function is implemented in assembly language. The implementation is strongly based on that of the previous work [6] (see section VII of [6] for details). 
Results
Custom Tests
All the processing times (in these tests) were measured in the host side with functions from the "time.h" C library. The results can be seen in tables 1 and 2. 
BLIS Tests
The BLAS library was compiled with the micro-kernel, and the BLIS standard tests were run. The micro-kernel used is the one that calls a different OS process to calculate the results. As can be seen in table 3, the results are very similar to those of the "custom" tests.
In table 4 the results for the whole sgemm function, are shown with m = n = K = 4096. The performance penalty, with respect to the kernel performance, is not too big.
As the version of the HPL Linpack Benchmark code that was available to the author uses Double Precision, a "dgemm" kernel was implemented which sends the data to the "sgemm inner kernel". The precision of the results is, therefore, expected to be close to that of Single Precision. In the process, some performance was lost (see table 5 ). That version was called the "false dgemm". 
HPL Linpack Tests
Finally, the High Performance Linpack Benchmark [12] was run with the parameters and results specified in table 7. It was run with a process grid of 1 × 1, in one node.
The results of the HPL benchmark showed that the sgemm implementation works correctly, up to Single Precision, but the performance is far lower than the one for the sgemm operation alone. The lower performance could be explained as due to a poor choice of algorithm parameters for the benchmark, or by the influence of the other BLAS functions that are called.
Conclusion and Future Work
An Epiphany accelerated, complete BLAS library was instantiated by the use of the BLIS framework. The performance of the Matrix-Matrix multiplication kernel achieved was better than in any other implementation before (as to the author's knowledge), when program loading and initialization are not taken into account (which is the standard in previous work [6] [7] [8] ). When trying to get a more practical kernel, to be used as a Linux service, the performance gets lower, due to the interprocess communication (which could, most likely, be improved), but gives still an interesting result for a first BLAS implementation. The results for the High Performance Linpack are far lower than expected, given the sgemm results. That may be explained due to a poor choice of parameters for the algorithm, or to the low performance of Level-2 BLAS functions.
There are many possible improvements for this implementation. Some of them are discussed below.
A "b-streaming" Solution
One way to improve the ir ratio would be to use a solution in which the values of B are only copied to the local memory as needed. That solution could make use of more free space for the input A.
An "output-streaming" Solution
If the output is not entirely stored locally, it is possible to use bigger values for m and n. In that kind of solutions, though, it is not possible to accumulate results for more than one KSU B block, in the coprocessor. The shrinking of RES2, makes some more space available for the input A. Also it is possible to increase the value of m by reducing the value of KSU B, but if that is done one has to do more partial results sums in the host. Regretfully the access, by the host, to the shared portion of the RAM memory (HC-RAM) is very slow (at the moment it is accessed by the eSDK "e_read" function), thus limiting that kind of improvements (bigger m,n means better ir ratio). The "output-streaming" implementation was what the author originally had in mind when implementing the "SUMMAlike" algorithm.
NEON or FPGA acceleration
For both, the level-2 BLAS operations and the summing of partial results by the Epiphany, the NEON SIMD engine in the ARM host or the FPGA in the Zynq could be used.
