Search CORE

856 research outputs found

Exact and heuristic allocation of multi-kernel applications to multi-FPGA platforms

Author: Casu Mario R.
Cortadella Jordi
Lavagno Luciano
Lazarescu Mihai T.
Shan Junnan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

FPGA-based accelerators demonstrated high energy efficiency compared to GPUs and CPUs. However, single FPGA designs may not achieve sufficient task parallelism. In this work, we optimize the mapping of high-performance multi-kernel applications, like Convolutional Neural Networks, to multi-FPGA platforms. First, we formulate the system level optimization problem, choosing within a huge design space the parallelism and number of compute units for each kernel in the pipeline. Then we solve it using a combination of Geometric Programming, producing the optimum performance solution given resource and DRAM bandwidth constraints, and a heuristic allocator of the compute units on the FPGA cluster.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Performance and Power Optimization of Multi-kernel Applications on Multi-FPGA Platforms

Author: SHAN JUNNAN
Publication venue: country:Italy
Publication date: 29/03/2021
Field of study

L'abstract è presente nell'allegato / the abstract is in the attachmen

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

CNN-on-AWS: Efficient Allocation of Multi-Kernel Applications on Multi-FPGA Platforms

Author: Casu Mario R.
Cortadella Jordi
Lavagno Luciano
Lazarescu Mihai T.
Shan Junnan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/02/2021
Field of study

Multi-FPGA platforms, like Amazon AWS F1, can run in the cloud multi-kernel pipelined applications, like Convolutional Neural Networks (CNNs), with excellent performance and lower energy consumption than CPUs or GPUs. We propose a method to efficiently map these applications on multi-FPGA platforms to maximize application throughput. Our methodology finds, for the given resources, the optimal number of parallel instances of each kernel in the pipeline and their allocation to one or more among the available FPGAs. We obtain this by formulating and solving a mixed-integer, non-linear optimization problem, in which we model the performance of each component and the duration of the phases in which the accelerated computation can be split into, namely: 1) data transfer from a host CPU to the DDR memory of each FPGA, 2) data transfer from FPGA DDR to FPGA on-chip memory, 3) kernel computation on the FPGA, 4) data transfer from FPGA on-chip memory to FPGA DDR, 5) data transfer from FPGA DDR to host. Finding the optimal solution using a Mixed-Integer Non-Linear Programming (MINLP) solver is often highly inefficient. Hence, we provide a fast heuristic method that according to our experiments can be much more efficient than the MINLP solver and finds comparable results. For larger problems (more CNN layers), our heuristic method can quickly find (several thousand times faster) much better solutions than the MINLP solver, even if we run the latter for a very long time

UPCommons. Portal del coneixement obert de la UPC

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

Author: Bouganis Christos-Savvas
Kouris Alexandros
Venieris Stylianos I.
Publication venue
Publication date: 19/02/2018
Field of study

In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

arXiv.org e-Print Archive

Spiral - Imperial College Digital Repository

Run-time management for future MPSoC platforms

Author: Nollet V.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2008
Field of study

In recent years, we are witnessing the dawning of the Multi-Processor Systemon- Chip (MPSoC) era. In essence, this era is triggered by the need to handle more complex applications, while reducing overall cost of embedded (handheld) devices. This cost will mainly be determined by the cost of the hardware platform and the cost of designing applications for that platform. The cost of a hardware platform will partly depend on its production volume. In turn, this means that ??exible, (easily) programmable multi-purpose platforms will exhibit a lower cost. A multi-purpose platform not only requires ??exibility, but should also combine a high performance with a low power consumption. To this end, MPSoC devices integrate computer architectural properties of various computing domains. Just like large-scale parallel and distributed systems, they contain multiple heterogeneous processing elements interconnected by a scalable, network-like structure. This helps in achieving scalable high performance. As in most mobile or portable embedded systems, there is a need for low-power operation and real-time behavior. The cost of designing applications is equally important. Indeed, the actual value of future MPSoC devices is not contained within the embedded multiprocessor IC, but in their capability to provide the user of the device with an amount of services or experiences. So from an application viewpoint, MPSoCs are designed to ef??ciently process multimedia content in applications like video players, video conferencing, 3D gaming, augmented reality, etc. Such applications typically require a lot of processing power and a signi??cant amount of memory. To keep up with ever evolving user needs and with new application standards appearing at a fast pace, MPSoC platforms need to be be easily programmable. Application scalability, i.e. the ability to use just enough platform resources according to the user requirements and with respect to the device capabilities is also an important factor. Hence scalability, ??exibility, real-time behavior, a high performance, a low power consumption and, ??nally, programmability are key components in realizing the success of MPSoC platforms. The run-time manager is logically located between the application layer en the platform layer. It has a crucial role in realizing these MPSoC requirements. As it abstracts the platform hardware, it improves platform programmability. By deciding on resource assignment at run-time and based on the performance requirements of the user, the needs of the application and the capabilities of the platform, it contributes to ??exibility, scalability and to low power operation. As it has an arbiter function between different applications, it enables real-time behavior. This thesis details the key components of such an MPSoC run-time manager and provides a proof-of-concept implementation. These key components include application quality management algorithms linked to MPSoC resource management mechanisms and policies, adapted to the provided MPSoC platform services. First, we describe the role, the responsibilities and the boundary conditions of an MPSoC run-time manager in a generic way. This includes a de??nition of the multiprocessor run-time management design space, a description of the run-time manager design trade-offs and a brief discussion on how these trade-offs affect the key MPSoC requirements. This design space de??nition and the trade-offs are illustrated based on ongoing research and on existing commercial and academic multiprocessor run-time management solutions. Consequently, we introduce a fast and ef??cient resource allocation heuristic that considers FPGA fabric properties such as fragmentation. In addition, this thesis introduces a novel task assignment algorithm for handling soft IP cores denoted as hierarchical con??guration. Hierarchical con??guration managed by the run-time manager enables easier application design and increases the run-time spatial mapping freedom. In turn, this improves the performance of the resource assignment algorithm. Furthermore, we introduce run-time task migration components. We detail a new run-time task migration policy closely coupled to the run-time resource assignment algorithm. In addition to detailing a design-environment supported mechanism that enables moving tasks between an ISP and ??ne-grained recon??gurable hardware, we also propose two novel task migration mechanisms tailored to the Network-on-Chip environment. Finally, we propose a novel mechanism for task migration initiation, based on reusing debug registers in modern embedded microprocessors. We propose a reactive on-chip communication management mechanism. We show that by exploiting an injection rate control mechanism it is possible to provide a communication management system capable of providing a soft (reactive) QoS in a NoC. We introduce a novel, platform independent run-time algorithm to perform quality management, i.e. to select an application quality operating point at run-time based on the user requirements and the available platform resources, as reported by the resource manager. This contribution also proposes a novel way to manage the interaction between the quality manager and the resource manager. In order to have a the realistic, reproducible and ??exible run-time manager testbench with respect to applications with multiple quality levels and implementation tradev offs, we have created an input data generation tool denoted Pareto Surfaces For Free (PSFF). The the PSFF tool is, to the best of our knowledge, the ??rst tool that generates multiple realistic application operating points either based on pro??ling information of a real-life application or based on a designer-controlled random generator. Finally, we provide a proof-of-concept demonstrator that combines these concepts and shows how these mechanisms and policies can operate for real-life situations. In addition, we show that the proposed solutions can be integrated into existing platform operating systems

CiteSeerX

Repository TU/e

Pure OAI Repository

Alternative Processor within Threshold: Flexible Scheduling on Heterogeneous Systems

Author: Karia Stavan Satish
Publication venue: RIT Scholar Works
Publication date: 01/03/2017
Field of study

Computing systems have become increasingly heterogeneous contributing to higher performance and power efficiency. However, this is at the cost of increasing the overall complexity of designing such systems. One key challenge in the design of heterogeneous systems is the efficient scheduling of computational load. To address this challenge, this paper thoroughly analyzes state of the art scheduling policies and proposes a new dynamic scheduling heuristic: Alternative Processor within Threshold (APT). This heuristic uses a flexibility factor to attain efficient usage of the available hardware resources, taking advantage of the degree of heterogeneity of the system. In a GPU-CPU-FPGA system, tested on workloads with and without data dependencies, this approach improved overall execution time by 16% and 18% when compared to the second-best heuristic

RIT Scholar Works

Heterogeneity-aware scheduling and data partitioning for system performance acceleration

Author: Yu Teng
Publication venue: The University of St Andrews
Publication date: 14/04/2020
Field of study

Over the past decade, heterogeneous processors and accelerators have become increasingly prevalent in modern computing systems. Compared with previous homogeneous parallel machines, the hardware heterogeneity in modern systems provides new opportunities and challenges for performance acceleration. Classic operating systems optimisation problems such as task scheduling, and application-specific optimisation techniques such as the adaptive data partitioning of parallel algorithms, are both required to work together to address hardware heterogeneity. Significant effort has been invested in this problem, but either focuses on a specific type of heterogeneous systems or algorithm, or a high-level framework without insight into the difference in heterogeneity between different types of system. A general software framework is required, which can not only be adapted to multiple types of systems and workloads, but is also equipped with the techniques to address a variety of hardware heterogeneity. This thesis presents approaches to design general heterogeneity-aware software frameworks for system performance acceleration. It covers a wide variety of systems, including an OS scheduler targeting on-chip asymmetric multi-core processors (AMPs) on mobile devices, a hierarchical many-core supercomputer and multi-FPGA systems for high performance computing (HPC) centers. Considering heterogeneity from on-chip AMPs, such as thread criticality, core sensitivity, and relative fairness, it suggests a collaborative based approach to co-design the task selector and core allocator on OS scheduler. Considering the typical sources of heterogeneity in HPC systems, such as the memory hierarchy, bandwidth limitations and asymmetric physical connection, it proposes an application-specific automatic data partitioning method for a modern supercomputer, and a topological-ranking heuristic based schedule for a multi-FPGA based reconfigurable cluster. Experiments on both a full system simulator (GEM5) and real systems (Sunway Taihulight Supercomputer and Xilinx Multi-FPGA based clusters) demonstrate the significant advantages of the suggested approaches compared against the state-of-the-art on variety of workloads."This work is supported by St Leonards 7th Century Scholarship and Computer Science PhD funding from University of St Andrews; by UK EPSRC grant Discovery: Pattern Discovery and Program Shaping for Manycore Systems (EP/P020631/1)." -- Acknowledgement

University of St. Andrews - Pure

St Andrews Research Repository

Accelerating Smith-Waterman Alignment of Long DNA Sequences with OpenCL on FPGA

Author: Botella Guillermo
De Giusti Armando Eduardo
García Sanchez Carlos
Naiouf Marcelo
Prieto-Matias Manuel
Rucci Enzo
Publication venue
Publication date: 27/04/2017
Field of study

With the greater importance of parallel architectures such as GPUs or Xeon Phi accelerators, the scientific community has developed efficient solutions in the bioinformatics field. In this context, FPGAs begin to stand out as high performance devices with moderate power consumption. This paper presents and evaluates a parallel strategy of the well-known Smith-Waterman algorithm using OpenCL on Intel/Altera’s FPGA for long DNA sequences. We efficiently exploit data and pipeline parallelism on a Intel/Altera Stratix V FPGA reaching upto 114 GCUPS in less than 25 watt power requirements.Publicado en Lecture Notes in Computer Science book series (LNCS, vol. 10209).Facultad de Informátic

Accelerating Smith-Waterman Alignment of Long DNA Sequences with OpenCL on FPGA

Author: Botella Guillermo
De Giusti Armando Eduardo
García Sanchez Carlos
Naiouf Marcelo
Prieto-Matias Manuel
Rucci Enzo
Publication venue
Publication date: 07/10/2019
Field of study

Run Time Approximation of Non-blocking Service Rates for Streaming Systems

Author: Beard Jonathan C.
Chamberlain Roger D.
Publication venue
Publication date: 12/04/2015
Field of study

Stream processing is a compute paradigm that promises safe and efficient parallelism. Modern big-data problems are often well suited for stream processing's throughput-oriented nature. Realization of efficient stream processing requires monitoring and optimization of multiple communications links. Most techniques to optimize these links use queueing network models or network flow models, which require some idea of the actual execution rate of each independent compute kernel within the system. What we want to know is how fast can each kernel process data independent of other communicating kernels. This is known as the "service rate" of the kernel within the queueing literature. Current approaches to divining service rates are static. Modern workloads, however, are often dynamic. Shared cloud systems also present applications with highly dynamic execution environments (multiple users, hardware migration, etc.). It is therefore desirable to continuously re-tune an application during run time (online) in response to changing conditions. Our approach enables online service rate monitoring under most conditions, obviating the need for reliance on steady state predictions for what are probably non-steady state phenomena. First, some of the difficulties associated with online service rate determination are examined. Second, the algorithm to approximate the online non-blocking service rate is described. Lastly, the algorithm is implemented within the open source RaftLib framework for validation using a simple microbenchmark as well as two full streaming applications.Comment: technical repor

arXiv.org e-Print Archive

Crossref