976 research outputs found

    Numerical Representation of Directed Acyclic Graphs for Efficient Dataflow Embedded Resource Allocation

    Get PDF
    International audienceStream processing applications running on Heterogeneous Multi-Processor Systems on Chips (HMPSoCs) require efficient resource allocation and management, both at compile-time and at runtime. To cope with modern adaptive applications whose behavior can not be exhaustively predicted at compile-time, runtime managers must be able to take resource allocation decisions on-the-fly, with a minimum overhead on application performance. Resource allocation algorithms often rely on an internal modeling of an application. Directed Acyclic Graph (DAGs) are the most commonly used models for capturing control and data dependencies between tasks. DAGs are notably often used as an intermediate representation for deploying applications modeled with a dataflow Model of Computation (MoC) on HMPSoCs. Building such intermediate representation at runtime for massively parallel applications is costly both in terms of computation and memory overhead. In this paper, an intermediate representation of DAGs for resource allocation is presented. This new representation shows improved performance for run-time analysis of dataflow graphs with less overhead in both computation time and memory footprint. The performances of the proposed representation are evaluated on a set of computer vision and machine learning applications

    SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

    Get PDF
    Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p

    A Compilation Flow for Parametric Dataflow: Programming Model, Scheduling, and Application to Heterogeneous MPSoC

    Get PDF
    International audienceEfficient programming of signal processing applications on embedded systems is a complex problem. High level models such as Synchronous dataflow (SDF) have been privileged candidates for dealing with this complexity. These models permit to express inherent application parallelism, as well as analysis for both verification and optimization. Parametric dataflow models aim at providing sufficient dynamicity to model new applications, while at the same time maintaining the high level of analyzability needed for efficient real life implementations. This paper presents a new compilation flow that targets parametric dataflows. Built on the LLVM compiler infrastructure, it offers an actor based C++ programming model to describe parametric graphs, a compilation front-end providing graph analysis features, and a retargetable back-end to map the application on real hardware. This paper gives an overview of this flow, with a specific focus on scheduling. The crucial gap between dataflow models and real hardware on which actor firing is not atomic, as well as the consequences on FIFOs sizing and execution pipelining are taken into account.The experimental results illustrate our compilation flow applied to compilation of 3GPP LTE-Advanced demodulation on a heterogeneous MPSoC with distributed scheduling features. This achieves performances similar to time-consuming hand made optimizations

    Specialization and reconfiguration of lightweight mobile processors for data-parallel applications

    Get PDF
    The worldwide utilization of mobile devices makes the segment of low power mobile processors leading in the entire computer industry. Customers demand low-cost, high-performance and energy-efficient mobile devices, which execute sophisticated mobile applications such as multimedia and 3D games. State-of-the-art mobile devices already utilize chip multiprocessors (CMP) with dedicated accelerators that exploit data-level parallelism (DLP) in these applications. Such heterogeneous system design enable the mobile processors to deliver the desired performance and efficiency. The heterogeneity however increases the processors complexity and manufacturing cost when adding extra special-purpose hardware for the accelerators. In this thesis, we propose new hardware techniques that leverage the available resources of a mobile CMP to achieve cost-effective acceleration of DLP workloads. Our techniques are inspired by classic vector architectures and the latest reconfigurable architectures, which both achieve high power efficiency when running DLP workloads. The high requirement of additional resources for these two architectures limits their applicability beyond high-performance computers. To achieve their advantages in mobile devices, we propose techniques that: 1) specialize the lightweight mobile cores for classic vector execution of DLP workloads; 2) dynamically tune the number of cores for the specialized execution; and 3) reconfigure a bulk of the existing general purpose execution resources into a compute hardware accelerator. Specialization enables one or more cores to process configurable large vector operands with new special purpose vector instructions. Reconfiguration goes one step further and allow the compute hardware in mobile cores to dynamically implement the entire functionality of diverse compute algorithms. The proposed specialization and reconfiguration techniques are applicable to a diverse range of general purpose processors available in mobile devices nowadays. However, we chose to implement and evaluate them on a lightweight processor based on the Explicit Data Graph Execution architecture, which we find promising for the research of low-power processors. The implemented techniques improve the mobile processor performance and the efficiency on its existing general purpose resources. The processor with enabled specialization/reconfiguration techniques efficiently exploits DLP without the extra cost of special-purpose accelerators.La utilización de dispositivos móviles a nivel mundial hace que el segmento de procesadores móviles de bajo consumo lidere la industria de computación. Los clientes piden dispositivos móviles de bajo coste, alto rendimiento y bajo consumo, que ejecuten aplicaciones móviles sofisticadas, tales como multimedia y juegos 3D.Los dispositivos móviles más avanzados utilizan chips con multiprocesadores (CMP) con aceleradores dedicados que explotan el paralelismo a nivel de datos (DLP) en estas aplicaciones. Tal diseño de sistemas heterogéneos permite a los procesadores móviles ofrecer el rendimiento y la eficiencia deseada. La heterogeneidad sin embargo aumenta la complejidad y el coste de fabricación de los procesadores al agregar hardware de propósito específico adicional para implementar los aceleradores. En esta tesis se proponen nuevas técnicas de hardware que aprovechan los recursos disponibles en un CMP móvil para lograr una aceleración con bajo coste de las aplicaciones con DLP. Nuestras técnicas están inspiradas por los procesadores vectoriales clásicos y por las recientes arquitecturas reconfigurables, pues ambas logran alta eficiencia en potencia al ejecutar cargas de trabajo DLP. Pero la alta exigencia de recursos adicionales que estas dos arquitecturas necesitan, limita sus aplicabilidad más allá de las computadoras de alto rendimiento. Para lograr sus ventajas en dispositivos móviles, en esta tesis se proponen técnicas que: 1) especializan núcleos móviles ligeros para la ejecución vectorial clásica de cargas de trabajo DLP; 2) ajustan dinámicamente el número de núcleos de ejecución especializada; y 3) reconfiguran en bloque los recursos existentes de ejecución de propósito general en un acelerador hardware de computación. La especialización permite a uno o más núcleos procesar cantidades configurables de operandos vectoriales largos con nuevas instrucciones vectoriales. La reconfiguración da un paso más y permite que el hardware de cómputo en los núcleos móviles ejecute dinámicamente toda la funcionalidad de diversos algoritmos informáticos. Las técnicas de especialización y reconfiguración propuestas son aplicables a diversos procesadores de propósito general disponibles en los dispositivos móviles de hoy en día. Sin embargo, en esta tesis se ha optado por implementarlas y evaluarlas en un procesador ligero basado en la arquitectura "Explicit Data Graph Execution", que encontramos prometedora para la investigación de procesadores de baja potencia. Las técnicas aplicadas mejoraran el rendimiento del procesador móvil y la eficiencia energética de sus recursos para propósito general ya existentes. El procesador con técnicas de especialización/reconfiguración habilitadas explota eficientemente el DLP sin el coste adicional de los aceleradores de propósito especial

    Dataflow Computing with Polymorphic Registers

    Get PDF
    Heterogeneous systems are becoming increasingly popular for data processing. They improve performance of simple kernels applied to large amounts of data. However, sequential data loads may have negative impact. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high speed, parallel access to performance-critical data. Furthermore, by PRF customization, specific data path features are exposed to the programmer in a very convenient way. PRFs allow additional control over the registers dimensions, and the number of elements which can be simultaneously accessed by computational units. This paper shows how PRFs can be integrated in dataflow computational platforms. In particular, starting from an annotated source code, we present a compiler-based methodology that automatically generates the customized PRFs and the enhanced computational kernels that efficiently exploit them
    corecore