953 research outputs found

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    Aggregation of Descriptive Regularization Methods with Hardware/Software Co-Design for Remote Sensing Imaging

    Get PDF
    This study consider the problem of high-resolution imaging of the remote sensing (RS) environment formalized in terms of a nonlinear ill- posed inverse problem of nonparametric estimation of the power spatial spectrum pattern (SSP) of the wavefield scattered from an extended remotely sensed scene (referred to as the scene image). However, the remote sensing techniques for reconstructive imaging in many RS application areas are relatively unacceptable for being implemented in a (near) real time implementation. In this work, we address a new aggregated descriptive-regularization (DR) method and the Hardware/Software (HW/SW) co-design for the SSP reconstruction from the uncertain speckle-corrupted measurement data in a computationally efficient parallel fashion that meets the (near) real time image processing requirements. The hardware design is performed via efficient systolic arrays (SAs). Finally, the efficiency both in resolution enhancement and in computational complexity reduction metrics of the aggregated descriptive-regularized and the HW/SW co-design method is presented via numerical simulations and by the performance analysis of the implementation based on a Xilinx Field Programmable Gate Array (FPGA) XC4VSX35-10ff668.Universidad de GuadalajaraUniversidad Autónoma de YucatánInstituto Tecnológico de Mérid

    A General-Purpose Graphics Processing Unit (GPGPU)-Accelerated Robotic Controller Using a Low Power Mobile Platform

    Get PDF
    Robotic controllers have to execute various complex independent tasks repeatedly. Massive processing power is required by the motion controllers to compute the solution of these computationally intensive algorithms. General-purpose graphics processing unit (GPGPU)-enabled mobile phones can be leveraged for acceleration of these motion controllers. Embedded GPUs can replace several dedicated computing boards by a single powerful and less power-consuming GPU. In this paper, the inverse kinematic algorithm based numeric controllers is proposed and realized using the GPGPU of a handheld mobile device. This work is the extension of a desktop GPU-accelerated robotic controller presented at DAS’16 where the comparative analysis of different sequential and concurrent controllers is discussed. First of all, the inverse kinematic algorithm is sequentially realized using Arduino-Due microcontroller and the field-programmable gate array (FPGA) is used for its parallel implementation. Execution speeds of these controllers are compared with two different GPGPU architectures (Nvidia Quadro K2200 and Nvidia Shield K1 Tablet), programmed with Compute Unified Device Architecture (CUDA) computing language. Experimental data shows that the proposed mobile platform-based scheme outperform s the FPGA by 5 and boasts a 100 speedup over the Arduino-based sequential implementation

    Dataflow/Actor-Oriented language for the design of complex signal processing systems

    Get PDF
    International audienceSignal processing algorithms become more and more complex and the algorithm architecture adaptation and design processes cannot any longer rely only on the intuition of the designers to build efficient systems. Specific tools and methods are needed to cope with the increasing complexity of both algorithms and platforms. This paper presents a new framework which allows the specification, design, simulation and implementation of a system operating at a higher level of abstraction compared to current approaches. The framework is base on the usage of a new actor/dataflow oriented language called CAL. Such language has been specifically designed for modelling complex signal processing systems. CAL data flow models expose the intrinsic concurrency of the algorithms by employing the notions of actor programming and dataflow. Concurrency and parallelism are very important aspects of embedded system design as we enter in the multicore era. The design framework is composed by a simulation platform and by Cal2C and CAL2HDL code generators. This paper described in details the principles on which such code generators are based and shows how efficient software (C) and hardware (VHDL and Verilog) code can be generated by appropriate CAL models. Results on a real design case, a MPEG-4 Simple Profile decoder, show that systems obtained with the hardware code generator outperform the hand written VHDL version both in terms of performance and resource usage. Concerning the C code generator results, the results show that the synthesized C-software mapped on a SystemC scheduler platform, is much faster than the simulated CAL dataflow program and approaches handwritten C versions

    Next generation of Exascale-class systems: ExaNeSt project and the status of its interconnect and storage development

    Get PDF
    The ExaNeSt project started on December 2015 and is funded by EU H2020 research framework (call H2020-FETHPC-2014, n. 671553) to study the adoption of low-cost, Linux-based power-efficient 64-bit ARM processors clusters for Exascale-class systems. The ExaNeSt consortium pools partners with industrial and academic research expertise in storage, interconnects and applications that share a vision of an European Exascale-class supercomputer. The common goal is designing and implementing a physical rack prototype together with its cooling system, the non-volatile memory (NVM) architecture and a unified low-latency interconnect able to test different options for network and storage. Furthermore, the consortium goal is to provide real HPC applications to validate the system. In this paper we describe the unified data and storage network architecture, reporting on the status of development of different testbeds and highlighting preliminary benchmark results obtained through the execution of scientific, engineering and data analytics scalable application kernels
    • …
    corecore