16,457 research outputs found

    A Reconfigurable Vector Instruction Processor for Accelerating a Convection Parametrization Model on FPGAs

    Full text link
    High Performance Computing (HPC) platforms allow scientists to model computationally intensive algorithms. HPC clusters increasingly use General-Purpose Graphics Processing Units (GPGPUs) as accelerators; FPGAs provide an attractive alternative to GPGPUs for use as co-processors, but they are still far from being mainstream due to a number of challenges faced when using FPGA-based platforms. Our research aims to make FPGA-based high performance computing more accessible to the scientific community. In this work we present the results of investigating the acceleration of a particular atmospheric model, Flexpart, on FPGAs. We focus on accelerating the most computationally intensive kernel from this model. The key contribution of our work is the architectural exploration we undertook to arrive at a solution that best exploits the parallelism available in the legacy code, and is also convenient to program, so that eventually the compilation of high-level legacy code to our architecture can be fully automated. We present the three different types of architecture, comparing their resource utilization and performance, and propose that an architecture where there are a number of computational cores, each built along the lines of a vector instruction processor, works best in this particular scenario, and is a promising candidate for a generic FPGA-based platform for scientific computation. We also present the results of experiments done with various configuration parameters of the proposed architecture, to show its utility in adapting to a range of scientific applications.Comment: This is an extended pre-print version of work that was presented at the international symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART2014), Sendai, Japan, June 911, 201

    A low cost reconfigurable soft processor for multimedia applications: design synthesis and programming model

    Get PDF
    This paper presents an FPGA implementation of a low cost 8 bit reconfigurable processor core for media processing applications. The core is optimized to provide all basic arithmetic and logic functions required by the media processing and other domains, as well as to make it easily integrable into a 2D array. This paper presents an investigation of the feasibility of the core as a potential soft processing architecture for FPGA platforms. The core was synthesized on the entire Virtex FPGA family to evaluate its overall performance, scalability and portability. A special feature of the proposed architecture is its simple programming model which allows low level programming. Throughput results for popular benchmarks coded using the programming model and cycle accurate simulator are presented

    Solving the Cauchy-Riemann equations on parallel computers

    Get PDF
    Discussed is the implementation of a single algorithm on three parallel-vector computers. The algorithm is a relaxation scheme for the solution of the Cauchy-Riemann equations; a set of coupled first order partial differential equations. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, and SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The machine architectures are briefly described. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Conclusions are presented

    Vector processing-aware advanced clock-gating techniques for low-power fused multiply-add

    Get PDF
    The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA 321253 and is supported in part by the European Union (FEDER funds) under contract TTIN2015-65316-P. The work of I. Ratkovic was supported by a FPU research grant from the Spanish MECD.Peer ReviewedPostprint (author's final draft

    High accuracy computation with linear analog optical systems: a critical study

    Get PDF
    High accuracy optical processors based on the algorithm of digital multiplication by analog convolution (DMAC) are studied for ultimate performance limitations. Variations of optical processors that perform high accuracy vector-vector inner products are studied in abstract and with specific examples. It is concluded that the use of linear analog optical processors in performing digital computations with DMAC leads to impractical requirements for the accuracy of analog optical systems and the complexity of postprocessing electronics

    Parallel Algorithms for Summing Floating-Point Numbers

    Full text link
    The problem of exactly summing n floating-point numbers is a fundamental problem that has many applications in large-scale simulations and computational geometry. Unfortunately, due to the round-off error in standard floating-point operations, this problem becomes very challenging. Moreover, all existing solutions rely on sequential algorithms which cannot scale to the huge datasets that need to be processed. In this paper, we provide several efficient parallel algorithms for summing n floating point numbers, so as to produce a faithfully rounded floating-point representation of the sum. We present algorithms in PRAM, external-memory, and MapReduce models, and we also provide an experimental analysis of our MapReduce algorithms, due to their simplicity and practical efficiency.Comment: Conference version appears in SPAA 201
    • …
    corecore