481 research outputs found

    FDTD/K-DWM simulation of 3D room acoustics on general purpose graphics hardware using compute unified device architecture (CUDA)

    Get PDF
    The growing demand for reliable prediction of sound fields in rooms have resulted in adaptation of various approaches for physical modeling, including the Finite Difference Time Domain (FDTD) and the Digital Waveguide Mesh (DWM). Whilst considered versatile and attractive methods, they suffer from dispersion errors that increase with frequency and vary with direction of propagation, thus imposing a high frequency calculation limit. Attempts have been made to reduce such errors by considering different mesh topologies, by spatial interpolation, or by simply oversampling the grid. As the latter approach is computationally expensive, its application to three-dimensional problems has often been avoided. In this paper, we propose an implementation of the FDTD on general purpose graphics hardware, allowing for high sampling rates whilst maintaining reasonable calculation times. Dispersion errors are consequently reduced and the high frequency limit is increased. A range of graphics processors are evaluated and compared with traditional CPUs in terms of accuracy, calculation time and memory requirements

    FPGA Acceleration of Domain-specific Kernels via High-Level Synthesis

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Efficient parallel LOD-FDTD method for Debye-dispersive media

    Get PDF
    The locally one-dimensional finite-difference time-domain (LOD-FDTD) method is a promising implicit technique for solving Maxwell’s equations in numerical electromagnetics. Thispaper describes an efficient message passing interface (MPI)-parallel implementation of the LOD-FDTD method for Debye-dispersive media. Its computational efficiency is demonstrated to be superior to that of the parallel ADI-FDTD method. We demonstrate the effectiveness of the proposed parallel algorithm in the simulation of a bio-electromagnetic problem: the deep brain stimulation (DBS) in the human body.The work described in this paper and the research leading to these results has received funding from the European Community’s Seventh Framework Programme FP7/2007-2013, under grant agreement no 205294 (HIRF SE project), and from the Spanish National Projects TEC2010-20841-C04-04, CSD2008-00068, and the Junta de Andalucia Project P09-TIC-5327

    Effective data parallel computing on multicore processors

    Get PDF
    The rise of chip multiprocessing or the integration of multiple general purpose processing cores on a single chip (multicores), has impacted all computing platforms including high performance, servers, desktops, mobile, and embedded processors. Programmers can no longer expect continued increases in software performance without developing parallel, memory hierarchy friendly software that can effectively exploit the chip level multiprocessing paradigm of multicores. The goal of this dissertation is to demonstrate a design process for data parallel problems that starts with a sequential algorithm and ends with a high performance implementation on a multicore platform. Our design process combines theoretical algorithm analysis with practical optimization techniques. Our target multicores are quad-core processors from Intel and the eight-SPE IBM Cell B.E. Target applications include Matrix Multiplications (MM), Finite Difference Time Domain (FDTD), LU Decomposition (LUD), and Power Flow Solver based on Gauss-Seidel (PFS-GS) algorithms. These applications are popular computation methods in science and engineering problems and are characterized by unit-stride (MM, LUD, and PFS-GS) or 2-point stencil (FDTD) memory access pattern. The main contributions of this dissertation include a cache- and space-efficient algorithm model, integrated data pre-fetching and caching strategies, and in-core optimization techniques. Our multicore efficient implementations of the above described applications outperform nai¨ve parallel implementations by at least 2x and scales well with problem size and with the number of processing cores

    Computer Modeling Using The Finite-Difference Time-Domain (FDTD) Method for Electromagnetic Wave Propagation

    Get PDF
    The Finite-Difference Time-Domain (FDTD) technique is a numerical analysis modeling method to find the solutions of the partial derivatives in Maxwell’s equations to electromagnetic problems. In FDTD the electrical and magnetic fields components staggered in time and space by a method developed by Yee. The approximation of the solutions can be found using a set of updated equations. In every simulation that utilizes the FDTD method, the factors of time and memory size are the two significant considerations. This study focused on reducing the computation time, as the time required to time-march the components of the electrical and magnetic fields at each of the FDTD problem cells is computationally expensive. Based on the findings of this study, the issue of time can be solved by parallelizing the code. Since the structures of the FDTD field\u27s components are independent, the algorithm of the FDTD can be divided into small tasks that can be executed concurrently. Two approaches were taken to parallelize the one- and two-dimensional FDTD code: The Compute Unified Device Architecture (CUDA) approach and Open Computing Language (OpenCL) approach. The serial FDTD C code was implemented and accelerated using CUDA. The result of the comparison between the serial and parallel algorithms (C, CUDA, MATLAB) showed a speed-up of 505 speed factor with the GPU-GPU method and 5 speedup factor with the CPU-GPU method. This was the case for a one-dimensional space problem. The FDTD code was implemented and executed with the OpenCL (Open Computing Language) software as well. The OpenCL software is important since it is open-source and freely available. In contrast to CUDA, which only supports NVIDIA and enabled GPUs, the code written in OpenCL isportable and can be executed on any parallel processing platforms such as CPUs, GPUs, DSPs, FPGAs, and others. Total time\u27s Speedup of 22X has been recorded with OpenCL (PCL) with respect to CPU-C, with 10000 iterations and a 150000 cells grid size

    Locality-Aware Concurrency Platforms

    Get PDF
    Modern computing systems from all domains are becoming increasingly more parallel. Manufacturers are taking advantage of the increasing number of available transistors by packaging more and more computing resources together on a single chip or within a single system. These platforms generally contain many levels of private and shared caches in addition to physically distributed main memory. Therefore, some memory is more expensive to access than other and high-performance software must consider memory locality as one of the first level considerations. Memory locality is often difficult for application developers to consider directly, however, since many of these NUMA affects are invisible to the application programmer and only show up in low performance. Moreover, on parallel platforms, the performance depends on both locality and load balance and these two metrics are often at odds with each other. Therefore, directly considering locality and load balance at the application level may make the application much more complex to program. In this work, we develop locality-conscious concurrency platforms for multiple different structured parallel programming models, including streaming applications, task-graphs and parallel for loops. In all of this work, the idea is to minimally disrupt the application programming model so that the application developer is either unimpacted or must only provide high-level hints to the runtime system. The runtime system then schedules the application to provide good locality of access while, at the same time also providing good load balance. In particular, we address cache locality for streaming applications through static partitioning and developed an extensible platform to execute partitioned streaming applications. For task-graphs, we extend a task-graph scheduling library to guide scheduling decisions towards better NUMA locality with the help of user-provided locality hints. CilkPlus parallel for loops utilize a randomized dynamic scheduler to distribute work which, in many loop based applications, results in poor locality at all levels of the memory hierarchy. We address this issue with a novel parallel for loop implementation that can get good cache and NUMA locality while providing support to maintain good load balance dynamically

    GPGPU Processing in CUDA Architecture

    Full text link
    The future of computation is the Graphical Processing Unit, i.e. the GPU. The promise that the graphics cards have shown in the field of image processing and accelerated rendering of 3D scenes, and the computational capability that these GPUs possess, they are developing into great parallel computing units. It is quite simple to program a graphics processor to perform general parallel tasks. But after understanding the various architectural aspects of the graphics processor, it can be used to perform other taxing tasks as well. In this paper, we will show how CUDA can fully utilize the tremendous power of these GPUs. CUDA is NVIDIA's parallel computing architecture. It enables dramatic increases in computing performance, by harnessing the power of the GPU. This paper talks about CUDA and its architecture. It takes us through a comparison of CUDA C/C++ with other parallel programming languages like OpenCL and DirectCompute. The paper also lists out the common myths about CUDA and how the future seems to be promising for CUDA.Comment: 16 pages, 5 figures, Advanced Computing: an International Journal (ACIJ) 201
    • …
    corecore