256,816 research outputs found

    A Domain Specific Approach to High Performance Heterogeneous Computing

    Full text link
    Users of heterogeneous computing systems face two problems: firstly, in understanding the trade-off relationships between the observable characteristics of their applications, such as latency and quality of the result, and secondly, how to exploit knowledge of these characteristics to allocate work to distributed computing platforms efficiently. A domain specific approach addresses both of these problems. By considering a subset of operations or functions, models of the observable characteristics or domain metrics may be formulated in advance, and populated at run-time for task instances. These metric models can then be used to express the allocation of work as a constrained integer program, which can be solved using heuristics, machine learning or Mixed Integer Linear Programming (MILP) frameworks. These claims are illustrated using the example domain of derivatives pricing in computational finance, with the domain metrics of workload latency or makespan and pricing accuracy. For a large, varied workload of 128 Black-Scholes and Heston model-based option pricing tasks, running upon a diverse array of 16 Multicore CPUs, GPUs and FPGAs platforms, predictions made by models of both the makespan and accuracy are generally within 10% of the run-time performance. When these models are used as inputs to machine learning and MILP-based workload allocation approaches, a latency improvement of up to 24 and 270 times over the heuristic approach is seen.Comment: 14 pages, preprint draft, minor revisio

    Efficient Generation of Parallel Spin-images Using Dynamic Loop Scheduling

    Get PDF
    High performance computing (HPC) systems underwent a significant increase in their processing capabilities. Modern HPC systems combine large numbers of homogeneous and heterogeneous computing resources. Scalability is, therefore, an essential aspect of scientific applications to efficiently exploit the massive parallelism of modern HPC systems. This work introduces an efficient version of the parallel spin-image algorithm (PSIA), called EPSIA. The PSIA is a parallel version of the spin-image algorithm (SIA). The (P)SIA is used in various domains, such as 3D object recognition, categorization, and 3D face recognition. EPSIA refers to the extended version of the PSIA that integrates various well-known dynamic loop scheduling (DLS) techniques. The present work: (1) Proposes EPSIA, a novel flexible version of PSIA; (2) Showcases the benefits of applying DLS techniques for optimizing the performance of the PSIA; (3) Assesses the performance of the proposed EPSIA by conducting several scalability experiments. The performance results are promising and show that using well-known DLS techniques, the performance of the EPSIA outperforms the performance of the PSIA by a factor of 1.2 and 2 for homogeneous and heterogeneous computing resources, respectively

    High performance lattice reduction on heterogeneous computing platform

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s11227-014-1201-2The lattice reduction (LR) technique has become very important in many engineering fields. However, its high complexity makes difficult its use in real-time applications, especially in applications that deal with large matrices. As a solution, the modified block LLL (MB-LLL) algorithm was introduced, where several levels of parallelism were exploited: (a) fine-grained parallelism was achieved through the cost-reduced all-swap LLL (CR-AS-LLL) algorithm introduced together with the MB-LLL by Jzsa et al. (Proceedings of the tenth international symposium on wireless communication systems, 2013) and (b) coarse-grained parallelism was achieved by applying the block-reduction concept presented by Wetzel (Algorithmic number theory. Springer, New York, pp 323-337, 1998). In this paper, we present the cost-reduced MB-LLL (CR-MB-LLL) algorithm, which allows to significantly reduce the computational complexity of the MB-LLL by allowing the relaxation of the first LLL condition while executing the LR of submatrices, resulting in the delay of the Gram-Schmidt coefficients update and by using less costly procedures during the boundary checks. The effects of complexity reduction and implementation details are analyzed and discussed for several architectures. A mapping of the CR-MB-LLL on a heterogeneous platform is proposed and it is compared with implementations running on a dynamic parallelism enabled GPU and a multi-core CPU. The mapping on the architecture proposed allows a dynamic scheduling of kernels where the overhead introduced is hidden by the use of several CUDA streams. Results show that the execution time of the CR-MB-LLL algorithm on the heterogeneous platform outperforms the multi-core CPU and it is more efficient than the CR-AS-LLL algorithm in case of large matrices.Financial support for this study was provided by grants TAMOP-4.2.1./B-11/2/KMR-2011-0002, TAMOP-4.2.2/B-10/1-2010-0014 from the Pazmany Peter Catholic University, European Union ERDF, Spanish Government through TEC2012-38142-C04-01 project and Generalitat Valenciana through PROMETEO/2009/013 project.Jozsa, CM.; Domene Oltra, F.; Vidal Maciá, AM.; Piñero Sipán, MG.; González Salvador, A. (2014). High performance lattice reduction on heterogeneous computing platform. Journal of Supercomputing. 70(2):772-785. https://doi.org/10.1007/s11227-014-1201-2S772785702Józsa CM, Domene F, Piñero G, González A, Vidal AM (2013) Efficient GPU implementation of lattice-reduction-aided multiuser precoding. In: Proceedings of the tenth international symposium on wireless communication systems (ISWCS 2013)Wetzel S (1998) An efficient parallel block-reduction algorithm. In: Buhler JP (ed) Algorithmic number theory. Lecture notes in computer science, vol 1423. Springer, Berlin, Heidelberg, pp 323–337Wubben D, Seethaler D, Jaldén J, Matz G (2011) Lattice reduction. Signal Process Mag IEEE 28(3):70–91Lenstra AK, Lenstra HW, Lovász L (1982) Factoring polynomials with rational coefficients. Math Ann 261(4):515–534Bremner MR (2012) Lattice basis reduction: an introduction to the LLL algorithm and its applications. CRC Press, USAWu D, Eilert J, Liu D (2008) A programmable lattice-reduction aided detector for MIMO-OFDMA. In: 4th IEEE international conference on circuits and systems for communications (ICCSC 2008), pp 293–297Barbero LG, Milliner DL, Ratnarajah T, Barry JR, Cowan C (2009) Rapid prototyping of Clarkson’s lattice reduction for MIMO detection. In: IEEE international conference on communications (ICC’09), pp 1–5Gestner B, Zhang W, Ma X, Anderson D (2011) Lattice reduction for MIMO detection: from theoretical analysis to hardware realization. IEEE Trans Circ Syst I Regul Pap 58(4):813–826Shabany M, Youssef A, Gulak G (2013) High-throughput 0.13- \upmu μ m CMOS lattice reduction core supporting 880 Mb/s detection. IEEE Trans Very Large Scale Integr (VLSI) Syst 21(5):848–861Luo Y, Qiao S (2011) A parallel LLL algorithm. In: Proceedings of the fourth international C* conference on computer science and software engineering, pp 93–101Backes W, Wetzel S (2011) Parallel lattice basis reduction—the road to many-core. In: IEEE 13th international conference on high performance computing and communications (HPCC)Ahmad U, Amin A, Li M, Pollin S, Van der Perre L, Catthoor F (2011) Scalable block-based parallel lattice reduction algorithm for an SDR baseband processor. In: 2011 IEEE international conference on communications (ICC)Villard G (1992) Parallel lattice basis reduction. In: Papers from the international symposium on symbolic and algebraic computation (ISSAC’92). ACM, New YorkDomene F, Józsa CM, Vidal AM, Piñero G, Gonzalez A (2013) Performance analysis of a parallel lattice reduction algorithm on many-core architectures. In: Proceedings of the 13th international conference on computational and mathematical methods in science and engineeringGestner B, Zhang W, Ma X, Anderson DV (2008) VLSI implementation of a lattice reduction algorithm for low-complexity equalization. In: 4th IEEE international conference on circuits and systems for communications (ICCSC 2008), pp 643–647Burg A, Seethaler D, Matz G (2007) VLSI implementation of a lattice-reduction algorithm for multi-antenna broadcast precoding. In: IEEE international symposium on circuits and systems (ISCAS 2007), pp 673–676Bruderer L, Studer C, Wenk M, Seethaler D, Burg A (2010) VLSI implementation of a low-complexity LLL lattice reduction algorithm for MIMO detection. In: Proceedings of 2010 IEEE international symposium on circuits and systems (ISCAS

    CPU-GPU Heterogeneous Code Acceleration of a Finite Volume Computational Fluid Dynamics Solver

    Full text link
    This work deals with the CPU-GPU heterogeneous code acceleration of a finite-volume CFD solver utilizing multiple CPUs and GPUs at the same time. First, a high-level description of the CFD solver called SENSEI, the discretization of SENSEI, and the CPU-GPU heterogeneous computing workflow in SENSEI leveraging MPI and OpenACC are given. Then, a performance model for CPU-GPU heterogeneous computing requiring ghost cell exchange is proposed to help estimate the performance of the heterogeneous implementation. The scaling performance of the CPU-GPU heterogeneous computing and its comparison with the pure multi-CPU/GPU performance for a supersonic inlet test case is presented to display the advantages of leveraging the computational power of both the CPU and the GPU. Using CPUs and GPUs as workers together, the performance can be improved further compared to using pure CPUs or GPUs, and the advantages can be fairly estimated by the performance model proposed in this work. Finally, conclusions are drawn to provide 1) suggestions for application users who have an interest to leverage the computational power of the CPU and GPU to accelerate their own scientific computing simulations and 2) feedback for hardware architects who have an interest to design a better CPU-GPU heterogeneous system for heterogeneous computing

    Efficient Utilization of Fine-Grained Parallelism using a microHeterogeneous Environment

    Get PDF
    The goal of this thesis is to propose a new computing paradigm, called micro- Heterogeneous computing or mHC, which incorporates PCI (or other high speed local system bus) based processing elements (vector processors, digital signal processors, etc) into a general purpose machine. In this manner the benefits of heterogeneous computing on scientific applications can be achieved while avoiding some of the lim itations. Overall performance is increased by exploiting fine-grained parallelism on the most efficient architecture available, while reducing the high communication over head and costs of traditional heterogeneous environments. Furthermore, mHC based machines can be combined into a cluster, allowing both the coarse-grained and fine grained parallelism to be fully exploited in order to achieve even greater levels of performance. An existing high performance computing API (GSL) was chosen as the interface to the system to allow for easy integration with applications that were previously developed using this API. The ensuing chapters will provide the motivation for this work, an overview of heterogenous computing, and the details pertaining to microHeterogeneous comput ing. The framework implemented to demonstrate a microHeterogeneous computing environment will be examined as well as the results. Finally, the future of micro Heterogeneous computing will be discussed
    • …