Search CORE

11 research outputs found

Code Generation and Global Optimization Techniques for a Reconfigurable PRAM-NUMA Multicore Architecture

Author
Publication venue: 'Linkoping University Electronic Press'
Publication date
Field of study

Parallel and Distributed Computing

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

Directory of Open Access Books (DOAB)

Computer science I like proceedings of miniconference on 4.11.2011

Author: Penttonen Martti (ed.)
Publication venue: University of Eastern Finland
Publication date
Field of study

UEF Electronic Publications

A configurable vector processor for accelerating speech coding algorithms

Author: Konstantia Koutsomyti (7201031)
Publication venue
Publication date: 01/01/2007
Field of study

The growing demand for voice-over-packer (VoIP) services and multimedia-rich applications has made increasingly important the efficient, real-time implementation of low-bit rates speech coders on embedded VLSI platforms. Such speech coders are designed to substantially reduce the bandwidth requirements thus enabling dense multichannel gateways in small form factor. This however comes at a high computational cost which mandates the use of very high performance embedded processors. This thesis investigates the potential acceleration of two major ITU-T speech coding algorithms, namely G.729A and G.723.1, through their efficient implementation on a configurable extensible vector embedded CPU architecture. New scalar and vector ISAs were introduced which resulted in up to 80% reduction in the dynamic instruction count of both workloads. These instructions were subsequently encapsulated into a parametric, hybrid SISD (scalar processor)–SIMD (vector) processor. This work presents the research and implementation of the vector datapath of this vector coprocessor which is tightly-coupled to a Sparc-V8 compliant CPU, the optimization and simulation methodologies employed and the use of Electronic System Level (ESL) techniques to rapidly design SIMD datapaths

Loughborough University Institutional Repository

A distributed multi-threaded data partitioner with space-filling curve orders

Author: Sasidharan Aparna
Publication venue
Publication date: 01/08/2018
Field of study

The problem discussed in this thesis is distributed data partitioning and data re-ordering on many-core architectures. We present extensive literature survey, with examples from various application domains - scientific computing, databases and large-scale graph processing. We propose a low-overhead partitioning framework based on geometry, that can be used to partition multi-dimensional data where the number of dimensions is >=2. The partitioner linearly orders items with good spatial locality. Partial output is stored on each process in the communication group. Space-filling curves are used to permute data - Morton order is the default curve. For dimensions <=3, we have options to generate Hilbert-like curves. Two metrics used to determine partitioning overheads are memory consumption and execution time, although these two factors are dependent on each other. The focus of this thesis is to reduce partitioning overheads as much as possible. We have described several optimizations to this end - incremental adjustments to partitions, careful dynamic memory management and using multi-threading and multi-processing to advantage. The quality of partitions is an important criteria for evaluating a partitioner. We have used graph partitioners as base-implementations against which our partitions are compared. The degree and edge-cuts of our partitions are comparable to graph partitions for regular grids. For irregular meshes, there is still room for improvement. No comparisons have been made for evaluating partitions of datasets without edges. We have deployed these partitions on two large applications - atmosphere simulation in 2D and adaptive mesh refinement in 3D. An adaptive mesh refinement benchmark was built to be part of the framework, which later became a testcase for evaluating partitions and load-balancing schemes. The performance of this benchmark is discussed in detail in the last chapter

Illinois Digital Environment for Access to Learning and Scholarship Repository

Case Studies on Optimizing Algorithms for GPU Architectures

Author: Brown Shawn
Publication venue: University of North Carolina at Chapel Hill Graduate School
Publication date: 01/01/2015
Field of study

Modern GPUs are complex, massively multi-threaded, and high-performance. Programmers naturally gravitate towards taking advantage of this high performance for achieving faster results. However, in order to do so successfully, programmers must first understand and then master a new set of skills – writing parallel code, using different types of parallelism, adapting to GPU architectural features, and understanding issues that limit performance. In order to ease this learning process and help GPU programmers become productive more quickly, this dissertation introduces three data access skeletons (DASks) – Block, Column, and Row -- and two block access skeletons (BASks) – Block-By-Block and Warp-by-Warp. Each “skeleton” provides a high-performance implementation framework that partitions data arrays into data blocks and then iterates over those blocks. The programmer must still write “body” methods on individual data blocks to solve their specific problem. These skeletons provide efficient machine dependent data access patterns for use on GPUs. DASks group n data elements into m fixed size data blocks. These m data block are then partitioned across p thread blocks using a 1D or 2D layout pattern. The fixed-size data blocks are parameterized using three C++ template parameters – nWork, WarpSize, and nWarps. Generic programming techniques use these three parameters to enable performance experiments on three different types of parallelism – instruction-level parallelism (ILP), data-level parallelism (DLP), and thread-level parallelism (TLP). These different DASks and BASks are introduced using a simple memory I/O (Copy) case study. A nearest neighbor search case study resulted in the development of DASKs and BASks but does not use these skeletons itself. Three additional case studies – Reduce/Scan, Histogram, and Radix Sort -- demonstrate DASks and BASks in action on parallel primitives and also provides more valuable performance lessons.Doctor of Philosoph

Carolina Digital Repository

A PRAM-NUMA model of computation for addressing low-TLP workloads

Author: Forsell Martti
Publication venue
Publication date: 01/01/2010
Field of study

It is possible to implement the parallel random access machine (PRAM) on a chip multiprocessor (CMP) efficiently with an emulated shared memory (ESM) architecture to gain easy parallel programmability crucial to wider penetration of CMPs to general purpose computing. This implementation relies on exploitation of the slack of parallel applications to hide the latency of the memory system instead of caches, sufficient bisection bandwidth to guarantee high throughput, and hashing to avoid hot spots in intercommunication. Unfortunately this solution can not handle workloads with low thread-level parallelism (TLP) efficiently because then there is not enough parallel slackness available for hiding the latency. In this paper we show that integrating non-uniform memory access (NUMA) support to the PRAM implementation architecture can solve this problem and provide a natural way for migration of the legacy code written for a sequential or multi-core NUMA machine. The obtained PRAM-NUMA hybrid model is defined and architectural implementation of it is outlined on our ECLIPSE ESM CMP framework. A high-level programming language example is given

Directory of Open Access Journals

VTT Research System

A PRAM-NUMA model of computation for addressing low-TLP workloads

Author: Forsell Martti
Publication venue
Publication date: 01/01/2011
Field of study

VTT Research System

A PRAM-NUMA model of computation for addressing low-TLP workloads

Author: Forsell Martti
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

VTT Research System

A PRAM-NUMA Model of Computation for Addressing Low-TLP Workloads

Author
Publication venue: 'IJNC Editorial Committee'
Publication date: 01/01/2011
Field of study

Crossref