Search CORE

3,067 research outputs found

Variable-based multi-module data caches for clustered VLIW processors

Author: Abella Ferrer Jaume
Gibert Codina Enric
González Colás Antonio María
Sánchez Navarro Jesús
Vera Rivera Francisco Javier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

Memory structures consume an important fraction of the total processor energy. One solution to reduce the energy consumed by cache memories consists of reducing their supply voltage and/or increase their threshold voltage at an expense in access time. We propose to divide the L1 data cache into two cache modules for a clustered VLIW processor consisting of two clusters. Such division is done on a variable basis so that the address of a datum determines its location. Each cache module is assigned to a cluster and can be set up as a fast power-hungry module or as a slow power-aware module. We also present compiler techniques in order to distribute variables between the two cache modules and generate code accordingly. We have explored several cache configurations using the Mediabench suite and we have observed that the best distributed cache organization outperforms traditional cache organizations by 19%-31% in energy-delay and by 11%-29% in energy-delay. In addition, we also explore a reconfigurable distributed cache, where the cache can be reconfigured on a context switch. This reconfigurable scheme further outperforms the best previous distributed organization by 3%-4%.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

pocl: A Performance-Portable OpenCL Implementation

Author: Berg Heikki
de La Lama Carlos Sánchez
Jääskeläinen Pekka
Raiskila Kalle
Schnetter Erik
Takala Jarmo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. Our results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via arxi

arXiv.org e-Print Archive

Trepo - Institutional Repository of Tampere University

Scheduling and Allocation of Non-Manifest Loops on Hardware Graph-Models

Author: Etalle S.
Krol T.
Mansour O.
Publication venue: STW Technology Foundation
Publication date: 01/01/2001
Field of study

We address the problem of scheduling non-manifest data dependant periodic loops for high throughput DSP-applications based on a streaming data model. In contrast to manifest loops, non-manifest data dependant loops are loops where the number of iterations needed in order to perform a calculation is data dependant and hence not known at compile time. For the case of manifest loops, static scheduling techniques have been devised which produce near optimal schedules. Due to the lack of exact run-time execution knowledge of non-manifest loops, these static scheduling techniques are not suitable for tackling scheduling problems of DSP-algorithms with non-manifest loops embedded in them. We consider the case where (a) a-priori knowledge of the data distribution, and (b) worst case execution time of the non-manifest loop are known and a constraint on the total execution time has been given. Under these conditions dynamic schedules of the non-manifest data dependant loops within the DSP-algorithm are possible. We show how to construct hardware which dynamically schedules these non-manifest loops. The sliding window execution, which is the execution of a non-manifest loop when the data streams through it, of the constructed hardware will guarantee real time performance for the worst case situation. This is the situation when each non-manifest loop requires its maximum number of iterations

CiteSeerX

Repository TU/e

Pure OAI Repository

University of Twente Research Information

Recommended from our members

Elixir: synthesis of parallel irregular algorithms

Author: Prountzos Dimitrios
Publication venue
Publication date: 11/02/2016
Field of study

Algorithms in new application areas like machine learning and data analytics usually operate on unstructured sparse graphs. Writing efficient parallel code to implement these algorithms is very challenging for a number of reasons. First, there may be many algorithms to solve a problem and each algorithm may have many implementations. Second, synchronization, which is necessary for correct parallel execution, introduces potential problems such as data-races and deadlocks. These issues interact in subtle ways, making the best solution dependent both on the parallel platform and on properties of the input graph. Consequently, implementing and selecting the best parallel solution can be a daunting task for non-experts, since we have few performance models for predicting the performance of parallel sparse graph programs on parallel hardware. This dissertation presents a synthesis methodology and a system, Elixir, that addresses these problems by (i) allowing programmers to specify solutions at a high level of abstraction, and (ii) generating many parallel implementations automatically and using search to find the best one. An Elixir specification consists of a set of operators capturing the main algorithm logic and a schedule specifying how to efficiently apply the operators. Elixir employs sophisticated automated reasoning to merge these two components, and uses techniques based on automated planning to insert synchronization and synthesize efficient parallel code. Experimental evaluation of our approach demonstrates that the performance of the Elixir generated code is competitive to, and can even outperform, hand-optimized code written by expert programmers for many interesting graph benchmarks.Computer Science

Texas ScholarWorks

Parallelizing Sequential Programs With Statistical Accuracy Tests

Author: Kim Deokhwan
Misailovic Sasa
Rinard Martin
Publication venue
Publication date: 05/08/2010
Field of study

We present QuickStep, a novel system for parallelizing sequential programs. QuickStep deploys a set of parallelization transformations that together induce a search space of candidate parallel programs. Given a sequential program, representative inputs, and an accuracy requirement, QuickStep uses performance measurements, profiling information, and statistical accuracy tests on the outputs of candidate parallel programs to guide its search for a parallelizationthat maximizes performance while preserving acceptable accuracy. When the search completes, QuickStep produces an interactive report that summarizes the applied parallelization transformations, performance, and accuracy results for the automatically generated candidate parallel programs. In our envisioned usage scenarios, the developer examines this report to evaluate the acceptability of the final parallelization and to obtain insight into how the original sequential program responds to different parallelization strategies. Itis also possible for the developer (or even a user of the program who has no software development expertise whatsoever) to simply use the best parallelization out of the box without examining the report or further investigating the parallelization. Results from our benchmark set of applications show that QuickStep can automatically generate accurate and efficient parallel programs---the automatically generated parallel versions of five of our six benchmark applications run between 5.0 and 7.7 times faster on 8 cores than the original sequential versions. Moreover, a comparison with the Intel icc compiler highlights how QuickStep can effectively parallelize applications with features (such as the use of modern object-oriented programming constructs or desirable parallelizations with infrequent but acceptable data races) that place them inherently beyond the reach of standard approaches

DSpace@MIT