Search CORE

112 research outputs found

OpenCL + OpenSHMEM Hybrid Programming Model for the Adapteva Epiphany Architecture

Author: D Wentzlaff
J Ross
JE Stone
M Baker
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 11/08/2016
Field of study

There is interest in exploring hybrid OpenSHMEM + X programming models to extend the applicability of the OpenSHMEM interface to more hardware architectures. We present a hybrid OpenCL + OpenSHMEM programming model for device-level programming for architectures like the Adapteva Epiphany many-core RISC array processor. The Epiphany architecture comprises a 2D array of low-power RISC cores with minimal uncore functionality connected by a 2D mesh Network-on-Chip (NoC). The Epiphany architecture offers high computational energy efficiency for integer and floating point calculations as well as parallel scalability. The Epiphany-III is available as a coprocessor in platforms that also utilize an ARM CPU host. OpenCL provides good functionality for supporting a co-design programming model in which the host CPU offloads parallel work to a coprocessor. However, the OpenCL memory model is inconsistent with the Epiphany memory architecture and lacks support for inter-core communication. We propose a hybrid programming model in which OpenSHMEM provides a better solution by replacing the non-standard OpenCL extensions introduced to achieve high performance with the Epiphany architecture. We demonstrate the proposed programming model for matrix-matrix multiplication based on Cannon's algorithm showing that the hybrid model addresses the deficiencies of using OpenCL alone to achieve good benchmark performance.Comment: 12 pages, 5 figures, OpenSHMEM 2016: Third workshop on OpenSHMEM and Related Technologie

arXiv.org e-Print Archive

Crossref

Инструментальные средства проектирования параллельных программ

Author: Буза М.К.
Publication venue: Інститут проблем штучного інтелекту МОН України та НАН України
Publication date: 01/01/2010
Field of study

Проанализированы современные средства проектирования параллельных программ. Выявлены свойства циклов в программах и создана их классификация. Предложена функциональная библиотека распараллеливания циклов. Разработана общая схема преобразования последовательных программ в последовательно-параллельные.Проаналізовані сучасні засоби проектування паралельних програм. Виявлені властивості циклів в програмах і створена їх класифікація. Запропонована функціональна бібліотека розпаралелювання циклів. Розроблена загальна схема перетворення послідовних програм в послідовно-паралельні.Properties cycles in programs are formalized and classification cycles are given. Function program library parallel cycles design is done. Common schema transformation program from seguental form to seguental- parallel is suggested and described

Наукова електронна бібліотека періодичних видань НАН України (Vernadsky National Library of Ukraine)

The Glasgow Parallel Reduction Machine: Programming Shared-memory Many-core Systems using Parallel Task Composition

Author: Tousimojarad Ashkan
Vanderbauwhede Wim
Publication venue: 'Open Publishing Association'
Publication date: 01/12/2013
Field of study

We present the Glasgow Parallel Reduction Machine (GPRM), a novel, flexible framework for parallel task-composition based many-core programming. We allow the programmer to structure programs into task code, written as C++ classes, and communication code, written in a restricted subset of C++ with functional semantics and parallel evaluation. In this paper we discuss the GPRM, the virtual machine framework that enables the parallel task composition approach. We focus the discussion on GPIR, the functional language used as the intermediate representation of the bytecode running on the GPRM. Using examples in this language we show the flexibility and power of our task composition framework. We demonstrate the potential using an implementation of a merge sort algorithm on a 64-core Tilera processor, as well as on a conventional Intel quad-core processor and an AMD 48-core processor system. We also compare our framework with OpenMP tasks in a parallel pointer chasing algorithm running on the Tilera processor. Our results show that the GPRM programs outperform the corresponding OpenMP codes on all test platforms, and can greatly facilitate writing of parallel programs, in particular non-data parallel algorithms such as reductions.Comment: In Proceedings PLACES 2013, arXiv:1312.221

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

Enlighten

Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from Concrete Concurrency Models

Author: A. Peymandoust
Alastair R. Beresford
Andreas Gal Albert Noll
Bram Adams
Bratin Saha
Carl Hewitt
Charles Antony Richard Hoare
Charles R. Johns
Chen-Yong Cher
Colin Blundell
David Ungar
David Wentzlaff
Doug Lea
ECMA International
Edward A. Lee
freescale semiconductor
Georg Sorst
Gul Agha
Hans Schippers
Haris Volos
Intel Corporation
James Gosling
Jim Gray
John A. Trono
John S. Danaher
John Zigman
Jos'e M. Piquer
Kevin Casey
Kevin Williams
Larry Seiler
Lukasz Ziarek
M. Anton Ertl
Mark S. Miller
Maurice Herlihy
Michael Haupt
Michael R. Marty
Nir Shavit
Pascal Costanza
Philipp Haller
Rajesh K. Karmani
Robert D. Blumofe
Robert Virding
Simon Gay
Sriram Srinivasan
Stefan Marr
Stefan Marr
Stijn Timbermont
Theo D'Hondt
Thomas Kistler
Tom Van Cutsem
Uwe Kastens
Vijay A. Saraswat
Virendra J. Marathe
Wenzhang Zhu
Wolfgang De Meuter
Xu Wang
Yaoqing Gao
Publication venue: 'Open Publishing Association'
Publication date: 01/02/2010
Field of study

The upcoming many-core architectures require software developers to exploit concurrency to utilize available computational power. Today's high-level language virtual machines (VMs), which are a cornerstone of software development, do not provide sufficient abstraction for concurrency concepts. We analyze concrete and abstract concurrency models and identify the challenges they impose for VMs. To provide sufficient concurrency support in VMs, we propose to integrate concurrency operations into VM instruction sets. Since there will always be VMs optimized for special purposes, our goal is to develop a methodology to design instruction sets with concurrency support. Therefore, we also propose a list of trade-offs that have to be investigated to advise the design of such instruction sets. As a first experiment, we implemented one instruction set extension for shared memory and one for non-shared memory concurrency. From our experimental results, we derived a list of requirements for a full-grown experimental environment for further research

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

Kent Academic Repository

A Multi-core architecture for a hybrid information system

Author: Chang Victor
Hamid Norhazlina
Walters Robert John
Wills Gary Brian
Publication venue: 'Elsevier BV'
Publication date: 20/12/2017
Field of study

This paper demonstrates our proposed Multi-core architecture for a hybrid information system (HIS) with the related work, system design, theories, experiments, analysis and discussion presented. Different designs on clusters, communication between different types of chips and clusters and network queuing methods have been described. Our aim is to achieve quality, reliability and resilience and to demonstrate it, our emphasis is on latency with messages communicated in our system – understand how it happens, what can trigger its increase, and then experiment with different types of focuses, including under Store-and-Forward Flow Control method, Wormhole flow control method, cluster size and message size to get a better understanding. Our analysis allows us to reduce latency and avoid its sharp increase. We justify our research contributions, particularly in the area of “traffic analysis and management” and “performance analysis of transmission control” of the HIS systems

Southampton (e-Prints Soton)

Crossref

Teeside University's Research Repository

Library Cache Coherence

Author: Cho Myong Hyon
Devadas Srinivas
Khan Omer
Lis Mieszko
Shim Keun Sup
Publication venue
Publication date: 02/05/2011
Field of study

Directory-based cache coherence is a popular mechanism for chip multiprocessors and multicores. The directory protocol, however, requires multicast for invalidation messages and the collection of acknowledgement messages, which can be expensive in terms of latency and network traffic. Furthermore, the size of the directory increases with the number of cores. We present Library Cache Coherence (LCC), which requires neither broadcast/multicast for invalidations nor waiting for invalidation acknowledgements. A library is a set of timestamps that are used to auto-invalidate shared cache lines, and delay writes on the lines until all shared copies expire. The size of library is independent of the number of cores. By removing the complex invalidation process of directory-based cache coherence protocols, LCC generates fewer network messages. At the same time, LCC also allows reads on a cache block to take place while a write to the block is being delayed, without breaking sequential consistency. As a result, LCC has 1.85X less average memory latency than a MESI directory-based protocol on our set of benchmarks, even with a simple timestamp choosing algorithm; moreover, our experimental results on LCC with an ideal timestamp scheme (though not implementable) show the potential of further improvement for LCC with more sophisticated timestamp schemes

DSpace@MIT

A Unified Operating System for Clouds and Manycore: fos

Author: Agarwal Anant
Beckmann Nathan
Belay Adam
Gruenwald Charles, III
Miller Jason
Modzelewski Kevin
Wentzlaff David
Youseff Lamia
Publication venue
Publication date: 20/11/2009
Field of study

Single chip processors with thousands of cores will be available in the next ten years and clouds of multicore processors afford the operating system designer thousands of cores today. Constructing operating systems for manycore and cloud systems face similar challenges. This work identifies these shared challenges and introduces our solution: a factored operating system (fos) designed to meet the scalability, faultiness, variability of demand, and programming challenges of OSâ s for single-chip thousand-core manycore systems as well as current day cloud computers. Current monolithic operating systems are not well suited for manycores and clouds as they have taken an evolutionary approach to scaling such as adding fine grain locks and redesigning subsystems, however these approaches do not increase scalability quickly enough. fos addresses the OS scalability challenge by using a message passing design and is composed out of a collection of Internet inspired servers. Each operating system service is factored into a set of communicating servers which in aggregate implement a system service. These servers are designed much in the way that distributed Internet services are designed, but provide traditional kernel services instead of Internet services. Also, fos embraces the elasticity of cloud and manycore platforms by adapting resource utilization to match demand. fos facilitates writing applications across the cloud by providing a single system image across both future 1000+ core manycores and current day Infrastructure as a Service cloud computers. In contrast, current cloud environments do not provide a single system image and introduce complexity for the user by requiring different programming models for intra- vs inter-machine communication, and by requiring the use of non-OS standard management tools

DSpace@MIT