Search CORE

19 research outputs found

HaTS: Hardware-Assisted Transaction Scheduler

Author: Chen Zhanhao
Hassan Ahmed
Kishi Masoomeh Javidi
Nelson Jacob
Palmieri Roberto
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 23rd International Conference on Principles of Distributed Systems (OPODIS 2019)
Publication date: 01/01/2020
Field of study

In this paper we present HaTS, a Hardware-assisted Transaction Scheduler. HaTS improves performance of concurrent applications by classifying the executions of their atomic blocks (or in-memory transactions) into scheduling queues, according to their so called conflict indicators. The goal is to group those transactions that are conflicting while letting non-conflicting transactions proceed in parallel. Two core innovations characterize HaTS. First, HaTS does not assume the availability of precise information associated with incoming transactions in order to proceed with the classification. It relaxes this assumption by exploiting the inherent conflict resolution provided by Hardware Transactional Memory (HTM). Second, HaTS dynamically adjusts the number of the scheduling queues in order to capture the actual application contention level. Performance results using the STAMP benchmark suite show up to 2x improvement over state-of-the-art HTM-based scheduling techniques

Dagstuhl Research Online Publication Server

ACOTES project: Advanced compiler technologies for embedded streaming

Author: Albert Cohen
Alex Ramírez
Andrea Ornstein
Antoniu Pop
Ayal Zaks
Cupertino Miranda
Cédric Bastoul
David Ródenas
Dorit Nuzman
E. Blossom
E.A. Lee
Eduard Ayguadé
Erven Rohou
Harm Munk
Ira Rosen
J. Hoogerbrugge
Konrad Trifunović
Louis-Noël Pouchet
M. Gschwind
M. Wolfe
Marc Duranton
Marco Cornero
Menno Lindwer
Mohammed Fellahi
Paul Carpenter
Philippe Dumont
R. Allen
R.G. Scarborough
Razya Ladelsky
Roger Ferrer
S. Campanoni
Sebastian Pop
Uzi Shvadron
Xavier Martorell
Zbigniew Chamski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

HAL-CentraleSupelec

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

INRIA a CCSD electronic archive server

HAL-MINES ParisTech

The University of Manchester - Institutional Repository

HAL-Rennes 1

Putting Queens in Carry Chains

Author: Nägel Bernd
Preußer Thomas B.
Spallek Rainer G.
Publication venue: Technische Universität Dresden
Publication date: 14/11/2012
Field of study

This paper describes an FPGA implementation of a solution-counting solver for the N-Queens Puzzle. The proposed algorithmic mapping utilizes the fast carrychain logic found on modern FPGA architectures in order to achieve a regular and efficient design. From an initial full chessboard mapping, several optimization strategies are explored. Also, the infrastructure is described, which we have constructed for the computation of the currently unknown solution count of the 26- Queens Puzzle. Finally, we compare the performance of our used concrete FPGA device mappings also in contrast to general-purpose CPUs

Technische Universität Dresden: Qucosa

Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures

Author: A. Duran
C. Augonnet
H. Topcuoglu
V.J. Jiménez
Publication venue: HAL CCSD
Publication date: 01/01/2009
Field of study

International audienceMulticore architectures featuring specialized accelerators are getting an increasing amount of attention, and this success will probably influence the design of future High Performance Computing hardware. Unfortunately, programmers are actually having a hard time trying to exploit all these heterogeneous computing units efficiently, and most existing efforts simply focus on providing tools to offload some computations on available accelerators. Recently, some runtime systems have been designed that exploit the idea of scheduling -- as opposed to offloading -- parallel tasks over the whole set of heterogeneous computing units. Scheduling tasks over heterogeneous platforms makes it necessary to use accurate prediction models in order to assign each task to its most adequate computing unit. A deep knowledge of the application is usually required to model per-task performance models, based on the algorithmic complexity of the underlying numeric kernel. We present an alternate, auto-tuning performance prediction approach based on performance history tables dynamically built during the application run. This approach does not require that the programmer provides some specific information. We show that, thanks to the use of a carefully chosen hash-function, our approach quickly achieves accurate performance estimations automatically. Our approach even outperforms regular algorithmic performance models with several linear algebra numerical kernels

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Performance Analysis of Adaptive GPS Signal Detection in Urban Interference Environment using the Monte Carlo Approach

Author: Ch. Kabakchiev
H. Rohling
I. Garvanov
V. Behar
Publication venue: 'IntechOpen'
Publication date: 28/02/2011
Field of study

IntechOpen

Crossbar-based memristive logic-in-memory architecture

Author: Abustelema Angel
Papandroulikadis Georgios
Rubio Sola Jose Antonio
Sirakoulis Georgios
Vourkas Ioannis
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

The use of memristors and resistive random access memory (ReRAM) technology to perform logic computations, has drawn considerable attention from researchers in recent years. However, the topological aspects of the underlying ReRAM architecture and its organization have received less attention, as the focus has mainly been on device-specific properties for functionally complete logic gates through conditional switching in ReRAM circuits. A careful investigation and optimization of the target geometry is thus highly desirable for the implementation of logic-in-memory architectures. In this paper, we propose a crossbar-based in-memory parallel processing system in which, through the heterogeneity of the resistive cross-point devices, we achieve local information processing in a state-of-the-art ReRAM crossbar architecture with vertical group-accessed transistors as cross-point selector devices. We primarily focus on the array organization, information storage, and processing flow, while proposing a novel geometry for the cross-point selection lines to mitigate current sneak-paths during an arbitrary number of possible parallel logic computations. We prove the proper functioning and potential capabilities of the proposed architecture through SPICE-level circuit simulations of half-adder and sum-of-products logic functions. We compare certain features of the proposed logic-in-memory approach with another work of the literature, and present an analysis of circuit resources, integration density, and logic computation parallelism.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Anuari: 2011-2012. Departament d'Arquitectura de Computadors (DAC)

Author
Publication venue
Publication date: 12/02/2013
Field of study

UPCommons. Portal del coneixement obert de la UPC

Parallel Programming with Global Asynchronous Memory: Models, C++ APIs and Implementations

Author: Drocco Maurizio
Publication venue
Publication date: 01/01/2017
Field of study

In the realm of High Performance Computing (HPC), message passing has been the programming paradigm of choice for over twenty years. The durable MPI (Message Passing Interface) standard, with send/receive communication, broadcast, gather/scatter, and reduction collectives is still used to construct parallel programs where each communication is orchestrated by the developer-based precise knowledge of data distribution and overheads; collective communications simplify the orchestration but might induce excessive synchronization. Early attempts to bring shared-memory programming model—with its programming advantages—to distributed computing, referred as the Distributed Shared Memory (DSM) model, faded away; one of the main issue was to combine performance and programmability with the memory consistency model. The recently proposed Partitioned Global Address Space (PGAS) model is a modern revamp of DSM that exposes data placement to enable optimizations based on locality, but it still addresses (simple) data- parallelism only and it relies on expensive sharing protocols. We advocate an alternative programming model for distributed computing based on a Global Asynchronous Memory (GAM), aiming to avoid coherency and consistency problems rather than solving them. We materialize GAM by designing and implementing a distributed smart pointers library, inspired by C++ smart pointers. In this model, public and pri- vate pointers (resembling C++ shared and unique pointers, respectively) are moved around instead of messages (i.e., data), thus alleviating the user from the burden of minimizing transfers. On top of smart pointers, we propose a high-level C++ template library for writing applications in terms of dataflow-like networks, namely GAM nets, consisting of stateful processors exchanging pointers in fully asynchronous fashion. We demonstrate the validity of the proposed approach, from the expressiveness perspective, by showing how GAM nets can be exploited to implement both standalone applications and higher-level parallel program- ming models, such as data and task parallelism. As for the performance perspective, preliminary experiments show both close-to-ideal scalability and negligible overhead with respect to state-of-the-art benchmark implementations. For instance, the GAM implementation of a high-quality video restoration filter sustains a 100 fps throughput over 70%-noisy high-quality video streams on a 4-node cluster of Graphics Processing Units (GPUs), with minimal programming effort

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Institutional Research Information System University of Turin