Search CORE

160 research outputs found

Performance Aspects of Synthesizable Computing Systems

Author: Schleuniger Pascal
Publication venue: Technical University of Denmark
Publication date: 01/01/2014
Field of study

Replicable parallel branch and bound search

Author: Abu-Khzam
Alba
Aldinucci
Archibald
Bernard Gendron
Blair Archibald
Blumofe
Bomze
Butenko
Chandra
Chu
Ciaran McCreesh
Cole
Dean
Depolli
de Bruin
Eblen
Everitt
Fukagawa
Hall
Harvey
Jones
Konc
Lai
Laporte
Li
Li
Li
Maier
Martello
Martello
Matoušek
McCreesh
McCreesh
McCreesh
McCreesh
Moisan
Morrison
Nikolaev
Okubo
Olivier
Patrick Maier
Phil Trinder
Pisinger
Poldner
Prim
Prosser
Regula
Reinders
Reinelt
Robert Stewart
Salkin
San Segundo
San Segundo
Segundo
Segundo
Segundo
Tomita
Tomita
Tomita
Trienekens
Vogels
Walsh
Wu
Xiang
Yan
Publication venue: 'Elsevier BV'
Publication date: 12/07/2017
Field of study

Combinatorial branch and bound searches are a common technique for solving global optimisation and decision problems. Their performance often depends on good search order heuristics, refined over decades of algorithms research. Parallel search necessarily deviates from the sequential search order, sometimes dramatically and unpredictably, e.g. by distributing work at random. This can disrupt effective search order heuristics and lead to unexpected and highly variable parallel performance. The variability makes it hard to reason about the parallel performance of combinatorial searches. This paper presents a generic parallel branch and bound skeleton, implemented in Haskell, with replicable parallel performance. The skeleton aims to preserve the search order heuristic by distributing work in an ordered fashion, closely following the sequential search order. We demonstrate the generality of the approach by applying the skeleton to 40 instances of three combinatorial problems: Maximum Clique, 0/1 Knapsack and Travelling Salesperson. The overheads of our Haskell skeleton are reasonable: giving slowdown factors of between 1.9 and 6.2 compared with a class-leading, dedicated, and highly optimised C++ Maximum Clique solver. We demonstrate scaling up to 200 cores of a Beowulf cluster, achieving speedups of 100x for several Maximum Clique instances. We demonstrate low variance of parallel performance across all instances of the three combinatorial problems and at all scales up to 200 cores, with median Relative Standard Deviation (RSD) below 2%. Parallel solvers that do not follow the sequential search order exhibit far higher variance, with median RSD exceeding 85% for Knapsack

arXiv.org e-Print Archive

Enlighten: Research Data (University of Glasgow)

Crossref

Heriot Watt Pure

Stirling Online Research Repository (RIOXX)

Sheffield Hallam University Research Archive

Enlighten

Stirling Online Research Repository

LEVERAGING DATA-FLOW INFORMATION FOR EFFICIENT SCHEDULING OF TASK-PARALLEL PROGRAMS ON HETEROGENEOUS SYSTEMS

Author: Simsek Osman
Publication venue
Publication date: 01/08/2020
Field of study

The University of Manchester - Institutional Repository

FPGA-based Query Acceleration for Non-relational Databases

Author: Dann Jonas Christian
Publication venue
Publication date: 01/01/2024
Field of study

Database management systems are an integral part of today’s everyday life. Trends like smart applications, the internet of things, and business and social networks require applications to deal efficiently with data in various data models close to the underlying domain. Therefore, non-relational database systems provide a wide variety of database models, like graphs and documents. However, current non-relational database systems face performance challenges due to the end of Dennard scaling and therefore performance scaling of CPUs. In the meanwhile, FPGAs have gained traction as accelerators for data management. Our goal is to tackle the performance challenges of non-relational database systems with FPGA acceleration and, at the same time, address design challenges of FPGA acceleration itself. Therefore, we split this thesis up into two main lines of work: graph processing and flexible data processing. Because of the lacking benchmark practices for graph processing accelerators, we propose GraphSim. GraphSim is able to reproduce runtimes of these accelerators based on a memory access model of the approach. Through this simulation environment, we extract three performance-critical accelerator properties: asynchronous graph processing, compressed graph data structure, and multi-channel memory. Since these accelerator properties have not been combined in one system, we propose GraphScale. GraphScale is the first scalable, asynchronous graph processing accelerator working on a compressed graph and outperforms all state-of-the-art graph processing accelerators. Focusing on accelerator flexibility, we propose PipeJSON as the first FPGA-based JSON parser for arbitrary JSON documents. PipeJSON is able to achieve parsing at line-speed, outperforming the fastest, vectorized parsers for CPUs. Lastly, we propose the subgraph query processing accelerator GraphMatch which outperforms state-of-the-art CPU systems for subgraph query processing and is able to flexibly switch queries during runtime in a matter of clock cycles

Heidelberger Dokumentenserver

Stealthy Opaque Predicates in Hardware -- Obfuscating Constant Expressions at Negligible Overhead

Author: Hoffmann Max
Paar Christof
Publication venue
Publication date: 08/05/2018
Field of study

Opaque predicates are a well-established fundamental building block for software obfuscation. Simplified, an opaque predicate implements an expression that provides constant Boolean output, but appears to have dynamic behavior for static analysis. Even though there has been extensive research regarding opaque predicates in software, techniques for opaque predicates in hardware are barely explored. In this work, we propose a novel technique to instantiate opaque predicates in hardware, such that they (1) are resource-efficient, and (2) are challenging to reverse engineer even with dynamic analysis capabilities. We demonstrate the applicability of opaque predicates in hardware for both, protection of intellectual property and obfuscation of cryptographic hardware Trojans. Our results show that we are able to implement stealthy opaque predicates in hardware with minimal overhead in area and no impact on latency

arXiv.org e-Print Archive

Ruhr-Universität Bochum (RUB): Open Journal Systems

Efficiently and Transparently Maintaining High SIMD Occupancy in the Presence of Wavefront Irregularity

Author: Cole Stephen V
Publication venue: Washington University Open Scholarship
Publication date: 15/08/2017
Field of study

Demand is increasing for high throughput processing of irregular streaming applications; examples of such applications from scientific and engineering domains include biological sequence alignment, network packet filtering, automated face detection, and big graph algorithms. With wide SIMD, lightweight threads, and low-cost thread-context switching, wide-SIMD architectures such as GPUs allow considerable flexibility in the way application work is assigned to threads. However, irregular applications are challenging to map efficiently onto wide SIMD because data-dependent filtering or replication of items creates an unpredictable data wavefront of items ready for further processing. Straightforward implementations of irregular applications on a wide-SIMD architecture are prone to load imbalance and reduced occupancy, while more sophisticated implementations require advanced use of parallel GPU operations to redistribute work efficiently among threads. This dissertation will present strategies for addressing the performance challenges of wavefront- irregular applications on wide-SIMD architectures. These strategies are embodied in a developer framework called Mercator that (1) allows developers to map irregular applications onto GPUs ac- cording to the streaming paradigm while abstracting from low-level data movement and (2) includes generalized techniques for transparently overcoming the obstacles to high throughput presented by wavefront-irregular applications on a GPU. Mercator forms the centerpiece of this dissertation, and we present its motivation, performance model, implementation, and extensions in this work

Washington University St. Louis: Open Scholarship

PYDAC: A DISTRIBUTED RUNTIME SYSTEM AND PROGRAMMING MODEL FOR A HETEROGENEOUS MANY-CORE ARCHITECTURE

Author: Huang Bin
NC DOCKS at The University of North Carolina at Charlotte
Publication venue
Publication date: 01/01/2014
Field of study

Heterogeneous many-core architectures that consist of big, fast cores and small, energy-efficient cores are very promising for future high-performance computing (HPC) systems. These architectures offer a good balance between single-threaded perfor- mance and multithreaded throughput. Such systems impose challenges on the design of programming model and runtime system. Specifically, these challenges include (a) how to fully utilize the chip’s performance, (b) how to manage heterogeneous, un- reliable hardware resources, and (c) how to generate and manage a large amount of parallel tasks. This dissertation proposes and evaluates a Python-based programming framework called PyDac. PyDac supports a two-level programming model. At the high level, a programmer creates a very large number of tasks, using the divide-and-conquer strategy. At the low level, tasks are written in imperative programming style. The runtime system seamlessly manages the parallel tasks, system resilience, and inter- task communication with architecture support. PyDac has been implemented on both an field-programmable gate array (FPGA) emulation of an unconventional het- erogeneous architecture and a conventional multicore microprocessor. To evaluate the performance, resilience, and programmability of the proposed system, several micro-benchmarks were developed. We found that (a) the PyDac abstracts away task communication and achieves programmability, (b) the micro-benchmarks are scalable on the hardware prototype, but (predictably) serial operation limits some micro-benchmarks, and (c) the degree of protection versus speed could be varied in redundant threading that is transparent to programmers

The University of North Carolina at Greensboro

Metacomputing on clusters augmented with reconfigurable hardware

Author: Healy Philip D.
Publication venue: 'University College Cork'
Publication date: 01/01/2006
Field of study

Cork Open Research Archive