Search CORE

15 research outputs found

FastFlow tutorial

Author: Aldinucci Marco
Danelutto Marco
Torquati Massimo
Publication venue
Publication date: 28/03/2012
Field of study

FastFlow is a structured parallel programming framework targeting shared memory multicores. Its layered design and the optimized implementation of the communication mechanisms used to implement the FastFlow streaming networks provided to the application programmer as algorithmic skeletons support the development of efficient fine grain parallel applications. FastFlow is available (open source) at SourceForge (http://sourceforge.net/projects/mc-fastflow/). This work introduces FastFlow programming techniques and points out the different ways used to parallelize existing C/C++ code using FastFlow as a software accelerator. In short: this is a kind of tutorial on FastFlow.Comment: 49 pages + cove

arXiv.org e-Print Archive

UnipiEprints

Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning

Author: Borowiec Damian
Publication venue: Lancaster University
Publication date: 01/03/2023
Field of study

Deep Learning (DL) is significantly impacting many industries, including automotive, retail and medicine, enabling autonomous driving, recommender systems and genomics modelling, amongst other applications. At the same time, demand for complex and fast DL models is continually growing. The most capable models tend to exhibit highest operational costs, primarily due to their large computational resource footprint and inefficient utilisation of computational resources employed by DL systems. In an attempt to tackle these problems, DL compilers and auto-tuners emerged, automating the traditionally manual task of DL model performance optimisation. While auto-tuning improves model inference speed, it is a costly process, which limits its wider adoption within DL deployment pipelines. The high operational costs associated with DL auto-tuning have multiple causes. During operation, DL auto-tuners explore large search spaces consisting of billions of tensor programs, to propose potential candidates that improve DL model inference latency. Subsequently, DL auto-tuners measure candidate performance in isolation on the target-device, which constitutes the majority of auto-tuning compute-time. Suboptimal candidate proposals, combined with their serial measurement in an isolated target-device lead to prolonged optimisation time and reduced resource availability, ultimately reducing cost-efficiency of the process. In this thesis, we investigate the reasons behind prolonged DL auto-tuning and quantify their impact on the optimisation costs, revealing directions for improved DL auto-tuner design. Based on these insights, we propose two complementary systems: Trimmer and DOPpler. Trimmer improves tensor program search efficacy by filtering out poorly performing candidates, and controls end-to-end auto-tuning using cost objectives, monitoring optimisation cost. Simultaneously, DOPpler breaks long-held assumptions about the serial candidate measurements by successfully parallelising them intra-device, with minimal penalty to optimisation quality. Through extensive experimental evaluation of both systems, we demonstrate that they significantly improve cost-efficiency of autotuning (up to 50.5%) across a plethora of tensor operators, DL models, auto-tuners and target-devices

Lancaster E-Prints

SMARTS: Exploiting Temporal through Vertical Execution

Author: Allen Malony
James Crotinger
Peter Beckman
Rod Oldehoeft
Sameer Shende
Stephen Smith
Steve Karmesin
Suvas Vajracharya
Publication venue
Publication date: 01/05/2020
Field of study

Abstract In the solution of large-scale numerical problems, parallel computing is becoming simultaneously more important and more dificult. The complex organization of today's multiprocessors with several memory hierarchies has forced the scientific programmer to make a choice between simple but unscalable code and scalable but extremely complex code that does not port to other architectures. This paper describes how the SMARTS runtime system and the POOMA C++ class library for high-performance scientific computing work together to exploit data parallelism in scientific applications while hiding the details of managing parallelism and data locality from the user. We present innovative algorithms, based on the macro-dataflow model, for detecting data parallelism and efficiently executing dataparallel statements on shared-memory multiprocessors. We also describe how these algorithms can be implemented on clusters of SMPs

CiteSeerX

Modeling of an Adaptive Parallel System with Malleable Applications in a Distributed Computing Environment

Author: Ghafoor Sheikh Khaled
Publication venue: Scholars Junction
Publication date: 15/12/2007
Field of study

Adaptive parallel applications that can change resources during execution, promise increased application performance and better system utilization. Furthermore, they open the opportunity for developing a new class of parallel applications driven by unpredictable data and events. The research issues in an adaptive parallel system are complex and interrelated. The nature and complexities of the relationships among these issues are not well researched and understood. Before developing adaptive applications or an infrastructure support for adaptive applications, these issues need to be investigated and studied in detail. One way of understanding and investigating these issues is by modeling and simulation. A model for adaptive parallel systems has been developed to enable the investigation of the impact of malleable workloads on its performance. The model can be used to determine how different model parameters impact the performance of the system and to determine the relationships among them Subsequently, a discrete event simulator has been developed to numerically simulate the model. Using the simulator, the impact of the variation in the number of malleable jobs in the workload, the flexibility, the negotiation cost, and the adaptation cost on system performance have been studied. The results and conclusions of these simulation experiments are presented in this dissertation. In general, the simulation results reveal that the performance improves with an increase in the number of malleable jobs in a workload, and that the performance saturates at a certain percentage of rigid to malleable jobs mix. A high percentage of malleable jobs is not necessary to achieve significant improvement in performance. The performance in general improves as the flexibility increases up to a certain point; then, it saturates. The negotiation cost impacts the performance, but not significantly. The number of negotiations for a given workload increases as number of malleable jobs increases up to a certain point, and then it decreases as number of malleable jobs increases further. The performance degrades as the application adaptation cost increases. The impact of the application adaptation cost on performance is much more significant compared to that of the negotiation cost

Scholars Junction - Mississippi State University Institutional Repository

Modeling of an Adaptive Parallel System with Malleable Applications in a Distributed Computing Environment

Author: Ghafoor Sheikh Khaled
Publication venue: Scholars Junction
Publication date: 15/12/2007
Field of study

Mississippi State University Libraries ETD database

Scholars Junction - Mississippi State University Institutional Repository

A domain-extensible compiler with controllable automation of optimisations

Author: Kœhler Thomas
Publication venue
Publication date: 01/01/2022
Field of study

In high performance domains like image processing, physics simulation or machine learning, program performance is critical. Programmers called performance engineers are responsible for the challenging task of optimising programs. Two major challenges prevent modern compilers targeting heterogeneous architectures from reliably automating optimisation. First, domain specific compilers such as Halide for image processing and TVM for machine learning are difficult to extend with the new optimisations required by new algorithms and hardware. Second, automatic optimisation is often unable to achieve the required performance, and performance engineers often fall back to painstaking manual optimisation. This thesis shows the potential of the Shine compiler to achieve domain-extensibility, controllable automation, and generate high performance code. Domain-extensibility facilitates adapting compilers to new algorithms and hardware. Controllable automation enables performance engineers to gradually take control of the optimisation process. The first research contribution is to add 3 code generation features to Shine, namely: synchronisation barrier insertion, kernel execution, and storage folding. Adding these features requires making novel design choices in terms of compiler extensibility and controllability. The rest of this thesis builds on these features to generate code with competitive runtime compared to established domain-specific compilers. The second research contribution is to demonstrate how extensibility and controllability are exploited to optimise a standard image processing pipeline for corner detection. Shine achieves 6 well-known image processing optimisations, 2 of them not being supported by Halide. Our results on 4 ARM multi-core CPUs show that the code generated by Shine for corner detection runs up to 1.4× faster than the Halide code. However, we observe that controlling rewriting is tedious, motivating the need for more automation. The final research contribution is to introduce sketch-guided equality saturation, a semiautomated technique that allows performance engineers to guide program rewriting by specifying rewrite goals as sketches: program patterns that leave details unspecified. We evaluate this approach by applying 7 realistic optimisations of matrix multiplication. Without guidance, the compiler fails to apply the 5 most complex optimisations even given an hour and 60GB of RAM. With the guidance of at most 3 sketch guides, each 10 times smaller than the complete program, the compiler applies the optimisations in seconds using less than 1GB

Glasgow Theses Service

Automatic scheduling of image processing pipelines

Author: Sioutas Savvas
Publication venue: Technische Universiteit Eindhoven
Publication date: 18/12/2020
Field of study

Pure OAI Repository

Analysis and architectural support for parallel stateful packet processing

Author: Verdú Mulà Javier
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2008
Field of study

The evolution of network services is closely related to the network technology trend. Originally network nodes forwarded packets from a source to a destination in the network by executing lightweight packet processing, or even negligible workloads. As links provide more complex services, packet processing demands the execution of more computational intensive applications. Complex network applications deal with both packet header and payload (i.e. packet contents) to provide upper layer network services, such as enhanced security, system utilization policies, and video on demand management.Applications that provide complex network services arise two key capabilities that differ from the low layer network applications: a) deep packet inspection examines the packet payload tipically searching for a matching string or regular expression, and b) stateful processing keeps track information of previous packet processing, unlike other applications that don't keep any data about other packet processing. In most cases, deep packet inspection also integrates stateful processing.Computer architecture researches aim to maximize the system throughput to sustain the required network processing performance as well as other demands, such as memory and I/O bandwidth. In fact, there are different processor architectures depending on the sharing degree of hardware resources among streams (i.e. hardware context). Multicore architectures present multiple processing engines within a single chip that share cache levels of memory hierarchy and interconnection network. Multithreaded architectures integrates multiple streams in a single processing engine sharing functional units, register file, fecth unit, and inner levels of cache hierarchy. Scalable multicore multithreaded architectures emerge as a solution to overcome the requirements of high throughput systems. We call massively multithreaded architectures to the architectures that comprise tens to hundreds of streams distributed across multiple cores on a chip. Nevertheless, the efficient utilization of these architectures depends on the application characteristics. On one hand, emerging network applications show large computational workloads with significant variations in the packet processing behavior. Then, it is important to analyze the behavior of each packet processing to optimally assign packets to threads (i.e. software context) for reducing any negative interaction among them. On the other hand, network applications present Packet Level Parallelism (PLP) in which several packets can be processed in parallel. As in other paradigms, dependencies among packets limit the amount of PLP. Lower network layer applications show negligible packet dependencies. In contrast, complex upper network applications show dependencies among packets leading to reduce the amount of PLP.In this thesis, we address the limitations of parallelism in stateful network applications to maximize the throughput of advanced network devices. This dissertation comprises three complementary sets of contributions focused on: network analysis, workload characterization and architectural proposal.The network analysis evaluates the impact of network traffic on stateful network applications. We specially study the impact of network traffic aggregation on memory hierarchy performance. We categorize and characterize network applications according to their data management. The results point out that stateful processing presents reduced instruction level parallelism and high rate of long latency memory accesses. Our analysis reveal that stateful applications expose a variety of levels of parallelism related to stateful data categories. Thus, we propose the MultiLayer Processing (MLP) as an execution model to exploit multiple levels of parallelism. The MLP is a thread migration based mechanism that increases the sinergy among streams in the memory hierarchy and alleviates the contention in critical sections of parallel stateful workloads

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

Exploring Hardware Agnostic Multiarrays in Magnolia

Author: Larnøy Marius Kleppe
Publication venue: The University of Bergen
Publication date: 01/01/2022
Field of study

We present a specification and implementation of a generic multiarray API based on A Mathematics of Arrays in the general purpose research language Magnolia. We show how we can lift the reasoning on arrays to a more abstract level, and how this enables us to precisely manipulate arrays independent of hardware memory layouts.Masteroppgave i informatikkINF399MAMN-PROGMAMN-IN

University of Bergen

NORA - Norwegian Open Research Archives

Space Network Control Conference on Resource Allocation Concepts and Approaches

Author: Moe Karen L.
Publication venue
Publication date
Field of study

The results are presented of the Space Network Control (SNC) Conference. In the late 1990s, when the Advanced Tracking and Data Relay Satellite System is operational, Space Network communication services will be supported and controlled by the SNC. The goals of the conference were to survey existing resource allocation concepts and approaches, to identify solutions applicable to the Space Network, and to identify avenues of study in support of the SNC development. The conference was divided into three sessions: (1) Concepts for Space Network Allocation; (2) SNC and User Payload Operations Control Center (POCC) Human-Computer Interface Concepts; and (3) Resource Allocation Tools, Technology, and Algorithms. Key recommendations addressed approaches to achieving higher levels of automation in the scheduling process

NASA Technical Reports Server