Search CORE

119 research outputs found

COLAB:A Collaborative Multi-factor Scheduler for Asymmetric Multicore Processors

Author: Janjic Vladimir
Leather Hugh
Petoumenos Pavlos
Thomson John Donald
Yu Teng
Zhu Mingcan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Funding: Partially funded by the UK EPSRC grants Discovery: Pattern Discovery and Program Shaping for Many-core Systems (EP/P020631/1) and ABC: Adaptive Brokerage for Cloud (EP/R010528/1); Royal Academy of Engineering under the Research Fellowship scheme.Increasingly prevalent asymmetric multicore processors (AMP) are necessary for delivering performance in the era of limited power budget and dark silicon. However, the software fails to use them efficiently. OS schedulers, in particular, handle asymmetry only under restricted scenarios. We have efficient symmetric schedulers, efficient asymmetric schedulers for single-threaded workloads, and efficient asymmetric schedulers for single program workloads. What we do not have is a scheduler that can handle all runtime factors affecting AMP for multi-threaded multi-programmed workloads. This paper introduces the first general purpose asymmetry-aware scheduler for multi-threaded multi-programmed workloads. It estimates the performance of each thread on each type of core and identifies communication patterns and bottleneck threads. The scheduler then makes coordinated core assignment and thread selection decisions that still provide each application its fair share of the processor's time. We evaluate our approach using the GEM5 simulator on four distinct big.LITTLE configurations and 26 mixed workloads composed of PARSEC and SPLASH2 benchmarks. Compared to the state-of-the art Linux CFS and AMP-aware schedulers, we demonstrate performance gains of up to 25% and 5% to 15% on average depending on the hardware setup.Postprin

Crossref

The University of Manchester - Institutional Repository

University of Dundee Online Publications

University of St. Andrews - Pure

St Andrews Research Repository

Approximating ReLU on a Reduced Ring for Efficient MPC-based Private Inference

Author: Maeng Kiwan
Suh G. Edward
Publication venue
Publication date: 09/09/2023
Field of study

Secure multi-party computation (MPC) allows users to offload machine learning inference on untrusted servers without having to share their privacy-sensitive data. Despite their strong security properties, MPC-based private inference has not been widely adopted in the real world due to their high communication overhead. When evaluating ReLU layers, MPC protocols incur a significant amount of communication between the parties, making the end-to-end execution time multiple orders slower than its non-private counterpart. This paper presents HummingBird, an MPC framework that reduces the ReLU communication overhead significantly by using only a subset of the bits to evaluate ReLU on a smaller ring. Based on theoretical analyses, HummingBird identifies bits in the secret share that are not crucial for accuracy and excludes them during ReLU evaluation to reduce communication. With its efficient search engine, HummingBird discards 87--91% of the bits during ReLU and still maintains high accuracy. On a real MPC setup involving multiple servers, HummingBird achieves on average 2.03--2.67x end-to-end speedup without introducing any errors, and up to 8.64x average speedup when some amount of accuracy degradation can be tolerated, due to its up to 8.76x communication reduction

arXiv.org e-Print Archive

Challenges and Opportunities in Near-Threshold DNN Accelerators around Timing Errors

Author: Basu Prabal
Chakraborty Koushik
Gundi Noel Daniel
Pandey Pramesh
Patrick Mitchell Craig
Roy Sanghamitra
Shabanian Tahmoures
Publication venue: Hosted by Utah State University Libraries
Publication date: 16/10/2020
Field of study

AI evolution is accelerating and Deep Neural Network (DNN) inference accelerators are at the forefront of ad hoc architectures that are evolving to support the immense throughput required for AI computation. However, much more energy efficient design paradigms are inevitable to realize the complete potential of AI evolution and curtail energy consumption. The Near-Threshold Computing (NTC) design paradigm can serve as the best candidate for providing the required energy efficiency. However, NTC operation is plagued with ample performance and reliability concerns arising from the timing errors. In this paper, we dive deep into DNN architecture to uncover some unique challenges and opportunities for operation in the NTC paradigm. By performing rigorous simulations in TPU systolic array, we reveal the severity of timing errors and its impact on inference accuracy at NTC. We analyze various attributes—such as data–delay relationship, delay disparity within arithmetic units, utilization pattern, hardware homogeneity, workload characteristics—and uncover unique localized and global techniques to deal with the timing errors in NTC

DigitalCommons@USU

Mixed-data-model heterogeneous compilation and OpenMP offloading

Author: Benini L.
Capotondi A.
Forsberg B.
Grosser T.
Kurth A.
Marongiu A.
Wolters K.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

Heterogeneous computers combine a general-purpose host processor with domain-specific programmable many-core accelerators, uniting high versatility with high performance and energy efficiency. While the host manages ever-more application memory, accelerators are designed to work mainly on their local memory. This difference in addressed memory leads to a discrepancy between the optimal address width of the host and the accelerator. Today 64-bit host processors are commonplace, but few accelerators exceed 32-bit addressable local memory, a difference expected to increase with 128-bit hosts in the exascale era. Managing this discrepancy requires support for multiple data models in heterogeneous compilers. So far, compiler support for multiple data models has not been explored, which hampers the programmability of such systems and inhibits their adoption. In this work, we perform the first exploration of the feasibility and performance of implementing a mixed-data-mode heterogeneous system. To support this, we present and evaluate the first mixed-data-model compiler, supporting arbitrary address widths on host and accelerator. To hide the inherent complexity and to enable high programmer productivity, we implement transparent offloading on top of OpenMP. The proposed compiler techniques are implemented in LLVM and evaluated on a 64+32-bit heterogeneous SoC. Results on benchmarks from the PolyBench-ACC suite show that memory can be transparently shared between host and accelerator at overheads below 0.7 % compared to 32-bit-only execution, enabling mixed-data-model computers to execute at near-native performance

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Accelerating Generic Graph Neural Networks via Architecture, Compiler, Partition Method Co-Design

Author: Guo Cong
Guo Minyi
Leng Jingwen
Lu Shuwen
Zhang Zhihui
Zhou Yangjie
Publication venue
Publication date: 16/08/2023
Field of study

Graph neural networks (GNNs) have shown significant accuracy improvements in a variety of graph learning domains, sparking considerable research interest. To translate these accuracy improvements into practical applications, it is essential to develop high-performance and efficient hardware acceleration for GNN models. However, designing GNN accelerators faces two fundamental challenges: the high bandwidth requirement of GNN models and the diversity of GNN models. Previous works have addressed the first challenge by using more expensive memory interfaces to achieve higher bandwidth. For the second challenge, existing works either support specific GNN models or have generic designs with poor hardware utilization. In this work, we tackle both challenges simultaneously. First, we identify a new type of partition-level operator fusion, which we utilize to internally reduce the high bandwidth requirement of GNNs. Next, we introduce partition-level multi-threading to schedule the concurrent processing of graph partitions, utilizing different hardware resources. To further reduce the extra on-chip memory required by multi-threading, we propose fine-grained graph partitioning to generate denser graph partitions. Importantly, these three methods make no assumptions about the targeted GNN models, addressing the challenge of model variety. We implement these methods in a framework called SwitchBlade, consisting of a compiler, a graph partitioner, and a hardware accelerator. Our evaluation demonstrates that SwitchBlade achieves an average speedup of

1.85\times

and energy savings of

19.03\times

compared to the NVIDIA V100 GPU. Additionally, SwitchBlade delivers performance comparable to state-of-the-art specialized accelerators

arXiv.org e-Print Archive

Circuits and Systems Advances in Near Threshold Computing

Author
Publication venue: 'MDPI AG'
Publication date: 11/01/2022
Field of study

Modern society is witnessing a sea change in ubiquitous computing, in which people have embraced computing systems as an indispensable part of day-to-day existence. Computation, storage, and communication abilities of smartphones, for example, have undergone monumental changes over the past decade. However, global emphasis on creating and sustaining green environments is leading to a rapid and ongoing proliferation of edge computing systems and applications. As a broad spectrum of healthcare, home, and transport applications shift to the edge of the network, near-threshold computing (NTC) is emerging as one of the promising low-power computing platforms. An NTC device sets its supply voltage close to its threshold voltage, dramatically reducing the energy consumption. Despite showing substantial promise in terms of energy efficiency, NTC is yet to see widescale commercial adoption. This is because circuits and systems operating with NTC suffer from several problems, including increased sensitivity to process variation, reliability problems, performance degradation, and security vulnerabilities, to name a few. To realize its potential, we need designs, techniques, and solutions to overcome these challenges associated with NTC circuits and systems. The readers of this book will be able to familiarize themselves with recent advances in electronics systems, focusing on near-threshold computing

Directory of Open Access Books (DOAB)

Adaptive Microarchitectural Optimizations to Improve Performance and Security of Multi-Core Architectures

Author: Holtryd Nadja
Publication venue
Publication date: 01/01/2023
Field of study

With the current technological barriers, microarchitectural optimizations are increasingly important to ensure performance scalability of computing systems. The shift to multi-core architectures increases the demands on the memory system, and amplifies the role of microarchitectural optimizations in performance improvement. In a multi-core system, microarchitectural resources are usually shared, such as the cache, to maximize utilization but sharing can also lead to contention and lower performance. This can be mitigated through partitioning of shared caches.However, microarchitectural optimizations which were assumed to be fundamentally secure for a long time, can be used in side-channel attacks to exploit secrets, as cryptographic keys. Timing-based side-channels exploit predictable timing variations due to the interaction with microarchitectural optimizations during program execution. Going forward, there is a strong need to be able to leverage microarchitectural optimizations for performance without compromising security. This thesis contributes with three adaptive microarchitectural resource management optimizations to improve security and/or\ua0performance\ua0of multi-core architectures\ua0and a systematization-of-knowledge of timing-based side-channel attacks.\ua0We observe that to achieve high-performance cache partitioning in a multi-core system\ua0three requirements need to be met: i) fine-granularity of partitions, ii) locality-aware placement and iii) frequent changes. These requirements lead to\ua0high overheads for current centralized partitioning solutions, especially as the number of cores in the\ua0system increases. To address this problem, we present an adaptive and scalable cache partitioning solution (DELTA) using a distributed and asynchronous allocation algorithm. The\ua0allocations occur through core-to-core challenges, where applications with larger performance benefit will gain cache capacity. The\ua0solution is implementable in hardware, due to low computational complexity, and can scale to large core counts.According to our analysis, better performance can be achieved by coordination of multiple optimizations for different resources, e.g., off-chip bandwidth and cache, but is challenging due to the increased number of possible allocations which need to be evaluated.\ua0Based on these observations, we present a solution (CBP) for coordinated management of the optimizations: cache partitioning, bandwidth partitioning and prefetching.\ua0Efficient allocations, considering the inter-resource interactions and trade-offs, are achieved using local resource managers to limit the solution space.The continuously growing number of\ua0side-channel attacks leveraging\ua0microarchitectural optimizations prompts us to review attacks and defenses to understand the vulnerabilities of different microarchitectural optimizations. We identify the four root causes of timing-based side-channel attacks: determinism, sharing, access violation\ua0and information flow.\ua0Our key insight is that eliminating any of the exploited root causes, in any of the attack steps, is enough to provide protection.\ua0Based on our framework, we present a systematization of the attacks and defenses on a wide range of microarchitectural optimizations, which highlights their key similarities.\ua0Shared caches are an attractive attack surface for side-channel attacks, while defenses need to be efficient since the cache is crucial for performance.\ua0To address this issue, we present an adaptive and scalable cache partitioning solution (SCALE) for protection against cache side-channel attacks. The solution leverages randomness,\ua0and provides quantifiable and information theoretic security guarantees using differential privacy. The solution closes the performance gap to a state-of-the-art non-secure allocation policy for a mix of secure and non-secure applications

Chalmers Research

An Updated Survey of Efficient Hardware Architectures for Accelerating Deep Convolutional Neural Networks

Author: Bussolino Beatrice
Capra Maurizio
Marchisio Alberto
Martina Maurizio
Masera Guido
Shafique Muhammad
Publication venue: 'MDPI AG'
Publication date
Field of study

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Fault-Tolerant Circuits and Interconnects for Biomedical Implantable Devices

Author: Chacón-Rodríguez Alfonso
García-Ramírez Ronny
Molina-Robles Roberto Carlos
Rímolo-Donadio Renato
Publication venue: 'Instituto Tecnologico de Costa Rica'
Publication date: 01/01/2020
Field of study

Proyecto de Investigación (Código 1360014) Instituto Tecnológico de Costa Rica. Vicerrectoría de Investigación y Extensión (VIE). Escuela de Ingeniería Electrónica, 2020Los dispositivos médicos implantables (IMDs) son sistemas críticos para la seguridad con requerimientos de potencia muy bajos, los cuales se utilizan para el tratamiento a largo plazo de diferentes condiciones médicas. IMDs utilizan un número de componentes cada vez más elevado (sensores, actuadores, procesadores, bloques de memoria), que tienen que comunicarse entre ellos en un Sistema en Chip (SoC). En este proyecto, diferentes tipos de interconexiones (punto a punto, bus, red en chip) fueron evaluadas considerando su tolerancia a fallas, consumo de potencia y capacidades de comunicación. Como parte de los productos se desarrolló una base de datos escalable sobre sistemas médicos implantables reportados en la literatura hasta el año 2018, con el fin de conocer el estado del arte y las tendencias sobre la incorporación de sistemas electrónicos en este tipo de solución. Basado en este estudio inicial, se procedió a proponer un marco de trabajo de evaluación de interconexiones, el que incorpora un generador de topologías y el flujo de diseño para evaluar estas topologías en términos de potencia y tolerancia a fallas a nivel de simulación, junto con la propuesta de una métrica para comparar diferentes arquitecturas a nivel de pre-síntesis (previo a la consolidación del diseño). Por último, un diseño e implementación a nivel de circuito integrado (IC) de una solución de interconexiones ajustada a IMDs se incorporó en el diseño de un microprocesador a la medida. Este proyecto se desarrolló en el marco de la cooperación con el Centro Médico Erasmus (Erasmus MC) en los Países Bajos y la Universidad Católica del Uruguay

Repositorio Institucional del Instituto Tecnologico de Costa Rica

Warping Cache Simulation of Polyhedral Programs

Author: Morelli Canberk
Reineke Jan
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2022
Field of study

Techniques to evaluate a program’s cache performance fall into two camps: 1. Traditional trace-based cache simulators precisely account for sophisticated real-world cache models and support arbitrary workloads, but their runtime is proportional to the number of memory accesses performed by the program under analysis. 2. Relying on implicit workload characterizations such as the polyhedral model, analytical approaches often achieve problem-size-independent runtimes, but so far have been limited to idealized cache models. We introduce a hybrid approach, warping cache simulation, that aims to achieve applicability to real-world cache models and problem-size-independent runtimes. As prior analytical approaches, we focus on programs in the polyhedral model, which allows to reason about the sequence of memory accesses analytically. Combining this analytical reasoning with information about the cache behavior obtained from explicit cache simulation allows us to soundly fast-forward the simulation. By this process of warping, we accelerate the simulation so that its cost is often independent of the number of memory accesses

arXiv.org e-Print Archive

Universaar

Acronym