178 research outputs found
The Sparse Abstract Machine
We propose the Sparse Abstract Machine (SAM), an abstract machine model for
targeting sparse tensor algebra to reconfigurable and fixed-function spatial
dataflow accelerators. SAM defines a streaming dataflow abstraction with sparse
primitives that encompass a large space of scheduled tensor algebra
expressions. SAM dataflow graphs naturally separate tensor formats from
algorithms and are expressive enough to incorporate arbitrary iteration
orderings and many hardware-specific optimizations. We also present Custard, a
compiler from a high-level language to SAM that demonstrates SAM's usefulness
as an intermediate representation. We automatically bind from SAM to a
streaming dataflow simulator. We evaluate the generality and extensibility of
SAM, explore the performance space of sparse tensor algebra optimizations using
SAM, and show SAM's ability to represent dataflow hardware.Comment: 18 pages, 17 figures, 3 table
TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators
Over the past few years, the explosion in sparse tensor algebra workloads has
led to a corresponding rise in domain-specific accelerators to service them.
Due to the irregularity present in sparse tensors, these accelerators employ a
wide variety of novel solutions to achieve good performance. At the same time,
prior work on design-flexible sparse accelerator modeling does not express this
full range of design features, making it difficult to understand the impact of
each design choice and compare or extend the state-of-the-art.
To address this, we propose TeAAL: a language and compiler for the concise
and precise specification and evaluation of sparse tensor algebra
architectures. We use TeAAL to represent and evaluate four disparate
state-of-the-art accelerators--ExTensor, Gamma, OuterSPACE, and SIGMA--and
verify that it reproduces their performance with high accuracy. Finally, we
demonstrate the potential of TeAAL as a tool for designing new accelerators by
showing how it can be used to speed up Graphicionado--by on BFS and
on SSSP.Comment: 14 pages, 12 figure
Fundamentals
Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
LIPIcs, Volume 277, GIScience 2023, Complete Volume
LIPIcs, Volume 277, GIScience 2023, Complete Volum
KAPLA: Pragmatic Representation and Fast Solving of Scalable NN Accelerator Dataflow
Dataflow scheduling decisions are of vital importance to neural network (NN)
accelerators. Recent scalable NN accelerators support a rich set of advanced
dataflow techniques. The problems of comprehensively representing and quickly
finding optimized dataflow schemes thus become significantly more complicated
and challenging. In this work, we first propose comprehensive and pragmatic
dataflow representations for temporal and spatial scheduling on scalable
multi-node NN architectures. An informal hierarchical taxonomy highlights the
tight coupling across different levels of the dataflow space as the major
difficulty for fast design exploration. A set of formal tensor-centric
directives accurately express various inter-layer and intra-layer schemes, and
allow for quickly determining their validity and efficiency. We then build a
generic, optimized, and fast dataflow solver, KAPLA, which makes use of the
pragmatic directives to explore the design space with effective validity check
and efficiency estimation. KAPLA decouples the upper inter-layer level for fast
pruning, and solves the lower intra-layer schemes with a novel bottom-up cost
descending method. KAPLA achieves within only 2.2% and 7.7% energy overheads on
the result dataflow for training and inference, respectively, compared to the
exhaustively searched optimal schemes. It also outperforms random and
machine-learning-based approaches, with more optimized results and orders of
magnitude faster search speedup
LIPIcs, Volume 258, SoCG 2023, Complete Volume
LIPIcs, Volume 258, SoCG 2023, Complete Volum
12th International Conference on Geographic Information Science: GIScience 2023, September 12–15, 2023, Leeds, UK
No abstract available
Design and Code Optimization for Systems with Next-generation Racetrack Memories
With the rise of computationally expensive application domains such as machine learning, genomics, and fluids simulation, the quest for performance and energy-efficient computing has gained unprecedented momentum. The significant increase in computing and memory devices in modern systems has resulted in an unsustainable surge in energy consumption, a substantial portion of which is attributed to the memory system. The scaling of conventional memory technologies and their suitability for the next-generation system is also questionable. This has led to the emergence and rise of nonvolatile memory ( NVM ) technologies. Today, in different development stages, several NVM technologies are competing for their rapid access to the market.
Racetrack memory ( RTM ) is one such nonvolatile memory technology that promises SRAM -comparable latency, reduced energy consumption, and unprecedented density compared to other technologies. However, racetrack memory ( RTM ) is sequential in nature, i.e., data in an RTM cell needs to be shifted to an access port before it can be accessed. These shift operations incur performance and energy penalties. An ideal RTM , requiring at most one shift per access, can easily outperform SRAM . However, in the worst-cast shifting scenario, RTM can be an order of magnitude slower than SRAM .
This thesis presents an overview of the RTM device physics, its evolution, strengths and challenges, and its application in the memory subsystem. We develop tools that allow the programmability and modeling of RTM -based systems. For shifts minimization, we propose a set of techniques including optimal, near-optimal, and evolutionary algorithms for efficient scalar and instruction placement in RTMs . For array accesses, we explore schedule and layout transformations that eliminate the longer overhead shifts in RTMs . We present an automatic compilation framework that analyzes static control flow programs and transforms the loop traversal order and memory layout to maximize accesses to consecutive RTM locations and minimize shifts. We develop a simulation framework called RTSim that models various RTM parameters and enables accurate architectural level simulation.
Finally, to demonstrate the RTM potential in non-Von-Neumann in-memory computing paradigms, we exploit its device attributes to implement logic and arithmetic operations. As a concrete use-case, we implement an entire hyperdimensional computing framework in RTM to accelerate the language recognition problem. Our evaluation shows considerable performance and energy improvements compared to conventional Von-Neumann models and state-of-the-art accelerators
Task-based Runtime Optimizations Towards High Performance Computing Applications
The last decades have witnessed a rapid improvement of computational capabilities in high-performance computing (HPC) platforms thanks to hardware technology scaling. HPC architectures benefit from mainstream advances on the hardware with many-core systems, deep hierarchical memory subsystem, non-uniform memory access, and an ever-increasing gap between computational power and memory bandwidth. This has necessitated continuous adaptations across the software stack to maintain high hardware utilization. In this HPC landscape of potentially million-way parallelism, task-based programming models associated with dynamic runtime systems are becoming more popular, which fosters developers’ productivity at extreme scale by abstracting the underlying hardware complexity.
In this context, this dissertation highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by HPC applications., i.e., data redistribution, geospatial modeling and 3D unstructured mesh deformation here. Data redistribution aims to reshuffle data to optimize some objective for an algorithm, whose objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal of increasing the efficiency and therefore reducing the time-to-solution for the algorithm. Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Meshing the deformable contour of moving 3D bodies is an expensive operation that can cause huge computational challenges in fluid-structure interaction (FSI) applications. Therefore, in this dissertation, Redistribute-PaRSEC, ExaGeoStat-PaRSEC and HiCMA-PaRSEC are proposed to efficiently tackle these HPC applications respectively at extreme scale, and they are evaluated on multiple HPC clusters, including AMD-based, Intel-based, Arm-based CPU systems and IBM-based multi-GPU system. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system for servicing the next-generation scientific applications
Computing graph neural networks: A survey from algorithms to accelerators
Graph Neural Networks (GNNs) have exploded onto the machine learning scene in recent years owing to their capability to model and learn from graph-structured data. Such an ability has strong implications in a wide variety of fields whose data are inherently relational, for which conventional neural networks do not perform well. Indeed, as recent reviews can attest, research in the area of GNNs has grown rapidly and has lead to the development of a variety of GNN algorithm variants as well as to the exploration of ground-breaking applications in chemistry, neurology, electronics, or communication networks, among others. At the current stage research, however, the efficient processing of GNNs is still an open challenge for several reasons. Besides of their novelty, GNNs are hard to compute due to their dependence on the input graph, their combination of dense and very sparse operations, or the need to scale to huge graphs in some applications. In this context, this article aims to make two main contributions. On the one hand, a review of the field of GNNs is presented from the perspective of computing. This includes a brief tutorial on the GNN fundamentals, an overview of the evolution of the field in the last decade, and a summary of operations carried out in the multiple phases of different GNN algorithm variants. On the other hand, an in-depth analysis of current software and hardware acceleration schemes is provided, from which a hardware-software, graph-aware, and communication-centric vision for GNN accelerators is distilled.This work is possible thanks to funding from the European Union’s Horizon 2020 research and innovation programme under Grant No. 863337 (WiPLASH project) and the Spanish Ministry of Economy and Competitiveness under contract TEC2017-90034-C2-1-R (ALLIANCE project) that receives funding from FEDER.Peer ReviewedPostprint (published version
- …