2,597 research outputs found
Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures
We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-to-data binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming.This research was sponsored by project PID2019-107255GB of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; project 2017-SGR-1414 of the Generalitat de Catalunya and the Madrid Government under the Multiannual Agreement with UCM in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT, project PR65/19-22445. This project has also received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558. The JU receives support from the European Union’s Horizon 2020 research and innovation programme, and Spain, Germany, France, Italy, Poland, Switzerland, Norway. The work is also supported by grants PID2020-113656RB-C22 and PID2021-126576NB-I00 of MCIN/AEI/10.13039/501100011033 and by ERDF A way of making Europe.Peer ReviewedPostprint (published version
Recommended from our members
A survey of behavioral-level partitioning systems
Many approaches have been developed to partition a system's behavioral description before a structural implementation is synthesized. We highlight the foundations and motivations for behavioral partitioning. We survey behavioral partitioning approaches, discussing abstraction levels, goals, major steps, and key assumptions in each
HL-Pow: A Learning-Based Power Modeling Framework for High-Level Synthesis
High-level synthesis (HLS) enables designers to customize hardware designs
efficiently. However, it is still challenging to foresee the correlation
between power consumption and HLS-based applications at an early design stage.
To overcome this problem, we introduce HL-Pow, a power modeling framework for
FPGA HLS based on state-of-the-art machine learning techniques. HL-Pow
incorporates an automated feature construction flow to efficiently identify and
extract features that exert a major influence on power consumption, simply
based upon HLS results, and a modeling flow that can build an accurate and
generic power model applicable to a variety of designs with HLS. By using
HL-Pow, the power evaluation process for FPGA designs can be significantly
expedited because the power inference of HL-Pow is established on HLS instead
of the time-consuming register-transfer level (RTL) implementation flow.
Experimental results demonstrate that HL-Pow can achieve accurate power
modeling that is only 4.67% (24.02 mW) away from onboard power measurement. To
further facilitate power-oriented optimizations, we describe a novel design
space exploration (DSE) algorithm built on top of HL-Pow to trade off between
latency and power consumption. This algorithm can reach a close approximation
of the real Pareto frontier while only requiring running HLS flow for 20% of
design points in the entire design space.Comment: published as a conference paper in ASP-DAC 202
Energy-Aware Data Management on NUMA Architectures
The ever-increasing need for more computing and data processing power demands for a continuous and rapid growth of power-hungry data center capacities all over the world. As a first study in 2008 revealed, energy consumption of such data centers is becoming a critical problem, since their power consumption is about to double every 5 years. However, a recently (2016) released follow-up study points out that this threatening trend was dramatically throttled within the past years, due to the increased energy efficiency actions taken by data center operators. Furthermore, the authors of the study emphasize that making and keeping data centers energy-efficient is a continuous task, because more and more computing power is demanded from the same or an even lower energy budget, and that this threatening energy consumption trend will resume as soon as energy efficiency research efforts and its market adoption are reduced. An important class of applications running in data centers are data management systems, which are a fundamental component of nearly every application stack. While those systems were traditionally designed as disk-based databases that are optimized for keeping disk accesses as low a possible, modern state-of-the-art database systems are main memory-centric and store the entire data pool in the main memory, which replaces the disk as main bottleneck. To scale up such in-memory database systems, non-uniform memory access (NUMA) hardware architectures are employed that face a decreased bandwidth and an increased latency when accessing remote memory compared to the local memory.
In this thesis, we investigate energy awareness aspects of large scale-up NUMA systems in the context of in-memory data management systems. To do so, we pick up the idea of a fine-grained data-oriented architecture and improve the concept in a way that it keeps pace with increased absolute performance numbers of a pure in-memory DBMS and scales up on NUMA systems in the large scale. To achieve this goal, we design and build ERIS, the first scale-up in-memory data management system that is designed from scratch to implement a data-oriented architecture. With the help of the ERIS platform, we explore our novel core concept for energy awareness, which is Energy Awareness by Adaptivity. The concept describes that software and especially database systems have to quickly respond to environmental changes (i.e., workload changes) by adapting themselves to enter a state of low energy consumption. We present the hierarchically organized Energy-Control Loop (ECL), which is a reactive control loop and provides two concrete implementations of our Energy Awareness by Adaptivity concept, namely the hardware-centric Resource Adaptivity and the software-centric Storage Adaptivity. Finally, we will give an exhaustive evaluation regarding the scalability of ERIS as well as our adaptivity facilities
Effective Cache Apportioning for Performance Isolation Under Compiler Guidance
With a growing number of cores in modern high-performance servers, effective
sharing of the last level cache (LLC) is more critical than ever. The primary
agenda of such systems is to maximize performance by efficiently supporting
multi-tenancy of diverse workloads. However, this could be particularly
challenging to achieve in practice, because modern workloads exhibit dynamic
phase behaviour, which causes their cache requirements & sensitivities to vary
at finer granularities during execution. Unfortunately, existing systems are
oblivious to the application phase behavior, and are unable to detect and react
quickly enough to these rapidly changing cache requirements, often incurring
significant performance degradation. In this paper, we propose Com-CAS, a new
apportioning system that provides dynamic cache allocations for co-executing
applications. Com-CAS differs from the existing cache partitioning systems by
adapting to the dynamic cache requirements of applications just-in-time, as
opposed to reacting, without any hardware modifications. The front-end of
Com-CAS consists of compiler-analysis equipped with machine learning mechanisms
to predict cache requirements, while the back-end consists of proactive
scheduler that dynamically apportions LLC amongst co-executing applications
leveraging Intel Cache Allocation Technology (CAT). Com-CAS's partitioning
scheme utilizes the compiler-generated information across finer granularities
to predict the rapidly changing dynamic application behaviors, while
simultaneously maintaining data locality. Our experiments show that Com-CAS
improves average weighted throughput by 15% over unpartitioned cache system,
and outperforms state-of-the-art partitioning system KPart by 20%, while
maintaining the worst individual application completion time degradation to
meet various Service-Level Agreement (SLA) requirements
- …