32 research outputs found
A High-performance, Energy-efficient Modular DMA Engine Architecture
Data transfers are essential in today's computing systems as latency and
complex memory access patterns are increasingly challenging to manage. Direct
memory access engines (DMAEs) are critically needed to transfer data
independently of the processing elements, hiding latency and achieving high
throughput even for complex access patterns to high-latency memory. With the
prevalence of heterogeneous systems, DMAEs must operate efficiently in
increasingly diverse environments. This work proposes a modular and highly
configurable open-source DMAE architecture called intelligent DMA (iDMA), split
into three parts that can be composed and customized independently. The
front-end implements the control plane binding to the surrounding system. The
mid-end accelerates complex data transfer patterns such as multi-dimensional
transfers, scattering, or gathering. The back-end interfaces with the on-chip
communication fabric (data plane). We assess the efficiency of iDMA in various
instantiations: In high-performance systems, we achieve speedups of up to 15.8x
with only 1 % additional area compared to a base system without a DMAE. We
achieve an area reduction of 10 % while improving ML inference performance by
23 % in ultra-low-energy edge AI systems over an existing DMAE solution. We
provide area, timing, latency, and performance characterization to guide its
instantiation in various systems.Comment: 14 pages, 14 figures, accepted by an IEEE journal for publicatio
Optimization of condensed matter physics application with OpenMP tasking model
The Density Matrix Renormalization Group (DMRG++) is a condensed matter physics application used to study superconductivity properties of materials. It’s main computations consist of calculating hamiltonian matrix which requires sparse matrix-vector multiplications. This paper presents task-based parallelization and optimization strategies of the Hamiltonian algorithm. The algorithm is implemented as a mini-application in C++ and parallelized with OpenMP. The optimization leverages tasking features, such as dependencies or priorities included in the OpenMP standard 4.5. The code refactoring targets performance as much as programmability. The optimized version achieves a speedup of 8.0 × with 8 threads and 20.5 × with 40 threads on a Power9 computing node while reducing the memory consumption to 90 MB with respect to the original code, by adding less than ten OpenMP directives.This work is partially supported by the Spanish Government through Programa Severo Ochoa (SEV2015-0493), by the Spanish Ministry of Science and Technology (project TIN2015-65316-P), by the Generalitat de Catalunya (contract 2017-SGR-1414) and by the BSC-IBM Deep Learning Research Agreement, under JSA “Application porting, analysis and optimization for POWER and POWER AI”. This work was partially supported by the Scientific Discovery through Advanced Computing (SciDAC) program funded by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research and Basic Energy Sciences, Division of Materials Sciences and Engineering. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.Peer ReviewedPostprint (author's final draft
Solutions for the optimization of the software interface on an FPGA-based NIC
The theme of the research is the study of solutions for the optimization of the software interface on FPGA-based Network Interface Cards. The research activity was carried out in the APE group at INFN (Istituto Nazionale di Fisica Nucleare), which has been historically active in designing of high performance scalable networks for hybrid nodes (CPU/GPU) clusters.
The result of the research is validated on two projects the APE group is currently working on, both allowing fast prototyping for solutions and hardware-software co-design: APEnet (a PCIe FPGA-based 3D torus network controller) and NaNet (FPGA-based family of NICs mainly dedicated to real-time, low-latency computing systems such as fast control systems or High Energy Physics Data Acquisition Systems). NaNet is also used to validate a GPU-controlled device driver to improve network perfomances, i.e. even lower latency of the communication, while used in combination with existing user-space software.
This research is also gaining results in the "Horizon2020 FET-HPC ExaNeSt project", which aims to prototype and develop solutions for some of the crucial problems on the way towards production of Exascale-level Supercomputers, where the APE group is actively contribuiting to the development of the network / interconnection infrastructure
PRISM: an intelligent adaptation of prefetch and SMT levels
Current microprocessors include hardware to optimize some specifics workloads.
In general, these hardware knobs are set on a default configuration on the booting
process of the machine. This default behavior cannot be beneficial for all types of
workloads and they are not controlled by anyone but the end user, who needs to
know what configuration is the best one for the workload running. Some of these
knobs are: (1) the Simultaneous MultiThreading level, which specifies the number
of threads that can run simultaneously on a physical CPU, and (2) the data
prefetch engine, that manages the prefetches on memory. Parallel programming
models are here to stay, and one programming model that succeed in allowing programmers
to easily parallelize applications is Open Multi Processing (OMP). Also,
the architecture of microprocessors is getting more complex that end users cannot
afford to optimize their workloads for all the architectural details. These architectural
knobs can help to increase performance but it is needed an automatic and
adaptive system managing them. In this work we propose an independent library
for OpenMP runtimes to increase performance up to 220% (14.7% on average)
while reducing dynamic power consumption up to 13% (2% on average) on a real
POWER8 processor
Recommended from our members
Photonic Interconnection Networks for Applications in Heterogeneous Utility Computing Systems
Growing demands in heterogeneous utility computing systems in future cloud and high performance computing systems are driving the development of processor-hardware accelerator interconnects with greater performance, flexibility, and dynamism. Recent innovations in the field of utility computing have led to an emergence in the use of heterogeneous compute elements. By leveraging the computing advantages of hardware accelerators alongside typical general purpose processors, performance efficiency can be maximized. The network linking these compute nodes is increasingly becoming the bottleneck in these architectures, limiting the hardware accelerators to be restricted to localized computing.
A high-bandwidth, agile interconnect is an imperative enabler for hardware accelerator delocalization in heterogeneous utility computing. A redesign of these systems' interconnect and architecture will be essential to establishing high-bandwidth, low-latency, efficient, and dynamic heterogeneous systems that can meet the challenges of next-generation utility computing.
By leveraging an optics-based approach, this dissertation presents the design and implementation of optically-connected hardware accelerators (OCHA) that exploit the distance-independent energy dissipation and bandwidth density of photonic transceivers, in combination with the flexibility, efficiency and data parallelization offered by optical networks. By replacing the electronic buses with an optical interconnection network, architectures that delocalize hardware accelerators can be created that are otherwise infeasible.
With delocalized optically-connected hardware accelerator nodes accessible by processors at run time, the system can alleviate the network latency issues plague current heterogeneous systems. Accelerators that would otherwise sit idle, waiting for it's master CPU to feed it data, can instead operate at high utilization rates, leading to dramatic improvements in overall system performance.
This work presents a prototype optically-connect hardware accelerator module and custom optical-network-aware, dynamic hardware accelerator allocator that communicate transparently and optically across an optical interconnection network. The hardware accelerators and processor are optimized to enable hardware acceleration across an optical network using fast packet-switching. The versatility of the optical network enables additional performance benefits including optical multicasting to exploit the data parallelism found in many accelerated data sets. The integration of hardware acceleration, heterogeneous computing, and optics constitutes a critical step for both computing and optics.
The massive data parallelism, application dependent-location and function, as well as network latency, and bandwidth limitations facing networks today complement well with the strength of optical communications-based systems. Moreover, ongoing efforts focusing on development of low-cost optical components and subsystems that are suitable for computing environment may benefit from the high-volume heterogeneous computing market. This work, therefore, takes the first steps in merging the areas of hardware acceleration and optics by developing architectures, protocols, and systems to interface with the two technologies and demonstrating areas of potential benefits and areas for future work. Next-generation heterogeneous utility computing systems will indubitably benefit from the use of efficient, flexible and high-performance optically connect hardware acceleration
The readying of applications for heterogeneous computing
High performance computing is approaching a potentially significant change in architectural design. With pressures on the cost and sheer amount of power, additional architectural features are emerging which require a re-think to the programming models deployed over the last two decades.
Today's emerging high performance computing (HPC) systems are maximising performance per unit of power consumed resulting in the constituent parts of the system to be made up of a range of different specialised building blocks, each with their own purpose. This heterogeneity is not just limited to the hardware components but also in the mechanisms that exploit the hardware components. These multiple levels of parallelism, instruction sets and memory hierarchies, result in truly heterogeneous computing in all aspects of the global system.
These emerging architectural solutions will require the software to exploit tremendous amounts of on-node parallelism and indeed programming models to address this are emerging. In theory, the application developer can design new software using these models to exploit emerging low power architectures. However, in practice, real industrial scale applications last the lifetimes of many architectural generations and therefore require a migration path to these next generation supercomputing platforms.
Identifying that migration path is non-trivial: With applications spanning many decades, consisting of many millions of lines of code and multiple scientific algorithms, any changes to the programming model will be extensive and invasive and may turn out to be the incorrect model for the application in question.
This makes exploration of these emerging architectures and programming models using the applications themselves problematic. Additionally, the source code of many industrial applications is not available either due to commercial or security sensitivity constraints.
This thesis highlights this problem by assessing current and emerging hard- ware with an industrial strength code, and demonstrating those issues described. In turn it looks at the methodology of using proxy applications in place of real industry applications, to assess their suitability on the next generation of low power HPC offerings. It shows there are significant benefits to be realised in using proxy applications, in that fundamental issues inhibiting exploration of a particular architecture are easier to identify and hence address.
Evaluations of the maturity and performance portability are explored for a number of alternative programming methodologies, on a number of architectures and highlighting the broader adoption of these proxy applications, both within the authors own organisation, and across the industry as a whole
Intelligent systems for efficiency and security
As computing becomes ubiquitous and personalized, resources like energy, storage and time are becoming increasingly scarce and, at the same time, computing systems must deliver in multiple dimensions, such as high performance, quality of service, reliability, security and low power. Building such computers is hard, particularly when the operating environment is becoming more dynamic, and systems are becoming heterogeneous and distributed.
Unfortunately, computers today manage resources with many ad hoc heuristics that are suboptimal, unsafe, and cannot be composed across the computer’s subsystems. Continuing this approach has severe consequences: underperforming systems, resource waste, information loss, and even life endangerment.
This dissertation research develops computing systems which, through intelligent adaptation, deliver efficiency along multiple dimensions. The key idea is to manage computers with principled methods from formal control. It is with these methods that the multiple subsystems of a computer sense their environment and configure themselves to meet system-wide goals.
To achieve the goal of intelligent systems, this dissertation makes a series of contributions, each building on the previous. First, it introduces the use of formal MIMO (Multiple Input Multiple Output) control for processors, to simultaneously optimize many goals like performance, power, and temperature. Second, it develops the Yukta control system, which uses coordinated formal controllers in different layers of the stack (hardware and operating system). Third, it uses robust control to develop a fast, globally coordinated and decentralized control framework called Tangram, for heterogeneous computers. Finally, it presents Maya, a defense against power side-channel attacks that uses formal control to reshape the power dissipated by a computer, confusing the attacker. The ideas in the dissertation have been demonstrated successfully with several prototypes, including one built along with AMD (Advanced Micro Devices, Inc.) engineers. These designs significantly outperformed the state of the art.
The research in this dissertation brought formal control closer to computer architecture and has been well-received in both domains. It has the first application of full-fledged MIMO control for processors, the first use of robust control in computer systems, and the first application of formal control for side-channel defense. It makes a significant stride towards intelligent systems that are efficient, secure and reliable
Methodology for complex dataflow application development
This thesis addresses problems inherent to the development of complex applications for reconfig- urable systems. Many projects fail to complete or take much longer than originally estimated by relying on traditional iterative software development processes typically used with conventional computers. Even though designer productivity can be increased by abstract programming and execution models, e.g., dataflow, development methodologies considering the specific properties of reconfigurable systems do not exist.
The first contribution of this thesis is a design methodology to facilitate systematic develop- ment of complex applications using reconfigurable hardware in the context of High-Performance Computing (HPC). The proposed methodology is built upon a careful analysis of the original application, a software model of the intended hardware system, an analytical prediction of performance and on-chip area usage, and an iterative architectural refinement to resolve identi- fied bottlenecks before writing a single line of code targeting the reconfigurable hardware. It is successfully validated using two real applications and both achieve state-of-the-art performance.
The second contribution extends this methodology to provide portability between devices in two steps. First, additional tool support for contemporary multi-die Field-Programmable Gate Arrays (FPGAs) is developed. An algorithm to automatically map logical memories to hetero- geneous physical memories with special attention to die boundaries is proposed. As a result, only the proposed algorithm managed to successfully place and route all designs used in the evaluation while the second-best algorithm failed on one third of all large applications. Second, best practices for performance portability between different FPGA devices are collected and evaluated on a financial use case, showing efficient resource usage on five different platforms.
The third contribution applies the extended methodology to a real, highly demanding emerging application from the radiotherapy domain. A Monte-Carlo based simulation of dose accumu- lation in human tissue is accelerated using the proposed methodology to meet the real time requirements of adaptive radiotherapy.Open Acces