1,178 research outputs found
Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft
Analytic performance models are essential for understanding the performance
characteristics of loop kernels, which consume a major part of CPU cycles in
computational science. Starting from a validated performance model one can
infer the relevant hardware bottlenecks and promising optimization
opportunities. Unfortunately, analytic performance modeling is often tedious
even for experienced developers since it requires in-depth knowledge about the
hardware and how it interacts with the software. We present the "Kerncraft"
tool, which eases the construction of analytic performance models for streaming
kernels and stencil loop nests. Starting from the loop source code, the problem
size, and a description of the underlying hardware, Kerncraft can ideally
predict the single-core performance and scaling behavior of loops on multicore
processors using the Roofline or the Execution-Cache-Memory (ECM) model. We
describe the operating principles of Kerncraft with its capabilities and
limitations, and we show how it may be used to quickly gain insights by
accelerated analytic modeling.Comment: 11 pages, 4 figures, 8 listing
The "MIND" Scalable PIM Architecture
MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a
Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on
each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND
architecture
Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels
Achieving optimal program performance requires deep insight into the
interaction between hardware and software. For software developers without an
in-depth background in computer architecture, understanding and fully utilizing
modern architectures is close to impossible. Analytic loop performance modeling
is a useful way to understand the relevant bottlenecks of code execution based
on simple machine models. The Roofline Model and the Execution-Cache-Memory
(ECM) model are proven approaches to performance modeling of loop nests. In
comparison to the Roofline model, the ECM model can also describes the
single-core performance and saturation behavior on a multicore chip. We give an
introduction to the Roofline and ECM models, and to stencil performance
modeling using layer conditions (LC). We then present Kerncraft, a tool that
can automatically construct Roofline and ECM models for loop nests by
performing the required code, data transfer, and LC analysis. The layer
condition analysis allows to predict optimal spatial blocking factors for loop
nests. Together with the models it enables an ab-initio estimate of the
potential benefits of loop blocking optimizations and of useful block sizes. In
cases where LC analysis is not easily possible, Kerncraft supports a cache
simulator as a fallback option. Using a 25-point long-range stencil we
demonstrate the usefulness and predictive power of the Kerncraft tool.Comment: 22 pages, 5 figure
S-Net for multi-memory multicores
Copyright ACM, 2010. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the 5th ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming: http://doi.acm.org/10.1145/1708046.1708054S-Net is a declarative coordination language and component technology aimed at modern multi-core/many-core architectures and systems-on-chip. It builds on the concept of stream processing to structure dynamically evolving networks of communicating asynchronous components. Components themselves are implemented using a conventional language suitable for the application domain. This two-level software architecture maintains a familiar sequential development environment for large parts of an application and offers a high-level declarative approach to component coordination. In this paper we present a conservative language extension for the placement of components and component networks in a multi-memory environment, i.e. architectures that associate individual compute cores or groups thereof with private memories. We describe a novel distributed runtime system layer that complements our existing multithreaded runtime system for shared memory multicores. Particular emphasis is put on efficient management of data communication. Last not least, we present preliminary experimental data
Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M
The Dynamic Distributed Dimensional Data Model (D4M) library implements
associative arrays in a variety of languages (Python, Julia, and Matlab/Octave)
and provides a lightweight in-memory database implementation of hypersparse
arrays that are ideal for analyzing many types of network data. D4M relies on
associative arrays which combine properties of spreadsheets, databases,
matrices, graphs, and networks, while providing rigorous mathematical
guarantees, such as linearity. Streaming updates of D4M associative arrays put
enormous pressure on the memory hierarchy. This work describes the design and
performance optimization of an implementation of hierarchical associative
arrays that reduces memory pressure and dramatically increases the update rate
into an associative array. The parameters of hierarchical associative arrays
rely on controlling the number of entries in each level in the hierarchy before
an update is cascaded. The parameters are easily tunable to achieve optimal
performance for a variety of applications. Hierarchical arrays achieve over
40,000 updates per second in a single instance. Scaling to 34,000 instances of
hierarchical D4M associative arrays on 1,100 server nodes on the MIT SuperCloud
achieved a sustained update rate of 1,900,000,000 updates per second. This
capability allows the MIT SuperCloud to analyze extremely large streaming
network data sets.Comment: 6 pages; 6 figures; accepted to IEEE High Performance Extreme
Computing (HPEC) Conference 2019. arXiv admin note: text overlap with
arXiv:1807.05308, arXiv:1902.0084
Mainstream parallel array programming on cell
We present the E] compiler and runtime library for the âFâ subset of
the Fortran 95 programming language. âFâ provides first-class support for arrays,
allowing E] to implicitly evaluate array expressions in parallel using the SPU coprocessors
of the Cell Broadband Engine. We present performance results from
four benchmarks that all demonstrate absolute speedups over equivalent âCâ or
Fortran versions running on the PPU host processor. A significant benefit of this
straightforward approach is that a serial implementation of any code is always
available, providing code longevity, and a familiar development paradigm
- âŠ