5,250 research outputs found
Evaluating Multicore Algorithms on the Unified Memory Model
One of the challenges to achieving good performance on multicore architectures is the effective utilization of the underlying memory hierarchy. While this is an issue for single-core architectures, it is a critical problem for multicore chips. In this paper, we formulate the unified multicore model (UMM) to help understand the fundamental limits on cache performance on these architectures. The UMM seamlessly handles different types of multiple-core processors with varying degrees of cache sharing at different levels. We demonstrate that our model can be used to study a variety of multicore architectures on a variety of applications. In particular, we use it to analyze an option pricing problem using the trinomial model and develop an algorithm for it that has near-optimal memory traffic between cache levels. We have implemented the algorithm on a two Quad-Core Intel Xeon 5310 1.6 GHz processors (8 cores). It achieves a peak performance of 19.5 GFLOPs, which is 38% of the theoretical peak of the multicore system. We demonstrate that our algorithm outperforms compiler-optimized and auto-parallelized code by a factor of up to 7.5
Reducing Timing Interferences in Real-Time Applications Running on Multicore Architectures
We introduce a unified wcet analysis and scheduling framework for real-time applications deployed on multicore architectures. Our method does not follow a particular programming model, meaning that any piece of existing code (in particular legacy) can be re-used, and aims at reducing automatically the worst-case number of timing interferences between tasks. Our method is based on the notion of Time Interest Points (tips), which are instructions that can generate and/or suffer from timing interferences. We show how such points can be extracted from the binary code of applications and selected prior to performing the wcet analysis. We then represent real-time tasks as sequences of time intervals separated by tips, and schedule those tasks so that the overall makespan (including the potential timing penalties incurred by interferences) is minimized. This scheduling phase is performed using an Integer Linear Programming (ilp) solver. Preliminary results on state-of-the-art benchmarks show promising results and pave the way for future extensions of the model and optimizations
Cache-aware static scheduling for hard real-time multicore systems based on communication affinities
The growing need for continuous processing capabilities has led to the
development of multicore systems with a complex cache hierarchy. Such multicore
systems are generally designed for improving the performance in average case,
while hard real-time systems must consider worst-case scenarios. An open
challenge is therefore to efficiently schedule hard real-time tasks on a
multicore architecture. In this work, we propose a mathematical formulation for
computing a static scheduling that minimize L1 data cache misses between hard
real-time tasks on a multicore architecture using communication affinities
On the tailoring of CAST-32A certification guidance to real COTS multicore architectures
The use of Commercial Off-The-Shelf (COTS) multicores in real-time industry is on the rise due to multicores' potential performance increase and energy reduction. Yet, the unpredictable impact on timing of contention in shared hardware resources challenges certification. Furthermore, most safety certification standards target single-core architectures and do not provide explicit guidance for multicore processors. Recently, however, CAST-32A has been presented providing guidance for software planning, development and verification in multicores. In this paper, from a theoretical level, we provide a detailed review of CAST-32A objectives and the difficulty of reaching them under current COTS multicore design trends; at experimental level, we assess the difficulties of the application of CAST-32A to a real multicore processor, the NXP P4080.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant
TIN2015-65316-P and the HiPEAC Network of Excellence.
Jaume Abella has been partially supported by the MINECO under Ramon y Cajal grant RYC-2013-14717.Peer ReviewedPostprint (author's final draft
Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips
The trend in industry is towards heterogeneous multicore processors (HMCs),
including chips with CPUs and massively-threaded throughput-oriented processors
(MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the
cores with cache-coherent shared virtual memory (CCSVM), this is not the
communication paradigm used by any current HMC. In this paper, we present a
CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads
programming model, called xthreads, for programming this HMC. Our goal is to
evaluate the potential performance benefits of tightly coupling heterogeneous
cores with CCSVM
Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code
This paper introduces Tiramisu, a polyhedral framework designed to generate
high performance code for multiple platforms including multicores, GPUs, and
distributed machines. Tiramisu introduces a scheduling language with novel
extensions to explicitly manage the complexities that arise when targeting
these systems. The framework is designed for the areas of image processing,
stencils, linear algebra and deep learning. Tiramisu has two main features: it
relies on a flexible representation based on the polyhedral model and it has a
rich scheduling language allowing fine-grained control of optimizations.
Tiramisu uses a four-level intermediate representation that allows full
separation between the algorithms, loop transformations, data layouts, and
communication. This separation simplifies targeting multiple hardware
architectures with the same algorithm. We evaluate Tiramisu by writing a set of
image processing, deep learning, and linear algebra benchmarks and compare them
with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu
matches or outperforms existing compilers and libraries on different hardware
architectures, including multicore CPUs, GPUs, and distributed machines.Comment: arXiv admin note: substantial text overlap with arXiv:1803.0041
- …