759 research outputs found
A survey of emerging architectural techniques for improving cache energy consumption
The search goes on for another ground breaking phenomenon to reduce the ever-increasing disparity between the CPU performance and storage. There are encouraging breakthroughs in enhancing CPU performance through fabrication technologies and changes in chip designs but not as much luck has been struck with regards to the computer storage resulting in material negative system performance. A lot of research effort has been put on finding techniques that can improve the energy efficiency of cache architectures. This work is a survey of energy saving techniques which are grouped on whether they save the dynamic energy, leakage energy or both. Needless to mention, the aim of this work is to compile a quick reference guide of energy saving techniques from 2013 to 2016 for engineers, researchers and students
Evaluating kernels on Xeon Phi to accelerate Gysela application
This work describes the challenges presented by porting parts ofthe Gysela
code to the Intel Xeon Phi coprocessor, as well as techniques used for
optimization, vectorization and tuning that can be applied to other
applications. We evaluate the performance of somegeneric micro-benchmark on Phi
versus Intel Sandy Bridge. Several interpolation kernels useful for the Gysela
application are analyzed and the performance are shown. Some memory-bound and
compute-bound kernels are accelerated by a factor 2 on the Phi device compared
to Sandy architecture. Nevertheless, it is hard, if not impossible, to reach a
large fraction of the peek performance on the Phi device,especially for
real-life applications as Gysela. A collateral benefit of this optimization and
tuning work is that the execution time of Gysela (using 4D advections) has
decreased on a standard architecture such as Intel Sandy Bridge.Comment: submitted to ESAIM proceedings for CEMRACS 2014 summer school version
reviewe
Performance Debugging and Tuning using an Instruction-Set Simulator
Instruction-set simulators allow programmers a detailed level of insight into,
and control over, the execution of a program, including parallel programs and
operating systems. In principle, instruction set simulation can model any
target computer and gather any statistic. Furthermore, such simulators are
usually portable, independent of compiler tools, and deterministic-allowing
bugs to be recreated or measurements repeated. Though often viewed as being
too slow for use as a general programming tool, in the last several years
their performance has improved considerably.
We describe SIMICS, an instruction set simulator of SPARC-based
multiprocessors developed at SICS, in its rĂ´le as a general programming tool.
We discuss some of the benefits of using a tool such as SIMICS to support
various tasks in software engineering, including debugging, testing, analysis,
and performance tuning. We present in some detail two test cases, where we've
used SimICS to support analysis and performance tuning of two applications,
Penny and EQNTOTT. This work resulted in improved parallelism in, and
understanding of, Penny, as well as a performance improvement for EQNTOTT of
over a magnitude. We also present some early work on analyzing SPARC/Linux,
demonstrating the ability of tools like SimICS to analyze operating systems
Exploiting cache locality at run-time
With the increasing gap between the speeds of the processor and memory system, memory access has become a major performance bottleneck in modern computer systems. Recently, Symmetric Multi-Processor (SMP) systems have emerged as a major class of high-performance platforms. Improving the memory performance of Parallel applications with dynamic memory-access patterns on Symmetric Multi-Processors (SMP) is a hard problem. The solution to this problem is critical to the successful use of the SMP systems because dynamic memory-access patterns occur in many real-world applications. This dissertation is aimed at solving this problem.;Based on a rigorous analysis of cache-locality optimization, we propose a memory-layout oriented run-time technique to exploit the cache locality of parallel loops. Our technique have been implemented in a run-time system. Using simulation and measurement, we have shown our run-time approach can achieve comparable performance with compiler optimizations for those regular applications, whose load balance and cache locality can be well optimized by tiling and other program transformations. However, our approach was shown to improve significantly the memory performance for applications with dynamic memory-access patterns. Such applications are usually hard to optimize with static compiler optimizations.;Several contributions are made in this dissertation. We present models to characterize the complexity and present a solution framework for optimizing cache locality. We present an effective estimation technique for memory-access patterns to support efficient locality optimizations and information integration. We present a memory-layout oriented run-time technique for locality optimization. We present efficient scheduling algorithms to trade off locality and load imbalance. We provide a detailed performance evaluation of the run-time technique
Developing a compiler for the XeonPhi (TR-2014-341)
The XeonPhi is a highly parallel x86 architecture
chip made by Intel. It has a number of novel features which make it
a particularly challenging target for the compiler writer. This paper
describes the techniques used to port the Glasgow Vector Pascal Compiler (VPC)
to this architecture and assess its performance by comparisons of the XeonPhi with
3 other machines running the same algorithms
The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology
The open-source RISC-V instruction set architecture (ISA) is gaining traction, both in industry and academia. The ISA is designed to scale from microcontrollers to server-class processors. Furthermore, openness promotes the availability of various open-source and commercial implementations. Our main contribution in this paper is a thorough power, performance, and efficiency analysis of the RISC-V ISA targeting baseline "application class" functionality, i.e., supporting the Linux OS and its application environment based on our open-source single-issue in-order implementation of the 64-bit ISA variant (RV64GC) called Ariane. Our analysis is based on a detailed power and efficiency analysis of the RISC-V ISA extracted from silicon measurements and calibrated simulation of an Ariane instance (RV64IMC) taped-out in GlobalFoundries 22FDX technology. Ariane runs at up to 1.7-GHz, achieves up to 40-Gop/sW energy efficiency, which is superior to similar cores presented in the literature. We provide insight into the interplay between functionality required for the application-class execution (e.g., virtual memory, caches, and multiple modes of privileged operation) and energy cost. We also compare Ariane with RISCY, a simpler and a slower microcontroller-class core. Our analysis confirms that supporting application-class execution implies a nonnegligible energy-efficiency loss and that compute performance is more cost-effectively boosted by instruction extensions (e.g., packed SIMD) rather than the high-frequency operation
A Comparative Xeon and CBE Performance Analysis
The Cell Broadband Engine is a high performance multicore processor with superb performance on certain types of problems. However, it does not perform as well running other algorithms, particularly those with heavy branching. The Intel Xeon processor is a high performance superscalar processor. It utilizes a high clock speed and deep pipelines to help it achieve superior performance. But deep pipelines can perform poorly with frequent memory accesses. This paper is a study and attempt at quantifying the types of programmatic structures that are more suitable to a particular architecture. It focuses on the issues of pipelines, memory access and branching on these two microprocessor architectures
RowCore: A Processing-Near-Memory Architecture for Big Data Machine Learning
The technology-push of die stacking and application-pull of
Big Data machine learning (BDML) have created a unique
opportunity for processing-near-memory (PNM). This paper
makes four contributions: (1) While previous PNM work
explores general MapReduce workloads, we identify three
workload characteristics: (a) irregular-and-compute-light (i.e.,
perform only a few operations per input word which include
data-dependent branches and indirect memory accesses); (b)
compact (i.e., the computation has a small intermediate live
data and uses only a small amount of contiguous input data);
and (c) memory-row-dense (i.e., process the input data without
skipping over many bytes). We show that BDMLs have
or can be transformed to have these characteristics which,
except for irregularity, are necessary for bandwidth- and energyefficient
PNM, irrespective of the architecture. (2) Based on
these characteristics, we propose RowCore, a row-oriented
PNM architecture, which (pre)fetches and operates on entire
memory rows to exploit BDMLs’ row-density. Instead
of this row-centric access and compute-schedule, traditional
architectures opportunistically improve row locality while
fetching and operating on cache blocks. (3) RowCore employs
well-known MIMD execution to handle BDMLs’ irregularity,
and sequential prefetch of input data to hide memory
latency. In RowCore, however, one corelet prefetches
a row for all the corelets which may stray far from each
other due to their MIMD execution. Consequently, a leading
corelet may prematurely evict the prefetched data before
a lagging corelet has consumed the data. RowCore employs
novel cross-corelet flow-control to prevent such eviction. (4)
RowCore further exploits its flow-controlled prefetch for frequency
scaling based on novel coarse-grain compute-memory
rate-matching which decreases (increases) the processor clock
speed when the prefetch buffers are empty (full). Using simulations,
we show that RowCore improves performance and
energy, by 135% and 20% over a GPGPU with prefetch,
and by 35% and 34% over a multicore with prefetch, when
all three architectures use the same resources (i.e., number
of cores, and on-processor-die memory) and identical diestacking
(i.e., GPGPUs/multicores/RowCore and DRAM)
- …