279 research outputs found
Hardware-only stream prediction + cache prefetching + dynamic access ordering
Journal ArticleThe speed gap between processors and memory system is becoming the performance bottleneck for many applications, and computations with strided access patterns are among those that suffer most. The vectors used in such applications lack temporal and often spatial locality, and are usually too large to cache. In spite of their poor cache behavior, these access patterns have the advantage of being, predictable, which can be exploited to improve the efficiency of the memory subsystem. As a promising technique to relieve memory system bottleneck, prefetching has been studied in its various forms, and so is dynamic memory scheduling. This study builds on these results, combining a stride-based reference prediction table, a mechanism that prefetches L2 cache lines, and a memory controller that dynamically schedules accesses to a Direct Rambus memory subsystem. We find that such a system delivers impressive speedups for scientific applications with regular access patterns (reducing execution time by almost a factor of two) without negatively affecting the performance of non-streaming programs
Characterizing and Subsetting Big Data Workloads
Big data benchmark suites must include a diversity of data and workloads to
be useful in fairly evaluating big data systems and architectures. However,
using truly comprehensive benchmarks poses great challenges for the
architecture community. First, we need to thoroughly understand the behaviors
of a variety of workloads. Second, our usual simulation-based research methods
become prohibitively expensive for big data. As big data is an emerging field,
more and more software stacks are being proposed to facilitate the development
of big data applications, which aggravates hese challenges. In this paper, we
first use Principle Component Analysis (PCA) to identify the most important
characteristics from 45 metrics to characterize big data workloads from
BigDataBench, a comprehensive big data benchmark suite. Second, we apply a
clustering technique to the principle components obtained from the PCA to
investigate the similarity among big data workloads, and we verify the
importance of including different software stacks for big data benchmarking.
Third, we select seven representative big data workloads by removing redundant
ones and release the BigDataBench simulation version, which is publicly
available from http://prof.ict.ac.cn/BigDataBench/simulatorversion/.Comment: 11 pages, 6 figures, 2014 IEEE International Symposium on Workload
Characterizatio
Heated aquatic microcosms for climate change experiments
Ponds and shallow lakes are likely to be strongly affected by climate change, and by increase in environmental temperature in particular. Hydrological regimes and nutrient cycling may be altered, plant and animal communities may undergo changes in both composition and dynamics, and long-term and difficult to reverse switches between alternative stable equilibria may occur. A thorough understanding of the potential effects of increased temperature on ponds and shallow lakes is desirable because these ecosystems are of immense importance throughout the world as sources of drinking water, and for their amenity and conservation value. This understanding can only come through experimental studies in which the effects of different temperature regimes are compared. This paper reports design details and operating characteristics of a recently constructed experimental facility consisting of 48 aquatic microcosms which mimic the pond and shallow lake environment. Thirty-two of the microcosms can be heated and regulated to simulate climate change scenarios, including those predicted for the UK. The authors also summarise the current and future experimental uses of the microcosms
An approach to resource-aware coscheduling for cmps.
ABSTRACT We develop real-time scheduling techniques for improving performance and energy for multiprogrammed workloads that scale nonuniformly with increasing thread counts. Multithreaded programs generally deliver higher throughput than single-threaded programs on chip multiprocessors, but performance gains from increasing threads decrease when there is contention for shared resources. We use analytic metrics to derive local search heuristics for creating efficient multiprogrammed, multithreaded workload schedules. Programs are allocated fewer cores than requested, and scheduled to space-share the CMP to improve global throughput. Our holistic approach attempts to co-schedule programs that complement each other with respect to shared resource consumption. We find application co-scheduling for performance and energy in a resource-aware manner achieves better results than solely targeting total throughput or concurrently co-scheduling all programs. Our schedulers improve overall energy delay (E*D) by a factor of 1.5 over time-multiplexed gang scheduling
Main memory in HPC: do we need more, or could we live with less?
An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore, the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now.
This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High-Performance Conjugate Gradients (HPCG) benchmark could be an important success story for 3D-stacked memories in HPC, but High-Performance Linpack (HPL) is likely to be constrained by 3D memory capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications is complex and that it requires an understanding of application scalability and target category, i.e., whether the users target capability or capacity computing. The results show that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step toward adoption of this novel technology in the HPC domain.This work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC, Spanish Government through Severo Ochoa programme (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). This work has also received funding from the European Union’s Horizon
2020 research and innovation programme under ExaNoDe project (grant agreement No 671578). Darko Zivanovic holds the Severo Ochoa grant (SVP-2014-068501) of the Ministry of Economy and Competitiveness
of Spain. The authors thank Harald Servat from BSC and Vladimir Marjanovi´c from High Performance Computing Center Stuttgart for their technical support.Postprint (published version
Code density concerns for new architectures
Reducing a program\u27s instruction count can improve cache behavior and bandwidth utilization, lower power consumption, and increase overall performance. Nonetheless, code density is an often overlooked feature in studying processor architectures. We hand-optimize an assembly language embedded benchmark for size on 21 different instruction set architectures, finding up to a factor of three difference in code sizes from ISA alone. We find that the architectural features that contribute most heavily to code density are instruction length, number of registers, availability of a zero register, bit-width, hardware divide units, number of instruction operands, and the availability of unaligned loads and stores. We extend our results to investigate operating system, compiler, and system library effects on code density. We find that the executable starting address, executable format, and system call interface all affect program size. While ISA effects are important, the efficiency of the entire system stack must be taken into account when developing a new dense instruction set architecture
- …