1,738 research outputs found
HALLS: An Energy-Efficient Highly Adaptable Last Level STT-RAM Cache for Multicore Systems
Spin-Transfer Torque RAM (STT-RAM) is widely considered a promising
alternative to SRAM in the memory hierarchy due to STT-RAM's non-volatility,
low leakage power, high density, and fast read speed. The STT-RAM's small
feature size is particularly desirable for the last-level cache (LLC), which
typically consumes a large area of silicon die. However, long write latency and
high write energy still remain challenges of implementing STT-RAMs in the CPU
cache. An increasingly popular method for addressing this challenge involves
trading off the non-volatility for reduced write speed and write energy by
relaxing the STT-RAM's data retention time. However, in order to maximize
energy saving potential, the cache configurations, including STT-RAM's
retention time, must be dynamically adapted to executing applications' variable
memory needs. In this paper, we propose a highly adaptable last level STT-RAM
cache (HALLS) that allows the LLC configurations and retention time to be
adapted to applications' runtime execution requirements. We also propose
low-overhead runtime tuning algorithms to dynamically determine the best
(lowest energy) cache configurations and retention times for executing
applications. Compared to prior work, HALLS reduced the average energy
consumption by 60.57% in a quad-core system, while introducing marginal latency
overhead.Comment: To Appear on IEEE Transactions on Computers (TC
Crystal gazer : profile-driven write-rationing garbage collection for hybrid memories
Non-volatile memories (NVM) offer greater capacity than DRAM but suffer from high latency and low write endurance. Hybrid memories combine DRAM and NVM to form scalable memory systems with the promise of high capacity, low energy consumption, and high endurance. Automatically managing hybrid NVM-DRAM memories to achieve their promise without changing user applications or their programming models remains an open question. This paper uses garbage collection in managed languages to exploit NVM capacity while preventing NVM wear out in hybrid memories with no changes to the programming model. We introduce profile-driven write-rationing garbage collection. Allocation sites that produce frequently written objects are predicted based on previous program executions. Objects are initially allocated in a DRAM nursery space. The collector copies surviving nursery objects from highly written sites to a mature DRAM space and read-mostly objects to a mature NVM space.Write-intensity prediction for 15 Java benchmarks accurately places objects in the correct space, eliminating expensive object monitoring from prior write-rationing garbage collectors. Furthermore, our technique exposes a Pareto tradeoff between DRAM usage and NVM lifetime, unlike prior work. Experimental results on NUMA hardware that emulates hybrid NVM-DRAM memory demonstrates that profile-driven write-rationing garbage collection reduces the number of writes to NVM compared to prior work to extend its lifetime, maximizes the use of NVM for its capacity, and achieves good performance
Many-Task Computing and Blue Waters
This report discusses many-task computing (MTC) generically and in the
context of the proposed Blue Waters systems, which is planned to be the largest
NSF-funded supercomputer when it begins production use in 2012. The aim of this
report is to inform the BW project about MTC, including understanding aspects
of MTC applications that can be used to characterize the domain and
understanding the implications of these aspects to middleware and policies.
Many MTC applications do not neatly fit the stereotypes of high-performance
computing (HPC) or high-throughput computing (HTC) applications. Like HTC
applications, by definition MTC applications are structured as graphs of
discrete tasks, with explicit input and output dependencies forming the graph
edges. However, MTC applications have significant features that distinguish
them from typical HTC applications. In particular, different engineering
constraints for hardware and software must be met in order to support these
applications. HTC applications have traditionally run on platforms such as
grids and clusters, through either workflow systems or parallel programming
systems. MTC applications, in contrast, will often demand a short time to
solution, may be communication intensive or data intensive, and may comprise
very short tasks. Therefore, hardware and software for MTC must be engineered
to support the additional communication and I/O and must minimize task dispatch
overheads. The hardware of large-scale HPC systems, with its high degree of
parallelism and support for intensive communication, is well suited for MTC
applications. However, HPC systems often lack a dynamic resource-provisioning
feature, are not ideal for task communication via the file system, and have an
I/O system that is not optimized for MTC-style applications. Hence, additional
software support is likely to be required to gain full benefit from the HPC
hardware
Using Class-Level Static Properties to Predict Object Lifetimes
Today, most modern programming languages such as C # or Java use an automatic memory management system also known as a Garbage Collector (GC). Over the course of program execution, new objects are allocated in memory, and some older objects become unreachable (die). In order for the program to keep running, it becomes necessary to free the memory of dead objects; this task is performed periodically by the GC.
Research has shown that most objects die young and as a result, generational collectors have become very popular over the years. Yet, these algorithms are not good at handling long-lived objects. Typically, long-lived objects would first be allocated in the nursery space and be promoted (copied) to an older generation after surviving a garbage collection, hence wasting precious time.
By allocating long-lived and immortal objects directly into infrequently or never collected regions, pretenuring can reduce garbage collection costs significantly. Current state of the art methodology to predict object lifetime involves off-line profiling combined with a simple, heuristic classification. Profiling is slow (can take days), requires gathering gigabytes of data that need to be analysed (can take hours), and needs to be repeated for every previously unseen program.
This thesis explores the space of lifetime predictions and shows how object lifetimes can be predicted accurately and quickly using simple program characteristics gathered within minutes. Following an innovative methodology introduced in this thesis, object lifetime predictions are fed into a specifically modified Java virtual machine. Performance tests show gains in GC times of as much as 77% for the “SPEC jvm98” benchmarks, against a generational copying collector
Energy-Efficient GPU Clusters Scheduling for Deep Learning
Training deep neural networks (DNNs) is a major workload in datacenters
today, resulting in a tremendously fast growth of energy consumption. It is
important to reduce the energy consumption while completing the DL training
jobs early in data centers. In this paper, we propose PowerFlow, a GPU clusters
scheduler that reduces the average Job Completion Time (JCT) under an energy
budget. We first present performance models for DL training jobs to predict the
throughput and energy consumption performance with different configurations.
Based on the performance models, PowerFlow dynamically allocates GPUs and
adjusts the GPU-level or job-level configurations of DL training jobs.
PowerFlow applies network packing and buddy allocation to job placement, thus
avoiding extra energy consumed by cluster fragmentations. Evaluation results
show that under the same energy consumption, PowerFlow improves the average JCT
by 1.57 - 3.39 x at most, compared to competitive baselines
- …