408 research outputs found
TRANSFERRING PERFORMANCE GAIN FROM SOFTWARE PREFETCHING TO ENERGY REDUCTION
Performance-enhancement techniques improve CPU speed, but at
higher cost to other valuable system resources such as power and
energy. We study this trade-off using software prefetching as the
system performance-enhancement technique. We first demonstrate
software prefetching provides an average 36% performance boost
with 8% more energy consumption and 69% higher power on six
memory-intensive benchmarks. However, when we combine prefetching
with a (unrealistic) static voltage scaling technique, the performance
gain afforded by prefetching can be traded off for savings
in power/energy consumption. In particular, we observe a 48% energy
saving when we slow down the system with prefetching so
as to match the performance of the system without prefetching.
This suggests a promising approach to build low power systems
by transforming traditional performance-enhancement techniques
into low power methods. We thus propose a real time dynamic
voltage scaling (DVS) algorithm that monitors a system’s performance
and adapts the voltage level accordingly while maintaining
the observed system performance. Our dynamic DVS algorithm
achieves a 38% energy saving without any performance loss on our
benchmark suite
A Survey of Techniques For Improving Energy Efficiency in Embedded Computing Systems
Recent technological advances have greatly improved the performance and
features of embedded systems. With the number of just mobile devices now
reaching nearly equal to the population of earth, embedded systems have truly
become ubiquitous. These trends, however, have also made the task of managing
their power consumption extremely challenging. In recent years, several
techniques have been proposed to address this issue. In this paper, we survey
the techniques for managing power consumption of embedded systems. We discuss
the need of power management and provide a classification of the techniques on
several important parameters to highlight their similarities and differences.
This paper is intended to help the researchers and application-developers in
gaining insights into the working of power management techniques and designing
even more efficient high-performance embedded systems of tomorrow
Understanding and Improving the Latency of DRAM-Based Memory Systems
Over the past two decades, the storage capacity and access bandwidth of main
memory have improved tremendously, by 128x and 20x, respectively. These
improvements are mainly due to the continuous technology scaling of DRAM
(dynamic random-access memory), which has been used as the physical substrate
for main memory. In stark contrast with capacity and bandwidth, DRAM latency
has remained almost constant, reducing by only 1.3x in the same time frame.
Therefore, long DRAM latency continues to be a critical performance bottleneck
in modern systems. Increasing core counts, and the emergence of increasingly
more data-intensive and latency-critical applications further stress the
importance of providing low-latency memory access.
In this dissertation, we identify three main problems that contribute
significantly to long latency of DRAM accesses. To address these problems, we
present a series of new techniques. Our new techniques significantly improve
both system performance and energy efficiency. We also examine the critical
relationship between supply voltage and latency in modern DRAM chips and
develop new mechanisms that exploit this voltage-latency trade-off to improve
energy efficiency.
The key conclusion of this dissertation is that augmenting DRAM architecture
with simple and low-cost features, and developing a better understanding of
manufactured DRAM chips together lead to significant memory latency reduction
as well as energy efficiency improvement. We hope and believe that the proposed
architectural techniques and the detailed experimental data and observations on
real commodity DRAM chips presented in this dissertation will enable
development of other new mechanisms to improve the performance, energy
efficiency, or reliability of future memory systems.Comment: PhD Dissertatio
Memory Management for Emerging Memory Technologies
The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues.
This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM.
The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling.
Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%.
As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach.
In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system
Memory Management for Emerging Memory Technologies
The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues.
This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM.
The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling.
Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%.
As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach.
In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system
CATA: Criticality aware task acceleration for multicore processors
Managing criticality in task-based programming models opens a wide range of performance and power optimization opportunities in future manycore systems. Criticality aware task schedulers can benefit from these opportunities by scheduling tasks to the most appropriate cores. However, these schedulers may suffer from priority inversion and static binding problems that limit their expected improvements. Based on the observation that task criticality information can be exploited to drive hardware reconfigurations, we propose a Criticality Aware Task Acceleration (CATA) mechanism that dynamically adapts the computational power of a task depending on its criticality. As a result, CATA achieves significant improvements over a baseline static scheduler, reaching average improvements up to 18.4% in execution time and 30.1% in Energy-Delay Product (EDP) on a simulated 32-core system. The cost of reconfiguring hardware by means of a software-only solution rises with the number of cores due to lock contention and reconfiguration overhead. Therefore, novel architectural support is proposed to eliminate these overheads on future manycore systems. This architectural support minimally extends hardware structures already present in current processors, which allows further improvements in performance with negligible overhead. As a consequence, average improvements of up to 20.4% in execution time and 34.0% in EDP are obtained, outperforming state-of-the-art acceleration proposals not aware of task criticality.This work has been supported by the Spanish Government (grant SEV2015-0493, SEV-2011-00067 of the Severo Ochoa
Program), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316, TIN2012-34557, TIN2013-46957-C2-2-P), by Generalitat de Catalunya (contracts 2014-SGR-
1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the
EU’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 610402 and from the EU’s H2020 Framework
Programme (H2020/2014-2020) under grant agreement no 671697. M. Moret´o has been partially supported by the Ministry of Economy and Competitiveness under Juan de
la Cierva postdoctoral fellowship number JCI-2012-15047.
M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the
Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243). E. Castillo has been partially supported by the Spanish Ministry of Education, Culture and Sports under grant FPU2012/2254.Peer ReviewedPostprint (author's final draft
Improving GPGPU Energy-Efficiency through Concurrent Kernel Execution and DVFS
Master'sMASTER OF SCIENC
Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters
High Performance and Energy Efficiency are critical requirements for Internet
of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable
processors (CMPs) has recently emerged as a suitable solution to address this
challenge. One of the main bottlenecks limiting the performance and energy
efficiency of these systems is the instruction cache architecture due to its
criticality in terms of timing (i.e., maximum operating frequency), bandwidth,
and power. We propose a hierarchical instruction cache tailored to
ultra-low-power tightly-coupled processor clusters where a relatively large
cache (L1.5) is shared by L1 private caches through a two-cycle latency
interconnect. To address the performance loss caused by the L1 capacity misses,
we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to
L1.5. We optimize the core instruction fetch (IF) stage by removing the
critical core-to-L1 combinational path. We present a detailed comparison of
instruction cache architectures' performance and energy efficiency for parallel
ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level
instruction cache provides better scalability than existing shared caches,
delivering up to 20\% higher operating frequency. On average, the proposed
two-level cache improves maximum performance by up to 17\% compared to the
state-of-the-art while delivering similar energy efficiency for most relevant
applications.Comment: 14 page
- …