Search CORE

254 research outputs found

Energy Saving Techniques for Phase Change Memory (PCM)

Author: Mittal Sparsh
Publication venue
Publication date: 15/09/2013
Field of study

In recent years, the energy consumption of computing systems has increased and a large fraction of this energy is consumed in main memory. Towards this, researchers have proposed use of non-volatile memory, such as phase change memory (PCM), which has low read latency and power; and nearly zero leakage power. However, the write latency and power of PCM are very high and this, along with limited write endurance of PCM present significant challenges in enabling wide-spread adoption of PCM. To address this, several architecture-level techniques have been proposed. In this report, we review several techniques to manage power consumption of PCM. We also classify these techniques based on their characteristics to provide insights into them. The aim of this work is encourage researchers to propose even better techniques for improving energy efficiency of PCM based main memory.Comment: Survey, phase change RAM (PCRAM

arXiv.org e-Print Archive

CiteSeerX

Improving Phase Change Memory Performance with Data Content Aware Access

Author: Ahn S. J.
Alshboul M.
Awad A.
Awad A.
Bock S.
Bock S.
Bondurant D.
Boroumand A.
Burr G. W.
Chen J.
Chhabra S.
Dogan H.
Du Y.
Ferreira A. P.
Frigo P.
Gueron S.
Guerra J.
Ham T. J.
Hashemi M.
Hsieh K.
Hwang W.
Jia Y.
Jiang L.
Joo Y.
Kang U.
Karlsson M.
Kim J.
Kim Y.
Kim Y.
Lalam A.
Lam C. H.
Lee J. I.
Mallik A.
Marathe V. J.
Meza J.
Morikawa T.
Mutlu O.
Mutlu O.
Pourshirazi B.
Qureshi M. K.
Qureshi M. K.
Saileshwar G.
Seong N. H.
Seshadri V.
Stuecheli J.
Villa C.
Wang Y.
Wang Z.
Wuttig M.
Yamada N.
Yang J.
Yue J.
Zhang L.
Zhou M.
Zhou M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 10/05/2020
Field of study

A prominent characteristic of write operation in Phase-Change Memory (PCM) is that its latency and energy are sensitive to the data to be written as well as the content that is overwritten. We observe that overwriting unknown memory content can incur significantly higher latency and energy compared to overwriting known all-zeros or all-ones content. This is because all-zeros or all-ones content is overwritten by programming the PCM cells only in one direction, i.e., using either SET or RESET operations, not both. In this paper, we propose data content aware PCM writes (DATACON), a new mechanism that reduces the latency and energy of PCM writes by redirecting these requests to overwrite memory locations containing all-zeros or all-ones. DATACON operates in three steps. First, it estimates how much a PCM write access would benefit from overwriting known content (e.g., all-zeros, or all-ones) by comprehensively considering the number of set bits in the data to be written, and the energy-latency trade-offs for SET and RESET operations in PCM. Second, it translates the write address to a physical address within memory that contains the best type of content to overwrite, and records this translation in a table for future accesses. We exploit data access locality in workloads to minimize the address translation overhead. Third, it re-initializes unused memory locations with known all-zeros or all-ones content in a manner that does not interfere with regular read and write accesses. DATACON overwrites unknown content only when it is absolutely necessary to do so. We evaluate DATACON with workloads from state-of-the-art machine learning applications, SPEC CPU2017, and NAS Parallel Benchmarks. Results demonstrate that DATACON significantly improves system performance and memory system energy consumption compared to the best of performance-oriented state-of-the-art techniques.Comment: 18 pages, 21 figures, accepted at ACM SIGPLAN International Symposium on Memory Management (ISMM

arXiv.org e-Print Archive

Crossref

Fast Parallel Algorithms for Basic Problems

Author: Wen Zhaofang
Publication venue: ODU Digital Commons
Publication date: 01/07/1991
Field of study

Parallel processing is one of the most active research areas these days. We are interested in one aspect of parallel processing, i.e. the design and analysis of parallel algorithms. Here, we focus on non-numerical parallel algorithms for basic combinatorial problems, such as data structures, selection, searching, merging and sorting. The purposes of studying these types of problems are to obtain basic building blocks which will be useful in solving complex problems, and to develop fundamental algorithmic techniques. In this thesis, we study the following problems: priority queues, multiple search and multiple selection, and reconstruction of a binary tree from its traversals. The research on priority queue was motivated by its various applications. The purpose of studying multiple search and multiple selection is to explore the relationships between four of the most fundamental problems in algorithm design, that is, selection, searching, merging and sorting; while our parallel solutions can be used as subroutines in algorithms for other problems. The research on the last problem, reconstruction of a binary tree from its traversals, was stimulated by a challenge proposed in a recent paper by Berkman et al. ( Highly Parallelizable Problems, STOC 89) to design doubly logarithmic time optimal parallel algorithms because a remarkably small number of such parallel algorithms exist

Old Dominion University

Task Oriented Programming for the RC64 Manycore DSP

Author: Aviely Peleg
Ginosar Ran
Publication venue: DigitalCommons@USU
Publication date: 10/08/2022
Field of study

RC64 is a rad-hard manycore DSP combining 64 VLIW/SIMD DSP cores, lock-free shared memory, a hardware scheduler and a task-based programming model. The hardware scheduler enables fast scheduling and allocation of fine grain tasks to all cores. Parallel programming is based on Tasks

DigitalCommons@USU

Parallelization methodology for video coding - an implementation on the TMS320C80

Author: Cheung PYS
Leung KK
Yung NHC
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2000
Field of study

This paper presents a parallelization methodology for video coding based on the philosophy of hiding as much communications by computation as possible. It models the task/data size, processor cache capacity, and communication contention, through a systematic decomposition and scheduling approach. With the aid of Petri-nets and task graphs for representation and analysis, it employs a triple buffering scheme to enable the functions of frame capture, management, and coding to be performed in parallel. The theoretical speedup analysis indicates that this method offers excellent communication hiding, resulting in system efficiency well above 90%. To prove its practicality, a H.261 video encoder has been implemented on a TMS320C80 system using the method. Its performance was measured, from which the speedup and efficiency figures were calculated. The only difference detected between the theoretical and measured data is the program control overhead that has not been accounted for in the theoretical model. Even with this, the measured speedup of the H.261 is 3.67 and 3.76 on four parallel processors (PPs) for QCIF and 352 × 240 video, respectively, which correspond to frame rate of 30.7 and 9.25 frames per second, and system efficiency of 91.8% and 94%, respectively. This method is particularly efficient for platforms with small number of parallel processors.published_or_final_versio

HKU Scholars Hub

Wear Leveling Revisited

Author: Onodera Taku
Shibuya Tetsuo
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2020
Field of study

Wear leveling - a technology designed to balance the write counts among memory cells regardless of the requested accesses - is vital in prolonging the lifetime of certain computer memory devices, especially the type of next-generation non-volatile memory, known as phase change memory (PCM). Although researchers have been working extensively on wear leveling, almost all existing studies mainly focus on the practical aspects and lack rigorous mathematical analyses. The lack of theory is particularly problematic for security-critical applications. We address this issue by revisiting wear leveling from a theoretical perspective. First, we completely determine the problem parameter regime for which Security Refresh - one of the most well-known existing wear leveling schemes for PCM - works effectively by providing a positive result and a matching negative result. In particular, Security Refresh is not competitive for the practically relevant regime of large-scale memory. Then, we propose a novel scheme that achieves better lifetime, time/space overhead, and wear-free space for the relevant regime not covered by Security Refresh. Unlike existing studies, we give rigorous theoretical lifetime analyses, which is necessary to assess and control the security risk.Peer reviewe

Dagstuhl Research Online Publication Server

Helsingin yliopiston digitaalinen arkisto

Parallel dynamic lowest common ancestors

Author: A. Moitra
Alfred V. V. Aho
D. Harel
D. Maier
J. JáJá
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Fast parallel permutation algorithms

Author: Hagerup Torben
Keller Jörg
Publication venue: Fakultät 6 - Naturwissenschaftlich-Technische Fakultät I. Fachrichtung 6.2 - Informatik
Publication date: 01/01/1994
Field of study

We investigate the problem of permuting n data items on an EREW PRAM with p processors using little additional storage. We present a simple algorithm with run time O((n/p)log n) and an improved algorithm with run time O(n/p+log nloglog(n/p)). Both algorithms require n additional global bits and O(1) local storage per processor. If prefix summation is supported at the instruction level, the run time of the improved algorithm is O(n/p). The algorithms can be used to rehash the address space of a PRAM emulation

Parallel Implementation of the Accelerated Integer GCD Algorithm

Author: WEBER KENNETH
Publication venue: Published by Elsevier Ltd.
Publication date: 30/04/1996
Field of study

AbstractThe accelerated integer greatest common divisor (GCD) algorithm has been shown to be one of the most efficient in practice. This paper describes a parallel implementation of the accelerated algorithm for the Sequent Balance, a shared-memory multiprocessor. For input of roughly 10 000 digits, it displays speed-ups of 1.6, 2.5, 3.4 and 4.0 using 2, 4, 8 and 16 processors, respectively

Elsevier - Publisher Connector