Search CORE

453 research outputs found

Fault- and Yield-Aware On-Chip Memory Design and Management

Author: Lee Hyunjin A.
Publication venue
Publication date: 28/09/2011
Field of study

Ever decreasing device size causes more frequent hard faults, which becomes a serious burden to processor design and yield management. This problem is particularly pronounced in the on-chip memory which consumes up to 70% of a processor' s total chip area. Traditional circuit-level techniques, such as redundancy and error correction code, become less effective in error-prevalent environments because of their large area overhead. In this work, we suggest an architectural solution to building reliable on-chip memory in the future processor environment. Our approaches have two parts, a design framework and architectural techniques for on-chip memory structures. Our design framework provides important architectural evaluation metrics such as yield, area, and performance based on low level defects and process variations parameters. Processor architects can quickly evaluate their designs' characteristics in terms of yield, area, and performance. With the framework, we develop architectural yield enhancement solutions for on-chip memory structures including L1 cache, L2 cache and directory memory. Our proposed solutions greatly improve yield with negligible area and performance overhead. Furthermore, we develop a decoupled yield model of compute cores and L2 caches in CMPs, which show that there will be many more L2 caches than compute cores in a chip. We propose efficient utilization techniques for excess caches. Evaluation results show that excess caches significantly improve overall performance of CMPs

D-Scholarship@Pitt

Resource-aware scheduling for 2D/3D multi-/many-core processor-memory systems

Author: Shen Y.
Publication venue
Publication date: 01/01/2024
Field of study

This dissertation addresses the complexities of 2D/3D multi-/many-core processor-memory systems, focusing on two key areas: enhancing timing predictability in real-time multi-core processors and optimizing performance within thermal constraints. The integration of an increasing number of transistors into compact chip designs, while boosting computational capacity, presents challenges in resource contention and thermal management. The first part of the thesis improves timing predictability. We enhance shared cache interference analysis for set-associative caches, advancing the calculation of Worst-Case Execution Time (WCET). This development enables accurate assessment of cache interference and the effectiveness of partitioned schedulers in real-world scenarios. We introduce TCPS, a novel task and cache-aware partitioned scheduler that optimizes cache partitioning based on task-specific WCET sensitivity, leading to improved schedulability and predictability. Our research explores various cache and scheduling configurations, providing insights into their performance trade-offs. The second part focuses on thermal management in 2D/3D many-core systems. Recognizing the limitations of Dynamic Voltage and Frequency Scaling (DVFS) in S-NUCA many-core processors, we propose synchronous thread migrations as a thermal management strategy. This approach culminates in the HotPotato scheduler, which balances performance and thermal safety. We also introduce 3D-TTP, a transient temperature-aware power budgeting strategy for 3D-stacked systems, reducing the need for Dynamic Thermal Management (DTM) activation. Finally, we present 3QUTM, a novel method for 3D-stacked systems that combines core DVFS and memory bank Low Power Modes with a learning algorithm, optimizing response times within thermal limits. This research contributes significantly to enhancing performance and thermal management in advanced processor-memory systems

International Migration, Integration and Social Cohesion online publications

Efficient concurrent data structure access parallelism techniques for increasing scalability

Author: Rukundo Adones
Publication venue
Publication date: 01/01/2023
Field of study

Multi-core processors have revolutionised the way data structures are designed by bringing parallelism to mainstream computing. Key to exploiting hardware parallelism available in multi-core processors are concurrent data structures. However, some concurrent data structure abstractions are inherently sequential and incapable of harnessing the parallelism performance of multi-core processors. Designing and implementing concurrent data structures to harness hardware parallelism is challenging due to the requirement of correctness, efficiency and practicability under various application constraints. In this thesis, our research contribution is towards improving concurrent data structure access parallelism to increase data structure performance. We propose new design frameworks that improve access parallelism of already existing concurrent data structure designs. Also, we propose new concurrent data structure designs with significant performance improvements. To give an insight into the interplay between hardware and concurrent data structure access parallelism, we give a detailed analysis and model the performance scalability with varying parallelism.In the first part of the thesis, we focus on data structure semantic relaxation. By relaxing the semantics of a data structure, a bigger design space, that allows weaker synchronization and more useful parallelism, is unveiled. Investigating new data structure designs, capable of trading semantics for achieving better performance in a monotonic way, is a major challenge in the area. We algorithmically address this challenge in this part of the thesis. We present an efficient, lock-free, concurrent data structure design framework for out-of-order semantic relaxation. We introduce a new two-dimensional algorithmic design, that uses multiple instances of a given data structure to improve access parallelism. In the second part of the thesis, we propose an efficient priority queue that improves access parallelism by reducing the number of synchronization points for each operation. Priority queues are fundamental abstract data types, often used to manage limited resources in parallel systems. Typical proposed parallel priority queue implementations are based on heaps or skip lists. In recent literature, skip lists have been shown to be the most efficient design choice for implementing priority queues. Though numerous intricate implementations of skip list based queues have been proposed in the literature, their performance is constrained by the high number of global atomic updates per operation and the high memory consumption, which are proportional to the number of sub-lists in the queue. In this part of the thesis, we propose an alternative approach for designing lock-free linearizable priority queues, that significantly improve memory efficiency and throughput performance, by reducing the number of global atomic updates and memory consumption as compared to skip-list based queues. To achieve this, our new design combines two structures; a search tree and a linked list, forming what we call a Tree Search List Queue (TSLQueue). Subsequently, we analyse and introduce a model for lock-free concurrent data structure access parallelism. The major impediment to scaling concurrent data structures is memory contention when accessing shared data structure access points, leading to thread serialisation, and hindering parallelism. Aiming to address this challenge, a significant amount of work in the literature has proposed multi-access techniques that improve concurrent data structure parallelism. However, there is little work on analysing and modelling the execution behaviour of concurrent multi-access data structures especially in a shared memory setting. In this part of the thesis, we analyse and model the general execution behaviour of concurrent multi-access data structures in the shared memory setting. We study and analyse the behaviour of the two popular random access patterns: shared (Remote) and exclusive (Local) access, and the behaviour of the two most commonly used atomic primitives for designing lock-free data structures: Compare and Swap, and, Fetch and Add

Chalmers Research

Staged reads: mitigating the impact of DRAM writes on DRAM reads

Author: Balasubramonian Rajeev
Chatterjee Niladrish
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

Journal ArticleMain memory latencies have always been a concern for system performance. Given that reads are on the criti- cal path for CPU progress, reads must be prioritized over writes. However, writes must be eventually processed and they often delay pending reads. In fact, a single channel in the main memory system offers almost no parallelism between reads and writes. This is because a single off-chip memory bus is shared by reads and writes and the direction of the bus has to be explicitly turned around when switching from writes to reads. This is an expensive operation and its cost is amortized by carrying out a burst of writes or reads every time the bus direction is switched. As a result, no reads can be processed while a memory channel is busy servicing writes. This paper proposes a novel mechanism to boost read-write parallelism and perform useful components of read operations even when the memory system is busy performing writes. If some of the banks are busy servicing writes, we start issuing reads to the other idle banks. The results of these reads are stored in a few registers near the memory chip's I/O pads. These results are quickly returned immediately following the bus turnaround. The process is referred to as a Staged Read because it decouples a single read operation into two stages, with the first step being performed in parallel with writes. This innovation can also be viewed as a form of prefetch that is internal to a memory chip. The proposed tech- nique works best when there is bank imbalance in the write stream. We also introduce a write scheduling algorithm that artificially creates bank imbalance and allows useful read operations to be performed during the write drain. Across a suite of memory-intensive workloads, we show that Staged Reads can boost throughput by up to 33% (average 7%) with an average DRAM access latency improvement of 17%, while incurring a very small cost (0.25%) in terms of memory chip area. The throughput improvements are even greater when considering write-intensive work-loads (average 11%) or future systems (average 12%)

The University of Utah: J. Willard Marriott Digital Library

Auto-tuning similarity search algorithms on multi-core architectures

Author: Gedik B.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Cataloged from PDF version of article.In recent times, large high-dimensional datasets have become ubiquitous. Video and image repositories, financial, and sensor data are just a few examples of such datasets in practice. Many applications that use such datasets require the retrieval of data items similar to a given query item, or the nearest neighbors (NN or k-NN) of a given item. Another common query is the retrieval of multiple sets of nearest neighbors, i.e., multi k-NN, for different query items on the same data. With commodity multi-core CPUs becoming more and more widespread at lower costs, developing parallel algorithms for these search problems has become increasingly important. While the core nearest neighbor search problem is relatively easy to parallelize, it is challenging to tune it for optimality. This is due to the fact that the various performance-specific algorithmic parameters, or “tuning knobs”, are inter-related and also depend on the data and query workloads. In this paper, we present (1) a detailed study of the various tuning knobs and their contributions on increasing the query throughput for parallelized versions of the two most common classes of high-dimensional multi-NN search algorithms: linear scan and tree traversal, and (2) an offline auto-tuner for setting these knobs by iteratively measuring actual query execution times for a given workload and dataset. We show experimentally that our auto-tuner reaches near-optimal performance and significantly outperforms un-tuned versions of parallel multi-NN algorithms for real video repository data on a variety of multi-core platforms. © Springer Science+Business Media New York 201

Bilkent University Institutional Repository

A survey of cross-layer power-reliability tradeoffs in multi and many core systems-on-chip

Author: Abdallah
Ahmed A. Eltawil
Amin Khajeh Djahromi
Bhavnagarwala
Bhavnagarwala
Bibiche Geuskens
Bolchini
Bowman
Calhoun
Chang
Dennard
Fadi J. Kurdahi
Joshi
Kuhn
Kurdahi
Mazen A.R. Saghir
Michael Engel
Mukhopadhyay
Nalam
Noble
Noble
Peter Marwedel
Smail Niar
Stolk
Uht
Wang
Yamauchi
Zhai
Publication venue: 'Elsevier BV'
Publication date: 01/01/2013
Field of study

As systems-on-chip increase in complexity, the underlying technology presents us with significant challenges due to increased power consumption as well as decreased reliability. Today, designers must consider building systems that achieve the requisite functionality and performance using components that may be unreliable. In order to do so, it is crucial to understand the close interplay between the different layers of a system: technology, platform, and application. This will enable the most general tradeoff exploration, reaping the most benefits in power, performance and reliability. This paper surveys various cross layer techniques and approaches for power, performance, and reliability tradeoffs are technology, circuit, architecture and application layers. © 2013 Elsevier B.V. All rights reserved

Crossref

HAL Descartes

eScholarship - University of California

Exploiting Adaptive Techniques to Improve Processor Energy Efficiency

Author: Chen Hu
Publication venue: DigitalCommons@USU
Publication date: 01/05/2016
Field of study

Rapid device-miniaturization keeps on inducing challenges in building energy efficient microprocessors. As the size of the transistors continuously decreasing, more uncertainties emerge in their operations. On the other hand, integrating more and more transistors on a single chip accentuates the need to lower its supply-voltage. This dissertation investigates one of the primary device uncertainties - timing error, in microprocessor performance bottleneck in NTC era. Then it proposes various innovative techniques to exploit these opportunities to maintain processor energy efficiency, in the context of emerging challenges. Evaluated with the cross-layer methodology, the proposed approaches achieve substantial improvements in processor energy efficiency, compared to other start-of-art techniques

DigitalCommons@USU

Doctor of Philosophy

Author: Chatterjee Niladrish
Publication venue: University of Utah
Publication date: 01/12/2013
Field of study

dissertationThe internet-based information infrastructure that has powered the growth of modern personal/mobile computing is composed of powerful, warehouse-scale computers or datacenters. These heavily subscribed datacenters perform data-processing jobs under intense quality of service guarantees. Further, high-performance compute platforms are being used to model and analyze increasingly complex scientific problems and natural phenomena. To ensure that the high-performance needs of these machines are met, it is necessary to increase the efficiency of the memory system that supplies data to the processing cores. Many of the microarchitectural innovations that were designed to scale the memory wall (e.g., out-of-order instruction execution, on-chip caches) are being rendered less effective due to several emerging trends (e.g., increased emphasis on energy consumption, limited access locality). This motivates the optimization of the main memory system itself. The key to an efficient main memory system is the memory controller. In particular, the scheduling algorithm in the memory controller greatly influences its performance. This dissertation explores this hypothesis in several contexts. It develops tools to better understand memory scheduling and develops scheduling innovations for CPUs and GPUs. We propose novel memory scheduling techniques that are strongly aware of the access patterns of the clients as well as the microarchitecture of the memory device. Based on these, we present (i) a Dynamic Random Access Memory (DRAM) chip microarchitecture optimized for reducing write-induced slowdown, (ii) a memory scheduling algorithm that exploits these features, (iii) several memory scheduling algorithms to reduce the memory-related stall experienced by irregular General Purpose Graphics Processing Unit (GPGPU) applications, and (iv) the Utah Simulated Memory Module (USIMM), a detailed, validated simulator for DRAM main memory that we use for analyzing and proposing scheduler algorithms

The University of Utah: J. Willard Marriott Digital Library

High-performance and hardware-aware computing: proceedings of the second International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC\u2711), San Antonio, Texas, USA, February 2011 ; (in conjunction with HPCA-17)

Author: Buchty Rainer
Weiß Jan-Philipp
Publication venue: KIT Scientific Publishing, Karlsruhe
Publication date: 01/01/2011
Field of study

High-performance system architectures are increasingly exploiting heterogeneity. The HipHaC workshop aims at combining new aspects of parallel, heterogeneous, and reconfigurable microprocessor technologies with concepts of high-performance computing and, particularly, numerical solution methods. Compute- and memory-intensive applications can only benefit from the full hardware potential if all features on all levels are taken into account in a holistic approach

KITopen

메모리 가상 채널을 통한 라스트 레벨 캐시 파티셔닝

Author: 정종욱
Publication venue: 서울대학교 대학원
Publication date: 01/02/2023
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2023. 2. 김장우.Ensuring fairness or providing isolation between multiple workloads with distinct characteristics that are collocated on a single, shared-memory system is a challenge. Recent multicore processors provide last-level cache (LLC) hardware partitioning to provide hardware support for isolation, with the cache partitioning often specified by the user. While more LLC capacity often results in higher performance, in this dissertation we identify that a workload allocated more LLC capacity result in worse performance on real-machine experiments, which we refer to as MiW (more is worse). Through various controlled experiments, we identify that another workload with less LLC capacity causes more frequent LLC misses. The workload stresses the main memory system shared by both workloads and degrades the performance of the former workload even if LLC partitioning is used (a balloon effect). To resolve this problem, we propose virtualizing the data path of main memory controllers and dedicating the memory virtual channels (mVCs) to each group of applications, grouped for LLC partitioning. mVC can further fine-tune the performance of groups by differentiating buffer sizes among mVCs. It can reduce the total system cost by executing latency-critical and throughput-oriented workloads together on shared machines, of which performance criteria can be achieved only on dedicated machines if mVCs are not supported. Experiments on a simulated chip multiprocessor show that our proposals effectively eliminate the MiW phenomenon, hence providing additional opportunities for workload consolidation in a datacenter. Our case study demonstrates potential savings of machine count by 21.8% with mVC, which would otherwise violate a service level objective (SLO).최근 멀티코어 프로세서 기반 시스템은 학계 및 업계의 주목을 받고 있으며, 널리 사용되고 있다. 멀티코어 프로세서 기반 시스템은 서로 다른 특성을 가진 여러 응용 프로그램들이 동시에 실행되는데, 이 때 응용 프로그램들은 시스템의 여러 자원들을 공유하게 된다. 대표적인 공유 자원의 예로는 라스트 레벨 캐시(LLC) 및 메인 메모리를 들 수 있다. 이러한 단일 공유 메모리 시스템에서 서로 다른 특성을 가진 여러 응용 프로그램들 간에 공유 자원의 공정성을 보장하거나 특정 응용 프로그램이 다른 응용 프로그램으로부터 간섭을 받지 않도록 격리하는 것은 어려운 일이다. 이를 해결하기 위하여 최근 멀티코어 프로세서는 LLC 파티셔닝을 하드웨어적으로 제공하기 시작하였다. 사용자는 하드웨어적으로 제공된 LLC 파티셔닝을 통해 특정 응용 프로그램에 원하는 수준만큼 LLC를 할당하여 다른 응용 프로그램으로부터 간섭을 받지 않도록 격리할 수 있게 되었다. 일반적인 경우 LLC 용량을 많이 할당 받을수록 성능이 향상되는 경우가 많지만, 본 연구에서는 더 많은 LLC 용량을 할당 받은 응용 프로그램이 오히려 성능 저하된다는 사실(MiW, more is worse)을 하드웨어적 실험을 통해 확인하였다. 다양한 통제된 실험을 통해 LLC 파티셔닝을 통해 LLC 용량을 적게 할당 받은 응용 프로그램이 LLC 미스를 더 자주 발생시킨다는 사실을 확일 할 수 있었다. LLC 용량을 적게 할당 받은 응용 프로그램은 응용 프로그램들이 공유하는 메인 메모리 시스템에 스트레스를 가하고, LLC 파티셔닝을 통해 서로 격리를 하였음에도 불구하고 응용 프로그램의 성능을 저하시켰다. MiW 현상을 해결하기 위해 본 연구에서는 메인 메모리 컨트롤러의 데이터 경로를 가상화하고 LLC 파티셔닝에 의해 그룹화된 각 응용 프로그램 그룹에 전용으로 할당되는 메모리 가상 채널(mVC)을 제안하였다. mVC를 통해 각 응용 프로그램 그룹은 독립적인 데이터 경로를 소유한 것처럼 가상화 된다. 따라서 특정 응용 프로그램 그룹이 데이터 경로를 독점하더라도 다른 응용 프로그램들은 성능 저하를 유발할 수 없게 되어 서로 격리된 환경을 조성한다. 추가적으로 mVC의 버퍼 크기를 조정하여 응용 프로그램 그룹의 성능 미세 조정이 가능하도록 하였다. mVC를 도입함으로써 전체적인 시스템 비용을 줄일 수 있다. 지연 시간이 중요한 응용 프로그램과 처리량이 중요한 응용 프로그램을 함께 실행할 때 mVC가 없을 경우에는 지연 시간의 성능 기준치를 만족할 수 없었지만, mVC를 통해 성능 기준치를 만족하면서 시스템의 총 비용을 감소시킬 수 있었다. 멀티 칩 프로세서를 시뮬레이션한 실험 결과는 MiW 현상을 효과적으로 제거함을 보여주었다. 또한, 데이터 센터에서 응용 프로그램들의 동시 실행을 위한 추가적인 가능성을 제공하는 것을 보여주었다. 사례 연구를 통해 mVC를 도입하여 시스템 비용을 21.8%까지 절약할 수 있음을 보였으며, mVC를 도입하지 않은 경우에는 서비스 기준(SLO)을 만족하지 않음을 확인하였다.1. Introduction 1 1.1 Research Contributions 5 1.2 Outline 6 2. Background 7 2.1 Cache Hierarchy and Policies 7 2.2 Cache Partitioning 10 2.3 Benchmarks 15 2.3.1 Working Set Size 16 2.3.2 Top-down Analysis 17 2.3.3 Profiling Tools 19 3. More-is-Worse Phenonmenon 21 3.1 More LLC Leading to Performance Drop 21 3.2 Synthetic Workload Evaluation 27 3.3 Impact on Latency-critical Workloads 31 3.4 Workload Analysis 33 3.5 The Root Cause of the MiW Phenomenon 35 3.6 Limitations of Existing Solutions 41 3.6.1 Memory Bandwidth Throttling 41 3.6.2 Fairness-aware Memory Scheduling 44 4. Virtualizing Memory Channels 49 4.1 Memory Virtual Channel (mVC) 50 4.2 mVC Buffer Allocation Strategies 52 4.3 Evaluation 57 4.3.1 Experimental Setup 57 4.3.2 Reproducing Hardware Results 59 4.3.3 Mitigating MiW through mVC 60 4.3.4 Evaluation on Four Groups 64 4.3.5 Potentials for Operating Cost Savings with mVC 66 5. Related Work 71 5.1 Component-wise QoS/Fairness for Shared Resources 71 5.2 Holistic Approaches to QoS/Fairness 73 5.3 MiW on Recent Architectures 74 6. Conclusion 76 6.1 Discussion 78 6.2 Future Work 79 Bibliography 81 국문초록 89박

SNU Open Repository and Archive