Search CORE

990 research outputs found

Optimisation of computational fluid dynamics applications on multicore and manycore architectures

Author: Hadade Ioan
Publication venue: Mechanical Engineering, Imperial College London
Publication date: 01/05/2018
Field of study

This thesis presents a number of optimisations used for mapping the underlying computational patterns of finite volume CFD applications onto the architectural features of modern multicore and manycore processors. Their effectiveness and impact is demonstrated in a block-structured and an unstructured code of representative size to industrial applications and across a variety of processor architectures that make up contemporary high-performance computing systems. The importance of vectorization and the ways through which this can be achieved is demonstrated in both structured and unstructured solvers together with the impact that the underlying data layout can have on performance. The utility of auto-tuning for ensuring performance portability across multiple architectures is demonstrated and used for selecting optimal parameters such as prefetch distances for software prefetching or tile sizes for strip mining/loop tiling. On the manycore architectures, running more than one thread per physical core is found to be crucial for good performance on processors with in-order core designs but not required on out-of-order architectures. For architectures with high-bandwidth memory packages, their exploitation, whether explicitly or implicitly, is shown to be imperative for best performance. The implementation of all of these optimisations led to application speed-ups ranging between 2.7X and 3X on the multicore CPUs and 5.7X to 24X on the manycore processors.Open Acces

Spiral - Imperial College Digital Repository

Exploiting memory allocations in clusterized many-core architectures

Author: Abdoulaye Gamatie (7214183)
Anastasiia Butko (7214180)
Gilles Sassatelli (7214186)
Ost Copello-Ost (4910230)
Rafael Garibotti (7214177)
Ricardo Reis (181591)
Publication venue
Publication date: 22/01/2019
Field of study

Power-efficient architectures have become the most important feature required for future embedded systems. Modern designs, like those released on mobile devices, reveal that clusterization is the way to improve energy efficiency. However, such architectures are still limited by the memory subsystem (i.e., memory latency problems). This work investigates an alternative approach that exploits on-chip data locality to a large extent, through distributed shared memory systems that permit efficient reuse of on-chip mapped data in clusterized many-core architectures. First, this work reviews the current literature on memory allocations and explore the limitations of cluster-based many-core architectures. Then, several memory allocations are introduced and benchmarked scalability, performance and energy-wise, compared to the conventional centralized shared memory solution to reveal which memory allocation is the most appropriate for future mobile architectures. Our results show that distributed shared memory allocations bring performance gains and opportunities to reduce energy consumption

Loughborough University Institutional Repository

Medium access control in wireless network-on-chip: a context analysis

Author: Abadal Cavallé Sergi
Alarcón Cot Eduardo José
Cabellos Aparicio Alberto
Mestres Sugrañes Albert
Torrellas Josep
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Wireless on-chip communication is a promising candidate to address the performance and efficiency issues that arise when scaling current NoC techniques to manycore processors. A WNoC can serve global and broadcast traffic with ultra-low latency even in thousand-core chips, thus acting as a natural complement to conventional and throughput-oriented wireline NoCs. However, the development of MAC strategies needed to efficiently share the wireless medium among the increasing number of cores remains a considerable challenge given the singularities of the environment and the novelty of the research area. In this position article, we present a context analysis describing the physical constraints, performance objectives, and traffic characteristics of the on-chip communication paradigm. We summarize the main differences with respect to traditional wireless scenarios, and then discuss their implications on the design of MAC protocols for manycore WNoC, with the ultimate goal of kickstarting this arguably unexplored research area.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

STT-RAM을 이용한 에너지 효율적인 캐시 설계 기술

Author: 김남형
Publication venue: 서울대학교 대학원
Publication date: 01/02/2019
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2019. 2. 최기영.지난 수십 년간 '메모리 벽' 문제를 해결하기 위해 온 칩 캐시의 크기는 꾸준히 증가해왔다. 하지만 지금까지 캐시에 주로 사용되어 온 메모리 기술인 SRAM은 낮은 집적도와 높은 대기 전력 소모로 인해 큰 캐시를 구성하는 데에는 적합하지 않다. 이러한 SRAM의 단점을 보완하기 위해 더 높은 집적도와 낮은 대기 전력을 소모하는 새로운 메모리 기술인 STT-RAM으로 SRAM을 대체하는 것이 제안되었다. 하지만 STT-RAM은 데이터를 쓸 때 많은 에너지와 시간을 소비하기 때문에 단순히 SRAM을 STT-RAM으로 대체하는 것은 오히려 캐시 에너지 소비를 증가시킨다. 이러한 문제를 해결하기 위해 본 논문에서는 STT-RAM을 이용한 에너지 효율적인 캐시 설계 기술들을 제안한다. 첫 번째, 배타적 캐시 계층 구조에서 STT-RAM을 활용하는 방법을 제안하였다. 배타적 캐시 계층 구조는 계층 간에 중복된 데이터가 없기 때문에 포함적 캐시 계층 구조와 비교하여 더 큰 유효 용량을 갖지만, 배타적 캐시 계층 구조에서는 상위 레벨 캐시에서 내보내진 모든 데이터를 하위 레벨 캐시에 써야 하므로 더 많은 양의 데이터를 쓰게 된다. 이러한 배타적 캐시 계층 구조의 특성은 쓰기 특성이 단점인 STT-RAM을 함께 활용하는 것을 어렵게 한다. 이를 해결하기 위해 본 논문에서는 재사용 거리 예측을 기반으로 하는 SRAM/STT-RAM 하이브리드 캐시 구조를 설계하였다. 두 번째, 비휘발성 STT-RAM을 이용해 캐시를 설계할 때 고려해야 할 점들에 대해 분석하였다. STT-RAM의 비효율적인 쓰기 동작을 줄이기 위해 다양한 해결법들이 제안되었다. 그중 한 가지는 STT-RAM 소자가 데이터를 유지하는 시간을 줄여 (휘발성 STT-RAM) 쓰기 특성을 향상하는 방법이다. STT-RAM에 저장된 데이터를 잃는 것은 확률적으로 발생하기 때문에 저장된 데이터를 안정적으로 유지하기 위해서는 오류 정정 부호(ECC)를 이용해 주기적으로 오류를 정정해주어야 한다. 본 논문에서는 STT-RAM 모델을 이용하여 휘발성 STT-RAM 설계 요소들에 대해 분석하였고 실험을 통해 해당 설계 요소들이 캐시 에너지와 성능에 주는 영향을 보여주었다. 마지막으로, 매니코어 시스템에서의 분산 하이브리드 캐시 구조를 설계하였다. 단순히 기존의 하이브리드 캐시와 분산캐시를 결합하면 하이브리드 캐시의 효율성에 큰 영향을 주는 SRAM 활용도가 낮아진다. 따라서 기존의 하이브리드 캐시 구조에서의 에너지 감소를 기대할 수 없다. 본 논문에서는 분산 하이브리드 캐시 구조에서 SRAM 활용도를 높일 수 있는 두 가지 최적화 기술인 뱅크-내부 최적화와 뱅크간 최적화 기술을 제안하였다. 뱅크-내부 최적화는 highly-associative 캐시를 활용하여 뱅크 내부에서 쓰기 동작이 많은 데이터를 분산시키는 것이고 뱅크간 최적화는 서로 다른 캐시 뱅크에 쓰기 동작이 많은 데이터를 고르게 분산시키는 최적화 방법이다.Over the last decade, the capacity of on-chip cache is continuously increased to mitigate the memory wall problem. However, SRAM, which is a dominant memory technology for caches, is not suitable for such a large cache because of its low density and large static power. One way to mitigate these downsides of the SRAM cache is replacing SRAM with a more efficient memory technology. Spin-Transfer Torque RAM (STT-RAM), one of the emerging memory technology, is a promising candidate for the alternative of SRAM. As a substitute of SRAM, STT-RAM can compensate drawbacks of SRAM with its non-volatility and small cell size. However, STT-RAM has poor write characteristics such as high write energy and long write latency and thus simply replacing SRAM to STT-RAM increases cache energy. To overcome those poor write characteristics of STT-RAM, this dissertation explores three different design techniques for energy-efficient cache using STT-RAM. The first part of the dissertation focuses on combining STT-RAM with exclusive cache hierarchy. Exclusive caches are known to provide higher effective cache capacity than inclusive caches by removing duplicated copies of cache blocks across hierarchies. However, in exclusive cache hierarchies, every block evicted from the upper-level cache is written back to the last-level cache regardless of its dirtiness thereby incurring extra write overhead. This makes it challenging to use STT-RAM for exclusive last-level caches due to its high write energy and long write latency. To mitigate this problem, we design an SRAM/STT-RAM hybrid cache architecture based on reuse distance prediction. The second part of the dissertation explores trade-offs in the design of volatile STT-RAM cache. Due to the inefficient write operation of STT-RAM, various solutions have been proposed to tackle this inefficiency. One of the proposed solutions is redesigning STT-RAM cell for better write characteristics at the cost of shortened retention time (i.e., volatile STT-RAM). Since the retention failure of STT-RAM has a stochastic property, an extra overhead of periodic scrubbing with error correcting code (ECC) is required to tolerate the failure. With an analysis based on analytic STT-RAM model, we have conducted extensive experiments on various volatile STT-RAM cache design parameters including scrubbing period, ECC strength, and target failure rate. The experimental results show the impact of the parameter variations on last-level cache energy and performance and provide a guideline for designing a volatile STT-RAM with ECC and scrubbing. The last part of the dissertation proposes Benzene, an energy-efficient distributed SRAM/STT-RAM hybrid cache architecture for manycore systems running multiple applications. It is based on the observation that a naive application of hybrid cache techniques to distributed caches in a manycore architecture suffers from limited energy reduction due to uneven utilization of scarce SRAM. We propose two-level optimization techniques: intra-bank and inter-bank. Intra-bank optimization leverages highly-associative cache design, achieving more uniform distribution of writes within a bank. Inter-bank optimization evenly balances the amount of write-intensive data across the banks.Abstract i Contents iii List of Figures vii List of Tables xi Chapter 1 Introduction 1 1.1 Exclusive Last-Level Hybrid Cache 2 1.2 Designing Volatile STT-RAM Cache 4 1.3 Distributed Hybrid Cache 5 Chapter 2 Background 9 2.1 STT-RAM 9 2.1.1 Thermal Stability 10 2.1.2 Read and Write Operation of STT-RAM 11 2.1.3 Failures of STT-RAM 11 2.1.4 Volatile STT-RAM 13 2.1.5 Related Work 14 2.2 Exclusive Last-Level Hybrid Cache 18 2.2.1 Cache Hierarchies 18 2.2.2 Related Work 19 2.3 Distributed Hybrid Cache 21 2.3.1 Prediction Hybrid Cache 21 2.3.2 Distributed Cache Partitioning 22 2.3.3 Related Work 23 Chapter 3 Exclusive Last-Level Hybrid Cache 27 3.1 Motivation 27 3.1.1 Exclusive Cache Hierarchy 27 3.1.2 Reuse Distance 29 3.2 Architecture 30 3.2.1 Reuse Distance Predictor 30 3.2.2 Hybrid Cache Architecture 32 3.3 Evaluation 34 3.3.1 Methodology 34 3.3.2 LLC Energy Consumption 35 3.3.3 Main Memory Energy Consumption 38 3.3.4 Performance 39 3.3.5 Area Overhead 39 3.4 Summary 39 Chapter 4 Designing Volatile STT-RAM Cache 41 4.1 Analysis 41 4.1.1 Retention Failure of a Volatile STT-RAM Cell 41 4.1.2 Memory Array Design 43 4.2 Evaluation 45 4.2.1 Methodology 45 4.2.2 Last-Level Cache Energy 46 4.2.3 Performance 51 4.3 Summary 52 Chapter 5 Distributed Hybrid Cache 55 5.1 Motivation 55 5.2 Architecture 58 5.2.1 Intra-Bank Optimization 59 5.2.2 Inter-Bank Optimization 63 5.2.3 Other Optimizations 67 5.3 Evaluation Methodology 69 5.4 Evaluation Results 73 5.4.1 Energy Consumption and Performance 73 5.4.2 Analysis of Intra-bank Optimization 76 5.4.3 Analysis of Inter-bank Optimization 78 5.4.4 Impact of Inter-Bank Optimization on Network Energy 79 5.4.5 Sensitivity Analysis 80 5.4.6 Implementation Overhead 81 5.5 Summary 82 Chapter 6 Conculsion 85 Bibliography 88 초록 101Docto

SNU Open Repository and Archive

A Survey on Hardware-aware and Heterogeneous Computing on Multicore Processors and Accelerators

Author: Buchty Rainer
Heuveline Vincent
Karl Wolfgang
Weiß Jan-Philipp
Publication venue: Karlsruher Institut für Technologie
Publication date: 01/01/2009
Field of study

KITopen

Many-core and heterogeneous architectures: programming models and compilation toolchains

Author
Publication venue: Politecnico di Torino
Publication date: 30/10/2020
Field of study

1noL'abstract è presente nell'allegato / the abstract is in the attachmentopen677. INGEGNERIA INFORMATInopartially_openembargoed_20211002Barchi, Francesc

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Exceeding Conservative Limits: A Consolidated Analysis on Modern Hardware Margins

Author: Chatzidimitriou Athanasios
Gizopoulos Dimitris
Kestelman Adrian Cristal
Leng Jingwen
Papadimitriou George
Reddi Vijay Janapa
Salami Behzad
Unsal Osman S.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2020
Field of study

Modern large-scale computing systems (data centers, supercomputers, cloud and edge setups and high-end cyber-physical systems) employ heterogeneous architectures that consist of multicore CPUs, general-purpose many-core GPUs, and programmable FPGAs. The effective utilization of these architectures poses several challenges, among which a primary one is power consumption. Voltage reduction is one of the most efficient methods to reduce power consumption of a chip. With the galloping adoption of hardware accelerators (i.e., GPUs and FPGAs) in large datacenters and other large-scale computing infrastructures, a comprehensive evaluation of the safe voltage reduction levels for each different chip can be employed for efficient reduction of the total power. We present a survey of recent studies in voltage margins reduction at the system level for modern CPUs, GPUs and FPGAs. The pessimistic voltage guardbands inserted by the silicon vendors can be exploited in all devices for significant power savings. On average, voltage reduction can reach 12% in multicore CPUs, 20% in manycore GPUs and 39% in FPGAs.Comment: Accepted for publication in IEEE Transactions on Device and Materials Reliabilit

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC