Search CORE

55 research outputs found

Parallel HEVC Decoding on Multi- and Many-core Architectures : A Power and Performance Analysis

Author: Chi Chi Ching
Juurlink Ben
Lucas Jan
Schierl Thomas
Álvarez-Mesa Mauricio
Publication venue
Publication date: 01/01/2013
Field of study

The Joint Collaborative Team on Video Decoding is developing a new standard named High Efficiency Video Coding (HEVC) that aims at reducing the bitrate of H.264/AVC by another 50 %. In order to fulfill the computational demands of the new standard, in particular for high resolutions and at low power budgets, exploiting parallelism is no longer an option but a requirement. Therefore, HEVC includes several coding tools that allows to divide each picture into several partitions that can be processed in parallel, without degrading the quality nor the bitrate. In this paper we adapt one of these approaches, the Wavefront Parallel Processing (WPP) coding, and show how it can be implemented on multi- and many-core processors. Our approach, named Overlapped Wavefront (OWF), processes several partitions as well as several pictures in parallel. This has the advantage that the amount of (thread-level) parallelism stays constant during execution. In addition, performance and power results are provided for three platforms: a server Intel CPU with 8 cores, a laptop Intel CPU with 4 cores, and a TILE-Gx36 with 36 cores from Tilera. The results show that our parallel HEVC decoder is capable of achieving an average frame rate of 116 fps for 4k resolution on a standard multicore CPU. The results also demonstrate that exploiting more parallelism by increasing the number of cores can improve the energy efficiency measured in terms of Joules per frame substantially

DepositOnce

The ROSACE Case Study: From Simulink Specification to Multi/Many-Core Execution

Author: Gratia Romain
Noulard Eric
Pagetti Claire
Saussié David
Siron Pierre
Publication venue
Publication date: 01/01/2014
Field of study

This paper presents a complete case study - named ROSACE for Research Open-Source Avionics and Control Engineering - that goes from a baseline flight controller, developed in MATLAB/SIMULINK, to a multi-periodic controller executing on a multi/many-core target. The interactions between control and computer engineers are highlighted during the development steps, in particular by investigating several multi-periodic configurations. We deduced ways to improve the discussion between engineers in order to ease the integration on the target. The whole case study is made available to the community under an open-source license

Crossref

Open Archive Toulouse Archive Ouverte

PolyPublie

High-Performance Communication Primitives and Data Structures on Message-Passing Manycores:Broadcast and Map

Author: Shahmirzadi Omid
Publication venue: Lausanne, EPFL
Publication date: 09/09/2014
Field of study

The constant increase in single core frequency reached a plateau during recent years since the produced heat inside the chip cannot be cooled down by existing technologies anymore. An alternative to harvest more computational power per die is to fabricate more number of cores into a single chip. Therefore manycore chips with more than thousand cores are expected by the end of the decade. These environments provide a high level of parallel processing power while their energy consumption is considerably lower than their multi-chip counterparts. Although shared-memory programming is the classical paradigm to program these environments, there are numerous claims that taking into account the full life cycle of software, the message-passing programming model have numerous advantages. The direct architectural consequence of applying a message-passing programming model is to support message passing between the processing entities directly in the hardware. Therefore manycore architectures with hardware support for message passing are becoming more and more visible. These platforms can be seen in two ways: (i) as a High Performance Computing (HPC) cluster programmed by highly trained scientists using Message Passing Interface (MPI) libraries; or (ii) as a mainstream computing platform requiring a global operating system to abstract away the architectural complexities from the ordinary programmer. In the first view, performance of communication primitives is an important bottleneck for MPI applications. In the second view, kernel data structures have been shown to be a limiting factor. In this thesis (i) we overview existing state-of-the-art techniques to circumvent the mentioned bottlenecks; and (ii) we study high-performance broadcast communication primitive and map data structure on modern manycore architectures, with message-passing support in hardware, in two different chapters respectively. In one chapter, we study how to make use of the hardware features to implement an efficient broadcast primitive. We consider the Intel Single-chip Cloud Computer (SCC) as our target platform which offers the ability to move data between on-chip Message Passing Buffers (MPB) using Remote Memory Access (RMA). We propose OC-Bcast (On-Chip Broadcast), a pipelined k-ary tree algorithm tailored to exploit the parallelism provided by on-chip RMA. Experimental results show that OC-Bcast attains considerably better performance in terms of latency and throughput compared to state-of-the-art solutions. This performance improvement highlights the benefits of exploiting hardware features of the target platform: Our broadcast algorithm takes direct advantage of RMA, unlike the other broadcast solutions which are based on a higher-level send/receive interface. In the other chapter, we study the implementation of high-throughput concurrent maps in message-passing manycores. Partitioning and replication are the two approaches to achieve high throughput in a message-passing system. This chapter presents and compares different strongly-consistent map algorithms based on partitioning and replication. To assess the performance of these algorithms independently of architecture-specific features, we propose a communication model of message-passing manycores to express the throughput of each algorithm. The model is validated through experiments on a 36-core TILE-Gx8036 processor. Evaluations show that replication outperforms partitioning only in a narrow domain

Infoscience - École polytechnique fédérale de Lausanne

Hardware/Software Co-design for Multicore Architectures

Author: Xu Thomas Canhao
Publication venue: Turku Centre for Computer Science
Publication date: 12/09/2012
Field of study

Siirretty Doriast

UTUPub

Scalable and Distributed Resource Management for Many-Core Systems

Author: Kobbe Sebastian
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2015
Field of study

Many-core systems provide researchers with important new challenges, including the handling of very dynamic and hardly predictable computational loads. The large number of applications and cores causes scalability issues for centrally acting heuristics, which always must retain a global view of the entire system. Resource management itself can become a bottleneck which limits the achievable performance of the system. The focus of this work is to achieve scalability of resource management

KITopen

On the Energy Efficiency and Performance of Irregular Application Executions on Multicore, NUMA and Manycore Platforms

Author: Castro Márcio
Dupros Fabrice
Francesquini Emilio
Freitas Henrique,
Méhaut Jean-François
Navaux Philippe Olivier Alexandre
Penna Pedro Henrique
Publication venue: 'Elsevier BV'
Publication date: 01/02/2015
Field of study

International audienceUntil the last decade, performance of HPC architectures has been almost exclusively quantifiedby their processing power. However, energy efficiency is being recently considered as importantas raw performance and has become a critical aspect to the development of scalablesystems. These strict energy constraints guided the development of a new class of so-calledlight-weight manycore processors. This study evaluates the computing and energy performanceof two well-known irregular NP-hard problems — the Traveling-Salesman Problem (TSP) andK-Means clustering—and a numerical seismic wave propagation simulation kernel—Ondes3D—on multicore, NUMA, and manycore platforms. First, we concentrate on the nontrivial task ofadapting these applications to a manycore, specifically the novel MPPA-256 manycore processor.Then, we analyze their performance and energy consumption on those di↵erent machines.Our results show that applications able to fully use the resources of a manycore can have betterperformance and may consume from 3.8x to 13x less energy when compared to low-power andgeneral-purpose multicore processors, respectivel

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Multiprocessor System-on-Chips based Wireless Sensor Network Energy Optimization

Author: Ali Haider
Publication venue: Department of Electronics, Computing and Mathematics
Publication date: 01/01/2020
Field of study

Wireless Sensor Network (WSN) is an integrated part of the Internet-of-Things (IoT) used to monitor the physical or environmental conditions without human intervention. In WSN one of the major challenges is energy consumption reduction both at the sensor nodes and network levels. High energy consumption not only causes an increased carbon footprint but also limits the lifetime (LT) of the network. Network-on-Chip (NoC) based Multiprocessor System-on-Chips (MPSoCs) are becoming the de-facto computing platform for computationally extensive real-time applications in IoT due to their high performance and exceptional quality-of-service. In this thesis a task scheduling problem is investigated using MPSoCs architecture for tasks with precedence and deadline constraints in order to minimize the processing energy consumption while guaranteeing the timing constraints. Moreover, energy-aware nodes clustering is also performed to reduce the transmission energy consumption of the sensor nodes. Three distinct problems for energy optimization are investigated given as follows: First, a contention-aware energy-efficient static scheduling using NoC based heterogeneous MPSoC is performed for real-time tasks with an individual deadline and precedence constraints. An offline meta-heuristic based contention-aware energy-efficient task scheduling is developed that performs task ordering, mapping, and voltage assignment in an integrated manner. Compared to state-of-the-art scheduling our proposed algorithm significantly improves the energy-efficiency. Second, an energy-aware scheduling is investigated for a set of tasks with precedence constraints deploying Voltage Frequency Island (VFI) based heterogeneous NoC-MPSoCs. A novel population based algorithm called ARSH-FATI is developed that can dynamically switch between explorative and exploitative search modes at run-time. ARSH-FATI performance is superior to the existing task schedulers developed for homogeneous VFI-NoC-MPSoCs. Third, the transmission energy consumption of the sensor nodes in WSN is reduced by developing ARSH-FATI based Cluster Head Selection (ARSH-FATI-CHS) algorithm integrated with a heuristic called Novel Ranked Based Clustering (NRC). In cluster formation parameters such as residual energy, distance parameters, and workload on CHs are considered to improve LT of the network. The results prove that ARSH-FATI-CHS outperforms other state-of-the-art clustering algorithms in terms of LT.University of Derby, Derby, U

UDORA - University of Derby Online Research Archive

SnuMAP: 매니코어 시스템을 위한 어플리케이션 추적 프로파일러

Author: 카밀로셀리스
Publication venue: 서울대학교 대학원
Publication date: 01/08/2018
Field of study

학위논문 (석사)-- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2018. 8. Bernhard Egger.이 논문에서는 매니코어 시스템을 위한 모듈 방식의 공개 소스 추적 프로파일러인 SnuMAP를 제안한다. SnuMAP은 응용 프로그램마다 코어 분배, 파워, CPU, 그리고 메모리 이용률과 같은 관심있는 다양한 데이터에 관한 전반적인 시스템 관점을 제공한다. 게다가, SnuMAP은 가볍고 해당 응용 프로그램의 소스 코드의 수정이 필요 없으며, 또한 병렬 처리 응용 프로그램의 성능도 감소시키지 않는다. 대신, 응용 프로그램 개발자와 매니코어 자원 관리자로부터 가치 있는 정보 및 이해를 필요로 한다. 이러한 종류의 도구는 시스템 이용률을 증가시키는 데 목적을 둔 현대의 매니코어 시스템의 많은 병렬적 워크로드의 동시 스케줄링처럼 점점 더 중요해지고 있다. 우리는 SnuMAP을 다양한 연구 과제에 사용할 수 있도록 하며 이 논 에서 SnuMAP에서 제공하는 시각 자료와 데이터로 찾을 수 있는 중요한 결과 예시를 보여준다. 이 과제는 원래 간단한 공개 소스 프로파일러에서 출발하였고 점점 복잡한 분석 도구로 발전하였다. 더 많은 정보는 http://csap.snu.ac.kr/software/snumap 에서 확인할 수 있다.In this thesis, we propose SnuMAP, an open-source modular trace profiler for many-core systems. SnuMAP provides per-application and whole-system views of multiple data points of interest: core allocation, power, CPU and memory utilization. Additionally, SnuMAP is light-weight, requires no source-code instrumentation and does not degrade the performance of the target parallel application. It alternatively gathers valuable information and insights for application developers and many-core resource managers. This type of tools continues to gain importance as todays many-core systems co-schedule multiple parallel workloads to increase system utilization. We have put SnuMAP to use in numerous research projects and present in this paper a snapshot of essential findings enabled by the visualization and data SnuMAP can provide. This project started originally as a simple open-source profiler, then it grew and evolved into the complex analysis tool it is today. More information is available at http://csap.snu.ac.kr/software/snumap.Abstract Contents List of Figures List of Tables Introduction and Motivation 2 Background 2.1 TimingMechanisms 2.2 ManycoreProcessorsCharacteristics 2.3 Intel RAPL 2.3.1 Power Measurement 2.3.2 Power Capping 2.3.3 UserInterfaces 2.3.4 Power Measurements in Other Architectures 3 Related Work 4 Overview and Design 5 Implementation 5.1 CoreAllocation 5.1.1 Kernel Patch: context-switch Tracker 5.1.2 Kernel Module 5.2 Performance and Energy Monitoring Unit 5.2.1 CPU Performance Monitoring 5.2.2 Memory Performance Monitoring 5.2.3 Energy/Power Monitoring 5.3 User-level Interfaces 5.3.1 Dynamic Interface: Library Interpositionining 5.3.2 Static Interface: Shared Library 5.4 Visualizations 5.4.1 Core vs.Time 6 Overhead 6.1 Context-switchTracker 6.2 PMU Monitor 6.2.1 Reading Performance Counters 6.2.2 Discussion 7 Use Cases and Evaluation 7.1 Target Architectures 7.2 Target Applications 7.3 Space-sharedSchedulingScenario 7.4 MultipleApplicationsScenarios 7.4.1 TwoApplicationsonAMD32 7.4.2 TwoApplicationsonTilera 7.5 PythonScenario 8 Conclusion and Future Work 8.1 Conclusion 8.2 FutureWork Bibliography 요약Maste

SNU Open Repository and Archive

Efficient Multiprogramming for Multicores with SCAF

Author: Creech Timothy Mattausch
Publication venue
Publication date: 01/01/2015
Field of study

As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run, enable sophisticated and flexible resource management. Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed run-time environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend upon profiling applications ahead of time in order to make good decisions about allocations, or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This paper presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution which supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications, without requiring application modification or recompilation. In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters, and demonstrate its effectiveness in aiding allocation decisions. We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme. We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning --- the best existing competing scheme in the literature. If the sum of speedups with SCAF is within 5% of equipartitioning (i.e., improvement factor is 0.95X < improvement factor in sum of speedups < 1.05X), then we deem SCAF to break even. Less than 0.95X is considered a slowdown; greater than 1.05X is an improvement. We found that SCAF improves on equipartitioning on 4 out of 5 machines, breaking even or improving in 80-89% of pairs and showing a mean improvement of 1.11-1.22X for benchmark pairs for which it shows an improvement, depending on the machine. Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end-users today. SCAF improves or breaks even on the unmodified OpenMP runtimes for all 5 machines in 72-100% of pairs, with a mean improvement of 1.27-1.7X, depending on the machine

Digital Repository at the University of Maryland