55 research outputs found

    Parallel HEVC Decoding on Multi- and Many-core Architectures : A Power and Performance Analysis

    Get PDF
    The Joint Collaborative Team on Video Decoding is developing a new standard named High Efficiency Video Coding (HEVC) that aims at reducing the bitrate of H.264/AVC by another 50 %. In order to fulfill the computational demands of the new standard, in particular for high resolutions and at low power budgets, exploiting parallelism is no longer an option but a requirement. Therefore, HEVC includes several coding tools that allows to divide each picture into several partitions that can be processed in parallel, without degrading the quality nor the bitrate. In this paper we adapt one of these approaches, the Wavefront Parallel Processing (WPP) coding, and show how it can be implemented on multi- and many-core processors. Our approach, named Overlapped Wavefront (OWF), processes several partitions as well as several pictures in parallel. This has the advantage that the amount of (thread-level) parallelism stays constant during execution. In addition, performance and power results are provided for three platforms: a server Intel CPU with 8 cores, a laptop Intel CPU with 4 cores, and a TILE-Gx36 with 36 cores from Tilera. The results show that our parallel HEVC decoder is capable of achieving an average frame rate of 116 fps for 4k resolution on a standard multicore CPU. The results also demonstrate that exploiting more parallelism by increasing the number of cores can improve the energy efficiency measured in terms of Joules per frame substantially

    The ROSACE Case Study: From Simulink Specification to Multi/Many-Core Execution

    Get PDF
    This paper presents a complete case study - named ROSACE for Research Open-Source Avionics and Control Engineering - that goes from a baseline flight controller, developed in MATLAB/SIMULINK, to a multi-periodic controller executing on a multi/many-core target. The interactions between control and computer engineers are highlighted during the development steps, in particular by investigating several multi-periodic configurations. We deduced ways to improve the discussion between engineers in order to ease the integration on the target. The whole case study is made available to the community under an open-source license

    High-Performance Communication Primitives and Data Structures on Message-Passing Manycores:Broadcast and Map

    Get PDF
    The constant increase in single core frequency reached a plateau during recent years since the produced heat inside the chip cannot be cooled down by existing technologies anymore. An alternative to harvest more computational power per die is to fabricate more number of cores into a single chip. Therefore manycore chips with more than thousand cores are expected by the end of the decade. These environments provide a high level of parallel processing power while their energy consumption is considerably lower than their multi-chip counterparts. Although shared-memory programming is the classical paradigm to program these environments, there are numerous claims that taking into account the full life cycle of software, the message-passing programming model have numerous advantages. The direct architectural consequence of applying a message-passing programming model is to support message passing between the processing entities directly in the hardware. Therefore manycore architectures with hardware support for message passing are becoming more and more visible. These platforms can be seen in two ways: (i) as a High Performance Computing (HPC) cluster programmed by highly trained scientists using Message Passing Interface (MPI) libraries; or (ii) as a mainstream computing platform requiring a global operating system to abstract away the architectural complexities from the ordinary programmer. In the first view, performance of communication primitives is an important bottleneck for MPI applications. In the second view, kernel data structures have been shown to be a limiting factor. In this thesis (i) we overview existing state-of-the-art techniques to circumvent the mentioned bottlenecks; and (ii) we study high-performance broadcast communication primitive and map data structure on modern manycore architectures, with message-passing support in hardware, in two different chapters respectively. In one chapter, we study how to make use of the hardware features to implement an efficient broadcast primitive. We consider the Intel Single-chip Cloud Computer (SCC) as our target platform which offers the ability to move data between on-chip Message Passing Buffers (MPB) using Remote Memory Access (RMA). We propose OC-Bcast (On-Chip Broadcast), a pipelined k-ary tree algorithm tailored to exploit the parallelism provided by on-chip RMA. Experimental results show that OC-Bcast attains considerably better performance in terms of latency and throughput compared to state-of-the-art solutions. This performance improvement highlights the benefits of exploiting hardware features of the target platform: Our broadcast algorithm takes direct advantage of RMA, unlike the other broadcast solutions which are based on a higher-level send/receive interface. In the other chapter, we study the implementation of high-throughput concurrent maps in message-passing manycores. Partitioning and replication are the two approaches to achieve high throughput in a message-passing system. This chapter presents and compares different strongly-consistent map algorithms based on partitioning and replication. To assess the performance of these algorithms independently of architecture-specific features, we propose a communication model of message-passing manycores to express the throughput of each algorithm. The model is validated through experiments on a 36-core TILE-Gx8036 processor. Evaluations show that replication outperforms partitioning only in a narrow domain

    Hardware/Software Co-design for Multicore Architectures

    Get PDF
    Siirretty Doriast

    Scalable and Distributed Resource Management for Many-Core Systems

    Get PDF
    Many-core systems provide researchers with important new challenges, including the handling of very dynamic and hardly predictable computational loads. The large number of applications and cores causes scalability issues for centrally acting heuristics, which always must retain a global view of the entire system. Resource management itself can become a bottleneck which limits the achievable performance of the system. The focus of this work is to achieve scalability of resource management

    On the Energy Efficiency and Performance of Irregular Application Executions on Multicore, NUMA and Manycore Platforms

    No full text
    International audienceUntil the last decade, performance of HPC architectures has been almost exclusively quantifiedby their processing power. However, energy efficiency is being recently considered as importantas raw performance and has become a critical aspect to the development of scalablesystems. These strict energy constraints guided the development of a new class of so-calledlight-weight manycore processors. This study evaluates the computing and energy performanceof two well-known irregular NP-hard problems โ€” the Traveling-Salesman Problem (TSP) andK-Means clusteringโ€”and a numerical seismic wave propagation simulation kernelโ€”Ondes3Dโ€”on multicore, NUMA, and manycore platforms. First, we concentrate on the nontrivial task ofadapting these applications to a manycore, specifically the novel MPPA-256 manycore processor.Then, we analyze their performance and energy consumption on those diโ†ตerent machines.Our results show that applications able to fully use the resources of a manycore can have betterperformance and may consume from 3.8x to 13x less energy when compared to low-power andgeneral-purpose multicore processors, respectivel

    Multiprocessor System-on-Chips based Wireless Sensor Network Energy Optimization

    Get PDF
    Wireless Sensor Network (WSN) is an integrated part of the Internet-of-Things (IoT) used to monitor the physical or environmental conditions without human intervention. In WSN one of the major challenges is energy consumption reduction both at the sensor nodes and network levels. High energy consumption not only causes an increased carbon footprint but also limits the lifetime (LT) of the network. Network-on-Chip (NoC) based Multiprocessor System-on-Chips (MPSoCs) are becoming the de-facto computing platform for computationally extensive real-time applications in IoT due to their high performance and exceptional quality-of-service. In this thesis a task scheduling problem is investigated using MPSoCs architecture for tasks with precedence and deadline constraints in order to minimize the processing energy consumption while guaranteeing the timing constraints. Moreover, energy-aware nodes clustering is also performed to reduce the transmission energy consumption of the sensor nodes. Three distinct problems for energy optimization are investigated given as follows: First, a contention-aware energy-efficient static scheduling using NoC based heterogeneous MPSoC is performed for real-time tasks with an individual deadline and precedence constraints. An offline meta-heuristic based contention-aware energy-efficient task scheduling is developed that performs task ordering, mapping, and voltage assignment in an integrated manner. Compared to state-of-the-art scheduling our proposed algorithm significantly improves the energy-efficiency. Second, an energy-aware scheduling is investigated for a set of tasks with precedence constraints deploying Voltage Frequency Island (VFI) based heterogeneous NoC-MPSoCs. A novel population based algorithm called ARSH-FATI is developed that can dynamically switch between explorative and exploitative search modes at run-time. ARSH-FATI performance is superior to the existing task schedulers developed for homogeneous VFI-NoC-MPSoCs. Third, the transmission energy consumption of the sensor nodes in WSN is reduced by developing ARSH-FATI based Cluster Head Selection (ARSH-FATI-CHS) algorithm integrated with a heuristic called Novel Ranked Based Clustering (NRC). In cluster formation parameters such as residual energy, distance parameters, and workload on CHs are considered to improve LT of the network. The results prove that ARSH-FATI-CHS outperforms other state-of-the-art clustering algorithms in terms of LT.University of Derby, Derby, U

    SnuMAP: ๋งค๋‹ˆ์ฝ”์–ด ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ์ถ”์  ํ”„๋กœํŒŒ์ผ๋Ÿฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2018. 8. Bernhard Egger.์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋งค๋‹ˆ์ฝ”์–ด ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ๋ชจ๋“ˆ ๋ฐฉ์‹์˜ ๊ณต๊ฐœ ์†Œ์Šค ์ถ”์  ํ”„๋กœํŒŒ์ผ๋Ÿฌ์ธ SnuMAP๋ฅผ ์ œ์•ˆํ•œ๋‹ค. SnuMAP์€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋งˆ๋‹ค ์ฝ”์–ด ๋ถ„๋ฐฐ, ํŒŒ์›Œ, CPU, ๊ทธ๋ฆฌ๊ณ  ๋ฉ”๋ชจ๋ฆฌ ์ด์šฉ๋ฅ ๊ณผ ๊ฐ™์€ ๊ด€์‹ฌ์žˆ๋Š” ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์— ๊ด€ํ•œ ์ „๋ฐ˜์ ์ธ ์‹œ์Šคํ…œ ๊ด€์ ์„ ์ œ๊ณตํ•œ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, SnuMAP์€ ๊ฐ€๋ณ๊ณ  ํ•ด๋‹น ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ์†Œ์Šค ์ฝ”๋“œ์˜ ์ˆ˜์ •์ด ํ•„์š” ์—†์œผ๋ฉฐ, ๋˜ํ•œ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ์„ฑ๋Šฅ๋„ ๊ฐ์†Œ์‹œํ‚ค์ง€ ์•Š๋Š”๋‹ค. ๋Œ€์‹ , ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ๊ฐœ๋ฐœ์ž์™€ ๋งค๋‹ˆ์ฝ”์–ด ์ž์› ๊ด€๋ฆฌ์ž๋กœ๋ถ€ํ„ฐ ๊ฐ€์น˜ ์žˆ๋Š” ์ •๋ณด ๋ฐ ์ดํ•ด๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ข…๋ฅ˜์˜ ๋„๊ตฌ๋Š” ์‹œ์Šคํ…œ ์ด์šฉ๋ฅ ์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐ ๋ชฉ์ ์„ ๋‘” ํ˜„๋Œ€์˜ ๋งค๋‹ˆ์ฝ”์–ด ์‹œ์Šคํ…œ์˜ ๋งŽ์€ ๋ณ‘๋ ฌ์  ์›Œํฌ๋กœ๋“œ์˜ ๋™์‹œ ์Šค์ผ€์ค„๋ง์ฒ˜๋Ÿผ ์ ์  ๋” ์ค‘์š”ํ•ด์ง€๊ณ  ์žˆ๋‹ค. ์šฐ๋ฆฌ๋Š” SnuMAP์„ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ ๊ณผ์ œ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋ฉฐ ์ด ๋…ผ ์—์„œ SnuMAP์—์„œ ์ œ๊ณตํ•˜๋Š” ์‹œ๊ฐ ์ž๋ฃŒ์™€ ๋ฐ์ดํ„ฐ๋กœ ์ฐพ์„ ์ˆ˜ ์žˆ๋Š” ์ค‘์š”ํ•œ ๊ฒฐ๊ณผ ์˜ˆ์‹œ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ์ด ๊ณผ์ œ๋Š” ์›๋ž˜ ๊ฐ„๋‹จํ•œ ๊ณต๊ฐœ ์†Œ์Šค ํ”„๋กœํŒŒ์ผ๋Ÿฌ์—์„œ ์ถœ๋ฐœํ•˜์˜€๊ณ  ์ ์  ๋ณต์žกํ•œ ๋ถ„์„ ๋„๊ตฌ๋กœ ๋ฐœ์ „ํ•˜์˜€๋‹ค. ๋” ๋งŽ์€ ์ •๋ณด๋Š” http://csap.snu.ac.kr/software/snumap ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.In this thesis, we propose SnuMAP, an open-source modular trace profiler for many-core systems. SnuMAP provides per-application and whole-system views of multiple data points of interest: core allocation, power, CPU and memory utilization. Additionally, SnuMAP is light-weight, requires no source-code instrumentation and does not degrade the performance of the target parallel application. It alternatively gathers valuable information and insights for application developers and many-core resource managers. This type of tools continues to gain importance as todays many-core systems co-schedule multiple parallel workloads to increase system utilization. We have put SnuMAP to use in numerous research projects and present in this paper a snapshot of essential findings enabled by the visualization and data SnuMAP can provide. This project started originally as a simple open-source profiler, then it grew and evolved into the complex analysis tool it is today. More information is available at http://csap.snu.ac.kr/software/snumap.Abstract Contents List of Figures List of Tables Introduction and Motivation 2 Background 2.1 TimingMechanisms 2.2 ManycoreProcessorsCharacteristics 2.3 Intel RAPL 2.3.1 Power Measurement 2.3.2 Power Capping 2.3.3 UserInterfaces 2.3.4 Power Measurements in Other Architectures 3 Related Work 4 Overview and Design 5 Implementation 5.1 CoreAllocation 5.1.1 Kernel Patch: context-switch Tracker 5.1.2 Kernel Module 5.2 Performance and Energy Monitoring Unit 5.2.1 CPU Performance Monitoring 5.2.2 Memory Performance Monitoring 5.2.3 Energy/Power Monitoring 5.3 User-level Interfaces 5.3.1 Dynamic Interface: Library Interpositionining 5.3.2 Static Interface: Shared Library 5.4 Visualizations 5.4.1 Core vs.Time 6 Overhead 6.1 Context-switchTracker 6.2 PMU Monitor 6.2.1 Reading Performance Counters 6.2.2 Discussion 7 Use Cases and Evaluation 7.1 Target Architectures 7.2 Target Applications 7.3 Space-sharedSchedulingScenario 7.4 MultipleApplicationsScenarios 7.4.1 TwoApplicationsonAMD32 7.4.2 TwoApplicationsonTilera 7.5 PythonScenario 8 Conclusion and Future Work 8.1 Conclusion 8.2 FutureWork Bibliography ์š”์•ฝMaste

    Efficient Multiprogramming for Multicores with SCAF

    Get PDF
    As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run, enable sophisticated and flexible resource management. Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed run-time environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend upon profiling applications ahead of time in order to make good decisions about allocations, or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This paper presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution which supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications, without requiring application modification or recompilation. In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters, and demonstrate its effectiveness in aiding allocation decisions. We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme. We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning --- the best existing competing scheme in the literature. If the sum of speedups with SCAF is within 5% of equipartitioning (i.e., improvement factor is 0.95X < improvement factor in sum of speedups < 1.05X), then we deem SCAF to break even. Less than 0.95X is considered a slowdown; greater than 1.05X is an improvement. We found that SCAF improves on equipartitioning on 4 out of 5 machines, breaking even or improving in 80-89% of pairs and showing a mean improvement of 1.11-1.22X for benchmark pairs for which it shows an improvement, depending on the machine. Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end-users today. SCAF improves or breaks even on the unmodified OpenMP runtimes for all 5 machines in 72-100% of pairs, with a mean improvement of 1.27-1.7X, depending on the machine
    • โ€ฆ
    corecore