1,183 research outputs found

    MULTI-SCALE SCHEDULING TECHNIQUES FOR SIGNAL PROCESSING SYSTEMS

    Get PDF
    A variety of hardware platforms for signal processing has emerged, from distributed systems such as Wireless Sensor Networks (WSNs) to parallel systems such as Multicore Programmable Digital Signal Processors (PDSPs), Multicore General Purpose Processors (GPPs), and Graphics Processing Units (GPUs) to heterogeneous combinations of parallel and distributed devices. When a signal processing application is implemented on one of those platforms, the performance critically depends on the scheduling techniques, which in general allocate computation and communication resources for competing processing tasks in the application to optimize performance metrics such as power consumption, throughput, latency, and accuracy. Signal processing systems implemented on such platforms typically involve multiple levels of processing and communication hierarchy, such as network-level, chip-level, and processor-level in a structural context, and application-level, subsystem-level, component-level, and operation- or instruction-level in a behavioral context. In this thesis, we target scheduling issues that carefully address and integrate scheduling considerations at different levels of these structural and behavioral hierarchies. The core contributions of the thesis include the following. Considering both the network-level and chip-level, we have proposed an adaptive scheduling algorithm for wireless sensor networks (WSNs) designed for event detection. Our algorithm exploits discrepancies among the detection accuracy of individual sensors, which are derived from a collaborative training process, to allow each sensor to operate in a more energy efficient manner while the network satisfies given constraints on overall detection accuracy. Considering the chip-level and processor-level, we incorporated both temperature and process variations to develop new scheduling methods for throughput maximization on multicore processors. In particular, we studied how to process a large number of threads with high speed and without violating a given maximum temperature constraint. We targeted our methods to multicore processors in which the cores may operate at different frequencies and different levels of leakage. We develop speed selection and thread assignment schedulers based on the notion of a core's steady state temperature. Considering the application-level, component-level and operation-level, we developed a new dataflow based design flow within the targeted dataflow interchange format (TDIF) design tool. Our new multiprocessor system-on-chip (MPSoC)-oriented design flow, called TDIF-PPG, is geared towards analysis and mapping of embedded DSP applications on MPSoCs. An important feature of TDIF-PPG is its capability to integrate graph level parallelism and actor level parallelism into the application mapping process. Here, graph level parallelism is exposed by the dataflow graph application representation in TDIF, and actor level parallelism is modeled by a novel model for multiprocessor dataflow graph implementation that we call the Parallel Processing Group (PPG) model. Building on the contribution above, we formulated a new type of parallel task scheduling problem called Parallel Actor Scheduling (PAS) for chip-level MPSoC mapping of DSP systems that are represented as synchronous dataflow (SDF) graphs. In contrast to traditional SDF-based scheduling techniques, which focus on exploiting graph level (inter-actor) parallelism, the PAS problem targets the integrated exploitation of both intra- and inter-actor parallelism for platforms in which individual actors can be parallelized across multiple processing units. We address a special case of the PAS problem in which all of the actors in the DSP application or subsystem being optimized can be parallelized. For this special case, we develop and experimentally evaluate a two-phase scheduling framework with three work flows --- particle swarm optimization with a mixed integer programming formulation, particle swarm optimization with a simulated annealing engine, and particle swarm optimization with a fast heuristic based on list scheduling. Then, we extend our scheduling framework to support general PAS problem which considers the actors cannot be parallelized

    Dynamic Energy Management for Chip Multi-processors under Performance Constraints

    Get PDF
    We introduce a novel algorithm for dynamic energy management (DEM) under performance constraints in chip multi-processors (CMPs). Using the novel concept of delayed instructions count, performance loss estimations are calculated at the end of each control period for each core. In addition, a Kalman filtering based approach is employed to predict workload in the next control period for which voltage-frequency pairs must be selected. This selection is done with a novel dynamic voltage and frequency scaling (DVFS) algorithm whose objective is to reduce energy consumption but without degrading performance beyond the user set threshold. Using our customized Sniper based CMP system simulation framework, we demonstrate the effectiveness of the proposed algorithm for a variety of benchmarks for 16 core and 64 core network-on-chip based CMP architectures. Simulation results show consistent energy savings across the board. We present our work as an investigation of the tradeoff between the achievable energy reduction via DVFS when predictions are done using the effective Kalman filter for different performance penalty thresholds

    Investigation of LSTM Based Prediction for Dynamic Energy Management in Chip Multiprocessors

    Get PDF
    In this paper, we investigate the effectiveness of using long short-term memory (LSTM) instead of Kalman filtering to do prediction for the purpose of constructing dynamic energy management (DEM) algorithms in chip multi-processors (CMPs). Either of the two prediction methods is employed to estimate the workload in the next control period for each of the processor cores. These estimates are then used to select voltage-frequency (VF) pairs for each core of the CMP during the next control period as part of a dynamic voltage and frequency scaling (DVFS) technique. The objective of the DVFS technique is to reduce energy consumption under performance constraints that are set by the user. We conduct our investigation using a custom Sniper system simulation framework. Simulation results for 16 and 64 core network-on-chip based CMP architectures and using several benchmarks demonstrate that the LSTM is slightly better than Kalman filtering

    Investigation of LSTM Based Prediction for Dynamic Energy Management in Chip Multiprocessors

    Get PDF
    In this paper, we investigate the effectiveness of using long short-term memory (LSTM) instead of Kalman filtering to do prediction for the purpose of constructing dynamic energy management (DEM) algorithms in chip multi-processors (CMPs). Either of the two prediction methods is employed to estimate the workload in the next control period for each of the processor cores. These estimates are then used to select voltage-frequency (VF) pairs for each core of the CMP during the next control period as part of a dynamic voltage and frequency scaling (DVFS) technique. The objective of the DVFS technique is to reduce energy consumption under performance constraints that are set by the user. We conduct our investigation using a custom Sniper system simulation framework. Simulation results for 16 and 64 core network-on-chip based CMP architectures and using several benchmarks demonstrate that the LSTM is slightly better than Kalman filtering

    A Power-Efficient Methodology for Mapping Applications on Multi-Processor System-on-Chip Architectures

    Get PDF
    This work introduces an application mapping methodology and case study for multi-processor on-chip architectures. Starting from the description of an application in standard sequential code (e.g. in C), first the application is profiled, parallelized when possible, then its components are moved to hardware implementation when necessary to satisfy performance and power constraints. After mapping, with the use of hardware objects to handle concurrency, the application power consumption can be further optimized by a task-based scheduler for the remaining software part, without the need for operating system support. The key contributions of this work are: a methodology for high-level hardware/software partitioning that allows the designer to use the same code for both hardware and software models for simulation, providing nevertheless preliminary estimations for timing and power consumption; and a task-based scheduling algorithm that does not require operating system support. The methodology has been applied to the co-exploration of an industrial case study: an MPEG4 VGA real-time encoder

    Towards Optimal Application Mapping for Energy-Efficient Many-Core Platforms

    Get PDF
    Siirretty Doriast

    온 μΉ© λ„€νŠΈμ›Œν¬ 섀계: 맀핑, 관리, λΌμš°νŒ…

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : 전기·정보곡학뢀, 2016. 2. 졜기영.μ§€λ‚œ μˆ˜μ‹­ λ…„κ°„ 이어진 λ°˜λ„μ²΄ 기술의 ν–₯상은 λ§€λ‹ˆ μ½”μ–΄μ˜ μ‹œλŒ€λ₯Ό κ°€μ Έλ‹€ μ£Όμ—ˆλ‹€. μš°λ¦¬κ°€ 일상 μƒν™œμ— μ“°λŠ” λ°μŠ€ν¬ν†± 컴퓨터쑰차도 이미 수 개의 μ½”μ–΄λ₯Ό 가지고 있으며, 수백 개의 μ½”μ–΄λ₯Ό 가진 칩도 μƒμš©ν™”λ˜μ–΄ μžˆλ‹€. μ΄λŸ¬ν•œ λ§Žμ€ μ½”μ–΄λ“€ κ°„μ˜ 톡신 κΈ°λ°˜μœΌλ‘œμ„œ, λ„€νŠΈμ›Œν¬-온-μΉ©(NoC)이 μƒˆλ‘œμ΄ λŒ€λ‘λ˜μ—ˆμœΌλ©°, μ΄λŠ” ν˜„μž¬ λ§Žμ€ 연ꡬ 및 μƒμš© μ œν’ˆμ—μ„œ 널리 μ‚¬μš©λ˜κ³  μžˆλ‹€. κ·ΈλŸ¬λ‚˜ λ„€νŠΈμ›Œν¬-온-칩을 λ§€λ‹ˆ μ½”μ–΄ μ‹œμŠ€ν…œμ— μ‚¬μš©ν•˜λŠ” λ°μ—λŠ” μ—¬λŸ¬ 가지 λ¬Έμ œκ°€ λ”°λ₯΄λ©°, λ³Έ λ…Όλ¬Έμ—μ„œλŠ” κ·Έ 쀑 λͺ‡ 가지λ₯Ό ν’€μ–΄λ‚΄κ³ μž ν•˜μ˜€λ‹€. λ³Έ λ…Όλ¬Έμ˜ 두 번째 μ±•ν„°μ—μ„œλŠ” NoC 기반 λ§€λ‹ˆμ½”μ–΄ ꡬ쑰에 μž‘μ—…μ„ ν• λ‹Ήν•˜κ³  μŠ€μΌ€μ₯΄ν•˜λŠ” 방법을 λ‹€λ£¨μ—ˆλ‹€. λ§€λ‹ˆμ½”μ–΄μ—μ˜ μž‘μ—… 할당을 닀룬 논문은 이미 많이 μΆœνŒλ˜μ—ˆμ§€λ§Œ, λ³Έ μ—°κ΅¬λŠ” λ©”μ‹œμ§€ νŒ¨μ‹±κ³Ό 곡유 λ©”λͺ¨λ¦¬, 두 κ°€μ§€μ˜ 톡신 방식을 κ³ λ €ν•¨μœΌλ‘œμ¨ μ„±λŠ₯κ³Ό μ—λ„ˆμ§€ νš¨μœ¨μ„ κ°œμ„ ν•˜μ˜€λ‹€. λ˜ν•œ, λ³Έ μ—°κ΅¬λŠ” μ—­λ°©ν–₯ μ˜μ‘΄μ„±μ„ 가진 μž‘μ—… κ·Έλž˜ν”„λ₯Ό μŠ€μΌ€μ₯΄ν•˜λŠ” 방법 λ˜ν•œ μ œμ‹œν•˜μ˜€λ‹€. 3차원 적측 κΈ°μˆ μ€ 높아진 μ „λ ₯ 밀도 λ•Œλ¬Έμ— μ—΄ λ¬Έμ œκ°€ μ‹¬κ°ν•΄μ§€λŠ” λ“±, μ—¬λŸ¬ 가지 도전 과제λ₯Ό λ‚΄ν¬ν•˜κ³  μžˆλ‹€. μ„Έ 번째 μ±•ν„°μ—μ„œλŠ” DVFS κΈ°μˆ μ„ μ΄μš©ν•˜μ—¬ μ—΄ 문제λ₯Ό μ™„ν™”ν•˜κ³ μž ν•˜λŠ” κΈ°μˆ μ„ μ†Œκ°œν•œλ‹€. 각 코어와 λΌμš°ν„°κ°€ μ „μ••, μž‘λ™ 속도λ₯Ό μ‘°μ ˆν•  수 μžˆλŠ” κ΅¬μ‘°μ—μ„œ, κ°€μž₯ 높은 μ„±λŠ₯을 μ΄λŒμ–΄ λ‚΄λ©΄μ„œλ„ μ΅œλŒ€ μ˜¨λ„λ₯Ό λ„˜μ–΄μ„œμ§€ μ•Šλ„λ‘ ν•œλ‹€. μ„Έ λ²ˆμ§Έμ™€ λ„€ 번째 μ±•ν„°λŠ” 쑰금 λ‹€λ₯Έ 츑면을 닀룬닀. 3D 적측 κΈ°μˆ μ„ μ‚¬μš©ν•  λ•Œ, μΈ΅κ°„ 톡신은 주둜 TSVλ₯Ό μ΄μš©ν•˜μ—¬ 이루어진닀. κ·ΈλŸ¬λ‚˜ TSVλŠ” 일반 wire보닀 훨씬 큰 면적을 μ°¨μ§€ν•˜κΈ° λ•Œλ¬Έμ—, 전체 λ„€νŠΈμ›Œν¬μ—μ„œμ˜ TSV κ°œμˆ˜λŠ” μ œν•œλ˜μ–΄μ•Ό ν•  κ²½μš°κ°€ λ§Žλ‹€. 이 κ²½μš°μ—λŠ” 두 가지 선택지가 μžˆλŠ”λ°, μ²«μ§ΈλŠ” 각 μΈ΅κ°„ 톡신 μ±„λ„μ˜ λŒ€μ—­ν­μ„ μ€„μ΄λŠ” 것이고, λ‘˜μ§ΈλŠ” 각 μ±„λ„μ˜ λŒ€μ—­ν­μ€ μœ μ§€ν•˜λ˜ 일뢀 λ…Έλ“œλ§Œ μΈ΅κ°„ 톡신이 κ°€λŠ₯ν•œ 채널을 μ œκ³΅ν•˜λŠ” 것이닀. μš°λ¦¬λŠ” 각각의 κ²½μš°μ— λŒ€ν•˜μ—¬ λΌμš°νŒ… μ•Œκ³ λ¦¬μ¦˜μ„ ν•˜λ‚˜μ”© μ œμ‹œν•œλ‹€. 첫 번째 κ²½μš°μ— μžˆμ–΄μ„œλŠ” deflection λΌμš°νŒ… 기법을 μ‚¬μš©ν•˜μ—¬ μΈ΅κ°„ ν†΅μ‹ μ˜ κΈ΄ 지연 μ‹œκ°„μ„ κ·Ήλ³΅ν•˜κ³ μž ν•˜μ˜€λ‹€. μΈ΅κ°„ 톡신을 κ· λ“±ν•˜κ²Œ λΆ„λ°°ν•¨μœΌλ‘œμ¨, μ œμ‹œλœ μ•Œκ³ λ¦¬μ¦˜μ€ κ°œμ„ λœ 지연 μ‹œκ°„μ„ 보이며 λΌμš°ν„° λ²„νΌμ˜ 제거λ₯Ό ν†΅ν•œ 면적 및 μ—λ„ˆμ§€ νš¨μœ¨μ„± λ˜ν•œ 얻을 수 μžˆλ‹€. 두 번째 κ²½μš°μ—μ„œλŠ” μΈ΅κ°„ 톡신 채널을 μ„ νƒν•˜κΈ° μœ„ν•œ λͺ‡ 가지 κ·œμΉ™μ„ μ œμ‹œν•œλ‹€. μ•½κ°„μ˜ λΌμš°νŒ… μžμœ λ„λ₯Ό ν¬μƒν•¨μœΌλ‘œμ¨, μ œμ‹œλœ μ•Œκ³ λ¦¬μ¦˜μ€ κΈ°μ‘΄ μ•Œκ³ λ¦¬μ¦˜μ˜ 가상 채널 μš”κ΅¬ 쑰건을 μ œκ±°ν•˜κ³ , κ²°κ³Όμ μœΌλ‘œλŠ” μ„±λŠ₯ λ˜λŠ” μ—λ„ˆμ§€ 효율의 증가λ₯Ό κ°€μ Έ μ˜¨λ‹€.For decades, advance in semiconductor technology has led us to the era of many-core systems. Today's desktop computers already have multi-core processors, and chips with more than a hundred cores are commercially available. As a communication medium for such a large number of cores, network-on-chip (NoC) has emerged out, and now is being used by many researchers and companies. Adopting NoC for a many-core system incurs many problems, and this thesis tries to solve some of them. The second chapter of this thesis is on mapping and scheduling of tasks on NoC-based CMP architectures. Although mapping on NoC has a number of papers published, our work reveals that selecting communication types between shared memory and message passing can help improve the performance and energy efficiency. Additionally, our framework supports scheduling applications containing backward dependencies with the help of modified modulo scheduling. Evolving the SoCs through 3D stacking makes us face a number of new problems, and the thermal problem coming from increased power density is one of them. In the third chapter of this thesis, we try to mitigate the hotspot problem using DVFS techniques. Assuming that all the routers as well as cores have capabilities to control voltage and frequency individually, we find voltage-frequency pairs for all cores and routers which yields the best performance within the given thermal constraint. The fourth and the fifth chapters of this thesis are from a different aspect. In 3D stacking, inter-layer interconnections are implemented using through-silicon vias (TSV). TSVs usually take much more area than normal wires. Furthermore, they also consume silicon area as well as metal area. For this reason, designers would want to limit the number of TSVs used in their network. To limit the TSV count, there are two options: the first is to reduce the width of each vertical links, and the other is to use fewer vertical links, which results in a partially connected network. We present two routing methodologies for each case. For the network with reduced bandwidth vertical links, we propose using deflection routing to mitigate the long latency of vertical links. By balancing the vertical traffics properly, the algorithm provides improved latency. Also, a large amount of area and energy reduction can be obtained by the removal of router buffers. For partially connected networks, we introduce a set of routing rules for selecting the vertical links. At the expense of sacrificing some amount of routing freedom, the proposed algorithm removes the virtual channel requirement for avoiding deadlock. As a result, the performance, or energy consumption can be reduced at the designer's choice.Chapter 1 Introduction 1 1.1 Task Mapping and Scheduling 2 1.2 Thermal Management 3 1.3 Routing for 3D Networks 5 Chapter 2 Mapping and Scheduling 9 2.1 Introduction 9 2.2 Motivation 10 2.3 Background 12 2.4 Related Work 16 2.5 Platform Description 17 2.5.1 Architcture Description 17 2.5.2 Energy Model 21 2.5.3 Communication Delay Model 22 2.6 Problem Formulation 23 2.7 Proposed Solution 25 2.7.1 Task and Communication Mapping 27 2.7.2 Communication Type Optimization 31 2.7.3 Design Space Pruning via Pre-evaluation 34 2.7.4 Scheduling 35 2.8 Experimental Results 42 2.8.1 Experiments with Coarse-grained Iterative Modulo Scheduling 42 2.8.2 Comparison with Different Mapping Algorithms 43 2.8.3 Experiments with Overall Algorithms 45 2.8.4 Experiments with Various Local Memory Sizes 47 2.8.5 Experiments with Various Placements of Shared Memory 48 Chapter 3 Thermal Management 50 3.1 Introduction 50 3.2 Background 51 3.2.1 Thermal Modeling 51 3.2.2 Heterogeneity in Thermal Propagation 52 3.3 Motivation and Problem Definition 53 3.4 Related Work 56 3.5 Orchestrated Voltage-Frequency Assignment 56 3.5.1 Individual PI Control Method 56 3.5.2 PI Controlled Weighted-Power Budgeting 57 3.5.3 Performance/Power Estimation 59 3.5.4 Frequency Assignment 62 3.5.5 Algorithm Overview 64 3.5.6 Stability Conditions for PI Controller 65 3.6 Experimental Result 66 3.6.1 Experimental Setup 66 3.6.2 Overall Algorithm Performance 68 3.6.3 Accuracy of the Estimation Model 70 3.6.4 Performance of the Frequency Assignment Algorithm 70 Chapter 4 Routing for Limited Bandwidth 3D NoC 72 4.1 Introduction 72 4.2 Motivation 73 4.3 Background 74 4.4 Related Work 75 4.5 3D Deflection Routing 76 4.5.1 Serialized TSV Model 76 4.5.2 TSV Link Injection/ejection Scheme 78 4.5.3 Deadlock Avoidance 80 4.5.4 Livelock Avoidance 84 4.5.5 Router Architecture: Putting It All Together 86 4.5.6 System Level Consideration 87 4.6 Experimental Results 89 4.6.1 Experimental Setup 89 4.6.2 Results on Synthetic Traffic Patterns 91 4.6.3 Results on Realistic Traffic Patterns 94 4.6.4 Results on Real Application Benchmarks 98 4.6.5 Fairness Issue 103 4.6.6 Area Cost Comparison 104 Chapter 5 Routing for Partially Connected 3D NoC 106 5.1 Introduction 106 5.2 Background 107 5.3 Related Work 109 5.4 Proposed Algorithm 111 5.4.1 Preliminary 112 5.4.2 Routing Algorithm for 3-D Stacked Meshes with Regular Partial Vertical Connections 115 5.4.3 Routing Algorithm for 3-D Stacked Meshes with Irregular Partial Vertical Connections 118 5.4.4 Extension to Heterogeneous Mesh Layers 122 5.5 Experimental Results 126 5.5.1 Experimental Setup 126 5.5.2 Experiments on Synthetic Traffics 128 5.5.3 Experiments on Application Benchmarks 133 5.5.4 Comparison with Reduced Bandwidth Mesh 139 Chapter 6 Conclusion 141 Bibliography 144 초둝 163Docto
    • …
    corecore