3,576 research outputs found

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. Β© 2009 ACADEMY PUBLISHER

    EARLY PERFORMANCE PREDICTION METHODOLOGY FOR MANY-CORES ON CHIP BASED APPLICATIONS

    Get PDF
    Modern high performance computing applications such as personal computing, gaming, numerical simulations require application-specific integrated circuits (ASICs) that comprises of many cores. Performance for these applications depends mainly on latency of interconnects which transfer data between cores that implement applications by distributing tasks. Time-to-market is a critical consideration while designing ASICs for these applications. Therefore, to reduce design cycle time, predicting system performance accurately at an early stage of design is essential. With process technology in nanometer era, physical phenomena such as crosstalk, reflection on the propagating signal have a direct impact on performance. Incorporating these effects provides a better performance estimate at an early stage. This work presents a methodology for better performance prediction at an early stage of design, achieved by mapping system specification to a circuit-level netlist description. At system-level, to simplify description and for efficient simulation, SystemVerilog descriptions are employed. For modeling system performance at this abstraction, queueing theory based bounded queue models are applied. At the circuit level, behavioral Input/Output Buffer Information Specification (IBIS) models can be used for analyzing effects of these physical phenomena on on-chip signal integrity and hence performance. For behavioral circuit-level performance simulation with IBIS models, a netlist must be described consisting of interacting cores and a communication link. Two new netlists, IBIS-ISS and IBIS-AMI-ISS are introduced for this purpose. The cores are represented by a macromodel automatically generated by a developed tool from IBIS models. The generated IBIS models are employed in the new netlists. Early performance prediction methodology maps a system specification to an instance of these netlists to provide a better performance estimate at an early stage of design. The methodology is scalable in nanometer process technology and can be reused in different designs

    Circuit design and analysis for on-FPGA communication systems

    No full text
    On-chip communication system has emerged as a prominently important subject in Very-Large- Scale-Integration (VLSI) design, as the trend of technology scaling favours logics more than interconnects. Interconnects often dictates the system performance, and, therefore, research for new methodologies and system architectures that deliver high-performance communication services across the chip is mandatory. The interconnect challenge is exacerbated in Field-Programmable Gate Array (FPGA), as a type of ASIC where the hardware can be programmed post-fabrication. Communication across an FPGA will be deteriorating as a result of interconnect scaling. The programmable fabrics, switches and the specific routing architecture also introduce additional latency and bandwidth degradation further hindering intra-chip communication performance. Past research efforts mainly focused on optimizing logic elements and functional units in FPGAs. Communication with programmable interconnect received little attention and is inadequately understood. This thesis is among the first to research on-chip communication systems that are built on top of programmable fabrics and proposes methodologies to maximize the interconnect throughput performance. There are three major contributions in this thesis: (i) an analysis of on-chip interconnect fringing, which degrades the bandwidth of communication channels due to routing congestions in reconfigurable architectures; (ii) a new analogue wave signalling scheme that significantly improves the interconnect throughput by exploiting the fundamental electrical characteristics of the reconfigurable interconnect structures. This new scheme can potentially mitigate the interconnect scaling challenges. (iii) a novel Dynamic Programming (DP)-network to provide adaptive routing in network-on-chip (NoC) systems. The DP-network architecture performs runtime optimization for route planning and dynamic routing which, effectively utilizes the in-silicon bandwidth. This thesis explores a new horizon in reconfigurable system design, in which new methodologies and concepts are proposed to enhance the on-FPGA communication throughput performance that is of vital importance in new technology processes

    μ΄ˆλ―Έμ„Έ 회둜 섀계λ₯Ό μœ„ν•œ 인터컀λ„₯트의 타이밍 뢄석 및 λ””μžμΈ λ£° μœ„λ°˜ 예츑

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·컴퓨터곡학뢀, 2021. 2. κΉ€νƒœν™˜.타이밍 뢄석 및 λ””μžμΈ λ£° μœ„λ°˜ μ œκ±°λŠ” λ°˜λ„μ²΄ μΉ© 제쑰λ₯Ό μœ„ν•œ 마슀크 μ œμž‘ 전에 μ™„λ£Œλ˜μ–΄μ•Ό ν•  ν•„μˆ˜ 과정이닀. κ·ΈλŸ¬λ‚˜ νŠΈλžœμ§€μŠ€ν„°μ™€ 인터컀λ„₯트의 변이가 μ¦κ°€ν•˜κ³  있고 λ””μžμΈ λ£° μ—­μ‹œ λ³΅μž‘ν•΄μ§€κ³  있기 λ•Œλ¬Έμ— 타이밍 뢄석 및 λ””μžμΈ λ£° μœ„λ°˜ μ œκ±°λŠ” μ΄ˆλ―Έμ„Έ νšŒλ‘œμ—μ„œ 더 μ–΄λ €μ›Œμ§€κ³  μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ΄ˆλ―Έμ„Έ 섀계λ₯Ό μœ„ν•œ 두가지 문제인 타이밍 뢄석과 λ””μžμΈ λ£° μœ„λ°˜μ— λŒ€ν•΄ 닀룬닀. 첫번째둜 곡정 μ½”λ„ˆμ—μ„œ 타이밍 뢄석은 μ‹€λ¦¬μ½˜μœΌλ‘œ μ œμž‘λœ 회둜의 μ„±λŠ₯을 μ •ν™•νžˆ μ˜ˆμΈ‘ν•˜μ§€ λͺ»ν•œλ‹€. κ·Έ μ΄μœ λŠ” 곡정 μ½”λ„ˆμ—μ„œ κ°€μž₯ 느린 타이밍 κ²½λ‘œκ°€ λͺ¨λ“  곡정 μ‘°κ±΄μ—μ„œλ„ κ°€μž₯ 느린 것은 μ•„λ‹ˆκΈ° λ•Œλ¬Έμ΄λ‹€. κ²Œλ‹€κ°€ μΉ© λ‚΄μ˜ μž„κ³„ κ²½λ‘œμ—μ„œ 인터컀λ„₯νŠΈμ— μ˜ν•œ 지연 μ‹œκ°„μ΄ 전체 지연 μ‹œκ°„μ—μ„œμ˜ 영ν–₯이 μ¦κ°€ν•˜κ³  있고, 10λ‚˜λ…Έ μ΄ν•˜ κ³΅μ •μ—μ„œλŠ” 20%λ₯Ό μ΄ˆκ³Όν•˜κ³  μžˆλ‹€. 즉, μ‹€λ¦¬μ½˜μœΌλ‘œ μ œμž‘λœ 회둜의 μ„±λŠ₯을 μ •ν™•νžˆ μ˜ˆμΈ‘ν•˜κΈ° μœ„ν•΄μ„œλŠ” λŒ€ν‘œ νšŒλ‘œκ°€ νŠΈλžœμ§€μŠ€ν„°μ˜ 변이 λΏλ§Œμ•„λ‹ˆλΌ 인터컀λ„₯트의 변이도 λ°˜μ˜ν•΄μ•Όν•œλ‹€. 인터컀λ„₯트λ₯Ό κ΅¬μ„±ν•˜λŠ” κΈˆμ†μ΄ 10μΈ΅ 이상 μ‚¬μš©λ˜κ³  있고, 각 측을 κ΅¬μ„±ν•˜λŠ” κΈˆμ†μ˜ μ €ν•­κ³Ό μΊνŒ¨μ‹œν„΄μŠ€μ™€ λΉ„μ•„ 저항이 λͺ¨λ‘ 회둜 지연 μ‹œκ°„μ— 영ν–₯을 μ£ΌκΈ° λ•Œλ¬Έμ— λŒ€ν‘œ 회둜λ₯Ό μ°ΎλŠ” λ¬Έμ œλŠ” 차원이 맀우 높은 μ˜μ—­μ—μ„œ 졜적의 ν•΄λ₯Ό μ°ΎλŠ” 방법이 ν•„μš”ν•˜λ‹€. 이λ₯Ό μœ„ν•΄ 인터컀λ„₯트λ₯Ό μ œμž‘ν•˜λŠ” 곡정(λ°± μ—”λ“œ 였브 라인)의 변이λ₯Ό λ°˜μ˜ν•œ λŒ€ν‘œ 회둜λ₯Ό μƒμ„±ν•˜λŠ” 방법을 μ œμ•ˆν•˜μ˜€λ‹€. 곡정 변이가 μ—†μ„λ•Œ κ°€μž₯ 느린 타이밍 κ²½λ‘œμ— μ‚¬μš©λœ κ²Œμ΄νŠΈμ™€ λΌμš°νŒ… νŒ¨ν„΄μ„ λ³€κ²½ν•˜λ©΄μ„œ μ μ§„μ μœΌλ‘œ νƒμƒ‰ν•˜λŠ” 방법이닀. ꡬ체적으둜, λ³Έ λ…Όλ¬Έμ—μ„œ μ œμ•ˆν•˜λŠ” ν•©μ„± ν”„λ ˆμž„μ›Œν¬λŠ” λ‹€μŒμ˜ μƒˆλ‘œμš΄ κΈ°μˆ λ“€μ„ ν†΅ν•©ν•˜μ˜€λ‹€: (1) λΌμš°νŒ…μ„ κ΅¬μ„±ν•˜λŠ” μ—¬λŸ¬ κΈˆμ† μΈ΅κ³Ό λΉ„μ•„λ₯Ό μΆ”μΆœν•˜κ³  탐색 μ‹œκ°„ κ°μ†Œλ₯Ό μœ„ν•΄ μœ μ‚¬ν•œ ꡬ성듀을 같은 λ²”μ£Όλ‘œ λΆ„λ₯˜ν•˜μ˜€λ‹€. (2) λΉ λ₯΄κ³  μ •ν™•ν•œ 타이밍 뢄석을 μœ„ν•˜μ—¬ μ—¬λŸ¬ κΈˆμ† μΈ΅κ³Ό λΉ„μ•„λ“€μ˜ 변이λ₯Ό μˆ˜μ‹ν™”ν•˜μ˜€λ‹€. (3) ν™•μž₯성을 κ³ λ €ν•˜μ—¬ 일반적인 링 μ˜€μ‹€λ ˆμ΄ν„°λ‘œ λŒ€ν‘œνšŒλ‘œλ₯Ό νƒμƒ‰ν•˜μ˜€λ‹€. λ‘λ²ˆμ§Έλ‘œ λ””μžμΈ 룰의 λ³΅μž‘λ„κ°€ μ¦κ°€ν•˜κ³  있고, 이둜 인해 ν‘œμ€€ μ…€λ“€μ˜ 인터컀λ„₯트λ₯Ό ν†΅ν•œ 연결을 μ§„ν–‰ν•˜λŠ” λ™μ•ˆ λ””μžμΈ λ£° μœ„λ°˜μ΄ μ¦κ°€ν•˜κ³  μžˆλ‹€. κ²Œλ‹€κ°€ ν‘œμ€€ μ…€μ˜ 크기가 계속 μž‘μ•„μ§€λ©΄μ„œ μ…€λ“€μ˜ 연결은 점점 μ–΄λ €μ›Œμ§€κ³  μžˆλ‹€. κΈ°μ‘΄μ—λŠ” 회둜 λ‚΄ λͺ¨λ“  ν‘œμ€€ 셀을 μ—°κ²°ν•˜λŠ”λ° ν•„μš”ν•œ νŠΈλž™ 수, κ°€λŠ₯ν•œ νŠΈλž™ 수, 이듀 κ°„μ˜ 차이λ₯Ό μ΄μš©ν•˜μ—¬ μ—°κ²° κ°€λŠ₯성을 νŒλ‹¨ν•˜κ³ , λ””μžμΈ λ£° μœ„λ°˜μ΄ λ°œμƒν•˜μ§€ μ•Šλ„λ‘ μ…€ 배치λ₯Ό μ΅œμ ν™”ν•˜μ˜€λ‹€. κ·ΈλŸ¬λ‚˜ κΈ°μ‘΄ 방법은 μ΅œμ‹  κ³΅μ •μ—μ„œλŠ” μ •ν™•ν•˜μ§€ μ•ŠκΈ° λ•Œλ¬Έμ— 더 λ§Žμ€ 정보λ₯Ό μ΄μš©ν•œ νšŒλ‘œλ‚΄ λͺ¨λ“  ν‘œμ€€ μ…€ μ‚¬μ΄μ˜ μ—°κ²° κ°€λŠ₯성을 μ˜ˆμΈ‘ν•˜λŠ” 방법이 ν•„μš”ν•˜λ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 기계 ν•™μŠ΅μ„ 톡해 λ””μžμΈ λ£° μœ„λ°˜μ΄ λ°œμƒν•˜λŠ” μ˜μ—­ 및 개수λ₯Ό μ˜ˆμΈ‘ν•˜κ³  이λ₯Ό 쀄이기 μœ„ν•΄ ν‘œμ€€ μ…€μ˜ 배치λ₯Ό λ°”κΎΈλŠ” 방법을 μ œμ•ˆν•˜μ˜€λ‹€. λ””μžμΈ λ£° μœ„λ°˜ μ˜μ—­μ€ 이진 λΆ„λ₯˜λ‘œ μ˜ˆμΈ‘ν•˜μ˜€κ³  ν‘œμ€€ μ…€μ˜ λ°°μΉ˜λŠ” λ””μžμΈ λ£° μœ„λ°˜ 개수λ₯Ό μ΅œμ†Œν™”ν•˜λŠ” λ°©ν–₯으둜 μ΅œμ ν™”λ₯Ό μˆ˜ν–‰ν•˜μ˜€λ‹€. μ œμ•ˆν•˜λŠ” ν”„λ ˆμž„μ›Œν¬λŠ” λ‹€μŒμ˜ 세가지 기술둜 κ΅¬μ„±λ˜μ—ˆλ‹€: (1) 회둜 λ ˆμ΄μ•„μ›ƒμ„ μ—¬λŸ¬ 개의 μ •μ‚¬κ°ν˜• 격자둜 λ‚˜λˆ„κ³  각 κ²©μžμ—μ„œ λΌμš°νŒ…μ„ μ˜ˆμΈ‘ν•  수 μžˆλŠ” μš”μ†Œλ“€μ„ μΆ”μΆœν•œλ‹€. (2) 각 κ²©μžμ—μ„œ λ””μžμΈ λ£° μœ„λ°˜μ΄ μžˆλŠ”μ§€ μ—¬λΆ€λ₯Ό νŒλ‹¨ν•˜λŠ” 이진 λΆ„λ₯˜λ₯Ό μˆ˜ν–‰ν•œλ‹€. (3) λ©”νƒ€νœ΄λ¦¬μŠ€ν‹± μ΅œμ ν™” λ˜λŠ” λ² μ΄μ§€μ•ˆ μ΅œμ ν™”λ₯Ό μ΄μš©ν•˜μ—¬ 전체 λ””μžμΈ λ£° μœ„λ°˜ κ°œμˆ˜κ°€ κ°μ†Œν•˜λ„λ‘ 각 κ²©μžμ— μžˆλŠ” ν‘œμ€€ 셀을 움직인닀.Timing analysis and clearing design rule violations are the essential steps for taping out a chip. However, they keep getting harder in deep sub-micron circuits because the variations of transistors and interconnects have been increasing and design rules have become more complex. This dissertation addresses two problems on timing analysis and design rule violations for synthesizing deep sub-micron circuits. Firstly, timing analysis in process corners can not capture post-Si performance accurately because the slowest path in the process corner is not always the slowest one in the post-Si instances. In addition, the proportion of interconnect delay in the critical path on a chip is increasing and becomes over 20% in sub-10nm technologies, which means in order to capture post-Si performance accurately, the representative critical path circuit should reflect not only FEOL (front-end-of-line) but also BEOL (backend-of-line) variations. Since the number of BEOL metal layers exceeds ten and the layers have variation on resistance and capacitance intermixed with resistance variation on vias between them, a very high dimensional design space exploration is necessary to synthesize a representative critical path circuit which is able to provide an accurate performance prediction. To cope with this, I propose a BEOL-aware methodology of synthesizing a representative critical path circuit, which is able to incrementally explore, starting from an initial path circuit on the post-Si target circuit, routing patterns (i.e., BEOL reconfiguring) as well as gate resizing on the path circuit. Precisely, the synthesis framework of critical path circuit integrates a set of novel techniques: (1) extracting and classifying BEOL configurations for lightening design space complexity, (2) formulating BEOL random variables for fast and accurate timing analysis, and (3) exploring alternative (ring oscillator) circuit structures for extending the applicability of this work. Secondly, the complexity of design rules has been increasing and results in more design rule violations during routing. In addition, the size of standard cell keeps decreasing and it makes routing harder. In the conventional P&R flow, the routability of pre-routed layout is predicted by routing congestion obtained from global routing, and then placement is optimized not to cause design rule violations. But it turned out to be inaccurate in advanced technology nodes so that it is necessary to predict routability with more features. I propose a methodology of predicting the hotspots of design rule violations (DRVs) using machine learning with placement related features and the conventional routing congestion, and perturbating placed cells to reduce the number of DRVs. Precisely, the hotspots are predicted by a pre-trained binary classification model and placement perturbation is performed by global optimization methods to minimize the number of DRVs predicted by a pre-trained regression model. To do this, the framework is composed of three techniques: (1) dividing the circuit layout into multiple rectangular grids and extracting features such as pin density, cell density, global routing results (demand, capacity and overflow), and more in the placement phase, (2) predicting if each grid has DRVs using a binary classification model, and (3) perturbating the placed standard cells in the hotspots to minimize the number of DRVs predicted by a regression model.1 Introduction 1 1.1 Representative Critical Path Circuit . . . . . . . . . . . . . . . . . . . 1 1.2 Prediction of Design Rule Violations and Placement Perturbation . . . 5 1.3 Contributions of This Dissertation . . . . . . . . . . . . . . . . . . . 7 2 Methodology for Synthesizing Representative Critical Path Circuits reflecting BEOL Timing Variation 9 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Definitions and Overall Flow . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Techniques for BEOL-Aware RCP Generation . . . . . . . . . . . . . 17 2.3.1 Clustering BEOL Configurations . . . . . . . . . . . . . . . . 17 2.3.2 Formulating Statistical BEOL Random Variables . . . . . . . 18 2.3.3 Delay Modeling . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.4 Exploring Ring Oscillator Circuit Structures . . . . . . . . . . 24 2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Further Study on Variations . . . . . . . . . . . . . . . . . . . . . . . 37 3 Methodology for Reducing Routing Failures through Enhanced Prediction on Design Rule Violations in Placement 39 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Techniques for Reducing Routing Failures . . . . . . . . . . . . . . . 43 3.3.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . 43 3.3.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.4 Placement Perturbation . . . . . . . . . . . . . . . . . . . . . 47 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.1 Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.2 Hotspot Prediction . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.4 Placement Perturbation . . . . . . . . . . . . . . . . . . . . . 57 4 Conclusions 61 4.1 Synthesis of Representative Critical Path Circuits reflecting BEOL Timing Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Reduction of Routing Failures through Enhanced Prediction on Design Rule Violations in Placement . . . . . . . . . . . . . . . . . . . . . . 62 Abstract (In Korean) 69Docto

    Robust and Traffic Aware Medium Access Control Mechanisms for Energy-Efficient mm-Wave Wireless Network-on-Chip Architectures

    Get PDF
    To cater to the performance/watt needs, processors with multiple processing cores on the same chip have become the de-facto design choice. In such multicore systems, Network-on-Chip (NoC) serves as a communication infrastructure for data transfer among the cores on the chip. However, conventional metallic interconnect based NoCs are constrained by their long multi-hop latencies and high power consumption, limiting the performance gain in these systems. Among, different alternatives, due to the CMOS compatibility and energy-efficiency, low-latency wireless interconnect operating in the millimeter wave (mm-wave) band is nearer term solution to this multi-hop communication problem. This has led to the recent exploration of millimeter-wave (mm-wave) wireless technologies in wireless NoC architectures (WiNoC). To realize the mm-wave wireless interconnect in a WiNoC, a wireless interface (WI) equipped with on-chip antenna and transceiver circuit operating at 60GHz frequency range is integrated to the ports of some NoC switches. The WIs are also equipped with a medium access control (MAC) mechanism that ensures a collision free and energy-efficient communication among the WIs located at different parts on the chip. However, due to shrinking feature size and complex integration in CMOS technology, high-density chips like multicore systems are prone to manufacturing defects and dynamic faults during chip operation. Such failures can result in permanently broken wireless links or cause the MAC to malfunction in a WiNoC. Consequently, the energy-efficient communication through the wireless medium will be compromised. Furthermore, the energy efficiency in the wireless channel access is also dependent on the traffic pattern of the applications running on the multicore systems. Due to the bursty and self-similar nature of the NoC traffic patterns, the traffic demand of the WIs can vary both spatially and temporally. Ineffective management of such traffic variation of the WIs, limits the performance and energy benefits of the novel mm-wave interconnect technology. Hence, to utilize the full potential of the novel mm-wave interconnect technology in WiNoCs, design of a simple, fair, robust, and efficient MAC is of paramount importance. The main goal of this dissertation is to propose the design principles for robust and traffic-aware MAC mechanisms to provide high bandwidth, low latency, and energy-efficient data communication in mm-wave WiNoCs. The proposed solution has two parts. In the first part, we propose the cross-layer design methodology of robust WiNoC architecture that can minimize the effect of permanent failure of the wireless links and recover from transient failures caused by single event upsets (SEU). Then, in the second part, we present a traffic-aware MAC mechanism that can adjust the transmission slots of the WIs based on the traffic demand of the WIs. The proposed MAC is also robust against the failure of the wireless access mechanism. Finally, as future research directions, this idea of traffic awareness is extended throughout the whole NoC by enabling adaptiveness in both wired and wireless interconnection fabric

    SYSTEM-ON-A-CHIP (SOC)-BASED HARDWARE ACCELERATION FOR HUMAN ACTION RECOGNITION WITH CORE COMPONENTS

    Get PDF
    Today, the implementation of machine vision algorithms on embedded platforms or in portable systems is growing rapidly due to the demand for machine vision in daily human life. Among the applications of machine vision, human action and activity recognition has become an active research area, and market demand for providing integrated smart security systems is growing rapidly. Among the available approaches, embedded vision is in the top tier; however, current embedded platforms may not be able to fully exploit the potential performance of machine vision algorithms, especially in terms of low power consumption. Complex algorithms can impose immense computation and communication demands, especially action recognition algorithms, which require various stages of preprocessing, processing and machine learning blocks that need to operate concurrently. The market demands embedded platforms that operate with a power consumption of only a few watts. Attempts have been mad to improve the performance of traditional embedded approaches by adding more powerful processors; this solution may solve the computation problem but increases the power consumption. System-on-a-chip eld-programmable gate arrays (SoC-FPGAs) have emerged as a major architecture approach for improving power eciency while increasing computational performance. In a SoC-FPGA, an embedded processor and an FPGA serving as an accelerator are fabricated in the same die to simultaneously improve power consumption and performance. Still, current SoC-FPGA-based vision implementations either shy away from supporting complex and adaptive vision algorithms or operate at very limited resolutions due to the immense communication and computation demands. The aim of this research is to develop a SoC-based hardware acceleration workflow for the realization of advanced vision algorithms. Hardware acceleration can improve performance for highly complex mathematical calculations or repeated functions. The performance of a SoC system can thus be improved by using hardware acceleration method to accelerate the element that incurs the highest performance overhead. The outcome of this research could be used for the implementation of various vision algorithms, such as face recognition, object detection or object tracking, on embedded platforms. The contributions of SoC-based hardware acceleration for hardware-software codesign platforms include the following: (1) development of frameworks for complex human action recognition in both 2D and 3D; (2) realization of a framework with four main implemented IPs, namely, foreground and background subtraction (foreground probability), human detection, 2D/3D point-of-interest detection and feature extraction, and OS-ELM as a machine learning algorithm for action identication; (3) use of an FPGA-based hardware acceleration method to resolve system bottlenecks and improve system performance; and (4) measurement and analysis of system specications, such as the acceleration factor, power consumption, and resource utilization. Experimental results show that the proposed SoC-based hardware acceleration approach provides better performance in terms of the acceleration factor, resource utilization and power consumption among all recent works. In addition, a comparison of the accuracy of the framework that runs on the proposed embedded platform (SoCFPGA) with the accuracy of other PC-based frameworks shows that the proposed approach outperforms most other approaches

    An Artificial Neural Networks based Temperature Prediction Framework for Network-on-Chip based Multicore Platform

    Get PDF
    Continuous improvement in silicon process technologies has made possible the integration of hundreds of cores on a single chip. However, power and heat have become dominant constraints in designing these massive multicore chips causing issues with reliability, timing variations and reduced lifetime of the chips. Dynamic Thermal Management (DTM) is a solution to avoid high temperatures on the die. Typical DTM schemes only address core level thermal issues. However, the Network-on-chip (NoC) paradigm, which has emerged as an enabling methodology for integrating hundreds to thousands of cores on the same die can contribute significantly to the thermal issues. Moreover, the typical DTM is triggered reactively based on temperature measurements from on-chip thermal sensor requiring long reaction times whereas predictive DTM method estimates future temperature in advance, eliminating the chance of temperature overshoot. Artificial Neural Networks (ANNs) have been used in various domains for modeling and prediction with high accuracy due to its ability to learn and adapt. This thesis concentrates on designing an ANN prediction engine to predict the thermal profile of the cores and Network-on-Chip elements of the chip. This thermal profile of the chip is then used by the predictive DTM that combines both core level and network level DTM techniques. On-chip wireless interconnect which is recently envisioned to enable energy-efficient data exchange between cores in a multicore environment, will be used to provide a broadcast-capable medium to efficiently distribute thermal control messages to trigger and manage the DTM schemes
    • …
    corecore