167 research outputs found

    Weighted Round Robin Configuration for Worst-Case Delay Optimization in Network-on-Chip

    Get PDF
    We propose an approach for computing the end-to-end delay bound of individual variable bit-rate flows in a FIFO multiplexer with aggregate scheduling under Weighted Round Robin (WRR) policy. To this end, we use network calculus to derive per-flow end-to-end equivalent service curves employed for computing Least Upper Delay Bounds (LUDBs) of individual flows. Since real time applications are going to meet guaranteed services with lower delay bounds, we optimize weights in WRR policy to minimize LUDBs while satisfying performance constraints. We formulate two constrained delay optimization problems, namely, Minimize-Delay and Multiobjective optimization. Multi-objective optimization has both total delay bounds and their variance as minimization objectives. The proposed optimizations are solved using a genetic algorithm. A Video Object Plane Decoder (VOPD) case study exhibits 15.4% reduction of total worst-case delays and 40.3% reduction on the variance of delays when compared with round robin policy. The optimization algorithm has low run-time complexity, enabling quick exploration of large design spaces. We conclude that an appropriate weight allocation can be a valuable instrument for delay optimization in on-chip network designs

    A case study of hardware and software synthesis in ForSyDe

    Get PDF

    Analysis of Worst-Case Delay Bounds for On-Chip Packet-Switching Networks

    Full text link

    Enabling Energy-Efficient Inference for Self-Attention Mechanisms in Neural Networks

    Full text link
    The study of specialized accelerators tailored for neural networks is becoming a promising topic in recent years. Such existing neural network accelerators are usually designed for convolutional neural networks (CNNs) or recurrent neural networks have been (RNNs), however, less attention has been paid to the attention mechanisms, which is an emerging neural network primitive with the ability to identify the relations within input entities. The self-attention-oriented models such as Transformer have achieved great performance on natural language processing, computer vision and machine translation. However, the self-attention mechanism has intrinsically expensive computational workloads, which increase quadratically with the number of input entities. Therefore, in this work, we propose an software-hardware co-design solution for energy-efficient self-attention inference. A prediction-based approximate self-attention mechanism is introduced to substantially reduce the runtime as well as power consumption, and then a specialized hardware architecture is designed to further increase the speedup. The design is implemented on a Xilinx XC7Z035 FPGA, and the results show that the energy efficiency is improved by 5.7x with less than 1% accuracy loss

    Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels

    Get PDF
    Real-time applications such as multimedia and gaming require stringent performance guarantees, usually enforced by a tight upper bound on the maximum end-to-end delay. For FIFO multiplexed on-chip packet switched networks we consider worst-case delay bounds for Variable Bit-Rate (VBR) flows with aggregate scheduling, which schedules multiple flows as an aggregate flow. VBR Flows are characterized by a maximum transfer size (L), peak rate (p), burstiness (Ļƒ), and average sustainable rate (Ļ). Based on network calculus, we present and prove theorems to derive per-flow end-to-end Equivalent Service Curves (ESC), which are in turn used for computing Least Upper Delay Bounds (LUDBs) of individual flows. In a realistic case study we find that the end-to-end delay bound is up to 46.9% more accurate than the case without considering the traffic peak behavior. Likewise, results also show similar improvements for synthetic traffic patterns. The proposed methodology is implemented in C++ and has low run-time complexity, enabling quick evaluation for large and complex SoCs

    Seismic structure characteristics of the 18 December 2023 M6.2 Jishishan earthquake, Gansu Province

    Get PDF
    On 18 December 2023ļ¼Œat Beijing timeļ¼Œa M6.2 earthquake struck Jishishan in Gansu Province. A thorough analysis of the earthquake structure and characteristics was conducted by combining information on regional seismic tectonics, geology, seismic source mechanism, seismic intensity, and aftershock relocation. The earthquake was a reverse fault event trending north-northwest within the Xining-Lanzhou fault block. The earthquake-controlling tectonics of the earthquake is the Lajishan reverse fault zoneļ¼Œwhich is closest to the epicentre of the earthquake. The fault zone is situated at the intersection of the northwest-trending Riyueshan dextral strike-slip fault and the east-west-trending Western Qinling North Rim left-trending strike-slip fault. It has an overall northwest-to-north-northwest striking angle and comprises two branch fault zones with opposite trends in the southern and northern rims. The earthquakeā€™s epicentre location, aftershock distribution, and intensity distribution data suggest that the specific fault responsible for this earthquake is the east branch fault of the southern section of the reverse fault zone at the northern rim of Lajishan. This is consistent with the characteristics of the upward disc effect of reverse fault-type earthquakes. Further detailed field investigations are required to determine the deformation of the earthā€™s surface. The Jishishan earthquake is thought to have been caused by reverse fault activity at the intersection of the Riyueshan strike-slip fault and the northern edge of the Western Qinling fault. This was triggered by the eastward lateral slip along the original left-hand strike-slip fault along the Xining-Lanzhou fault block on the northeastern margin of the Tibetan Plateau. This occurred under the northeast-directed extrusion tectonic stress field resulting from the ongoing land-land collision between the Indian and Eurasian plates. This earthquake suggests that the extrusion tectonic system at the eastern margin of the Tibetan Plateau remains the primary structure controlling strong seismic activity in China in recent years. Further attention should be paid to the risk of strong earthquakes within the fault block

    Analytical approaches for performance evaluation of networks-on-chip

    Get PDF
    This tutorial reviews four popular mathematical formalisms ā€“ dataflow analysis, schedulability analysis, network calculus, and queueing theory ā€“ and how they have been applied to the analysis of Network-on-Chip (NoC) performance. We review the basic concepts and results of each formalism and provide examples of how they have been used in on-chip communication performance analysis. The tutorial also discusses the respective strengths and weaknesses of each formalism, their suitability for a specific purpose, and the attempts that have been made to bridge these analytical approaches. Finally, we conclude the tutorial by discussing open research issues

    Least Upper Delay Bound for VBR Flows in Networks-on- Chip with Virtual Channels

    Get PDF
    Real-time applications such as multimedia and gaming require stringent performance guarantees, usually enforced by a tight upper bound on the maximum end-to-end delay. For FIFO multiplexed on-chip packet switched networks we consider worst-case delay bounds for Variable Bit-Rate (VBR) flows with aggregate scheduling, which schedules multiple flows as an aggregate flow. VBR Flows are characterized by a maximum transfer size, peak rate, burstiness, and average sustainable rate. Based on network calculus, we present and prove theorems to derive per-flow end-to-end Equivalent Service Curves (ESC) which are in turn used for computing Least Upper Delay Bounds (LUDBs) of individual flows. In a realistic case study we find that the end-to-end delay bound is up to 46.9% more accurate than the case without considering the traffic peak behavior. Likewise, results also show similar improvements for synthetic traffic patterns. The proposed methodology is implemented in C++ and has low run-time complexity, enabling quick evaluation for large and complex SoCs
    • ā€¦
    corecore