10 research outputs found

    MaxEVA: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine

    Full text link
    The increasing computational and memory requirements of Deep Learning (DL) workloads has led to outstanding innovations in hardware architectures. An archetype of such architectures is the novel Versal AI Engine (AIE) by AMD/Xilinx. The AIE comprises multiple programmable processors optimized for vector-based algorithms. An AIE array consisting of 400 processor cores, operating at 1.25 GHz is able to deliver a peak throughput of 8 TFLOPs for 32-bit floating-point (fp32), and 128 TOPs for 8-bit integer (int8) precision. In this work, we propose MaxEVA: a novel framework to efficiently map Matrix Multiplication (MatMul) workloads on Versal AIE devices. Our framework maximizes the performance and energy efficiency of MatMul applications by efficiently exploiting features of the AIE architecture and resolving performance bottlenecks from multiple angles. When demonstrating on the VC1902 device of the VCK190 board, MaxEVA accomplishes up to 5.44 TFLOPs and 77.01 TOPs throughput for fp32 and int8 precisions, respectively. In terms of energy efficiency, MaxEVA attains up to 124.16 GFLOPs/W for fp32, and 1.16 TOPs/W for int8. Our proposed method substantially outperforms the state-of-the-art approach by exhibiting up to 2.19x throughput gain and 20.4% higher energy efficiency. The MaxEVA framework provides notable insights to fill the knowledge gap in effectively designing MatMul-based DL workloads on the new Versal AIE devices.Comment: Accepted as full paper at FPT 202

    Estimating the Power Consumption of Heterogeneous Devices when performing AI Inference

    Full text link
    Modern-day life is driven by electronic devices connected to the internet. The emerging research field of the Internet-of-Things (IoT) has become popular, just as there has been a steady increase in the number of connected devices. Since many of these devices are utilised to perform CV tasks, it is essential to understand their power consumption against performance. We report the power consumption profile and analysis of the NVIDIA Jetson Nano board while performing object classification. The authors present an extensive analysis regarding power consumption per frame and the output in frames per second using YOLOv5 models. The results show that the YOLOv5n outperforms other YOLOV5 variants in terms of throughput (i.e. 12.34 fps) and low power consumption (i.e. 0.154 mWh/frame)

    Hardware Trojan Detection and Mitigation in NoC using Key authentication and Obfuscation Techniques

    Get PDF
    Today's Multiprocessor System-on-Chip (MPSoC) contains many cores and integrated circuits. Due to the current requirements of communication, we make use of Network-on-Chip (NoC) to obtain high throughput and low latency. NoC is a communication architecture used in the processor cores to transfer  data from source to destination through several nodes. Since NoC deals with on-chip interconnection for data transmission, it will be a good prey for data leakage and other security attacks. One such way of attacking is done by a third-party vendor introducing Hardware Trojans (HTs) into routers of NoC architecture. This can cause packets to traverse in wrong paths, leak/extract information and cause Denial-of-Service (DoS) degrading the system performance. In this paper, a novel HT detection and mitigation approach using obfuscation and key-based authentication technique is proposed. The proposed technique prevents any illegal transitions between routers thereby protecting data from malicious activities, such as packet misrouting and information leakage. The proposed technique is evaluated on a 4x4 NoC architecture under synthetic traffic pattern and benchmarks, the hardware model is synthesized in Cadence Tool with 90nm technology. The introduced Hardware Trojan affects 8% of packets passing through infected router. Experimental results demonstrate that the proposed technique prevents those 10-15% of packets infected from the HT effect. Our proposed work has negligible power and area overhead of 8.6% and  2% respectively

    Worst-Case Latency Analysis for the Versal Network-on-Chip

    Get PDF
    The recent line of Versal FPGA devices from Xilinx Inc. includes a hard Network-On-Chip (NoC) embedded in the programmable logic, designed to be a high-performance system-level interconnect. While the target markets for Versal devices include applications with real-time constraints, such as automotive driver assist, the associated development tools only provide figures for "structural latencies" of data packets, which assume that the network is otherwise idle. In a realistic setting, this information is not enough to ensure deadlines are met, as different packets can contend for NoC switch outputs, which causes packet contents to be buffered while in transit, increasing their latency. In this work, we develop an approach for calculating upper bounds for such worst-case latencies (WCLs), assuming a model where system tasks release packets into the NoC periodically. In order to develop an accurate model for latencies in the network, we review the architecture and operation of the Versal NoC. We focus on a formal description of the NPS switches that compose the NoC from a flit arbitration perspective, based on study the available cycle-accurate switch simulation code. Working with the presented model, we propose an adaptation to an existing approach for WCL analysis in NoC, Recursive Calculus (RC), in order to apply it to the arbitration policy implemented in the Versal NoC. To evaluate the proposed approach, we implement a simulation experiment for the Versal NoC, with custom endpoints that allow for injecting packets programatically and measuring their latencies over the NoC. We simulate both a single NPS module and a complete NoC routing periodic workloads, in order to compare with the values given by the WCL approach and identify sources of pessimism

    6G White Paper on Edge Intelligence

    Get PDF
    In this white paper we provide a vision for 6G Edge Intelligence. Moving towards 5G and beyond the future 6G networks, intelligent solutions utilizing data-driven machine learning and artificial intelligence become crucial for several real-world applications including but not limited to, more efficient manufacturing, novel personal smart device environments and experiences, urban computing and autonomous traffic settings. We present edge computing along with other 6G enablers as a key component to establish the future 2030 intelligent Internet technologies as shown in this series of 6G White Papers. In this white paper, we focus in the domains of edge computing infrastructure and platforms, data and edge network management, software development for edge, and real-time and distributed training of ML/AI algorithms, along with security, privacy, pricing, and end-user aspects. We discuss the key enablers and challenges and identify the key research questions for the development of the Intelligent Edge services. As a main outcome of this white paper, we envision a transition from Internet of Things to Intelligent Internet of Intelligent Things and provide a roadmap for development of 6G Intelligent Edge

    HopliteBuf FPGA Network-on-Chip: Architecture and Analysis

    Get PDF
    We can prove occupancy bounds of stall-free FIFOs used in deflection-free, low-cost, and high-speed FPGA overlay Network-on-chips (NoCs). In our work, we build on top of the HopliteRT livelock-free overlay NoC with an FPGA-friendly 2D unidirectional torus topology to propose the novel HopliteBuf NoC. In our new NoC, we strategically introduce stall-free FIFOs in the network and support these FIFOs with static analysis based on network calculus to compute FIFO occupancy, latency, and bandwidth bounds. The microarchitecture of HopliteBuf combines the performance benefits of conventional buffered NoCs (high throughput, low latency) with the cost advantages of deflection-routed NoCs (low FPGA area, high clock frequencies). Specifically, we look at two design variants of the HopliteBuf NoC: (1) Single corner-turn FIFO (W to S), and (2) Dual corner-turn FIFO (W to S+N). The single corner-turn (W to S) design is simpler and only introduces a buffering requirement for packets changing dimension from X ring to the downhill Y ring (or West to South). The dual corner-turn variant requires two FIFOs for turning packets going downhill (W to S) as well as uphill (W to N). The dual corner-turn design overcomes the mathematical analysis challenges associated with single corner-turn designs for communication workloads with cyclic dependencies between flow traversal paths at the expense of small increase in resource cost. Essentially, we resolve an analysis challenge with extra hardware resources. Across a range of 100 synthetically-generated workloads on a 5 x 5 NoC, HopliteBuf outperforms HopliteRT by 1.2-2x in terms of latency, 10% in terms of injection rate, and 30-60% in terms of flowset feasibiliy. These advantages come at the cost of 3-4x higher FPGA resource requirement for buffers and muxes. Our analysis also deliver latency bounds that are not only better than HopliteRT in absolute terms but also tighter by 2-3x allowing us to provision less hardware to meet our specifications

    Rethinking FPGA Architectures for Deep Neural Network applications

    Get PDF
    The prominence of machine learning-powered solutions instituted an unprecedented trend of integration into virtually all applications with a broad range of deployment constraints from tiny embedded systems to large-scale warehouse computing machines. While recent research confirms the edges of using contemporary FPGAs to deploy or accelerate machine learning applications, especially where the latency and energy consumption are strictly limited, their pre-machine learning optimised architectures remain a barrier to the overall efficiency and performance. Realizing this shortcoming, this thesis demonstrates an architectural study aiming at solutions that enable hidden potentials in the FPGA technology, primarily for machine learning algorithms. Particularly, it shows how slight alterations to the state-of-the-art architectures could significantly enhance the FPGAs toward becoming more machine learning-friendly while maintaining the near-promised performance for the rest of the applications. Eventually, it presents a novel systematic approach to deriving new block architectures guided by designing limitations and machine learning algorithm characteristics through benchmarking. First, through three modifications to Xilinx DSP48E2 blocks, an enhanced digital signal processing (DSP) block for important computations in embedded deep neural network (DNN) accelerators is described. Then, two tiers of modifications to FPGA logic cell architecture are explained that deliver a variety of performance and utilisation benefits with only minor area overheads. Eventually, with the goal of exploring this new design space in a methodical manner, a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations is first proposed. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then suggested together with a family of new embedded blocks, called MLBlocks
    corecore