Search CORE

127 research outputs found

Recommended from our members

Branch prediction apparatus, systems, and methods

Author: John Lizy K.
Li Tao
Publication venue: United States Patent and Trademark Office
Publication date: 18/10/2011
Field of study

An apparatus and a system, as well as a method and article, may operate to predict a branch within a first operating context, such as a user context, using a first strategy; and to predict a branch within a second operating context, such as an operating system context, using a second strategy. In some embodiments, apparatus and systems may comprise one or more first storage locations to store branch history information associated with a first operating context, and one ore more second storage locations to store branch history information associated with a second operating context.Board of Regents, University of Texas Syste

Texas ScholarWorks

Recommended from our members

Apparatus and method for accelerating java translation

Author: Ciji Isen
Hyo-Jung Song
Lizy K. John
Publication venue: United States Patent and Trademark Office
Publication date: 17/05/2012
Field of study

An apparatus and method for accelerating Java translation are provided. The apparatus includes a lookup table which stores an lookup table having arrangements of bytecodes and native codes corresponding to the bytecodes, a decoder which generates pointer to the native code corresponding to the feed bytecode in the lookup table, a parameterized bytecode processing unit which detects parameterized bytecode among the feed bytecode, and generating pointer to native code required for constant embedding in the lookup table, a constant embedding unit which embeds constants into the native code with the pointer generated by the parameterized bytecode processing unit, and a native code buffer which stores the native code generated by the decoder or the constant embedding unit.Board of Regents, University of Texas Syste

Texas ScholarWorks

HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis

Author: Arora Aman
John Lizy K.
Li Ruihao
Wei Zhigang
Publication venue
Publication date: 21/08/2023
Field of study

Machine Learning (ML) has been widely adopted in design exploration using high level synthesis (HLS) to give a better and faster performance, and resource and power estimation at very early stages for FPGA-based design. To perform prediction accurately, high-quality and large-volume datasets are required for training ML models.This paper presents a dataset for ML-assisted FPGA design using HLS, called HLSDataset. The dataset is generated from widely used HLS C benchmarks including Polybench, Machsuite, CHStone and Rossetta. The Verilog samples are generated with a variety of directives including loop unroll, loop pipeline and array partition to make sure optimized and realistic designs are covered. The total number of generated Verilog samples is nearly 9,000 per FPGA type. To demonstrate the effectiveness of our dataset, we undertake case studies to perform power estimation and resource usage estimation with ML models trained with our dataset. All the codes and dataset are public at the github repo.We believe that HLSDataset can save valuable time for researchers by avoiding the tedious process of running tools, scripting and parsing files to generate the dataset, and enable them to spend more time where it counts, that is, in training ML models.Comment: 8 pages, 5 figure

arXiv.org e-Print Archive

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

Author: Arora Aman
John Lizy K.
Ma Siyuan
Nowatzki Tony
Weng Jian
Publication venue
Publication date: 19/11/2023
Field of study

Bit-serial Processing-In-Memory (PIM) is an attractive paradigm for accelerator architectures, for parallel workloads such as Deep Learning (DL), because of its capability to achieve massive data parallelism at a low area overhead and provide orders-of-magnitude data movement savings by moving computational resources closer to the data. While many PIM architectures have been proposed, improvements are needed in communicating intermediate results to consumer kernels, for communication between tiles at scale, for reduction operations, and for efficiently performing bit-serial operations with constants. We present PIMSAB, a scalable architecture that provides spatially aware communication network for efficient intra-tile and inter-tile data movement and provides efficient computation support for generally inefficient bit-serial compute patterns. Our architecture consists of a massive hierarchical array of compute-enabled SRAMs (CRAMs) and is codesigned with a compiler to achieve high utilization. The key novelties of our architecture are: (1) providing efficient support for spatially-aware communication by providing local H-tree network for reductions, by adding explicit hardware for shuffling operands, and by deploying systolic broadcasting, and (2) taking advantage of the divisible nature of bit-serial computations through adaptive precision, bit-slicing and efficient handling of constant operations. When compared against a similarly provisioned modern Tensor Core GPU (NVIDIA A100), across common DL kernels and an end-to-end DL network (Resnet18), PIMSAB outperforms the GPU by 3x, and reduces energy by 4.2x. We compare PIMSAB with similarly provisioned state-of-the-art SRAM PIM (Duality Cache) and DRAM PIM (SIMDRAM) and observe a speedup of 3.7x and 3.88x respectively.Comment: Aman Arora and Jian Weng are co-first authors with equal contributio

arXiv.org e-Print Archive

ULEEN: A Novel Architecture for Ultra Low-Energy Edge Neural Networks

Author: Arora Aman
Bacellar Alan T. L.
Breternitz Jr. Mauricio
de Araujo Leandro S.
Dutra Diego L. C.
Franca Felipe M. G.
John Lizy K.
Katopodis Rafael F.
Lima Priscila M. V.
Miranda Igor D. S.
Susskind Zachary
Villon Luis A. Q.
Publication venue
Publication date: 20/04/2023
Field of study

The deployment of AI models on low-power, real-time edge devices requires accelerators for which energy, latency, and area are all first-order concerns. There are many approaches to enabling deep neural networks (DNNs) in this domain, including pruning, quantization, compression, and binary neural networks (BNNs), but with the emergence of the "extreme edge", there is now a demand for even more efficient models. In order to meet the constraints of ultra-low-energy devices, we propose ULEEN, a model architecture based on weightless neural networks. Weightless neural networks (WNNs) are a class of neural model which use table lookups, not arithmetic, to perform computation. The elimination of energy-intensive arithmetic operations makes WNNs theoretically well suited for edge inference; however, they have historically suffered from poor accuracy and excessive memory usage. ULEEN incorporates algorithmic improvements and a novel training strategy inspired by BNNs to make significant strides in improving accuracy and reducing model size. We compare FPGA and ASIC implementations of an inference accelerator for ULEEN against edge-optimized DNN and BNN devices. On a Xilinx Zynq Z-7045 FPGA, we demonstrate classification on the MNIST dataset at 14.3 million inferences per second (13 million inferences/Joule) with 0.21

\mu

s latency and 96.2% accuracy, while Xilinx FINN achieves 12.3 million inferences per second (1.69 million inferences/Joule) with 0.31

\mu

s latency and 95.83% accuracy. In a 45nm ASIC, we achieve 5.1 million inferences/Joule and 38.5 million inferences/second at 98.46% accuracy, while a quantized Bit Fusion model achieves 9230 inferences/Joule and 19,100 inferences/second at 99.35% accuracy. In our search for ever more efficient edge devices, ULEEN shows that WNNs are deserving of consideration.Comment: 14 pages, 14 figures Portions of this article draw heavily from arXiv:2203.01479, most notably sections 5E and 5F.

arXiv.org e-Print Archive