922 research outputs found
From FPGA to ASIC: A RISC-V processor experience
This work document a correct design flow using these tools in the Lagarto RISC- V Processor and the RTL design considerations that must be taken into account, to move from a design for FPGA to design for ASIC
Energy challenges for ICT
The energy consumption from the expanding use of information and communications technology (ICT) is unsustainable with present drivers, and it will impact heavily on the future climate change. However, ICT devices have the potential to contribute signi - cantly to the reduction of CO2 emission and enhance resource e ciency in other sectors, e.g., transportation (through intelligent transportation and advanced driver assistance systems and self-driving vehicles), heating (through smart building control), and manu- facturing (through digital automation based on smart autonomous sensors). To address the energy sustainability of ICT and capture the full potential of ICT in resource e - ciency, a multidisciplinary ICT-energy community needs to be brought together cover- ing devices, microarchitectures, ultra large-scale integration (ULSI), high-performance computing (HPC), energy harvesting, energy storage, system design, embedded sys- tems, e cient electronics, static analysis, and computation. In this chapter, we introduce challenges and opportunities in this emerging eld and a common framework to strive towards energy-sustainable ICT
Field-based branch prediction for packet processing engines
Network processors have exploited many aspects of architecture design, such as employing multi-core, multi-threading and hardware accelerator, to support both the ever-increasing line rates and the higher complexity of network applications. Micro-architectural techniques like superscalar, deep pipeline and speculative execution provide an excellent method of improving performance without limiting either the scalability or flexibility, provided that the branch penalty is well controlled. However, it is difficult for traditional branch predictor to keep increasing the accuracy by using larger tables, due to the fewer variations in branch patterns of packet processing. To improve the prediction efficiency, we propose a flow-based prediction mechanism which caches the branch histories of packets with similar header fields, since they normally undergo the same execution path. For packets that cannot find a matching entry in the history table, a fallback gshare predictor is used to provide branch direction. Simulation results show that the our scheme achieves an average hit rate in excess of 97.5% on a selected set of network applications and real-life packet traces, with a similar chip area to the existing branch prediction architectures used in modern microprocessors
SpecLLM: Exploring Generation and Review of VLSI Design Specification with Large Language Model
The development of architecture specifications is an initial and fundamental
stage of the integrated circuit (IC) design process. Traditionally,
architecture specifications are crafted by experienced chip architects, a
process that is not only time-consuming but also error-prone. Mistakes in these
specifications may significantly affect subsequent stages of chip design.
Despite the presence of advanced electronic design automation (EDA) tools,
effective solutions to these specification-related challenges remain scarce.
Since writing architecture specifications is naturally a natural language
processing (NLP) task, this paper pioneers the automation of architecture
specification development with the advanced capabilities of large language
models (LLMs). Leveraging our definition and dataset, we explore the
application of LLMs in two key aspects of architecture specification
development: (1) Generating architecture specifications, which includes both
writing specifications from scratch and converting RTL code into detailed
specifications. (2) Reviewing existing architecture specifications. We got
promising results indicating that LLMs may revolutionize how these critical
specification documents are developed in IC design nowadays. By reducing the
effort required, LLMs open up new possibilities for efficiency and accuracy in
this crucial aspect of chip design
An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor
Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration
Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models
A Retrieval-Augmented Language Model (RALM) augments a generative language
model by retrieving context-specific knowledge from an external database. This
strategy facilitates impressive text generation quality even with smaller
models, thus reducing orders of magnitude of computational demands. However,
RALMs introduce unique system design challenges due to (a) the diverse workload
characteristics between LM inference and retrieval and (b) the various system
requirements and bottlenecks for different RALM configurations such as model
sizes, database sizes, and retrieval frequencies. We propose Chameleon, a
heterogeneous accelerator system that integrates both LM and retrieval
accelerators in a disaggregated architecture. The heterogeneity ensures
efficient acceleration of both LM inference and retrieval, while the
accelerator disaggregation enables the system to independently scale both types
of accelerators to fulfill diverse RALM requirements. Our Chameleon prototype
implements retrieval accelerators on FPGAs and assigns LM inference to GPUs,
with a CPU server orchestrating these accelerators over the network. Compared
to CPU-based and CPU-GPU vector search systems, Chameleon achieves up to 23.72x
speedup and 26.2x energy efficiency. Evaluated on various RALMs, Chameleon
exhibits up to 2.16x reduction in latency and 3.18x speedup in throughput
compared to the hybrid CPU-GPU architecture. These promising results pave the
way for bringing accelerator heterogeneity and disaggregation into future RALM
systems
Recommended from our members
Energy-efficient mobile Web computing
Next-generation Web services will be primarily accessed through mobile devices. However, mobile devices are low-performance and stringently energy-constrained. In my dissertation, I propose the design of a high-performance and energy-efficient mobile Web computing substrate. It is a hardware/software co-designed system that delivers satisfactory user quality-of-service (QoS) experience on a mobile energy budget. The key insight is that the traditional interfaces between different Web stacks need to be enhanced with new abstractions that express user QoS experience and that expose architectural-level complexities. On the basis of the enhanced interfaces, I propose synergistic cross-layer optimizations across the processor architecture, Web runtime, programming language, and application layers to maximize the whole system efficiency. The contributions made in this dissertation will likely have a long-term impact because the target application domain, the Web, is becoming a universal mobile development platform, and because our solutions target the fundamental computation layers of the Web domain.Electrical and Computer Engineerin
- …