52 research outputs found

    Benchmark methodologies for the optimized physical synthesis of RISC-V microprocessors

    Get PDF
    As technology continues to advance and chip sizes shrink, the complexity and design time required for integrated circuits have significantly increased. To address these challenges, Electronic Design Automation (EDA) tools have been introduced to streamline the design flow. These tools offer various methodologies and options to optimize power, performance, and chip area. However, selecting the most suitable methods from these options can be challenging, as they may lead to trade-offs among power, performance, and area. While architectural and Register Transfer Level (RTL) optimizations have been extensively studied in existing literature, the impact of optimization methods available in EDA tools on performance has not been thoroughly researched. This thesis aims to optimize a semiconductor processor through EDA tools within the physical synthesis domain to achieve increased performance while maintaining a balance between power efficiency and area utilization. By leveraging floorplanning tools and carefully selecting technology libraries and optimization options, the CV32E40P open-source processor is subjected to various floorplans to analyze their impact on chip performance. The employed techniques, including multibit components prefer option, multiplexer tree prefer option, identification and exclusion of problematic cells, and placement blockages, lead to significant improvements in cell density, congestion mitigation, and timing. The optimized synthesis results demonstrate a 71\% enhancement in chip design performance without a substantial increase in area, showcasing the effectiveness of these techniques in improving large-scale integrated circuits' performance, efficiency, and manufacturability. By exploring and implementing the available options in EDA tools, this study demonstrates how the processor's performance can be significantly improved while maintaining a balanced and efficient chip design. The findings contribute valuable insights to the field of electronic design automation, offering guidance to designers in selecting suitable methodologies for optimizing processors and other integrated circuits

    Variability-Aware Circuit Performance Optimisation Through Digital Reconfiguration

    Get PDF
    This thesis proposes optimisation methods for improving the performance of circuits imple- mented on a custom reconfigurable hardware platform with knowledge of intrinsic variations, through the use of digital reconfiguration. With the continuing trend of transistor shrinking, stochastic variations become first order effects, posing a significant challenge for device reliability. Traditional device models tend to be too conservative, as the margins are greatly increased to account for these variations. Variation-aware optimisation methods are then required to reduce the performance spread caused by these substrate variations. The Programmable Analogue and Digital Array (PAnDA) is a reconfigurable hardware plat- form which combines the traditional architecture of a Field Programmable Gate Array (FPGA) with the concept of configurable transistor widths, and is used in this thesis as a platform on which variability-aware circuits can be implemented. A model of the PAnDA architecture is designed to allow for rapid prototyping of devices, making the study of the effects of intrinsic variability on circuit performance – which re- quires expensive statistical simulations – feasible. This is achieved by means of importing statistically-enhanced transistor performance data from RandomSPICE simulations into a model of the PAnDA architecture implemented in hardware. Digital reconfiguration is then used to explore the hardware resources available for performance optimisation. A bio-inspired optimisation algorithm is used to explore the large solution space more efficiently. Results from test circuits suggest that variation-aware optimisation can provide a significant reduction in the spread of the distribution of performance across various instances of circuits, as well as an increase in performance for each. Even if transistor geometry flexibility is not available, as is the case of traditional architectures, it is still possible to make use of the substrate variations to reduce spread and increase performance by means of function relocation

    Social Insect-Inspired Adaptive Hardware

    Get PDF
    Modern VLSI transistor densities allow large systems to be implemented within a single chip. As technologies get smaller, fundamental limits of silicon devices are reached resulting in lower design yields and post-deployment failures. Many-core systems provide a platform for leveraging the computing resource on offer by deep sub-micron technologies and also offer high-level capabilities for mitigating the issues with small feature sizes. However, designing for many-core systems that can adapt to in-field failures and operation variability requires an extremely large multi-objective optimisation space. When a many-core reaches the size supported by the densities of modern technologies (thousands of processing cores), finding design solutions in this problem space becomes extremely difficult. Many biological systems show properties that are adaptive and scalable. This thesis proposes a self-optimising and adaptive, yet scalable, design approach for many-core based on the emergent behaviours of social-insect colonies. In these colonies there are many thousands of individuals with low intelligence who contribute, without any centralised control, to complete a wide range of tasks to build and maintain the colony. The experiments presented translate biological models of social-insect intelligence into simple embedded intelligence circuits. These circuits sense low-level system events and use this manage the parameters of the many-core's Network-on-Chip (NoC) during runtime. Centurion, a 128-node many-core, was created to investigate these models at large scale in hardware. The results show that, by monitoring a small number of signals within each NoC router, task allocation emerges from the social-insect intelligence models that can self-configure to support representative applications. It is demonstrated that emergent task allocation supports fault tolerance with no extra hardware overhead. The response-threshold decision making circuitry uses a negligible amount of hardware resources relative to the size of the many-core and is an ideal technology for implementing embedded intelligence for system runtime management of large-complexity single-chip systems

    A novel deep submicron bulk planar sizing strategy for low energy subthreshold standard cell libraries

    Get PDF
    Engineering andPhysical Science ResearchCouncil (EPSRC) and Arm Ltd for providing funding in the form of grants and studentshipsThis work investigates bulk planar deep submicron semiconductor physics in an attempt to improve standard cell libraries aimed at operation in the subthreshold regime and in Ultra Wide Dynamic Voltage Scaling schemes. The current state of research in the field is examined, with particular emphasis on how subthreshold physical effects degrade robustness, variability and performance. How prevalent these physical effects are in a commercial 65nm library is then investigated by extensive modeling of a BSIM4.5 compact model. Three distinct sizing strategies emerge, cells of each strategy are laid out and post-layout parasitically extracted models simulated to determine the advantages/disadvantages of each. Full custom ring oscillators are designed and manufactured. Measured results reveal a close correlation with the simulated results, with frequency improvements of up to 2.75X/2.43X obs erved for RVT/LVT devices respectively. The experiment provides the first silicon evidence of the improvement capability of the Inverse Narrow Width Effect over a wide supply voltage range, as well as a mechanism of additional temperature stability in the subthreshold regime. A novel sizing strategy is proposed and pursued to determine whether it is able to produce a superior complex circuit design using a commercial digital synthesis flow. Two 128 bit AES cores are synthesized from the novel sizing strategy and compared against a third AES core synthesized from a state-of-the-art subthreshold standard cell library used by ARM. Results show improvements in energy-per-cycle of up to 27.3% and frequency improvements of up to 10.25X. The novel subthreshold sizing strategy proves superior over a temperature range of 0 °C to 85 °C with a nominal (20 °C) improvement in energy-per-cycle of 24% and frequency improvement of 8.65X. A comparison to prior art is then performed. Valid cases are presented where the proposed sizing strategy would be a candidate to produce superior subthreshold circuits

    Abusing Hardware Race Conditions for High Throughput Energy Efficient Computation

    Get PDF
    We propose a novel computing approach, called “Race Logic”, which utilizes a new data representation to accelerate a broad class of optimization problems, such as those solved by dynamic programming algorithms. The core idea of Race Logic is to deliberately engineer race conditions in a circuit to perform useful computation. In Race Logic, information, instead of being represented as logic levels (as is done in conventional logic), is represented as a timing delay. Computations can then be performed by observing the relative propagation times of signals injected into a configurable circuit (i.e. the outcome of races through the circuit).In this dissertation I will introduce Race Based computation and talk about multiple VLSI implementations. We first begin by considering a synchronous approach, which uses simple clocked delay elements. Though this synchronous implementation outperforms highly optimized conventional implementations of the well-studied, DNA sequence alignment problem, its third order energy scaling with problem size and limited dynamic range of timing delays are its major pitfalls. Next, in the search for energy efficiency, we study asynchronous designs in order to understand the performance trade-offs and applicability of this new architecture. Finally, I will present the results of a prototype asynchronous Race Logic chip and demonstrate that Race-Based computations can align up to 10 million 50 symbol long DNA sequences per second, about 2-3 orders of magnitude faster than the state of the art general purpose computing systems

    Multi-objective Digital VLSI Design Optimisation

    Get PDF
    Modern VLSI design's complexity and density has been exponentially increasing over the past 50 years and recently reached a stage within its development, allowing heterogeneous, many-core systems and numerous functions to be integrated into a tiny silicon die. These advancements have revealed intrinsic physical limits of process technologies in advanced silicon technology nodes. Designers and EDA vendors have to handle these challenges which may otherwise result in inferior design quality, even failures, and lower design yields under time-to-market pressure. Multiple or many design objectives and constraints are emerging during the design process and often need to be dealt with simultaneously. Multi-objective evolutionary algorithms show flexible capabilities in maintaining multiple variable components and factors in uncertain environments. The VLSI design process involves a large number of available parameters both from designs and EDA tools. This provides many potential optimisation avenues where evolutionary algorithms can excel. This PhD work investigates the application of evolutionary techniques for digital VLSI design optimisation. Automated multi-objective optimisation frameworks, compatible with industrial design flows and foundry technologies, are proposed to improve solution performance, expand feasible design space, and handle complex physical floorplan constraints through tuning designs at gate-level. Methodologies for enriching standard cell libraries regarding drive strength are also introduced to cooperate with multi-objective optimisation frameworks, e.g., subsequent hill-climbing, providing a richer pool of solutions optimised for different trade-offs. The experiments of this thesis demonstrate that multi-objective evolutionary algorithms, derived from biological inspirations, can assist the digital VLSI design process, in an industrial design context, to more efficiently search for well-balanced trade-off solutions as well as optimised design space coverage. The expanded drive granularity of standard cells can push the performance of silicon technologies with offering improved solutions regarding critical objectives. The achieved optimisation results can better deliver trade-off solutions regarding power, performance and area metrics than using standard EDA tools alone. This has been not only shown for a single circuit solution but also covered the entire standard-tool-produced design space

    Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

    Get PDF
    abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    THEORY, DESIGN, AND SIMULATION OF LINA: A PATH FORWARD FOR QCA-TYPE NANOELECTRONICS

    Get PDF
    The past 50 years have seen exponential advances in digital integrated circuit technologies which has facilitated an explosion of uses and functionality. Although this rate (generally referred to as "Moore's Law") cannot be sustained indefinitely, significant advances will remain possible even after current technologies reach fundamental limits. However if these further advances are to be realized, nanoelectronics designs must be developed that provide significant improvements over, the currently-utilized, complementary metal-oxide semiconductor (CMOS) transistor based integrated circuits. One promising nanoelectronics paradigm to fulfill this function is Quantum-dot Cellular Automata (QCA). QCA provides the possibility of THz switching, molecular scaling, and provides particular applicability for advanced logical constructs such as reversible logic and systolic arrays within the paradigm. These attributes make QCA an exciting prospect; however, current fabrication technology does not exist which allows for the fabrication of reliable electronic QCA circuits which operate at room-temperature. Furthermore, a plausible path to fabrication of circuitry on the very large scale integration (VLSI) level with QCA does not currently exist. This has caused doubts to the viability of the paradigm and questions to its future as a suitable nanoelectronic replacement to CMOS. In order to resolve these issues, research was conducted into a new design which could utilize key attributes of QCA while also providing a means for near-term fabrication of reliable room-temperature circuits and a path forward for VLSI circuits.The result of this research, presented in this dissertation, is the Lattice-based Integrated-signal Nanocellular Automata (LINA) nanoelectronics paradigm. LINA designs are based on QCA and provide the same basic functionality as traditional QCA. LINA also retains the key attributes of THz switching, scalability to the molecular level, and ability to utilize advanced logical constructs which are crucial to the QCA proposals. However, LINA designs also provide significant improvements over traditional QCA. For example, the continuous correction of faults, due to LINA's integrated-signal approach, provides reliability improvements to enable room-temperature operation with cells which are potentially up to 20nm and fault tolerance to layout, patterning, stray-charge, and stuck-at-faults. In terms of fabrication, LINA's lattice-based structure allows precise relative placement through the use of self-assembly techniques seen in current nanoparticle research. LINA also allows for large enough wire and logic structures to enable use of widely available photo-lithographical patterning technologies. These aspects of the LINA designs, along with power, timing, and clocking results, have been verified through the use of new and/or modified simulation tools specifically developed for this purpose. To summarize, the LINA designs and results, presented in this dissertation, provide a path to realization of QCA-type VLSI nanoelectronic circuitry. Furthermore, they offer a renewed viability of the paradigm to replace CMOS and advance computing technologies beyond the next decade

    Crosstalk computing: circuit techniques, implementation and potential applications

    Get PDF
    Title from PDF of title [age viewed January 32, 2022Dissertation advisor: Mostafizur RahmanVitaIncludes bibliographical references (page 117-136)Thesis (Ph.D.)--School of Computing and Engineering. University of Missouri--Kansas City, 2020This work presents a radically new computing concept for digital Integrated Circuits (ICs), called Crosstalk Computing. The conventional CMOS scaling trend is facing device scaling limitations and interconnect bottleneck. The other primary concern of miniaturization of ICs is the signal-integrity issue due to Crosstalk, which is the unwanted interference of signals between neighboring metal lines. The Crosstalk is becoming inexorable with advancing technology nodes. Traditional computing circuits always tries to reduce this Crosstalk by applying various circuit and layout techniques. In contrast, this research develops novel circuit techniques that can leverage this detrimental effect and convert it astutely to a useful feature. The Crosstalk is engineered into a logic computation principle by leveraging deterministic signal interference for innovative circuit implementation. This research work presents a comprehensive circuit framework for Crosstalk Computing and derives all the key circuit elements that can enable this computing model. Along with regular digital logic circuits, it also presents a novel Polymorphic circuit approach unique to Crosstalk Computing. In Polymorphic circuits, the functionality of a circuit can be altered using a control variable. Owing to the multi-functional embodiment in polymorphic-circuits, they find many useful applications such as reconfigurable system design, resource sharing, hardware security, and fault-tolerant circuit design, etc. This dissertation shows a comprehensive list of polymorphic logic gate implementations, which were not reported previously in any other work. It also performs a comparison study between Crosstalk polymorphic circuits and existing polymorphic approaches, which are either inefficient due to custom non-linear circuit styles or propose exotic devices. The ability to design a wide range of polymorphic logic circuits (basic and complex logics) compact in design and minimal in transistor count is unique to Crosstalk Computing, which leads to benefits in the circuit density, power, and performance. The circuit simulation and characterization results show a 6x improvement in transistor count, 2x improvement in switching energy, and 1.5x improvement in performance compared to counterpart implementation in CMOS circuit style. Nevertheless, the Crosstalk circuits also face issues while cascading the circuits; this research analyzes all the problems and develops auxiliary circuit techniques to fix the problems. Moreover, it shows a module-level cascaded polymorphic circuit example, which also employs the auxiliary circuit techniques developed. For the very first time, it implements a proof-of-concept prototype Chip for Crosstalk Computing at TSMC 65nm technology and demonstrates experimental evidence for runtime reconfiguration of the polymorphic circuit. The dissertation also explores the application potentials for Crosstalk Computing circuits. Finally, the future work section discusses the Electronic Design Automation (EDA) challenges and proposes an appropriate design flow; besides, it also discusses ideas for the efficient implementation of Crosstalk Computing structures. Thus, further research and development to realize efficient Crosstalk Computing structures can leverage the comprehensive circuit framework developed in this research and offer transformative benefits for the semiconductor industry.Introduction and Motivation -- More Moore and Relevant Beyond CMOS Research Directions -- Crosstalk Computing -- Crosstalk Circuits Based on Perception Model -- Crosstalk Circuit Types -- Cascading Circuit Issues and Sollutions -- Existing Polymorphic Circuit Approaches -- Crosstalk Polymorphic Circuits -- Comparison and Benchmarking of Crosstalk Gates -- Practical Realization of Crosstalk Gates -- Poential Applications -- Conclusion and Future Wor
    • …
    corecore