1,794 research outputs found
Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques
The rapid growth of demanding applications in domains applying multimedia
processing and machine learning has marked a new era for edge and cloud
computing. These applications involve massive data and compute-intensive tasks,
and thus, typical computing paradigms in embedded systems and data centers are
stressed to meet the worldwide demand for high performance. Concurrently, the
landscape of the semiconductor field in the last 15 years has constituted power
as a first-class design concern. As a result, the community of computing
systems is forced to find alternative design approaches to facilitate
high-performance and/or power-efficient computing. Among the examined
solutions, Approximate Computing has attracted an ever-increasing interest,
with research works applying approximations across the entire traditional
computing stack, i.e., at software, hardware, and architectural levels. Over
the last decade, there is a plethora of approximation techniques in software
(programs, frameworks, compilers, runtimes, languages), hardware (circuits,
accelerators), and architectures (processors, memories). The current article is
Part I of our comprehensive survey on Approximate Computing, and it reviews its
motivation, terminology and principles, as well it classifies and presents the
technical details of the state-of-the-art software and hardware approximation
techniques.Comment: Under Review at ACM Computing Survey
Recommended from our members
Serial Biasing Technique for Rapid Single Flux Quantum Circuits
Superconductor electronics based on the Single Flux Quantum (SFQ) technology are considered a strong contender for the âbeyond CMOSâ future of digital circuits because of the high speed and low power dissipation associated with them. In fact, digital operations beyond tens of GHz have been routinely demonstrated in the SFQ technology. These circuits have widespread applications such as high-speed analog-to-digital conversion, digital signal processing, high speed computing and in emerging topics such as control circuitry for superconducting quantum computing.
Rapid Single Flux Quantum (RSFQ) circuits have emerged as a promising candidate within the SFQ technology, with information encoded in picosecond wide, milli-volt voltage pulses. As is the case with any integrated circuit technology, scalability of RSFQ circuits is essential to realizing their applications. These circuits, based on the Josephson junction, require a DC bias current for the correct operation. The DC bias current requirement increases with circuit complexity, and this has multiple implications on circuit operation. Large currents produce magnetic fields that can interfere with logic operation. Furthermore, the heat load delivered to the superconducting chip also increases with current which could result in the circuit becoming ânormalâ and not superconducting. These problems make reduction of the bias current necessary.
Serial Biasing (SB) is a bias current reduction technique, that has been proposed in the past. In this technique, a digital circuit is partitioned into multiple identical islands and bias current is provided to each island in a serial manner. While this scheme is promising, there are multiple challenges such as design of the driver-receiver pair circuit resulting in robust and wide operating bias margins, current management on the floating islands, etc.
This thesis investigates SB in a systematic manner, focusing on the design and measurement of the fundamental components of this technique with an emphasis on reliability and scalability. It presents works on circuit techniques achieving high speed serially biased RSFQ circuits with robust operating margins and the experimental evidence to support the ideas. It develops a framework for serial biasing that could be used by electronic design tools to automate design and synthesis of complex RSFQ circuits. It also investigates Passive Transmission Lines (PTLs) for use as passive interconnects between library cells in a complex design, reducing the DC bias current required by the active circuitry
Improving low latency applications for reconfigurable devices
This thesis seeks to improve low latency application performance via architectural improvements in reconfigurable devices. This is achieved by improving resource utilisation and access, and by exploiting the different environments within which reconfigurable devices are deployed.
Our first contribution leverages devices deployed at the network level to enable the low latency processing of financial market data feeds. Financial exchanges transmit messages via two identical data feeds to reduce the chance of message loss. We present an approach to arbitrate these redundant feeds at the network level using a Field-Programmable Gate Array (FPGA). With support for any messaging protocol, we evaluate our design using the NASDAQ TotalView-ITCH, OPRA, and ARCA data feed protocols, and provide two simultaneous outputs: one prioritising low latency, and one prioritising high reliability with three dynamically configurable windowing methods.
Our second contribution is a new ring-based architecture for low latency, parallel access to FPGA memory. Traditional FPGA memory is formed by grouping block memories (BRAMs) together and accessing them as a single device. Our architecture accesses these BRAMs independently and in parallel. Targeting memory-based computing, which stores pre-computed function results in memory, we benefit low latency applications that rely on: highly-complex functions; iterative computation; or many parallel accesses to a shared resource. We assess square root, power, trigonometric, and hyperbolic functions within the FPGA, and provide a tool to convert Python functions to our new architecture.
Our third contribution extends the ring-based architecture to support any FPGA processing element. We unify E heterogeneous processing elements within compute pools, with each element implementing the same function, and the pool serving D parallel function calls. Our implementation-agnostic approach supports processing elements with different latencies, implementations, and pipeline lengths, as well as non-deterministic latencies. Compute pools evenly balance access to processing elements across the entire application, and are evaluated by implementing eight different neural network activation functions within an FPGA.Open Acces
A Phase Change Memory and DRAM Based Framework For Energy-Efficient and High-Speed In-Memory Stochastic Computing
Convolutional Neural Networks (CNNs) have proven to be highly effective in various fields related to Artificial Intelligence (AI) and Machine Learning (ML). However, the significant computational and memory requirements of CNNs make their processing highly compute and memory-intensive. In particular, the multiply-accumulate (MAC) operation, which is a fundamental building block of CNNs, requires enormous arithmetic operations. As the input dataset size increases, the traditional processor-centric von-Neumann computing architecture becomes ill-suited for CNN-based applications. This results in exponentially higher latency and energy costs, making the processing of CNNs highly challenging.
To overcome these challenges, researchers have explored the Processing-In Memory (PIM) technique, which involves placing the processing unit inside or near the memory unit. This approach reduces data migration length and utilizes the internal memory bandwidth at the memory chip level. However, developing a reliable PIM-based system with minimal hardware modifications and design complexity remains a significant challenge.
The proposed solution in the report suggests utilizing different memory technologies, such as Dynamic RAM (DRAM) and phase change memory (PCM), with Stochastic arithmetic and minimal add-on logic. Stochastic computing is a technique that uses random numbers to perform arithmetic operations instead of traditional binary representation. This technique reduces hardware requirements for CNN\u27s arithmetic operations, making it possible to implement them with minimal add-on logic.
The report details the workflow for performing arithmetical operations used by CNNs, including MAC, activation, and floating-point functions. The proposed solution includes designs for scalable Stochastic Number Generator (SNG), DRAM CNN accelerator, non-volatile memory (NVM) class PCRAM-based CNN accelerator, and DRAM-based stochastic to binary conversion (StoB) for in-situ deep learning. These designs utilize stochastic computing to reduce the hardware requirements for CNN\u27s arithmetic operations and enable energy and time-efficient processing of CNNs.
The report also identifies future research directions for the proposed designs, including in-situ PCRAM-based SNG, ODIN (A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-Situ Neural Network Processing in Phase Change RAM), ATRIA (Bit-Parallel Stochastic Arithmetic Based Accelerator for In-DRAM CNN Processing), and AGNI (In-Situ, Iso-Latency Stochastic-to-Binary Number Conversion for In-DRAM Deep Learning), and presents initial findings for these ideas.
In summary, the proposed solution in the report offers a comprehensive approach to address the challenges of processing CNNs, and the proposed designs have the potential to improve the energy and time efficiency of CNNs significantly. Using Stochastic Computing and different memory technologies enables the development of reliable PIM-based systems with minimal hardware modifications and design complexity, providing a promising path for the future of CNN-based applications
Undergraduate and Graduate Course Descriptions, 2023 Spring
Wright State University undergraduate and graduate course descriptions from Spring 2023
Engineering metabolic time-sharing in a clonal Escherichia coli population
The âdivision of labourâ strategy is common among microbial communities, as dividing burdensome tasks between members of a community alleviates the strain placed on individual cells. Exploiting this phenomenon in heterogeneous microbial co-cultures for industrial synthesis of valuable compounds is limited by inefficiencies in nutrient exchange and conflicting growth requirements. Here, we demonstrate a synthetic gene circuit which enables cells of an isogenic Escherichia coli population to carry out âmetabolic time-sharingâ by shifting between alternate metabolic states via temporal changes in gene expression. Further, we review techniques for monitoring such dynamic processes at the single-cell level, and discuss their current applications in bacterial studies. To validate that our circuit may be used to induce cooperative behaviours in microbial populations, we adapted this circuit to engineer cells that oscillate between distinct amino acid auxotrophy phenotypes, driven by the periodic silencing of key biosynthetic genes. Culturing a clonal time-sharing population with unsynchronized oscillators permits reciprocal amino acid cross-feeding, ultimately ensuring population viability. Through comparative growth experiments, we found that the fitness of our time-sharing population was comparable to that of a heterogeneous co-culture composed of E. coli auxotrophs similarly capable of cross-feeding amino acids. Although future studies would be needed to confirm this, our preliminary results suggest that metabolic time-sharing may be a viable alternative to synthetic heterogeneous co-cultures. As it may enable an entire complex biosynthetic pathway to be engineered into a single host with reduced metabolic burden, the metabolic time-sharing strategy demonstrated here could potentially be implemented for microbial bioproduction, among other widespread applications
Improved precision scaling for simulating coupled quantum-classical dynamics
We present a super-polynomial improvement in the precision scaling of quantum
simulations for coupled classical-quantum systems in this paper. Such systems
are found, for example, in molecular dynamics simulations within the
Born-Oppenheimer approximation. By employing a framework based on the
Koopman-von Neumann formalism, we express the Liouville equation of motion as
unitary dynamics and utilize phase kickback from a dynamical quantum simulation
to calculate the quantum forces acting on classical particles. This approach
allows us to simulate the dynamics of these particles without the overheads
associated with measuring gradients and solving the equations of motion on a
classical computer, resulting in a super-polynomial advantage at the price of
increased space complexity. We demonstrate that these simulations can be
performed in both microcanonical and canonical ensembles, enabling the
estimation of thermodynamic properties from the prepared probability density.Comment: 19 + 51 page
Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning
Deep Learning (DL) is significantly impacting many industries, including automotive, retail and medicine, enabling autonomous driving, recommender systems and genomics modelling, amongst other applications. At the same time, demand for complex and fast DL models is continually growing. The most capable models tend to exhibit highest operational costs, primarily due to their large computational resource footprint and inefficient utilisation of computational resources employed by DL systems. In an attempt to tackle these problems, DL compilers and auto-tuners emerged, automating the traditionally manual task of DL model performance optimisation. While auto-tuning improves model inference speed, it is a costly process, which limits its wider adoption within DL deployment pipelines. The high operational costs associated with DL auto-tuning have multiple causes. During operation, DL auto-tuners explore large search spaces consisting of billions of tensor programs, to propose potential candidates that improve DL model inference latency. Subsequently, DL auto-tuners measure candidate performance in isolation on the target-device, which constitutes the majority of auto-tuning compute-time. Suboptimal candidate proposals, combined with their serial measurement in an isolated target-device lead to prolonged optimisation time and reduced resource availability, ultimately reducing cost-efficiency of the process. In this thesis, we investigate the reasons behind prolonged DL auto-tuning and quantify their impact on the optimisation costs, revealing directions for improved DL auto-tuner design. Based on these insights, we propose two complementary systems: Trimmer and DOPpler. Trimmer improves tensor program search efficacy by filtering out poorly performing candidates, and controls end-to-end auto-tuning using cost objectives, monitoring optimisation cost. Simultaneously, DOPpler breaks long-held assumptions about the serial candidate measurements by successfully parallelising them intra-device, with minimal penalty to optimisation quality. Through extensive experimental evaluation of both systems, we demonstrate that they significantly improve cost-efficiency of autotuning (up to 50.5%) across a plethora of tensor operators, DL models, auto-tuners and target-devices
Shedding in the Timber Rattlesnake: Natural Patterns, Endocrinological Underpinnings, Temporal and Energetic Effort, and Integration as a Reptilian Life History Trait
The semi-frequent replacement of the epidermis (ecdysis) is a characteristic trait of reptiles. Whereas all reptiles regularly engage in some degree of skin shedding, skin morphology in snakes necessitates the synchronous replacement of the entire epidermis and facilitates the subsequent removal of the old layer as a single sheet. To date, the ubiquitous process has garnered little attention from researchers because snakes shed with unpredictable timing and frequency and are exceedingly cryptic during ecdytic cycles; previously impeding detailed physiological or ecological investigations of the process in the clade. Because of the lack of study, ecdysis is often viewed as a maintenance function; occurring whenever change in body size necessitates the generation of a new epidermal layer. However, recent observations that skin shedding plays a role in conspecific sexual signaling in some snakes suggest that the predominate view of ecdysis as a growth function may be overly simplistic. By studying population-scale patterns of shed, I was able to describe the timing and frequency of ecdysis in a population of Timber Rattlesnakes, solving a long-standing problem in continued study of ecdysis; predicting the occurrence of shed events. Coupling my knowledge of patterns of shed timing with novel methodologies for inducing shed, I was able to induce ecdytic cycles in a laboratory setting and herein provide the first measurements of the metabolic effort and duration of shedding in any reptile. I integrated data on the frequency, duration, and metabolic effort of shed into an individual-based computer model of the Timber Rattlesnake to address larger questions about the selective pressures that may shape patterns of shed in snakes. I found that Timber Rattlesnakes shed infrequently (1-2 times per year) and often in close proximity to the mating season regardless of sex. However, the physiological conditions that best correlated to shed frequency differed between males (body condition) and females (reproductive condition). Each shed event required a significant metabolic (3% of the total annual energy budget) and temporal (~28 days at 25â°C with ~50% of that including some degree of visual limitation from occluded spectacles) investment. In my computer simulations, I found that time spent in shed limited lifetime energy budgets (decreasing lifetime resource acquisition via foraging) and that the energetic effort of ecdysis may serve to limit shed frequency in low resource environments. In my observations of patterns of shed in the wild and through simulations of expected female fecundity under alternate shed frequencies, I found evidence that ecdysis may play a vital role in the reproductive biology of rattlesnakes. Ecdysis is a vital and omnipresent feature of reptilian biology. My data are the first to demonstrate that the frequency of the process is constrained in a population. I provide evidence for the role of growth and body condition, time-energy budgets, environmental conditions, and reproductive events in dictating patterns of shed. As such, patterns of shed may be population specific and serve as an indicator of the important environmental and biophysical forces which shape life histories across populations and species
- âŠ