126 research outputs found

    Design Techniques for Energy-Quality Scalable Digital Systems

    Get PDF
    Energy efficiency is one of the key design goals in modern computing. Increasingly complex tasks are being executed in mobile devices and Internet of Things end-nodes, which are expected to operate for long time intervals, in the orders of months or years, with the limited energy budgets provided by small form-factor batteries. Fortunately, many of such tasks are error resilient, meaning that they can toler- ate some relaxation in the accuracy, precision or reliability of internal operations, without a significant impact on the overall output quality. The error resilience of an application may derive from a number of factors. The processing of analog sensor inputs measuring quantities from the physical world may not always require maximum precision, as the amount of information that can be extracted is limited by the presence of external noise. Outputs destined for human consumption may also contain small or occasional errors, thanks to the limited capabilities of our vision and hearing systems. Finally, some computational patterns commonly found in domains such as statistics, machine learning and operational research, naturally tend to reduce or eliminate errors. Energy-Quality (EQ) scalable digital systems systematically trade off the quality of computations with energy efficiency, by relaxing the precision, the accuracy, or the reliability of internal software and hardware components in exchange for energy reductions. This design paradigm is believed to offer one of the most promising solutions to the impelling need for low-energy computing. Despite these high expectations, the current state-of-the-art in EQ scalable design suffers from important shortcomings. First, the great majority of techniques proposed in literature focus only on processing hardware and software components. Nonetheless, for many real devices, processing contributes only to a small portion of the total energy consumption, which is dominated by other components (e.g. I/O, memory or data transfers). Second, in order to fulfill its promises and become diffused in commercial devices, EQ scalable design needs to achieve industrial level maturity. This involves moving from purely academic research based on high-level models and theoretical assumptions to engineered flows compatible with existing industry standards. Third, the time-varying nature of error tolerance, both among different applications and within a single task, should become more central in the proposed design methods. This involves designing “dynamic” systems in which the precision or reliability of operations (and consequently their energy consumption) can be dynamically tuned at runtime, rather than “static” solutions, in which the output quality is fixed at design-time. This thesis introduces several new EQ scalable design techniques for digital systems that take the previous observations into account. Besides processing, the proposed methods apply the principles of EQ scalable design also to interconnects and peripherals, which are often relevant contributors to the total energy in sensor nodes and mobile systems respectively. Regardless of the target component, the presented techniques pay special attention to the accurate evaluation of benefits and overheads deriving from EQ scalability, using industrial-level models, and on the integration with existing standard tools and protocols. Moreover, all the works presented in this thesis allow the dynamic reconfiguration of output quality and energy consumption. More specifically, the contribution of this thesis is divided in three parts. In a first body of work, the design of EQ scalable modules for processing hardware data paths is considered. Three design flows are presented, targeting different technologies and exploiting different ways to achieve EQ scalability, i.e. timing-induced errors and precision reduction. These works are inspired by previous approaches from the literature, namely Reduced-Precision Redundancy and Dynamic Accuracy Scaling, which are re-thought to make them compatible with standard Electronic Design Automation (EDA) tools and flows, providing solutions to overcome their main limitations. The second part of the thesis investigates the application of EQ scalable design to serial interconnects, which are the de facto standard for data exchanges between processing hardware and sensors. In this context, two novel bus encodings are proposed, called Approximate Differential Encoding and Serial-T0, that exploit the statistical characteristics of data produced by sensors to reduce the energy consumption on the bus at the cost of controlled data approximations. The two techniques achieve different results for data of different origins, but share the common features of allowing runtime reconfiguration of the allowed error and being compatible with standard serial bus protocols. Finally, the last part of the manuscript is devoted to the application of EQ scalable design principles to displays, which are often among the most energy- hungry components in mobile systems. The two proposals in this context leverage the emissive nature of Organic Light-Emitting Diode (OLED) displays to save energy by altering the displayed image, thus inducing an output quality reduction that depends on the amount of such alteration. The first technique implements an image-adaptive form of brightness scaling, whose outputs are optimized in terms of balance between power consumption and similarity with the input. The second approach achieves concurrent power reduction and image enhancement, by means of an adaptive polynomial transformation. Both solutions focus on minimizing the overheads associated with a real-time implementation of the transformations in software or hardware, so that these do not offset the savings in the display. For each of these three topics, results show that the aforementioned goal of building EQ scalable systems compatible with existing best practices and mature for being integrated in commercial devices can be effectively achieved. Moreover, they also show that very simple and similar principles can be applied to design EQ scalable versions of different system components (processing, peripherals and I/O), and to equip these components with knobs for the runtime reconfiguration of the energy versus quality tradeoff

    Bio-inspired learning and hardware acceleration with emerging memories

    Get PDF
    Machine Learning has permeated many aspects of engineering, ranging from the Internet of Things (IoT) applications to big data analytics. While computing resources available to implement these algorithms have become more powerful, both in terms of the complexity of problems that can be solved and the overall computing speed, the huge energy costs involved remains a significant challenge. The human brain, which has evolved over millions of years, is widely accepted as the most efficient control and cognitive processing platform. Neuro-biological studies have established that information processing in the human brain relies on impulse like signals emitted by neurons called action potentials. Motivated by these facts, the Spiking Neural Networks (SNNs), which are a bio-plausible version of neural networks have been proposed as an alternative computing paradigm where the timing of spikes generated by artificial neurons is central to its learning and inference capabilities. This dissertation demonstrates the computational power of the SNNs using conventional CMOS and emerging nanoscale hardware platforms. The first half of this dissertation presents an SNN architecture which is trained using a supervised spike-based learning algorithm for the handwritten digit classification problem. This network achieves an accuracy of 98.17% on the MNIST test data-set, with about 4X fewer parameters compared to the state-of-the-art neural networks achieving over 99% accuracy. In addition, a scheme for parallelizing and speeding up the SNN simulation on a GPU platform is presented. The second half of this dissertation presents an optimal hardware design for accelerating SNN inference and training with SRAM (Static Random Access Memory) and nanoscale non-volatile memory (NVM) crossbar arrays. Three prominent NVM devices are studied for realizing hardware accelerators for SNNs: Phase Change Memory (PCM), Spin Transfer Torque RAM (STT-RAM) and Resistive RAM (RRAM). The analysis shows that a spike-based inference engine with crossbar arrays of STT-RAM bit-cells is 2X and 5X more efficient compared to PCM and RRAM memories, respectively. Furthermore, the STT-RAM design has nearly 6X higher throughput per unit Watt per unit area than that of an equivalent SRAM-based (Static Random Access Memory) design. A hardware accelerator with on-chip learning on an STT-RAM memory array is also designed, requiring 1616 bits of floating-point synaptic weight precision to reach the baseline SNN algorithmic performance on the MNIST dataset. The complete design with STT-RAM crossbar array achieves nearly 20X higher throughput per unit Watt per unit mm^2 than an equivalent design with SRAM memory. In summary, this work demonstrates the potential of spike-based neuromorphic computing algorithms and its efficient realization in hardware based on conventional CMOS as well as emerging technologies. The schemes presented here can be further extended to design spike-based systems that can be ubiquitously deployed for energy and memory constrained edge computing applications

    Deployment of Deep Neural Networks on Dedicated Hardware Accelerators

    Get PDF
    Deep Neural Networks (DNNs) have established themselves as powerful tools for a wide range of complex tasks, for example computer vision or natural language processing. DNNs are notoriously demanding on compute resources and as a result, dedicated hardware accelerators for all use cases are developed. Different accelerators provide solutions from hyper scaling cloud environments for the training of DNNs to inference devices in embedded systems. They implement intrinsics for complex operations directly in hardware. A common example are intrinsics for matrix multiplication. However, there exists a gap between the ecosystems of applications for deep learning practitioners and hardware accelerators. HowDNNs can efficiently utilize the specialized hardware intrinsics is still mainly defined by human hardware and software experts. Methods to automatically utilize hardware intrinsics in DNN operators are a subject of active research. Existing literature often works with transformationdriven approaches, which aim to establish a sequence of program rewrites and data-layout transformations such that the hardware intrinsic can be used to compute the operator. However, the complexity this of task has not yet been explored, especially for less frequently used operators like Capsule Routing. And not only the implementation of DNN operators with intrinsics is challenging, also their optimization on the target device is difficult. Hardware-in-the-loop tools are often used for this problem. They use latency measurements of implementations candidates to find the fastest one. However, specialized accelerators can have memory and programming limitations, so that not every arithmetically correct implementation is a valid program for the accelerator. These invalid implementations can lead to unnecessary long the optimization time. This work investigates the complexity of transformation-driven processes to automatically embed hardware intrinsics into DNN operators. It is explored with a custom, graph-based intermediate representation (IR). While operators like Fully Connected Layers can be handled with reasonable effort, increasing operator complexity or advanced data-layout transformation can lead to scaling issues. Building on these insights, this work proposes a novel method to embed hardware intrinsics into DNN operators. It is based on a dataflow analysis. The dataflow embedding method allows the exploration of how intrinsics and operators match without explicit transformations. From the results it can derive the data layout and program structure necessary to compute the operator with the intrinsic. A prototype implementation for a dedicated hardware accelerator demonstrates state-of-the art performance for a wide range of convolutions, while being agnostic to the data layout. For some operators in the benchmark, the presented method can also generate alternative implementation strategies to improve hardware utilization, resulting in a geo-mean speed-up of ×2.813 while reducing the memory footprint. Lastly, by curating the initial set of possible implementations for the hardware-in-the-loop optimization, the median timeto- solution is reduced by a factor of ×2.40. At the same time, the possibility to have prolonged searches due a bad initial set of implementations is reduced, improving the optimization’s robustness by ×2.35

    Wavelet transform methods for identifying onset of SEMG activity

    Get PDF
    Quantifying improvements in motor control is predicated on the accurate identification of the onset of surface electromyograpic (sEMG) activity. Applying methods from wavelet theory developed in the past decade to digitized signals, a robust algorithm has been designed for use with sEMG collected during reaching tasks executed with the less-affected arm of stroke patients. The method applied both Discretized Continuous Wavelet Transforms (CWT) and Discrete Wavelet Transforms (DWT) for event detection and no-lag filtering, respectively. Input parameters were extracted from the assessed signals. The onset times found in the sEMG signals using the wavelet method were compared with physiological instants of motion onset, determined from video data. Robustness was evaluated by considering the response in onset time with variations of input parameter values. The wavelet method found physiologically relevant onset times in all signals, averaging 147 ms prior to motion onset, compared to predicted onset latencies of 90-110 ins. Latency exhibited slight dependence on subject, but no other variables

    Modelling and Simulation for Power Distribution Grids of 3D Tiled Computing Arrays

    Get PDF
    This thesis presents modelling and simulation developments for power distribution grids of 3D tiled computing arrays (TCAs), a novel type of paradigm for HPC systems, and tests the feasibility of such systems for HPC systems domains. The exploration of a complex power-grid such as those found in the TCA concept requires detailed simulations of systems with hundreds and possibly thousands of modular nodes, each contributing to the collective behaviour of the system. In particular power, voltage, and current behaviours are critically important observations. To facilitate this investigation, and test the hypothesis, which seeks to understand if scalability is feasible for such systems, a bespoke simulation platform has been developed, and (importantly) validated against hardware prototypes of small systems. A number of systems are simulated, including systems consisting of arrays of ’balls’. Balls are collections of modular tiles that form a ball-like modular unit, and can then themselves be tiled into large scale systems. Evaluations typically involved simulation of cubic arrays of sizes ranging from 2x2x2 balls up to 10x10x10. Larger systems require extended simulation times. Therefore models are developed to extrapolate system behaviours for higher-orders of systems and to gauge the ultimate scalability of such TCA systems. It is found that systems of 40x40x40 are quite feasible with appropriate configurations. Data connectivity is explored to a lesser degree, but comparisons were made between TCA systems and well known comparable HPC systems, and it is concluded that TCA systems can be built with comparable data-flow and scalability, and that the electrical and engineering challenges associated with the novelty of 3D tiled systems can be met with practical solutions

    Efficient Algorithms for Coastal Geographic Problems

    Get PDF
    The increasing performance of computers has made it possible to solve algorithmically problems for which manual and possibly inaccurate methods have been previously used. Nevertheless, one must still pay attention to the performance of an algorithm if huge datasets are used or if the problem iscomputationally diïŹƒcult. Two geographic problems are studied in the articles included in this thesis. In the ïŹrst problem the goal is to determine distances from points, called study points, to shorelines in predeïŹned directions. Together with other in-formation, mainly related to wind, these distances can be used to estimate wave exposure at diïŹ€erent areas. In the second problem the input consists of a set of sites where water quality observations have been made and of the results of the measurements at the diïŹ€erent sites. The goal is to select a subset of the observational sites in such a manner that water quality is still measured in a suïŹƒcient accuracy when monitoring at the other sites is stopped to reduce economic cost. Most of the thesis concentrates on the ïŹrst problem, known as the fetch length problem. The main challenge is that the two-dimensional map is represented as a set of polygons with millions of vertices in total and the distances may also be computed for millions of study points in several directions. EïŹƒcient algorithms are developed for the problem, one of them approximate and the others exact except for rounding errors. The solutions also diïŹ€er in that three of them are targeted for serial operation or for a small number of CPU cores whereas one, together with its further developments, is suitable also for parallel machines such as GPUs.Tietokoneiden suorituskyvyn kasvaminen on tehnyt mahdolliseksi ratkaista algoritmisesti ongelmia, joita on aiemmin tarkasteltu paljon ihmistyötĂ€ vaativilla, mahdollisesti epĂ€tarkoilla, menetelmillĂ€. Algoritmien suorituskykyyn on kuitenkin toisinaan edelleen kiinnitettĂ€vĂ€ huomiota lĂ€htömateriaalin suuren mÀÀrĂ€n tai ongelman laskennallisen vaikeuden takia. VĂ€itöskirjaansisĂ€ltyvissĂ€artikkeleissatarkastellaankahtamaantieteellistĂ€ ongelmaa. EnsimmĂ€isessĂ€ nĂ€istĂ€ on mÀÀritettĂ€vĂ€ etĂ€isyyksiĂ€ merellĂ€ olevista pisteistĂ€ lĂ€himpÀÀn rantaviivaan ennalta mÀÀrĂ€tyissĂ€ suunnissa. EtĂ€isyyksiĂ€ ja tuulen voimakkuutta koskevien tietojen avulla on mahdollista arvioida esimerkiksi aallokon voimakkuutta. Toisessa ongelmista annettuna on joukko tarkkailuasemia ja niiltĂ€ aiemmin kerĂ€ttyĂ€ tietoa erilaisista vedenlaatua kuvaavista parametreista kuten sameudesta ja ravinteiden mÀÀristĂ€. TehtĂ€vĂ€nĂ€ on valita asemajoukosta sellainen osa joukko, ettĂ€ vedenlaatua voidaan edelleen tarkkailla riittĂ€vĂ€llĂ€ tarkkuudella, kun mittausten tekeminen muilla havaintopaikoilla lopetetaan kustannusten sÀÀstĂ€miseksi. VĂ€itöskirja keskittyy pÀÀosin ensimmĂ€isen ongelman, suunnattujen etĂ€isyyksien, ratkaisemiseen. Haasteena on se, ettĂ€ tarkasteltava kaksiulotteinen kartta kuvaa rantaviivan tyypillisesti miljoonista kĂ€rkipisteistĂ€ koostuvana joukkonapolygonejajaetĂ€isyyksiĂ€onlaskettavamiljoonilletarkastelupisteille kymmenissĂ€ eri suunnissa. Ongelmalle kehitetÀÀn tehokkaita ratkaisutapoja, joista yksi on likimÀÀrĂ€inen, muut pyöristysvirheitĂ€ lukuun ottamatta tarkkoja. Ratkaisut eroavat toisistaan myös siinĂ€, ettĂ€ kolme menetelmistĂ€ on suunniteltu ajettavaksi sarjamuotoisesti tai pienellĂ€ mÀÀrĂ€llĂ€ suoritinytimiĂ€, kun taas yksi menetelmistĂ€ ja siihen tehdyt parannukset soveltuvat myös voimakkaasti rinnakkaisille laitteille kuten GPU:lle. Vedenlaatuongelmassa annetulla asemajoukolla on suuri mÀÀrĂ€ mahdollisia osajoukkoja. LisĂ€ksi tehtĂ€vĂ€ssĂ€ kĂ€ytetÀÀn aikaa vaativia operaatioita kuten lineaarista regressiota, mikĂ€ entisestÀÀn rajoittaa sitĂ€, kuinka monta osajoukkoa voidaan tutkia. Ratkaisussa kĂ€ytetÀÀnkin heuristiikkoja, jotkaeivĂ€t vĂ€lttĂ€mĂ€ttĂ€ tuota optimaalista lopputulosta.Siirretty Doriast

    Tools and Algorithms for the Construction and Analysis of Systems

    Get PDF
    This open access book constitutes the proceedings of the 28th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2022, which was held during April 2-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 46 full papers and 4 short papers presented in this volume were carefully reviewed and selected from 159 submissions. The proceedings also contain 16 tool papers of the affiliated competition SV-Comp and 1 paper consisting of the competition report. TACAS is a forum for researchers, developers, and users interested in rigorously based tools and algorithms for the construction and analysis of systems. The conference aims to bridge the gaps between different communities with this common interest and to support them in their quest to improve the utility, reliability, exibility, and efficiency of tools and algorithms for building computer-controlled systems

    A Proposal for a Three Detector Short-Baseline Neutrino Oscillation Program in the Fermilab Booster Neutrino Beam

    Get PDF
    A Short-Baseline Neutrino (SBN) physics program of three LAr-TPC detectors located along the Booster Neutrino Beam (BNB) at Fermilab is presented. This new SBN Program will deliver a rich and compelling physics opportunity, including the ability to resolve a class of experimental anomalies in neutrino physics and to perform the most sensitive search to date for sterile neutrinos at the eV mass-scale through both appearance and disappearance oscillation channels. Using data sets of 6.6e20 protons on target (P.O.T.) in the LAr1-ND and ICARUS T600 detectors plus 13.2e20 P.O.T. in the MicroBooNE detector, we estimate that a search for muon neutrino to electron neutrino appearance can be performed with ~5 sigma sensitivity for the LSND allowed (99% C.L.) parameter region. In this proposal for the SBN Program, we describe the physics analysis, the conceptual design of the LAr1-ND detector, the design and refurbishment of the T600 detector, the necessary infrastructure required to execute the program, and a possible reconfiguration of the BNB target and horn system to improve its performance for oscillation searches.Comment: 209 pages, 129 figure
    • 

    corecore