258 research outputs found

    Low Power Circuits for Smart Flexible ECG Sensors

    Get PDF
    Cardiovascular diseases (CVDs) are the world leading cause of death. In-home heart condition monitoring effectively reduced the CVD patient hospitalization rate. Flexible electrocardiogram (ECG) sensor provides an affordable, convenient and comfortable in-home monitoring solution. The three critical building blocks of the ECG sensor i.e., analog frontend (AFE), QRS detector, and cardiac arrhythmia classifier (CAC), are studied in this research. A fully differential difference amplifier (FDDA) based AFE that employs DC-coupled input stage increases the input impedance and improves CMRR. A parasitic capacitor reuse technique is proposed to improve the noise/area efficiency and CMRR. An on-body DC bias scheme is introduced to deal with the input DC offset. Implemented in 0.35m CMOS process with an area of 0.405mm2, the proposed AFE consumes 0.9W at 1.8V and shows excellent noise effective factor of 2.55, and CMRR of 76dB. Experiment shows the proposed AFE not only picks up clean ECG signal with electrodes placed as close as 2cm under both resting and walking conditions, but also obtains the distinct -wave after eye blink from EEG recording. A personalized QRS detection algorithm is proposed to achieve an average positive prediction rate of 99.39% and sensitivity rate of 99.21%. The user-specific template avoids the complicate models and parameters used in existing algorithms while covers most situations for practical applications. The detection is based on the comparison of the correlation coefficient of the user-specific template with the ECG segment under detection. The proposed one-target clustering reduced the required loops. A continuous-in-time discrete-in-amplitude (CTDA) artificial neural network (ANN) based CAC is proposed for the smart ECG sensor. The proposed CAC achieves over 98% classification accuracy for 4 types of beats defined by AAMI (Association for the Advancement of Medical Instrumentation). The CTDA scheme significantly reduces the input sample numbers and simplifies the sample representation to one bit. Thus, the number of arithmetic operations and the ANN structure are greatly simplified. The proposed CAC is verified by FPGA and implemented in 0.18m CMOS process. Simulation results show it can operate at clock frequencies from 10KHz to 50MHz. Average power for the patient with 75bpm heart rate is 13.34W

    Improving the Hardware Performance of Arithmetic Circuits using Approximate Computing

    Get PDF
    An application that can produce a useful result despite some level of computational error is said to be error resilient. Approximate computing can be applied to error resilient applications by intentionally introducing error to the computation in order to improve performance, and it has been shown that approximation is especially well-suited for application in arithmetic computing hardware. In this thesis, novel approximate arithmetic architectures are proposed for three different operations, namely multiplication, division, and the multiply accumulate (MAC) operation. For all designs, accuracy is evaluated in terms of mean relative error distance (MRED) and normalized mean error distance (NMED), while hardware performance is reported in terms of critical path delay, area, and power consumption. Three approximate Booth multipliers (ABM-M1, ABM-M2, ABM-M3) are designed in which two novel inexact partial product generators are used to reduce the dimensions of the partial product matrix. The proposed multipliers are compared to other state-of-the-art designs in terms of both accuracy and hardware performance, and are found to reduce power consumption by up to 56% when compared to the exact multiplier. The function of the multipliers is verified in several image processing applications. Two approximate restoring dividers (AXRD-M1, AXRD-M2) are proposed along with a novel inexact restoring divider cell. In the first divider, the conventional cells are replaced with the proposed inexact cells in several columns. The second divider computes only a subset of the trial subtractions, after which the divisor and partial remainder are rounded and encoded so that they may be used to estimate the remaining quotient bits. The proposed dividers are evaluated for accuracy and hardware performance alongside several benchmarking designs, and their function is verified using change detection and foreground extraction applications. An approximate MAC unit is presented in which the multiplication is implemented using a modified version of ABM-M3. The delay is reduced by using a fused architecture where the accumulator is summed as part of the multiplier compression. The accuracy and hardware savings of the MAC unit are measured against several works from the literature, and the design is utilized in a number of convolution operations

    The S2 VLBI Correlator: A Correlator for Space VLBI and Geodetic Signal Processing

    Get PDF
    We describe the design of a correlator system for ground and space-based VLBI. The correlator contains unique signal processing functions: flexible LO frequency switching for bandwidth synthesis; 1 ms dump intervals, multi-rate digital signal-processing techniques to allow correlation of signals at different sample rates; and a digital filter for very high resolution cross-power spectra. It also includes autocorrelation, tone extraction, pulsar gating, signal-statistics accumulation.Comment: 44 pages, 13 figure

    Flexible Multiple-Precision Fused Arithmetic Units for Efficient Deep Learning Computation

    Get PDF
    Deep Learning has achieved great success in recent years. In many fields of applications, such as computer vision, biomedical analysis, and natural language processing, deep learning can achieve a performance that is even better than human-level. However, behind this superior performance is the expensive hardware cost required to implement deep learning operations. Deep learning operations are both computation intensive and memory intensive. Many research works in the literature focused on improving the efficiency of deep learning operations. In this thesis, special focus is put on improving deep learning computation and several efficient arithmetic unit architectures are proposed and optimized for deep learning computation. The contents of this thesis can be divided into three parts: (1) the optimization of general-purpose arithmetic units for deep learning computation; (2) the design of deep learning specific arithmetic units; (3) the optimization of deep learning computation using 3D memory architecture. Deep learning models are usually trained on graphics processing unit (GPU) and the computations are done with single-precision floating-point numbers. However, recent works proved that deep learning computation can be accomplished with low precision numbers. The half-precision numbers are becoming more and more popular in deep learning computation due to their lower hardware cost compared to the single-precision numbers. In conventional floating-point arithmetic units, single-precision and beyond are well supported to achieve a better precision. However, for deep learning computation, since the computations are intensive, low precision computation is desired to achieve better throughput. As the popularity of half-precision raises, half-precision operations are also need to be supported. Moreover, the deep learning computation contains many dot-product operations and therefore, the support of mixed-precision dot-product operations can be explored in a multiple-precision architecture. In this thesis, a multiple-precision fused multiply-add (FMA) architecture is proposed. It supports half/single/double/quadruple-precision FMA operations. In addition, it also supports 2-term mixed-precision dot-product operations. Compared to the conventional multiple-precision FMA architecture, the newly added half-precision support and mixed-precision dot-product only bring minor resource overhead. The proposed FMA can be used as general-purpose arithmetic unit. Due to the support of parallel half-precision computations and mixed-precision dot-product computations, it is especially suitable for deep learning computation. For the design of deep learning specific computation unit, more optimizations can be performed. First, a fixed-point and floating-point merged multiply-accumulate (MAC) unit is proposed. As deep learning computation can be accomplished with low precision number formats, the support of high precision floating-point operations can be eliminated. In this design, the half-precision floating-point format is supported to provide a large dynamic range to handle small gradients for deep learning training. For deep learning inference, 8-bit fixed-point 2-term dot-product computation is supported. Second, a flexible multiple-precision MAC unit architecture is proposed. The proposed MAC unit supports both fixed-point operations and floating-point operations. For floating-point format, the proposed unit supports one 16-bit MAC operation or sum of two 8-bit multiplications plus a 16-bit addend. To make the proposed MAC unit more versatile, the bit-width of exponent and mantissa can be flexibly exchanged. By setting the bit-width of exponent to zero, the proposed MAC unit also supports fixed-point operations. For fixed-point format, the proposed unit supports one 16-bit MAC or sum of two 8-bit multiplications plus a 16-bit addend. Moreover, the proposed unit can be further divided to support sum of four 4-bit multiplications plus a 16-bit addend. At the lowest precision, the proposed MAC unit supports accumulating of eight 1-bit logic AND operations to enable the support of binary neural networks. Finally, a MAC architecture based on the posit format, a promising numerical format in deep learning computation, is proposed to facilitate the use of posit format in deep learning computation. In addition to the above mention arithmetic units, an improved hybrid memory cube (HMC) architecture is proposed for weight-sharing deep neural network processing. By modifying the HMC instruction set and HMC logic layer, the major part of the deep learning computation can be accomplished inside memory. The proposed design reduces the memory bandwidth requirements and thus reduces the energy consumed by memory data transfer

    An instruction systolic array architecture for multiple neural network types

    Get PDF
    Modern electronic systems, especially sensor and imaging systems, are beginning to incorporate their own neural network subsystems. In order for these neural systems to learn in real-time they must be implemented using VLSI technology, with as much of the learning processes incorporated on-chip as is possible. The majority of current VLSI implementations literally implement a series of neural processing cells, which can be connected together in an arbitrary fashion. Many do not perform the entire neural learning process on-chip, instead relying on other external systems to carry out part of the computation requirements of the algorithm. The work presented here utilises two dimensional instruction systolic arrays in an attempt to define a general neural architecture which is closer to the biological basis of neural networks - it is the synapses themselves, rather than the neurons, that have dedicated processing units. A unified architecture is described which can be programmed at the microcode level in order to facilitate the processing of multiple neural network types. An essential part of neural network processing is the neuron activation function, which can range from a sequential algorithm to a discrete mathematical expression. The architecture presented can easily carry out the sequential functions, and introduces a fast method of mathematical approximation for the more complex functions. This can be evaluated on-chip, thus implementing the entire neural process within a single system. VHDL circuit descriptions for the chip have been generated, and the systolic processing algorithms and associated microcode instruction set for three different neural paradigms have been designed. A software simulator of the architecture has been written, giving results for several common applications in the field

    Improving Compute & Data Efficiency of Flexible Architectures

    Get PDF

    Flexible Computing Systems For AI Acceleration At The Extreme Edge Of The IoT

    Get PDF
    Embedding intelligence in extreme edge devices allows distilling raw data acquired from sensors into actionable information, directly on IoT end-nodes. This computing paradigm, in which end-nodes no longer depend entirely on the Cloud, offers undeniable benefits, driving a large research area (TinyML) to deploy leading Machine Learning (ML) algorithms on micro-controller class of devices. To fit the limited memory storage capability of these tiny platforms, full-precision Deep Neural Networks (DNNs) are compressed by representing their data down to byte and sub-byte formats, in the integer domain. However, the current generation of micro-controller systems can barely cope with the computing requirements of QNNs. This thesis tackles the challenge from many perspectives, presenting solutions both at software and hardware levels, exploiting parallelism, heterogeneity and software programmability to guarantee high flexibility and high energy-performance proportionality. The first contribution, PULP-NN, is an optimized software computing library for QNN inference on parallel ultra-low-power (PULP) clusters of RISC-V processors, showing one order of magnitude improvements in performance and energy efficiency, compared to current State-of-the-Art (SoA) STM32 micro-controller systems (MCUs) based on ARM Cortex-M cores. The second contribution is XpulpNN, a set of RISC-V domain specific instruction set architecture (ISA) extensions to deal with sub-byte integer arithmetic computation. The solution, including the ISA extensions and the micro-architecture to support them, achieves energy efficiency comparable with dedicated DNN accelerators and surpasses the efficiency of SoA ARM Cortex-M based MCUs, such as the low-end STM32M4 and the high-end STM32H7 devices, by up to three orders of magnitude. To overcome the Von Neumann bottleneck while guaranteeing the highest flexibility, the final contribution integrates an Analog In-Memory Computing accelerator into the PULP cluster, creating a fully programmable heterogeneous fabric that demonstrates end-to-end inference capabilities of SoA MobileNetV2 models, showing two orders of magnitude performance improvements over current SoA analog/digital solutions

    A mixed-charge cluster facilities glutathione transferase dimerisation

    Get PDF
    Student Number : 0213014A - MSc dissertation - School of Molecular and Cell Biology - Faculty of ScienceCytosolic glutathione transferases (GSTs) are obligate stable homo- and heterodimers comprising two GST subunits. Interactions across the subunit interface play an important role in stabilising the subunit tertiary structure and maintain the dimeric structure required for activity. The crystal structure of a rat Mu class GST consisting of two type one subunits (rGST M1-1) reveals a lock-and-key motif and a mixedcharge cluster at the subunit interface. Previous investigations revealed the lock-andkey motif was not essential for dimerisation. It was therefore postulated that the mixed-charge cluster at the dimer interface is primarily responsible for subunit association. Statistical analyses of individual rGST M1-1 chains did not predict the presence of any charge clusters. This suggests that the mixed-charge cluster forms only upon dimerisation and reinforces the probability that quaternary structure stabilisation is a major role of the mixed-charge cluster. Arginine 81 (Arg-81), a structurally conserved residue in the GST family involved in the mixed-charge cluster, was mutated to alanine. Phenylalanine 56 (Phe-56), the ‘key’ residue in the lock-and-key motif, was mutated to serine. These changes were engineered to disrupt the mixed-charge cluster and the lock-and-key motif situated at the dimer interface of rGST M1-1. Sizing by gel filtration chromatography of the mutant GST identified that these engineered amino acids resulted in a stable monomeric protein (F56S/R81A rGST M1). The F56S/R81A rGST M1 displayed almost no catalytic activity, suggesting perturbations of the active site or substrate binding sites. Structural investigations of the monomer by far- and near-UV circular dichroism revealed a similar secondary structural content to the wild-type. However, the tryptophan fluorescence properties suggested the tryptophans were situated in more hydrophilic environments than in the wild-type. ANS binding studies indicated a large increase in the accessible hydrophobic surface area of the monomer. Ureainduced equilibrium unfolding of F56S/R81A rGST M1 follows a cooperative twostate unfolding model. The unfolding data indicates decreased conformational stability and a large increase in the solvent exposed surface area of the monomer. In conclusion, the mixed-charge cluster at the dimer interface of rGST M1-1 is essential for monomeric association, which subsequently contributes to catalytic activity of the dimer and the stabilities of individual rGST M1-1 subunits

    Disentangling the Role of CHOP in Mitochondrial Dysfunction

    Get PDF
    Maintenance of mitochondrial homeostasis is essential for a broad spectrum of signalling, metabolic and energetic processes. Consequently, mitochondrial dysfunction is linked to the development of a wide range of myopathies and many common diseases, including type 2 diabetes, Parkinson's and Alzheimer's diseases. In response to disturbed mitochondrial proteostasis, an organelle-specific stress response is initiated, which results in an adaptive transcriptional response partially sharing the signature of the integrated stress response (ISR). However, the exact sequence of events of the signalling cascade resulting in the activation of a nuclear response remains elusive. CHOP was one of the first transcription factors (TFs) proposed to play a role in response to impaired mitochondrial proteostasis. Although - due to the lack of a functional DNA-binding domain - CHOP needs to form heterodimers with other TFs in order to activate or suppresses respective target genes. The present study aims to investigate the molecular aspects and in vivo functions of CHOP in a murine model of mitochondrial dysfunction. Therefore, DARS2/CHOP double-deficient mice, from now referred to as double knock-out (DKO) mice, were generated. Disruption of mitochondrial translation by heart and skeletal muscle-specific knock-out of the mitochondrial aspartyl-tRNA synthetase Dars2 (DARS2 KO) results in severe mitochondrial dysfunction and causes the death of the animals with approximately seven weeks of age. Additional deletion of Chop even further reduces the lifespan to less than three weeks, suggesting an existential role of the TF within the initiated stress-signalling pathway. Our data indicate that CHOP's impact arises from the regulation of another TF: ATF4. The analysis of transcriptomic data uncovered excessive transcriptional activation of ATF4 targets in DKO mice. Those massive changes were further confirmed on the protein level and coincide with the rapid deterioration of the animal's health status. Co-immunoprecipitation experiments revealed the TF C/EBPβ as the most abundant CHOP interactor in hearts of DARS2 KO mice. Further experiments in cell culture showed that under normal conditions, mitochondrial dysfunction triggers CHOP and C/EBPβ protein expression. C/EBPβ has three isoforms: LAP*, LAP and LIP. In comparison to the isoforms LAP* and LAP, LIP exhibited a disproportionate increase under conditions of mitochondrial dysfunction. Our experiments confirmed opposite effects of LAP and LIP on Atf4 transcription. Whereas LAP promoted Atf4 transcription, LIP acted as a transcriptional repressor of Atf4. Notably, CHOP deficiency in the context of mitochondrial dysfunction resulted in an impaired response of LIP in particular, which failed to increase on the protein level as observed under wild type-like conditions. Hence we propose, that the impairment of LIP accumulation upon mitochondrial dysfunction in a CHOP-deficient background results in loss of negative regulation of Atf4 transcription in DKO animals. As a result, ATF4 is exceedingly active and causes an anabolic overstress of the mice. We propose to complement the current ISR model by a supplementary CHOP and LIP-driven regulatory layer, contributing to the transcriptional control of Atf4 in the context of mitochondrial dysfunction

    Dynamically reconfigurable bio-inspired hardware

    Get PDF
    During the last several years, reconfigurable computing devices have experienced an impressive development in their resource availability, speed, and configurability. Currently, commercial FPGAs offer the possibility of self-reconfiguring by partially modifying their configuration bitstream, providing high architectural flexibility, while guaranteeing high performance. These configurability features have received special interest from computer architects: one can find several reconfigurable coprocessor architectures for cryptographic algorithms, image processing, automotive applications, and different general purpose functions. On the other hand we have bio-inspired hardware, a large research field taking inspiration from living beings in order to design hardware systems, which includes diverse topics: evolvable hardware, neural hardware, cellular automata, and fuzzy hardware, among others. Living beings are well known for their high adaptability to environmental changes, featuring very flexible adaptations at several levels. Bio-inspired hardware systems require such flexibility to be provided by the hardware platform on which the system is implemented. In general, bio-inspired hardware has been implemented on both custom and commercial hardware platforms. These custom platforms are specifically designed for supporting bio-inspired hardware systems, typically featuring special cellular architectures and enhanced reconfigurability capabilities; an example is their partial and dynamic reconfigurability. These aspects are very well appreciated for providing the performance and the high architectural flexibility required by bio-inspired systems. However, the availability and the very high costs of such custom devices make them only accessible to a very few research groups. Even though some commercial FPGAs provide enhanced reconfigurability features such as partial and dynamic reconfiguration, their utilization is still in its early stages and they are not well supported by FPGA vendors, thus making their use difficult to include in existing bio-inspired systems. In this thesis, I present a set of architectures, techniques, and methodologies for benefiting from the configurability advantages of current commercial FPGAs in the design of bio-inspired hardware systems. Among the presented architectures there are neural networks, spiking neuron models, fuzzy systems, cellular automata and random boolean networks. For these architectures, I propose several adaptation techniques for parametric and topological adaptation, such as hebbian learning, evolutionary and co-evolutionary algorithms, and particle swarm optimization. Finally, as case study I consider the implementation of bio-inspired hardware systems in two platforms: YaMoR (Yet another Modular Robot) and ROPES (Reconfigurable Object for Pervasive Systems); the development of both platforms having been co-supervised in the framework of this thesis
    corecore