9,283 research outputs found

    An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

    Full text link
    Near-sensor data analytics is a promising direction for IoT endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data is stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a System-on-Chip based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65nm technology, consumes less than 20mW on average at 0.8V achieving an efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to 25MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep CNN consuming 3.16pJ per equivalent RISC op; local CNN-based face detection with secured remote recognition in 5.74pJ/op; and seizure detection with encrypted data collection from EEG within 12.7pJ/op.Comment: 15 pages, 12 figures, accepted for publication to the IEEE Transactions on Circuits and Systems - I: Regular Paper

    Design of multimedia processor based on metric computation

    Get PDF
    Media-processing applications, such as signal processing, 2D and 3D graphics rendering, and image compression, are the dominant workloads in many embedded systems today. The real-time constraints of those media applications have taxing demands on today's processor performances with low cost, low power and reduced design delay. To satisfy those challenges, a fast and efficient strategy consists in upgrading a low cost general purpose processor core. This approach is based on the personalization of a general RISC processor core according the target multimedia application requirements. Thus, if the extra cost is justified, the general purpose processor GPP core can be enforced with instruction level coprocessors, coarse grain dedicated hardware, ad hoc memories or new GPP cores. In this way the final design solution is tailored to the application requirements. The proposed approach is based on three main steps: the first one is the analysis of the targeted application using efficient metrics. The second step is the selection of the appropriate architecture template according to the first step results and recommendations. The third step is the architecture generation. This approach is experimented using various image and video algorithms showing its feasibility

    NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps

    Get PDF
    Convolutional neural networks (CNNs) have become the dominant neural network architecture for solving many state-of-the-art (SOA) visual processing tasks. Even though Graphical Processing Units (GPUs) are most often used in training and deploying CNNs, their power efficiency is less than 10 GOp/s/W for single-frame runtime inference. We propose a flexible and efficient CNN accelerator architecture called NullHop that implements SOA CNNs useful for low-power and low-latency application scenarios. NullHop exploits the sparsity of neuron activations in CNNs to accelerate the computation and reduce memory requirements. The flexible architecture allows high utilization of available computing resources across kernel sizes ranging from 1x1 to 7x7. NullHop can process up to 128 input and 128 output feature maps per layer in a single pass. We implemented the proposed architecture on a Xilinx Zynq FPGA platform and present results showing how our implementation reduces external memory transfers and compute time in five different CNNs ranging from small ones up to the widely known large VGG16 and VGG19 CNNs. Post-synthesis simulations using Mentor Modelsim in a 28nm process with a clock frequency of 500 MHz show that the VGG19 network achieves over 450 GOp/s. By exploiting sparsity, NullHop achieves an efficiency of 368%, maintains over 98% utilization of the MAC units, and achieves a power efficiency of over 3TOp/s/W in a core area of 6.3mm2^2. As further proof of NullHop's usability, we interfaced its FPGA implementation with a neuromorphic event camera for real time interactive demonstrations

    Approximate computing: An integrated cross-layer framework

    Get PDF
    A new design approach, called approximate computing (AxC), leverages the flexibility provided by intrinsic application resilience to realize hardware or software implementations that are more efficient in energy or performance. Approximate computing techniques forsake exact (numerical or Boolean) equivalence in the execution of some of the application’s computations, while ensuring that the output quality is acceptable. While early efforts in approximate computing have demonstrated great potential, they consist of ad hoc techniques applied to a very narrow set of applications, leaving in question the applicability of approximate computing in a broader context. The primary objective of this thesis is to develop an integrated cross-layer approach to approximate computing, and to thereby establish its applicability to a broader range of applications. The proposed framework comprises of three key components: (i) At the circuit level, systematic approaches to design approximate circuits, or circuits that realize a slightly modified function with improved efficiency, (ii) At the architecture level, utilize approximate circuits to build programmable approximate processors, and (iii) At the software level, methods to apply approximate computing to machine learning classifiers, which represent an important class of applications that are being utilized across the computing spectrum. Towards this end, the thesis extends the state-of-the-art in approximate computing in the following important directions. Synthesis of Approximate Circuits: First, the thesis proposes a rigorous framework for the automatic synthesis of approximate circuits , which are the hardware building blocks of approximate computing platforms. Designing approximate circuits involves making judicious changes to the function implemented by the circuit such that its hardware complexity is lowered without violating the specified quality constraint. Inspired by classical approaches to Boolean optimization in logic synthesis, the thesis proposes two synthesis tools called SALSA and SASIMI that are general, i.e., applicable to any given circuit and quality specification. The framework is further extended to automatically design quality configurable circuits , which are approximate circuits with the capability to reconfigure their quality at runtime. Over a wide range of arithmetic circuits, complex modules and complete datapaths, the circuits synthesized using the proposed framework demonstrate significant benefits in area and energy. Programmable AxC Processors: Next, the thesis extends approximate computing to the realm of programmable processors by introducing the concept of quality programmable processors (QPPs). A key principle of QPPs is that the notion of quality is explicitly codified in their HW/SW interface i.e., the instruction set. Instructions in the ISA are extended with quality fields, enabling software to specify the accuracy level that must be met during their execution. The micro-architecture is designed with hardware mechanisms to understand these quality specifications and translate them into energy savings. As a first embodiment of QPPs, the thesis presents a quality programmable 1D/2D vector processor QP-Vec, which contains a 3-tiered hierarchy of processing elements. Based on an implementation of QP-Vec with 289 processing elements, energy benefits up to 2.5X are demonstrated across a wide range of applications. Software and Algorithms for AxC: Finally, the thesis addresses the problem of applying approximate computing to an important class of applications viz. machine learning classifiers such as deep learning networks. To this end, the thesis proposes two approaches—AxNN and scalable effort classifiers. Both approaches leverage domain- specific insights to transform a given application to an energy-efficient approximate version that meets a specified application output quality. In the context of deep learning networks, AxNN adapts backpropagation to identify neurons that contribute less significantly to the network’s accuracy, approximating these neurons (e.g., by using lower precision), and incrementally re-training the network to mitigate the impact of approximations on output quality. On the other hand, scalable effort classifiers leverage the heterogeneity in the inherent classification difficulty of inputs to dynamically modulate the effort expended by machine learning classifiers. This is achieved by building a chain of classifiers of progressively growing complexity (and accuracy) such that the number of stages used for classification scale with input difficulty. Scalable effort classifiers yield substantial energy benefits as a majority of the inputs require very low effort in real-world datasets. In summary, the concepts and techniques presented in this thesis broaden the applicability of approximate computing, thus taking a significant step towards bringing approximate computing to the mainstream. (Abstract shortened by ProQuest.
    • …
    corecore