207 research outputs found

    Design Methodology for Face Detection Acceleration

    Get PDF
    A design methodology to accelerate the face detection for embedded systems is described, starting from high level (algorithm optimization) and ending with low level (software and hardware codesign) by addressing the issues and the design decisions made at each level based on the performance measurements and system limitations. The implemented embedded face detection system consumes very little power compared with the traditional PC software implementations while maintaining the same detection accuracy. The proposed face detection acceleration methodology is suitable for real time applications.Ministerio español de Ciencia y Tecnología TEC2011-24319Junta de Andalucía FEDER P08-TIC-0367

    PoET-BiN: Power Efficient Tiny Binary Neurons

    Get PDF
    RÉSUMÉ Le succĂšs des rĂ©seaux de neurones dans la classification des images a inspirĂ© diverses implĂ©mentations matĂ©rielles sur des systĂšmes embarquĂ©s telles que des FPGAs, des processeurs embarquĂ©s et des unitĂ©s de traitement graphiques. Ces systĂšmes sont souvent limitĂ©s en termes de puissance. Toutefois, les rĂ©seaux de neurones consomment Ă©normĂ©ment Ă  travers les opĂ©rations de multiplication/accumulation et des accĂšs mĂ©moire pour la rĂ©cupĂ©ration des poids. La quantification et l’élagage ont Ă©tĂ© proposĂ©s pour rĂ©soudre ce problĂšme. Bien que efficaces, ces techniques ne prennent pas en compte l’architecture sous-jacente du matĂ©riel utilisĂ©. Dans ce travail, nous proposons une implĂ©mentation Ă©conome en Ă©nergie, basĂ©e sur une table de vĂ©ritĂ©, d’un neurone binaire sur des systĂšmes embarquĂ©s Ă  ressources limitĂ©es. Une approche d’arbre de dĂ©cision modifiĂ©e constitue le fondement de la mise en Ɠuvre proposĂ©e dans le domaine binaire. Un accĂšs de LUT consomme beaucoup moins d’énergie que l’opĂ©ration Ă©quivalente de multiplication/accumulation qu’il remplace. De plus, l’algorithme modifiĂ© de l’arbre de dĂ©cision Ă©limine le besoin d’accĂ©der Ă  la mĂ©moire. Nous avons utilisĂ© les neurones binaires proposĂ©s pour mettre en Ɠuvre la couche de classification de rĂ©seaux utilisĂ©s pour la rĂ©solution des jeux de donnĂ©es MNIST, SVHN et CIFAR-10, avec des rĂ©sultats presque Ă  la pointe de la technologie. La rĂ©duction de puissance pour la couche de classification atteint trois ordres de grandeur pour l’ensemble de donnĂ©es MNIST et cinq ordres de grandeur pour les ensembles de donnĂ©es SVHN et CIFAR-10.----------ABSTRACT The success of neural networks in image classification has inspired various hardware implementations on embedded platforms such as Field Programmable Gate Arrays, embedded processors and Graphical Processing Units. These embedded platforms are constrained in terms of power, which is mainly consumed by the Multiply Accumulate operations and the memory accesses for weight fetching. Quantization and pruning have been proposed to ad-dress this issue. Though effective, these techniques do not take into account the underlying architecture of the embedded hardware. In this work, we propose PoET-BiN, a Look-Up Table based power efficient implementation on resource constrained embedded devices. A modified Decision Tree approach forms the backbone of the proposed implementation in the binary domain. A LUT access consumes far less power than the equivalent Multiply Accumulate operation it replaces, and the modified Decision Tree algorithm eliminates the need for memory accesses. We applied the PoET-BiN architecture to implement the classification layers of networks trained on MNIST, SVHN and CIFAR-10 datasets, with near state-of-the art results. The energy reduction for the classifier portion reaches up to six orders of magnitude compared to a floating point implementations and up to three orders of magnitude when compared to recent binary quantized neural networks

    Entwicklung von Objekterkennungssystemen fĂŒr SoC-FPGA Plattformen

    Get PDF
    Object detection is a crucial and challenging task in the field of embedded vision and robotics. Over the last 15 years, various object detection algorithms/systems (e.g., face detection, traffic sign detection) are proposed by different researchers and companies. Most of these research are either focused on improving the detection performance or devoted to boosting the detection speed, which reveals two typically employed criteria while evaluating an object detection algorithm/system: detection accuracy and execution speed. Considering these two factors and the application of object detection to the domains such as service robots and Advanced Driving Assistance System (ADAS), FPGA is a promising platform to achieve accurate object detection in real-time. Therefore, FPGA-based robust object detection systems are designed in this work. The main work of this thesis can be divided into two parts: promising algorithm obtaining and hardware design on SoC-FPGA. Firstly, representative object detection algorithms are selected, implemented, and evaluated. Thereafter, a generalized object detection framework is created. With this framework, pedestrian detection, traffic sign detection, and head detection algorithms are realized and tested. The experiments verify that promising detection results can be obtained by employing the generalized object detection framework. For the work of hardware design on FPGA, the platform of the object detection system, which consists of stereo OV7670 cameras, Xilinx Zedboard, and a monitor that can visualize the detection results, is created. After that, IP cores that correspond to each block of the framework are designed. Configurable parameters are provided by each IP core so that the IPs, especially feature calculation IP and feature scaler IP, can be correctly instanced according to the fast feature pyramid theory. Finally, by employing the designed IP cores, pedestrian detection system, traffic sign detection system, and head detection system are designed and evaluated. The on-board testing results show that real-time object (e.g., pedestrian, traffic sign, head) detection with promising accuracy can all be achieved. In addition, with the generalized object detection framework and the designed IP-toolbox, the object detection system that targets any instance of objects can be designed and implemented rapidly.Objekterkennung ist eine essenzielle und herausfordernde Aufgabe in den Forschungsgebieten der Embedded Vision und der Robotik. In den letzten 15 Jahren wurden verschiedene Objekterkennungsalgorithmen und Systeme (z.B. Gesichtserkennung) aus der Forschung sowie der Industrie prĂ€sentiert. Der Forschungsschwerpunkt liegt dabei typischerweise entweder bei der Verbesserung der ErkennungsqualitĂ€t oder bei der Beschleunigung der Erkennungsgeschwindigkeit. Hieraus leiten sich direkt die Kriterien fĂŒr den Einsatz von Objekterkennungsalgorithmen und Systemen ab: Erkennungsgenauigkeit und die Erkennungslatenz. Unter BerĂŒcksichtigung dieser Kriterien und dem konkreten Einsatzgebiet der Objekterkennung fĂŒr Serviceroboter und ADAS haben sich FPGA Plattformen als vielversprechende Kandidaten fĂŒr die Implementierung von hoch genauen Objekterkennungsalgorithmen mit strikten Echtzeitanforderungen herausgestellt. Auf Bases dessen wurden im Rahmen dieser Arbeit eine robuste Objekterkennung entwickelt. Der Fokus dieser Arbeit ist geteilt in zwei Aspekte: Identifikation und Analyse geeigneter Objekterkennungsalgorithmen und deren Implementierung in einer FPGA basierten SoC-Plattform. Zu Beginn wurden eine Reihe von reprĂ€sentativen Algorithmen ausgewĂ€hlt, in einer Testumgebung implementiert und evaluiert. Darauf aufbauend wurde ein Entwicklungs-Frameworks fĂŒr die Erkennung von Passanten, Verkehrszeichen sowie Kopfdetektion entwickelt und analysiert. FĂŒr die Entwicklung einer FPGA basierten Plattform wurde ein Objekterkennungssystem erstellt. Eine Sammlung aus IP-Cores, die das Entwicklungs-Framework bilden, wurden fĂŒr die FPGA Plattform implementiert. Die IP-Cores bieten eine Reihe von Konfigurationsparametern fĂŒr den flexiblen Einsatz von neuen Komponenten, im Besonderen IP-Cores fĂŒr die Berechnung und Skalierung von Erkennungseigenschaften, basierend auf der Fast Feature Pyramid Theorie. Abschliessend wurden die entwickelten IP-Cores zur Erkennung von Passanten, Verkehrszeichen sowie die Kopfdetektion integriert und gemeinsam evaluiert. Die gesammelten Ergebnisse aus verschiedenen Testscenarios der entwickelten FPGA-Plattform zeigen, dass die Objekterkennung in Echtzeit mit vielversprechender Genauigkeit erreichbar ist. DarĂŒber hinaus können mit dem generalisierten Objekterkennungsrahmen und der entwickelten IP-Toolbox schnell und flexibel belibige Objekterkennungssysteme entwickelt und implementiert werden

    Face detection hardware accelerator using C-based high-level synthesis

    Get PDF
    Research has shown that Field Programmable Gate Array (FPGA) based implementation of image processing system results in high computational speed and energy efficiency. However, FPGA design has relatively long development time compared to alternative implementation platforms, such as those based on Central Processing Unit, Graphical Processing Unit or Digital Signal Processor. Designing digital hardware at a higher level of abstraction is an effective way to shorten the development time. High-level synthesis (HLS) raises the abstraction level for designing digital circuit and translates a C-based description of the desired design into Hardware Descriptive Language. However, C-based HLS techniques are still lacking some maturity. In particular, existing works on applying C-based HLS to design hardware that accelerates window-based image processing algorithms are generally done in a trial and error manner, and usually results in non-optimal designs. Hence, there is a need for an effective procedure in applying C-based HLS that can lead to an optimized accelerator design. Therefore, the key contribution of this research is to present a systematic C-based HLS technique to be used in the design of hardware that accelerates image processing algorithm. The proposed C-based HLS design procedure is illustrated with a case study of the Sobel filter. The effectiveness of the proposed design technique is demonstrated by the case study of a Viola- Jones face detection accelerator targeted for implementation in FPGA. The proposed face detection hardware applies a pipelined architecture with task-level parallelism that allows concurrent execution on every sub-module. Experimental results show that the resulting accelerator module achieves a speed performance improvement of up to 12 times when compared to that of existing works. Tested on CMU+MIT database, the proposed accelerator achieves high detection accuracy of 88% and 46 false positives. Experimental results also show that the proposed design achieves up to 61 frames per second detection speed. This work demonstrates that the proposed Cbased HLS design methodology is effective for image processing hardware accelerator development

    A Future for Integrated Diagnostic Helping

    Get PDF
    International audienceMedical systems used for exploration or diagnostic helping impose high applicative constraints such as real time image acquisition and displaying. A large part of computing requirement of these systems is devoted to image processing. This chapter provides clues to transfer consumers computing architecture approaches to the benefit of medical applications. The goal is to obtain fully integrated devices from diagnostic helping to autonomous lab on chip while taking into account medical domain specific constraints.This expertise is structured as follows: the first part analyzes vision based medical applications in order to extract essentials processing blocks and to show the similarities between consumer’s and medical vision based applications. The second part is devoted to the determination of elementary operators which are mostly needed in both domains. Computing capacities that are required by these operators and applications are compared to the state-of-the-art architectures in order to define an efficient algorithm-architecture adequation. Finally this part demonstrates that it's possible to use highly constrained computing architectures designed for consumers handled devices in application to medical domain. This is based on the example of a high definition (HD) video processing architecture designed to be integrated into smart phone or highly embedded components. This expertise paves the way for the industrialisation of intergraded autonomous diagnostichelping devices, by showing the feasibility of such systems. Their future use would also free the medical staff from many logistical constraints due the deployment of today’s cumbersome systems

    Machine Learning for Microcontroller-Class Hardware -- A Review

    Full text link
    The advancements in machine learning opened a new opportunity to bring intelligence to the low-end Internet-of-Things nodes such as microcontrollers. Conventional machine learning deployment has high memory and compute footprint hindering their direct deployment on ultra resource-constrained microcontrollers. This paper highlights the unique requirements of enabling onboard machine learning for microcontroller class devices. Researchers use a specialized model development workflow for resource-limited applications to ensure the compute and latency budget is within the device limits while still maintaining the desired performance. We characterize a closed-loop widely applicable workflow of machine learning model development for microcontroller class devices and show that several classes of applications adopt a specific instance of it. We present both qualitative and numerical insights into different stages of model development by showcasing several use cases. Finally, we identify the open research challenges and unsolved questions demanding careful considerations moving forward.Comment: Accepted for publication at IEEE Sensors Journa

    Doctor of Philosophy

    Get PDF
    dissertationThe embedded system space is characterized by a rapid evolution in the complexity and functionality of applications. In addition, the short time-to-market nature of the business motivates the use of programmable devices capable of meeting the conflicting constraints of low-energy, high-performance, and short design times. The keys to achieving these conflicting constraints are specialization and maximally extracting available application parallelism. General purpose processors are flexible but are either too power hungry or lack the necessary performance. Application-specific integrated circuits (ASICS) efficiently meet the performance and power needs but are inflexible. Programmable domain-specific architectures (DSAs) are an attractive middle ground, but their design requires significant time, resources, and expertise in a variety of specialties, which range from application algorithms to architecture and ultimately, circuit design. This dissertation presents CoGenE, a design framework that automates the design of energy-performance-optimal DSAs for embedded systems. For a given application domain and a user-chosen initial architectural specification, CoGenE consists of a a Compiler to generate execution binary, a simulator Generator to collect performance/energy statistics, and an Explorer that modifies the current architecture to improve energy-performance-area characteristics. The above process repeats automatically until the user-specified constraints are achieved. This removes or alleviates the time needed to understand the application, manually design the DSA, and generate object code for the DSA. Thus, CoGenE is a new design methodology that represents a significant improvement in performance, energy dissipation, design time, and resources. This dissertation employs the face recognition domain to showcase a flexible architectural design methodology that creates "ASIC-like" DSAs. The DSAs are instruction set architecture (ISA)-independent and achieve good energy-performance characteristics by coscheduling the often conflicting constraints of data access, data movement, and computation through a flexible interconnect. This represents a significant increase in programming complexity and code generation time. To address this problem, the CoGenE compiler employs integer linear programming (ILP)-based 'interconnect-aware' scheduling techniques for automatic code generation. The CoGenE explorer employs an iterative technique to search the complete design space and select a set of energy-performance-optimal candidates. When compared to manual designs, results demonstrate that CoGenE produces superior designs for three application domains: face recognition, speech recognition and wireless telephony. While CoGenE is well suited to applications that exhibit a streaming behavior, multithreaded applications like ray tracing present a different but important challenge. To demonstrate its generality, CoGenE is evaluated in designing a novel multicore N-wide SIMD architecture, known as StreamRay, for the ray tracing domain. CoGenE is used to synthesize the SIMD execution cores, the compiler that generates the application binary, and the interconnection subsystem. Further, separating address and data computations in space reduces data movement and contention for resources, thereby significantly improving performance compared to existing ray tracing approaches

    A multi-layer approach to designing secure systems: from circuit to software

    Get PDF
    In the last few years, security has become one of the key challenges in computing systems. Failures in the secure operations of these systems have led to massive information leaks and cyber-attacks. Case in point, the identity leaks from Equifax in 2016, Spectre and Meltdown attacks to Intel and AMD processors in 2017, Cyber-attacks on Facebook in 2018. These recent attacks have shown that the intruders attack different layers of the systems, from low-level hardware to software as a service(SaaS). To protect the systems, the defense mechanisms should confront the attacks in the different layers of the systems. In this work, we propose four security mechanisms for computing systems: (i ) using backside imaging to detect Hardware Trojans (HTs) in Application Specific Integrated Circuits (ASICs) chips, (ii ) developing energy-efficient reconfigurable cryptographic engines, (iii) examining the feasibility of malware detection using Hardware Performance Counters (HPC). Most of the threat models assume that the root of trust is the hardware running beneath the software stack. However, attackers can insert malicious hardware blocks, i.e. HTs, into the Integrated Circuits (ICs) that provide back-doors to the attackers or leak confidential information. HTs inserted during fabrication are extremely hard to detect since their overheads in performance and power are below the variations in the performance and power caused by manufacturing. In our work, we have developed an optical method that identifies modified or replaced gates in the ICs. We use the near-infrared light to image the ICs because silicon is transparent to near-infrared light and metal reflects infrared light. We leverage the near-infrared imaging to identify the locations of each gate, based on the signatures of metal structures reflected by the lowest metal layer. By comparing the imaged results to the pre-fabrication design, we can identify any modifications, shifts or replacements in the circuits to detect HTs. With the trust of the silicon, the computing system must use secure communication channels for its applications. The low-energy cost devices, such as the Internet of Things (IoT), leverage strong cryptographic algorithms (e.g. AES, RSA, and SHA) during communications. The cryptographic operations cause the IoT devices a significant amount of power. As a result, the power budget limits their applications. To mitigate the high power consumption, modern processors embed these cryptographic operations into hardware primitives. This also improves system performance. The hardware unit embedded into the processor provides high energy-efficiency, low energy cost. However, hardware implementations limit flexibility. The longevity of theIoTs can exceed the lifetime of the cryptographic algorithms. The replacement of the IoT devices is costly and sometimes prohibitive, e.g., monitors in nuclear reactors.In order to reconfigure cryptographic algorithms into hardware, we have developed a system with a reconfigurable encryption engine on the Zedboard platform. The hardware implementation of the engine ensures fast, energy-efficient cryptographic operations. With reliable hardware and secure communication channels in place, the computing systems should detect any malicious behaviors in the processes. We have explored the use of the Hardware Performance Counters (HPCs) in malware detection. HPCs are hardware units that count micro-architectural events, such as cache hits/misses and floating point operations. Anti-virus software is commonly used to detect malware but it also introduces performance overhead. To reduce anti-virus performance overhead, many researchers propose to use HPCs with machine learning models in malware detection. However, it is counter-intuitive that the high-level program behaviors can manifest themselves in low-level statics. We perform experiments using 2 ∌ 3 × larger program counts than the previous works and perform a rigorous analysis to determine whether HPCs can be used to detect malware. Our results show that the False Discovery Rate of malware detection can reach 20%. If we deploy this detection system on a fresh installed Windows 7 systems, among 1,323 binaries, 198 binaries would be flagged as malware
    • 

    corecore