299 research outputs found

    ParaFPGA15 : exploring threads and trends in programmable hardware

    Get PDF
    The symposium ParaFPGA focuses on parallel techniques using FPGAs as accelerator in high performance computing. The green computing aspects of low power consumption at high performance were somewhat tempered by long design cycles and hard programmability issues. However, in recent years FPGAs have become new contenders as versatile compute accelerators because of a growing market interest, extended application domains and maturing high-level synthesis tools. The keynote paper highlights the historical and modern approaches to high-level FPGA and the contributions cover applications such as NP-complete satisfiability problems and convex hull image processing as well as performance evaluation, partial reconfiguration and systematic design exploration

    FPGA Acceleration of Communication-Bound Streaming Applications: Architecture Modeling and a 3D Image Compositing Case Study

    Get PDF
    Reconfigurable computers usually provide a limited number of different memory resources, such as host memory, external memory, and on-chip memory with different capacities and communication characteristics. A key challenge for achieving high-performance with reconfigurable accelerators is the efficient utilization of the available memory resources. A detailed knowledge of the memories' parameters is key for generating an optimized communication layout. In this paper, we discuss a benchmarking environment for generating such a characterization. The environment is built on IMORC, our architectural template and on-chip network for creating reconfigurable accelerators. We provide a characterization of the memory resources available on the XtremeData XD1000 reconfigurable computer. Based on this data, we present as a case study the implementation of a 3D image compositing accelerator that is able to double the frame rate of a parallel renderer

    An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

    Full text link
    Near-sensor data analytics is a promising direction for IoT endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data is stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a System-on-Chip based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65nm technology, consumes less than 20mW on average at 0.8V achieving an efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to 25MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep CNN consuming 3.16pJ per equivalent RISC op; local CNN-based face detection with secured remote recognition in 5.74pJ/op; and seizure detection with encrypted data collection from EEG within 12.7pJ/op.Comment: 15 pages, 12 figures, accepted for publication to the IEEE Transactions on Circuits and Systems - I: Regular Paper

    Real-Time UAV Pose Estimation and Tracking Using FPGA Accelerated April Tag

    Get PDF
    April Tags and other passive fiducial markers are widely used to determine localization using a monocular camera. It utilizes specialized algorithms that detect markers to calculate their orientation and distance in three dimensional (3-D) space. The video and image processing steps performed to use these fiducial systems dominate the computation time of the algorithms. Low latency is a key component for the real-time application of these fiducial markers. The drawbacks of performing the video and image processing in software is the difficulty in performing the same operation in parallel effectively. Specialized hardware instantiations with the same algorithm scan efficiently parallelize them as well as operate on the image in a streaming fashion. Compared to graphics processing units (GPUs) that also perform well in the field, field programmable gate arrays (FPGAs) operate with less power, making them optimal with tight power constraints. This research describes such an optimization for the April Tag algorithm on an unmanned aerial vehicle with an embedded platform to perform real-time pose estimation, tracking, and localization in GPS-denied (global positioning system) environments at 30 frames per second (FPS) by converting the initial embedded C/C++ solution to a heterogeneous one through hardware acceleration. It compares the size, accuracy, and speed of the April Tag algorithm’s various implementations. The initial solution operated at around 2 FPS while the final solution, a novel heterogeneous algorithm on the Fusion 2 Zynq 7020 system on chip (SoC), operated at around 43 FPS using hardware acceleration. The research proposes a pipeline that breaks the algorithm into distinct steps where portions of it can be improved by utilizing algorithms optimized to run on a FPGA. Additional steps were made to further reduce the hardware algorithm’s resource utilization. Each step in the software was compared against its hardware counterpart using its utilization and timing as benchmarks

    A Model-based Design Framework for Application-specific Heterogeneous Systems

    Get PDF
    The increasing heterogeneity of computing systems enables higher performance and power efficiency. However, these improvements come at the cost of increasing the overall complexity of designing such systems. These complexities include constructing implementations for various types of processors, setting up and configuring communication protocols, and efficiently scheduling the computational work. The process for developing such systems is iterative and time consuming, with no well-defined performance goal. Current performance estimation approaches use source code implementations that require experienced developers and time to produce. We present a framework to aid in the design of heterogeneous systems and the performance tuning of applications. Our framework supports system construction: integrating custom hardware accelerators with existing cores into processors, integrating processors into cohesive systems, and mapping computations to processors to achieve overall application performance and efficient hardware usage. It also facilitates effective design space exploration using processor models (for both existing and future processors) that do not require source code implementations to estimate performance. We evaluate our framework using a variety of applications and implement them in systems ranging from low power embedded systems-on-chip (SoC) to high performance systems consisting of commercial-off-the-shelf (COTS) components. We show how the design process is improved, reducing the number of design iterations and unnecessary source code development ultimately leading to higher performing efficient systems

    FuncTeller: How Well Does eFPGA Hide Functionality?

    Full text link
    Hardware intellectual property (IP) piracy is an emerging threat to the global supply chain. Correspondingly, various countermeasures aim to protect hardware IPs, such as logic locking, camouflaging, and split manufacturing. However, these countermeasures cannot always guarantee IP security. A malicious attacker can access the layout/netlist of the hardware IP protected by these countermeasures and further retrieve the design. To eliminate/bypass these vulnerabilities, a recent approach redacts the design's IP to an embedded field-programmable gate array (eFPGA), disabling the attacker's access to the layout/netlist. eFPGAs can be programmed with arbitrary functionality. Without the bitstream, the attacker cannot recover the functionality of the protected IP. Consequently, state-of-the-art attacks are inapplicable to pirate the redacted hardware IP. In this paper, we challenge the assumed security of eFPGA-based redaction. We present an attack to retrieve the hardware IP with only black-box access to a programmed eFPGA. We observe the effect of modern electronic design automation (EDA) tools on practical hardware circuits and leverage the observation to guide our attack. Thus, our proposed method FuncTeller selects minterms to query, recovering the circuit function within a reasonable time. We demonstrate the effectiveness and efficiency of FuncTeller on multiple circuits, including academic benchmark circuits, Stanford MIPS processor, IBEX processor, Common Evaluation Platform GPS, and Cybersecurity Awareness Worldwide competition circuits. Our results show that FuncTeller achieves an average accuracy greater than 85% over these tested circuits retrieving the design's functionality.Comment: To be published in the proceedings of the 32st USENIX Security Symposium, 202

    New Design Techniques for Dynamic Reconfigurable Architectures

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Data processing in high-performance computing systems

    Get PDF
    The paper integrates the results of a large group of authors working in different areas that are important in the scope of big data, including but not limited to: overview of the basic solutions for the development of data centers; storage and processing; decomposition of a problem into sub-problems of lower complexity (such as applying divide and conquer algorithms); models and methods allowing broad parallelism to be realized; alternative techniques for potential acceleration; programming languages; and practical applications

    ASlib: A Benchmark Library for Algorithm Selection

    Full text link
    The task of algorithm selection involves choosing an algorithm from a set of algorithms on a per-instance basis in order to exploit the varying performance of algorithms over a set of instances. The algorithm selection problem is attracting increasing attention from researchers and practitioners in AI. Years of fruitful applications in a number of domains have resulted in a large amount of data, but the community lacks a standard format or repository for this data. This situation makes it difficult to share and compare different approaches effectively, as is done in other, more established fields. It also unnecessarily hinders new researchers who want to work in this area. To address this problem, we introduce a standardized format for representing algorithm selection scenarios and a repository that contains a growing number of data sets from the literature. Our format has been designed to be able to express a wide variety of different scenarios. Demonstrating the breadth and power of our platform, we describe a set of example experiments that build and evaluate algorithm selection models through a common interface. The results display the potential of algorithm selection to achieve significant performance improvements across a broad range of problems and algorithms.Comment: Accepted to be published in Artificial Intelligence Journa

    Evaluation of Edge AI Co-Processing Methods for Space Applications

    Get PDF
    The recent years spread of SmallSats offers several new services and opens to the implementation of new technologies to improve the existent ones. However, the communication link to Earth in order to process data often is a bottleneck, due to the amount of collected data and the limited bandwidth. A way to face this challenge is edge computing, which supposedly discards useless data and fasten up the transmission, and therefore the research has moved towards the study of COTS architectures to be used in space, often organized in co-processing setups. This thesis considers AI as application use case and two devices in a controller-accelerator configuration. It proposes to investigate the performances of co-processing methods such as simple parallel, horizontal partitioning and vertical partitioning, for a set of different tasks and taking advantage of different pre-trained models. The actual experiments regard only simple parallel and horizontal partitioning mode, and they compare latency and accuracy results with single processing runs on both devices. Evaluating the results task-by-task, image classification has the best performance improvement taking advantage of horizontal partitioning, with a clear accuracy improvement, as well as semantic segmentation, which shows almost stable accuracy and potentially higher throughput with smaller models input sizes. On the other hand, object detection shows a drop in performances, especially accuracy, which could maybe be improved with more specifically developed models for the chosen hardware. The project clearly shows how co-processing methods are worth of being investigated and can improve system outcomes for some of the analyzed tasks, making future work about it interesting
    corecore