246 research outputs found

    ACCL+: an FPGA-Based Collective Engine for Distributed Applications

    Full text link
    FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs or network-attached accelerators. Despite their potential, developing distributed FPGA-accelerated applications remains cumbersome due to the lack of appropriate infrastructure and communication abstractions. To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-source versatile FPGA-based collective communication library. Portable across different platforms and supporting UDP, TCP, as well as RDMA, ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective communication. Additionally, it can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks. It is user-extensible, allowing new collectives to be implemented and deployed without having to re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100 Gb/s networking, comparing its performance against software MPI over RDMA. The results demonstrate ACCL+'s significant advantages for FPGA-based distributed applications and highly competitive performance for CPU applications. We showcase ACCL+'s dual role with two use cases: seamlessly integrating as a collective offload engine to distribute CPU-based vector-matrix multiplication, and serving as a crucial and efficient component in designing fully FPGA-based distributed deep-learning recommendation inference

    Embedding Clock Buffer IP Core in FPGA Emulation

    Get PDF
    Department of Computer Scienc

    Embedded Electronic Systems for Electronic Skin Applications

    Get PDF
    The advances in sensor devices are potentially providing new solutions to many applications including prosthetics and robotics. Endowing upper limb prosthesis with tactile sensors (electronic/sensitive skin) can be used to provide tactile sensory feedback to the amputees. In this regard, the prosthetic device is meant to be equipped with tactile sensing system allowing the user limb to receive tactile feedback about objects and contact surfaces. Thus, embedding tactile sensing system is required for wearable sensors that should cover wide areas of the prosthetics. However, embedding sensing system involves set of challenges in terms of power consumption, data processing, real-time response and design scalability (e-skin may include large number of tactile sensors). The tactile sensing system is constituted of: (i) a tactile sensor array, (ii) an interface electronic circuit, (iii) an embedded processing unit, and (iv) a communication interface to transmit tactile data. The objective of the thesis is to develop an efficient embedded tactile sensing system targeting e-skin application (e.g. prosthetic) by: 1) developing a low power and miniaturized interface electronics circuit, operating in real-time; 2) proposing an efficient algorithm for embedded tactile data processing, affecting the system time latency and power consumption; 3) implementing an efficient communication channel/interface, suitable for large amount of data generated from large number of sensors. Most of the interface electronics for tactile sensing system proposed in the literature are composed of signal conditioning and commercial data acquisition devices (i.e. DAQ). However, these devices are bulky (PC-based) and thus not suitable for portable prosthetics from the size, power consumption and scalability point of view. Regarding the tactile data processing, some works have exploited machine learning methods for extracting meaningful information from tactile data. However, embedding these algorithms poses some challenges because of 1) the high amount of data to be processed significantly affecting the real time functionality, and 2) the complex processing tasks imposing burden in terms of power consumption. On the other hand, the literature shows lack in studies addressing data transfer in tactile sensing system. Thus, dealing with large number of sensors will pose challenges on the communication bandwidth and reliability. Therefore, this thesis exploits three approaches: 1) Developing a low power and miniaturized Interface Electronics (IE), capable of interfacing and acquiring signals from large number of tactile sensors in real-time. We developed a portable IE system based on a low power arm microcontroller and a DDC232 A/D converter, that handles an array of 32 tactile sensors. Upon touch applied to the sensors, the IE acquires and pre-process the sensor signals at low power consumption achieving a battery lifetime of about 22 hours. Then we assessed the functionality of the IE by carrying out Electrical and electromechanical characterization experiments to monitor the response of the interface electronics with PVDF-based piezoelectric sensors. The results of electrical and electromechanical tests validate the correct functionality of the proposed system. In addition, we implemented filtering methods on the IE that reduced the effect of noise in the system. Furthermore, we evaluated our proposed IE by integrating it in tactile sensory feedback system, showing effective deliver of tactile data to the user. The proposed system overcomes similar state of art solutions dealing with higher number of input channels and maintaining real time functionality. 2) Optimizing and implementing a tensorial-based machine learning algorithm for touch modality classification on embedded Zynq System-on-chip (SoC). The algorithm is based on Support Vector Machine classifier to discriminate between three input touch modality classes \u201cbrushing\u201d, \u201crolling\u201d and \u201csliding\u201d. We introduced an efficient algorithm minimizing the hardware implementation complexity in terms of number of operations and memory storage which directly affect time latency and power consumption. With respect to the original algorithm, the proposed approach \u2013 implemented on Zynq SoC \u2013 achieved reduction in the number of operations per inference from 545 M-ops to 18 M-ops and the memory storage from 52.2 KB to 1.7 KB. Moreover, the proposed method speeds up the inference time by a factor of 43 7 at a cost of only 2% loss in accuracy, enabling the algorithm to run on embedded processing unit and to extract tactile information in real-time. 3) Implementing a robust and efficient data transfer channel to transfer aggregated data at high transmission data rate and low power consumption. In this approach, we proposed and demonstrated a tactile sensory feedback system based on an optical communication link for prosthetic applications. The optical link features a low power and wide transmission bandwidth, which makes the feedback system suitable for large number of tactile sensors. The low power transmission is due to the employed UWB-based optical modulation. We implemented a system prototype, consisting of digital transmitter and receiver boards and acquisition circuits to interface 32 piezoelectric sensors. Then we evaluated the system performance by measuring, processing and transmitting data of the 32 piezoelectric sensors at 100 Mbps data rate through the optical link, at 50 pJ/bit communication energy consumption. Experimental results have validated the functionality and demonstrated the real time operation of the proposed sensory feedback system

    Proceedings of the 5th International Workshop on Reconfigurable Communication-centric Systems on Chip 2010 - ReCoSoC\u2710 - May 17-19, 2010 Karlsruhe, Germany. (KIT Scientific Reports ; 7551)

    Get PDF
    ReCoSoC is intended to be a periodic annual meeting to expose and discuss gathered expertise as well as state of the art research around SoC related topics through plenary invited papers and posters. The workshop aims to provide a prospective view of tomorrow\u27s challenges in the multibillion transistor era, taking into account the emerging techniques and architectures exploring the synergy between flexible on-chip communication and system reconfigurability

    Reconfigurable Architectures and Systems for IoT Applications

    Get PDF
    abstract: Internet of Things (IoT) has become a popular topic in industry over the recent years, which describes an ecosystem of internet-connected devices or things that enrich the everyday life by improving our productivity and efficiency. The primary components of the IoT ecosystem are hardware, software and services. While the software and services of IoT system focus on data collection and processing to make decisions, the underlying hardware is responsible for sensing the information, preprocess and transmit it to the servers. Since the IoT ecosystem is still in infancy, there is a great need for rapid prototyping platforms that would help accelerate the hardware design process. However, depending on the target IoT application, different sensors are required to sense the signals such as heart-rate, temperature, pressure, acceleration, etc., and there is a great need for reconfigurable platforms that can prototype different sensor interfacing circuits. This thesis primarily focuses on two important hardware aspects of an IoT system: (a) an FPAA based reconfigurable sensing front-end system and (b) an FPGA based reconfigurable processing system. To enable reconfiguration capability for any sensor type, Programmable ANalog Device Array (PANDA), a transistor-level analog reconfigurable platform is proposed. CAD tools required for implementation of front-end circuits on the platform are also developed. To demonstrate the capability of the platform on silicon, a small-scale array of 24×25 PANDA cells is fabricated in 65nm technology. Several analog circuit building blocks including amplifiers, bias circuits and filters are prototyped on the platform, which demonstrates the effectiveness of the platform for rapid prototyping IoT sensor interfaces. IoT systems typically use machine learning algorithms that run on the servers to process the data in order to make decisions. Recently, embedded processors are being used to preprocess the data at the energy-constrained sensor node or at IoT gateway, which saves considerable energy for transmission and bandwidth. Using conventional CPU based systems for implementing the machine learning algorithms is not energy-efficient. Hence an FPGA based hardware accelerator is proposed and an optimization methodology is developed to maximize throughput of any convolutional neural network (CNN) based machine learning algorithm on a resource-constrained FPGA.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    REDUCING POWER DURING MANUFACTURING TEST USING DIFFERENT ARCHITECTURES

    Get PDF
    Power during manufacturing test can be several times higher than power consumption in functional mode. Excessive power during test can cause IR drop, over-heating, and early aging of the chips. In this dissertation, three different architectures have been introduced to reduce test power in general cases as well as in certain scenarios, including field test. In the first architecture, scan chains are divided into several segments. Every segment needs a control bit to enable capture in a segment when new faults are detectable on that segment for that pattern. Otherwise, the segment should be disabled to reduce capture power. We group the control bits together into one or more control chains. To address the extra pin(s) required to shift data into the control chain(s) and significant post processing in the first architecture, we explored a second architecture. The second architecture stitches the control bits into the chains they control as EECBs (embedded enable capture bits) in between the segments. This allows an ATPG software tool to automatically generate the appropriate EECB values for each pattern to maintain the fault coverage. This also works in the presence of an on-chip decompressor. The last architecture focuses primarily on the self-test of a device in a 3D stacked IC when an existing FPGA in the stack can be programmed as a tester. We show that the energy expended during test is significantly less than would be required using low power patterns fed by an on-chip decompressor for the same very short scan chains

    The PANOPTIC Camera: A Plenoptic Sensor with Real-Time Omnidirectional Capability

    Get PDF
    A new biologically-inspired vision sensor made of one hundred "eyes” is presented, which is suitable for real-time acquisition and processing of 3-D image sequences. This device, named the Panoptic camera, consists of a layered arrangement of approximately 100 classical CMOS imagers, distributed over a hemisphere of 13cm in diameter. The Panoptic camera is a polydioptric system where all imagers have their own vision of the world, each with a distinct focal point, which is a specific feature of the Panoptic system. This enables 3-D information recording such as omnidirectional stereoscopy or depth estimation, applying specific signal processing. The algorithms dictating the image reconstruction of an omnidirectional observer located at any point inside the hemisphere are presented. A hardware architecture which has the capability of handling these algorithms, and the flexibility to support additional image processing in real time, has been developed as a two-layer system based on FPGAs. The detail of the hardware architecture, its internal blocks, the mapping of the algorithms onto the latter elements, and the device calibration procedure are presented, along with imaging result

    Configurable data center switch architectures

    Get PDF
    In this thesis, we explore alternative architectures for implementing con_gurable Data Center Switches along with the advantages that can be provided by such switches. Our first contribution centers around determining switch architectures that can be implemented on Field Programmable Gate Array (FPGA) to provide configurable switching protocols. In the process, we identify a gap in the availability of frameworks to realistically evaluate the performance of switch architectures in data centers and contribute a simulation framework that relies on realistic data center traffic patterns. Our framework is then used to evaluate the performance of currently existing as well as newly proposed FPGA-amenable switch designs. Through collaborative work with Meng and Papaphilippou, we establish that only small-medium range switches can be implemented on today's FPGAs. Our second contribution is a novel switch architecture that integrates a custom in-network hardware accelerator with a generic switch to accelerate Deep Neural Network training applications in data centers. Our proposed accelerator architecture is prototyped on an FPGA, and a scalability study is conducted to demonstrate the trade-offs of an FPGA implementation when compared to an ASIC implementation. In addition to the hardware prototype, we contribute a light weight load-balancing and congestion control protocol that leverages the unique communication patterns of ML data-parallel jobs to enable fair sharing of network resources across different jobs. Our large-scale simulations demonstrate the ability of our novel switch architecture and light weight congestion control protocol to both accelerate the training time of machine learning jobs by up to 1.34x and benefit other latency-sensitive applications by reducing their 99%-tile completion time by up to 4.5x. As for our final contribution, we identify the main requirements of in-network applications and propose a Network-on-Chip (NoC)-based architecture for supporting a heterogeneous set of applications. Observing the lack of tools to support such research, we provide a tool that can be used to evaluate NoC-based switch architectures.Open Acces
    corecore