672 research outputs found

    Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques

    Full text link
    The rapid growth of demanding applications in domains applying multimedia processing and machine learning has marked a new era for edge and cloud computing. These applications involve massive data and compute-intensive tasks, and thus, typical computing paradigms in embedded systems and data centers are stressed to meet the worldwide demand for high performance. Concurrently, the landscape of the semiconductor field in the last 15 years has constituted power as a first-class design concern. As a result, the community of computing systems is forced to find alternative design approaches to facilitate high-performance and/or power-efficient computing. Among the examined solutions, Approximate Computing has attracted an ever-increasing interest, with research works applying approximations across the entire traditional computing stack, i.e., at software, hardware, and architectural levels. Over the last decade, there is a plethora of approximation techniques in software (programs, frameworks, compilers, runtimes, languages), hardware (circuits, accelerators), and architectures (processors, memories). The current article is Part I of our comprehensive survey on Approximate Computing, and it reviews its motivation, terminology and principles, as well it classifies and presents the technical details of the state-of-the-art software and hardware approximation techniques.Comment: Under Review at ACM Computing Survey

    Optical Synchronization of Time-of-Flight Cameras

    Get PDF
    Time-of-Flight (ToF)-Kameras erzeugen Tiefenbilder (3D-Bilder), indem sie Infrarotlicht aussenden und die Zeit messen, bis die Reflexion des Lichtes wieder empfangen wird. Durch den Einsatz mehrerer ToF-Kameras können ihre vergleichsweise geringere Auflösungen überwunden, das Sichtfeld vergrößert und Verdeckungen reduziert werden. Der gleichzeitige Betrieb birgt jedoch die Möglichkeit von Störungen, die zu fehlerhaften Tiefenmessungen führen. Das Problem der gegenseitigen Störungen tritt nicht nur bei Mehrkamerasystemen auf, sondern auch wenn mehrere unabhängige ToF-Kameras eingesetzt werden. In dieser Arbeit wird eine neue optische Synchronisation vorgestellt, die keine zusätzliche Hardware oder Infrastruktur erfordert, um ein Zeitmultiplexverfahren (engl. Time-Division Multiple Access, TDMA) für die Anwendung mit ToF-Kameras zu nutzen, um so die Störungen zu vermeiden. Dies ermöglicht es einer Kamera, den Aufnahmeprozess anderer ToF-Kameras zu erkennen und ihre Aufnahmezeiten schnell zu synchronisieren, um störungsfrei zu arbeiten. Anstatt Kabel zur Synchronisation zu benötigen, wird nur die vorhandene Hardware genutzt, um eine optische Synchronisation zu erreichen. Dazu wird die Firmware der Kamera um das Synchronisationsverfahren erweitert. Die optische Synchronisation wurde konzipiert, implementiert und in einem Versuchsaufbau mit drei ToF-Kameras verifiziert. Die Messungen zeigen die Wirksamkeit der vorgeschlagenen optischen Synchronisation. Während der Experimente wurde die Bildrate durch das zusätzliche Synchronisationsverfahren lediglich um etwa 1 Prozent reduziert.Time-of-Flight (ToF) cameras produce depth images (three-dimensional images) by measuring the time between the emission of infrared light and the reception of its reflection. A setup of multiple ToF cameras may be used to overcome their comparatively low resolution, increase the field of view, and reduce occlusion. However, the simultaneous operation of multiple ToF cameras introduces the possibility of interference resulting in erroneous depth measurements. The problem of interference is not only related to a collaborative multicamera setup but also to multiple ToF cameras operating independently. In this work, a new optical synchronization for ToF cameras is presented, requiring no additional hardware or infrastructure to utilize a time-division multiple access (TDMA) scheme to mitigate interference. It effectively enables a camera to sense the acquisition process of other ToF cameras and rapidly synchronizes its acquisition times to operate without interference. Instead of requiring cables to synchronize, only the existing hardware is utilized to enable an optical synchronization. To achieve this, the camera’s firmware is extended with the synchronization procedure. The optical synchronization has been conceptualized, implemented, and verified with an experimental setup deploying three ToF cameras. The measurements show the efficacy of the proposed optical synchronization. During the experiments, the frame rate was reduced by only about 1% due to the synchronization procedure

    LIPIcs, Volume 261, ICALP 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 261, ICALP 2023, Complete Volum

    Adaptive Intelligent Systems for Extreme Environments

    Get PDF
    As embedded processors become powerful, a growing number of embedded systems equipped with artificial intelligence (AI) algorithms have been used in radiation environments to perform routine tasks to reduce radiation risk for human workers. On the one hand, because of the low price, commercial-off-the-shelf devices and components are becoming increasingly popular to make such tasks more affordable. Meanwhile, it also presents new challenges to improve radiation tolerance, the capability to conduct multiple AI tasks and deliver the power efficiency of the embedded systems in harsh environments. There are three aspects of research work that have been completed in this thesis: 1) a fast simulation method for analysis of single event effect (SEE) in integrated circuits, 2) a self-refresh scheme to detect and correct bit-flips in random access memory (RAM), and 3) a hardware AI system with dynamic hardware accelerators and AI models for increasing flexibility and efficiency. The variances of the physical parameters in practical implementation, such as the nature of the particle, linear energy transfer and circuit characteristics, may have a large impact on the final simulation accuracy, which will significantly increase the complexity and cost in the workflow of the transistor level simulation for large-scale circuits. It makes it difficult to conduct SEE simulations for large-scale circuits. Therefore, in the first research work, a new SEE simulation scheme is proposed, to offer a fast and cost-efficient method to evaluate and compare the performance of large-scale circuits which subject to the effects of radiation particles. The advantages of transistor and hardware description language (HDL) simulations are combined here to produce accurate SEE digital error models for rapid error analysis in large-scale circuits. Under the proposed scheme, time-consuming back-end steps are skipped. The SEE analysis for large-scale circuits can be completed in just few hours. In high-radiation environments, bit-flips in RAMs can not only occur but may also be accumulated. However, the typical error mitigation methods can not handle high error rates with low hardware costs. In the second work, an adaptive scheme combined with correcting codes and refreshing techniques is proposed, to correct errors and mitigate error accumulation in extreme radiation environments. This scheme is proposed to continuously refresh the data in RAMs so that errors can not be accumulated. Furthermore, because the proposed design can share the same ports with the user module without changing the timing sequence, it thus can be easily applied to the system where the hardware modules are designed with fixed reading and writing latency. It is a challenge to implement intelligent systems with constrained hardware resources. In the third work, an adaptive hardware resource management system for multiple AI tasks in harsh environments was designed. Inspired by the “refreshing” concept in the second work, we utilise a key feature of FPGAs, partial reconfiguration, to improve the reliability and efficiency of the AI system. More importantly, this feature provides the capability to manage the hardware resources for deep learning acceleration. In the proposed design, the on-chip hardware resources are dynamically managed to improve the flexibility, performance and power efficiency of deep learning inference systems. The deep learning units provided by Xilinx are used to perform multiple AI tasks simultaneously, and the experiments show significant improvements in power efficiency for a wide range of scenarios with different workloads. To further improve the performance of the system, the concept of reconfiguration was further extended. As a result, an adaptive DL software framework was designed. This framework can provide a significant level of adaptability support for various deep learning algorithms on an FPGA-based edge computing platform. To meet the specific accuracy and latency requirements derived from the running applications and operating environments, the platform may dynamically update hardware and software (e.g., processing pipelines) to achieve better cost, power, and processing efficiency compared to the static system

    An efficient asynchronous spatial division multiplexing router for network-on-chip on the hardware platform

    Get PDF
    The quasi-delay-insensitive (QDI) based asynchronous network-on-chip (ANoC) has several advantages over clock-based synchronous network-on-chips (NoCs). The asynchronous router uses a virtual channel (VC) as a primary flow-control mechanism however, the spatial division multiplexing (SDM) based mechanism performs better over input traffics over VC. This manuscript uses an asynchronous spatial division multiplexing (ASDM) based router for NoC architecture on a field-programmable gate array (FPGA) platform. The ASDM router is configurable to different bandwidths and VCs. The ASDM router mainly contains input-output (I/O) buffers, a switching allocator, and a crossbar unit. The 4-phase 1-of-4 dual-rail protocol is used to construct the I/O buffers. The performance of the ASDM router is analyzed in terms of lower urinary tract symptoms (LUTs) (chip area), delay, latency, and throughput parameters. The work is implemented using Verilog-HDL with Xilinx ISE 14.7 on artix-7 FPGA. The ASDM router achieves % chip area and obtains 0.8 ns of latency with a throughput of 800 Mfps. The proposed router is compared with existing asynchronous approaches with improved latency and throughput metrics

    Low Power Memory/Memristor Devices and Systems

    Get PDF
    This reprint focusses on achieving low-power computation using memristive devices. The topic was designed as a convenient reference point: it contains a mix of techniques starting from the fundamental manufacturing of memristive devices all the way to applications such as physically unclonable functions, and also covers perspectives on, e.g., in-memory computing, which is inextricably linked with emerging memory devices such as memristors. Finally, the reprint contains a few articles representing how other communities (from typical CMOS design to photonics) are fighting on their own fronts in the quest towards low-power computation, as a comparison with the memristor literature. We hope that readers will enjoy discovering the articles within

    Decentralized Ultra-Reliable Low-Latency Communications through Concurrent Cooperative Transmission

    Get PDF
    Emerging cyber-physical systems demand for communication technologies that enable seamless interactions between humans and physical objects in a shared environment. This thesis proposes decentralized URLLC (dURLLC) as a new communication paradigm that allows the nodes in a wireless multi-hop network (WMN) to disseminate data quickly, reliably and without using a centralized infrastructure. To enable the dURLLC paradigm, this thesis explores the practical feasibility of concurrent cooperative transmission (CCT) with orthogonal frequency-division multiplexing (OFDM). CCT allows for an efficient utilization of the medium by leveraging interference instead of trying to avoid collisions. CCT-based network flooding disseminates data in a WMN through a reception-triggered low-level medium access control (MAC). OFDM provides high data rates by using a large bandwidth, resulting in a short transmission duration for a given amount of data. This thesis explores CCT-based network flooding with the OFDM-based IEEE 802.11 Non-HT and HT physical layers (PHYs) to enable interactions with commercial devices. An analysis of CCT with the IEEE 802.11 Non-HT PHY investigates the combined effects of the phase offset (PO), the carrier frequency offset (CFO) and the time offset (TO) between concurrent transmitters, as well as the elapsed time. The analytical results of the decodability of a CCT are validated in simulations and in testbed experiments with Wireless Open Access Research Platform (WARP) v3 software-defined radios (SDRs). CCT with coherent interference (CI) is the primary approach of this thesis. Two prototypes for CCT with CI are presented that feature mechanisms for precise synchronization in time and frequency. One prototype is based on the WARP v3 and its IEEE 802.11 reference design, whereas the other prototype is created through firmware modifications of the Asus RT-AC86U wireless router. Both prototypes are employed in testbed experiments in which two groups of nodes generate successive CCTs in a ping-pong fashion to emulate flooding processes with a very large number of hops. The nodes stay synchronized in experiments with 10 000 successive CCTs for various modulation and coding scheme (MCS) indices and MAC service data unit (MSDU) sizes. The URLLC requirement of delivering a 32-byte MSDU with a reliability of 99.999 % and with a latency of 1 ms is assessed in experiments with 1 000 000 CCTs, while the reliability is approximated by means of the frame reception rate (FRR). An FRR of at least 99.999 % is achieved at PHY data rates of up to 48 Mbit/s under line-of-sight (LOS) conditions and at PHY data rates of up to 12 Mbit/s under non-line-of-sight (NLOS) conditions on a 20 MHz wide channel, while the latency per hop is 48.2 µs and 80.2 µs, respectively. With four multiple input multiple output (MIMO) spatial streams on a 40 MHz wide channel, a LOS receiver achieves an FRR of 99.5 % at a PHY data rate of 324 Mbit/s. For CCT with incoherent interference, this thesis proposes equalization with time-variant zero-forcing (TVZF) and presents a TVZF receiver for the IEEE 802.11 Non-HT PHY, achieving an FRR of up to 92 % for CCTs from three unsyntonized commercial devices. As CCT-based network flooding allows for an implicit time synchronization of all nodes, a reception-triggered low-level MAC and a reservation-based high-level MAC may in combination support various applications and scenarios under the dURLLC paradigm

    Novel Architectures for Offloading and Accelerating Computations in Artificial Intelligence and Big Data

    Get PDF
    Due to the end of Moore's Law and Dennard Scaling, performance gains in general-purpose architectures have significantly slowed in recent years. While raising the number of cores has been a viable approach for further performance increases, Amdahl's Law and its implications on parallelization also limit further performance gains. Consequently, research has shifted towards different approaches, including domain-specific custom architectures tailored to specific workloads. This has led to a new golden age for computer architecture, as noted in the Turing Award Lecture by Hennessy and Patterson, which has spawned several new architectures and architectural advances specifically targeted at highly current workloads, including Machine Learning. This thesis introduces a hierarchy of architectural improvements ranging from minor incremental changes, such as High-Bandwidth Memory, to more complex architectural extensions that offload workloads from the general-purpose CPU towards more specialized accelerators. Finally, we introduce novel architectural paradigms, namely Near-Data or In-Network Processing, as the most complex architectural improvements. This cumulative dissertation then investigates several architectural improvements to accelerate Sum-Product Networks, a novel Machine Learning approach from the class of Probabilistic Graphical Models. Furthermore, we use these improvements as case studies to discuss the impact of novel architectures, showing that minor and major architectural changes can significantly increase performance in Machine Learning applications. In addition, this thesis presents recent works on Near-Data Processing, which introduces Smart Storage Devices as a novel architectural paradigm that is especially interesting in the context of Big Data. We discuss how Near-Data Processing can be applied to improve performance in different database settings by offloading database operations to smart storage devices. Offloading data-reductive operations, such as selections, reduces the amount of data transferred, thus improving performance and alleviating bandwidth-related bottlenecks. Using Near-Data Processing as a use-case, we also discuss how Machine Learning approaches, like Sum-Product Networks, can improve novel architectures. Specifically, we introduce an approach for offloading Cardinality Estimation using Sum-Product Networks that could enable more intelligent decision-making in smart storage devices. Overall, we show that Machine Learning can benefit from developing novel architectures while also showing that Machine Learning can be applied to improve the applications of novel architectures

    Evaluation of Open-Source EDA Tool “EDA Playground”

    Get PDF
    With the advancement of Information Technology, the design, verification, and manufacturing of Integrated circuits have been challenging and time consuming. Unlike the software domain, Electronic Design Automation (EDA) tools are mostly commercially available, and access is limited to the students. An open-source EDA tool might help the students to initialize the learning process. This thesis showcases an open-source EDA platform, EDA Playground, where users can practice their hardware description language (HDL) codes, create a testbench to simulate their designs and synthesize their code. The thesis shows how EDA Playground provides its users with the ability to write code in various HDLs, enabling them to evaluate their designs using a range of both commercial and freely available simulators. Additionally, it also shows how the platform helps in identifying and resolving design failures through the utilization of waveform viewing tool, EPwave, developed my EDA Playground and logs. It is also highlighted how users have the ability to employ commercial synthesizers in order to combine their codes, thereby facilitating the assessment of device utilization and circuit diagram. Another notable objective of the thesis is to highlight the application of EDA Playground to the incorporate of UVM 1.2. A step-by-step UVM testbench of a simple SystemVerilog adder was developed and simulated as a part of the thesis. Prospective users have the opportunity to gain knowledge about this methodology by accessing educational resources, which encompass various tools and examples provided for their advantage. The thesis provides an extensive array of use cases that showcase the varied functionalities provided by EDA Playground. This thesis extensively employs and evaluates the diverse resources offered on EDA Playground to determine their usefulness

    QuePaxa: Escaping the tyranny of timeouts in consensus

    Get PDF
    Leader-based consensus algorithms are fast and efficient under normal conditions, but lack robustness to adverse conditions due to their reliance on timeouts for liveness. We present QuePaxa, the first protocol offering state-of-the-art normal-case efficiency without depending on timeouts. QuePaxa uses a novel randomized asynchronous consensus core to tolerate adverse conditions such as denial-of-service (DoS) attacks, while a one-round-trip fast path preserves the normal-case efficiency of Multi-Paxos or Raft. By allowing simultaneous proposers without destructive interference, and using short hedging delays instead of conservative timeouts to limit redundant effort, QuePaxa permits rapid recovery after leader failure without risking costly view changes due to false timeouts. By treating leader choice and hedging delay as a multi-armed-bandit optimization, QuePaxa achieves responsiveness to prevalent conditions, and can choose the best leader even if the current one has not failed. Experiments with a prototype confirm that QuePaxa achieves normal-case LAN and WAN performance of 584k and 250k cmd/sec in throughput, respectively, comparable to Multi-Paxos. Under conditions such as DoS attacks, misconfigurations, or slow leaders that severely impact existing protocols, we find that QuePaxa remains live with median latency under 380ms in WAN experiments
    • …
    corecore