7,253 research outputs found
Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs
Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel computation. Multi-core CPU based solutions have demonstrated their inadequacy for this problem due to the memory wall and low parallelism. Many-core GPU architectures show superior performance but they consume high power and also have memory constraints due to inconsistencies between cache and main memory. FPGA design solutions are also actively being explored, which allow implementing the memory hierarchy using embedded BlockRAM. This boosts the parallel use of shared memory elements between multiple processing units, avoiding data replicability and inconsistencies. This makes FPGAs potentially powerful solutions for real-time classification of CNNs. Both Altera and Xilinx have adopted OpenCL co-design framework from GPU for FPGA designs as a pseudo-automatic development solution. In this paper, a comprehensive evaluation and comparison of Altera and Xilinx OpenCL frameworks for a 5-layer deep CNN is presented. Hardware resources, temporal performance and the OpenCL architecture for CNNs are discussed. Xilinx demonstrates faster synthesis, better FPGA resource utilization and more compact boards. Altera provides multi-platforms tools, mature design community and better execution times
Comprehensive Evaluation of OpenCL-Based CNN Implementations for FPGAs
Deep learning has significantly advanced the state of the
art in artificial intelligence, gaining wide popularity from both industry
and academia. Special interest is around Convolutional Neural Networks
(CNN), which take inspiration from the hierarchical structure
of the visual cortex, to form deep layers of convolutional operations,
along with fully connected classifiers. Hardware implementations of these
deep CNN architectures are challenged with memory bottlenecks that
require many convolution and fully-connected layers demanding large
amount of communication for parallel computation. Multi-core CPU
based solutions have demonstrated their inadequacy for this problem
due to the memory wall and low parallelism. Many-core GPU architectures
show superior performance but they consume high power and also
have memory constraints due to inconsistencies between cache and main
memory. OpenCL is commonly used to describe these architectures for
their execution on GPGPUs or FPGAs. FPGA design solutions are also
actively being explored, which allow implementing the memory hierarchy
using embedded parallel BlockRAMs. This boosts the parallel use
of shared memory elements between multiple processing units, avoiding
data replicability and inconsistencies. This makes FPGAs potentially
powerful solutions for real-time classification of CNNs. In this
paper both Altera and Xilinx adopted OpenCL co-design frameworks
for pseudo-automatic development solutions are evaluated. A comprehensive
evaluation and comparison for a 5-layer deep CNN is presented.
Hardware resources, temporal performance and the OpenCL architecture
for CNNs are discussed. Xilinx demonstrates faster synthesis, better
FPGA resource utilization and more compact boards. Altera provides
multi-platforms tools, mature design community and better execution
times.Ministerio de Economía y Competitividad TEC2016-77785-
Learning Concise Models from Long Execution Traces
Abstract models of system-level behaviour have applications in design
exploration, analysis, testing and verification. We describe a new algorithm
for automatically extracting useful models, as automata, from execution traces
of a HW/SW system driven by software exercising a use-case of interest. Our
algorithm leverages modern program synthesis techniques to generate predicates
on automaton edges, succinctly describing system behaviour. It employs trace
segmentation to tackle complexity for long traces. We learn concise models
capturing transaction-level, system-wide behaviour--experimentally
demonstrating the approach using traces from a variety of sources, including
the x86 QEMU virtual platform and the Real-Time Linux kernel
Classification of Existing Virtualization Methods Used in Telecommunication Networks
This article studies the existing methods of virtualization of different
resources. The positive and negative aspects of each of the methods are
analyzed, the perspectivity of the approach is noted. It is also made an
attempt to classify virtualization methods according to the application domain,
which allows us to discover the method weaknesses which are needed to be
optimized.Comment: 4 pages, 3 figure
Efficient hardware debugging using parameterized FPGA reconfiguration
Functional errors and bugs inadvertently introduced at the RTL stage of the design process are responsible for the largest fraction of silicon IC re-spins. Thus, comprehensive func- tional verification is the key to reduce development costs and to deliver a product in time. The increasing demands for verification led to an increase in FPGA-based tools that perform emulation. These tools can run at much higher operating frequencies and achieve higher coverage than simulation. However, an important pitfall of the FPGA tools is that they suffer from limited internal signal observability, as only a small and preselected set of signals is guided towards (embedded) trace buffers and observed. This paper proposes a dynamically reconfigurable network of multiplexers that significantly enhance the visibility of internal signals. It allows the designer to dynamically change the small set of internal signals to be observed, virtually enlarging the set of observed signals significantly. These multiplexers occupy minimal space, as they are implemented by the FPGA’s routing infrastructure
Automated Dynamic Firmware Analysis at Scale: A Case Study on Embedded Web Interfaces
Embedded devices are becoming more widespread, interconnected, and
web-enabled than ever. However, recent studies showed that these devices are
far from being secure. Moreover, many embedded systems rely on web interfaces
for user interaction or administration. Unfortunately, web security is known to
be difficult, and therefore the web interfaces of embedded systems represent a
considerable attack surface.
In this paper, we present the first fully automated framework that applies
dynamic firmware analysis techniques to achieve, in a scalable manner,
automated vulnerability discovery within embedded firmware images. We apply our
framework to study the security of embedded web interfaces running in
Commercial Off-The-Shelf (COTS) embedded devices, such as routers, DSL/cable
modems, VoIP phones, IP/CCTV cameras. We introduce a methodology and implement
a scalable framework for discovery of vulnerabilities in embedded web
interfaces regardless of the vendor, device, or architecture. To achieve this
goal, our framework performs full system emulation to achieve the execution of
firmware images in a software-only environment, i.e., without involving any
physical embedded devices. Then, we analyze the web interfaces within the
firmware using both static and dynamic tools. We also present some interesting
case-studies, and discuss the main challenges associated with the dynamic
analysis of firmware images and their web interfaces and network services. The
observations we make in this paper shed light on an important aspect of
embedded devices which was not previously studied at a large scale.
We validate our framework by testing it on 1925 firmware images from 54
different vendors. We discover important vulnerabilities in 185 firmware
images, affecting nearly a quarter of vendors in our dataset. These
experimental results demonstrate the effectiveness of our approach
Fast, Accurate and Detailed NoC Simulations
Network-on-Chip (NoC) architectures have a wide variety of parameters that can be adapted to the designer's requirements. Fast exploration of this parameter space is only possible at a high-level and several methods have been proposed. Cycle and bit accurate simulation is necessary when the actual router's RTL description needs to be evaluated and verified. However, extensive simulation of the NoC architecture with cycle and bit accuracy is prohibitively time consuming. In this paper we describe a simulation method to simulate large parallel homogeneous and heterogeneous network-on-chips on a single FPGA. The method is especially suitable for parallel systems where lengthy cycle and bit accurate simulations are required. As a case study, we use a NoC that was modelled and simulated in SystemC. We simulate the same NoC on the described FPGA simulator. This enables us to observe the NoC behavior under a large variety of traffic patterns. Compared with the SystemC simulation we achieved a speed-up of 80-300, without compromising the cycle and bit level accuracy
MGSim - Simulation tools for multi-core processor architectures
MGSim is an open source discrete event simulator for on-chip hardware
components, developed at the University of Amsterdam. It is intended to be a
research and teaching vehicle to study the fine-grained hardware/software
interactions on many-core and hardware multithreaded processors. It includes
support for core models with different instruction sets, a configurable
multi-core interconnect, multiple configurable cache and memory models, a
dedicated I/O subsystem, and comprehensive monitoring and interaction
facilities. The default model configuration shipped with MGSim implements
Microgrids, a many-core architecture with hardware concurrency management.
MGSim is furthermore written mostly in C++ and uses object classes to represent
chip components. It is optimized for architecture models that can be described
as process networks.Comment: 33 pages, 22 figures, 4 listings, 2 table
- …