36,782 research outputs found
Acceleration of stereo-matching on multi-core CPU and GPU
This paper presents an accelerated version of a
dense stereo-correspondence algorithm for two different parallelism
enabled architectures, multi-core CPU and GPU. The
algorithm is part of the vision system developed for a binocular
robot-head in the context of the CloPeMa 1 research project.
This research project focuses on the conception of a new clothes
folding robot with real-time and high resolution requirements
for the vision system. The performance analysis shows that
the parallelised stereo-matching algorithm has been significantly
accelerated, maintaining 12x and 176x speed-up respectively
for multi-core CPU and GPU, compared with non-SIMD singlethread
CPU. To analyse the origin of the speed-up and gain
deeper understanding about the choice of the optimal hardware,
the algorithm was broken into key sub-tasks and the performance
was tested for four different hardware architectures
Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine
Deep neural networks have achieved impressive results in computer vision and
machine learning. Unfortunately, state-of-the-art networks are extremely
compute and memory intensive which makes them unsuitable for mW-devices such as
IoT end-nodes. Aggressive quantization of these networks dramatically reduces
the computation and memory footprint. Binary-weight neural networks (BWNs)
follow this trend, pushing weight quantization to the limit. Hardware
accelerators for BWNs presented up to now have focused on core efficiency,
disregarding I/O bandwidth and system-level efficiency that are crucial for
deployment of accelerators in ultra-low power devices. We present Hyperdrive: a
BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel
binary-weight streaming approach, which can be used for arbitrarily sized
convolutional neural network architecture and input resolution by exploiting
the natural scalability of the compute units both at chip-level and
system-level by arranging Hyperdrive chips systolically in a 2D mesh while
processing the entire feature map together in parallel. Hyperdrive achieves 4.3
TOp/s/W system-level efficiency (i.e., including I/Os)---3.1x higher than
state-of-the-art BWN accelerators, even if its core uses resource-intensive
FP16 arithmetic for increased robustness
ACE16K: A 128Ă128 focal plane analog processor with digital I/O
This paper presents a new generation 128Ă128 focal-plane analog programmable array processor (FPAPAP), from a system level perspective, which has been manufactured in a 0.35 ÎŒm standard digital 1P-5M CMOS technology. The chip has been designed to achieve the high-speed and moderate-accuracy (8b) requirements of most real time early-vision processing applications. It is easily embedded in conventional digital hosting systems: external data interchange and control are completely digital. The chip contains close to four millions transistors, 90% of them working in analog mode, and exhibits a relatively low power consumption-<4 W, i.e. less than 1 ÎŒW per transistor. Computing vs. power peak values are in the order of 1 TeraOPS/W, while maintained VGA processing throughputs of 100 frames/s are possible with about 10-20 basic image processing tasks on each frame
Digital implementation of the cellular sensor-computers
Two different kinds of cellular sensor-processor architectures are used nowadays in various
applications. The first is the traditional sensor-processor architecture, where the sensor and the
processor arrays are mapped into each other. The second is the foveal architecture, in which a
small active fovea is navigating in a large sensor array. This second architecture is introduced
and compared here. Both of these architectures can be implemented with analog and digital
processor arrays. The efficiency of the different implementation types, depending on the used
CMOS technology, is analyzed. It turned out, that the finer the technology is, the better to use
digital implementation rather than analog
A scalable multi-core architecture with heterogeneous memory structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs)
Neuromorphic computing systems comprise networks of neurons that use
asynchronous events for both computation and communication. This type of
representation offers several advantages in terms of bandwidth and power
consumption in neuromorphic electronic systems. However, managing the traffic
of asynchronous events in large scale systems is a daunting task, both in terms
of circuit complexity and memory requirements. Here we present a novel routing
methodology that employs both hierarchical and mesh routing strategies and
combines heterogeneous memory structures for minimizing both memory
requirements and latency, while maximizing programming flexibility to support a
wide range of event-based neural network architectures, through parameter
configuration. We validated the proposed scheme in a prototype multi-core
neuromorphic processor chip that employs hybrid analog/digital circuits for
emulating synapse and neuron dynamics together with asynchronous digital
circuits for managing the address-event traffic. We present a theoretical
analysis of the proposed connectivity scheme, describe the methods and circuits
used to implement such scheme, and characterize the prototype chip. Finally, we
demonstrate the use of the neuromorphic processor with a convolutional neural
network for the real-time classification of visual symbols being flashed to a
dynamic vision sensor (DVS) at high speed.Comment: 17 pages, 14 figure
Locust-inspired vision system on chip architecture for collision detection in automotive applications
This paper describes a programmable digital computing architecture dedicated to process information in accordance to the organization and operating principles of the four-layer neuron structure encountered at the visual system of Locusts. This architecture takes advantage of the natural collision detection skills of locusts and is capable of processing images and ascertaining collision threats in real-time automotive scenarios. In addition to the Locust features, the architecture embeds a Topological Feature Estimator module to identify and classify objects in collision course.European Commission IST2001 - 38097Ministerio de Ciencia y TecnologĂa TIC2003 - 09817- C02 - 0
In the quest of vision-sensors-on-chip: Pre-processing sensors for data reduction
This paper shows that the implementation of vision systems benefits from the usage of sensing front-end chips with embedded pre-processing capabilities - called CVIS. Such embedded pre-processors reduce the number of data to be delivered for ulterior processing. This strategy, which is also adopted by natural vision systems, relaxes system-level requirements regarding data storage and communications and enables highly compact and fast vision systems. The paper includes several proof-o-concept CVIS chips with embedded pre-processing and illustrate their potential advantages. © 2017, Society for Imaging Science and Technology.Office of Naval Research (USA) N00014-14-1-0355Ministerio de EconomĂa y Competitiviad TEC2015-66878-C3-1-R, TEC2015-66878-C3-3-RJunta de AndalucĂa 2012 TIC 233
CMOS Vision Sensors: Embedding Computer Vision at Imaging Front-Ends
CMOS Image Sensors (CIS) are key for imaging technol-ogies. These chips are conceived for capturing opticalscenes focused on their surface, and for delivering elec-trical images, commonly in digital format. CISs may incor-porate intelligence; however, their smartness basicallyconcerns calibration, error correction and other similartasks. The term CVISs (CMOS VIsion Sensors) definesother class of sensor front-ends which are aimed at per-forming vision tasks right at the focal plane. They havebeen running under names such as computational imagesensors, vision sensors and silicon retinas, among others. CVIS and CISs are similar regarding physical imple-mentation. However, while inputs of both CIS and CVISare images captured by photo-sensors placed at thefocal-plane, CVISs primary outputs may not be imagesbut either image features or even decisions based on thespatial-temporal analysis of the scenes. We may hencestate that CVISs are more âintelligentâ than CISs as theyfocus on information instead of on raw data. Actually,CVIS architectures capable of extracting and interpretingthe information contained in images, and prompting reac-tion commands thereof, have been explored for years inacademia, and industrial applications are recently ramp-ing up.One of the challenges of CVISs architects is incorporat-ing computer vision concepts into the design flow. Theendeavor is ambitious because imaging and computervision communities are rather disjoint groups talking dif-ferent languages. The Cellular Nonlinear Network Univer-sal Machine (CNNUM) paradigm, proposed by Profs.Chua and Roska, defined an adequate framework forsuch conciliation as it is particularly well suited for hard-ware-software co-design [1]-[4]. This paper overviewsCVISs chips that were conceived and prototyped at IMSEVision Lab over the past twenty years. Some of them fitthe CNNUM paradigm while others are tangential to it. Allthem employ per-pixel mixed-signal processing circuitryto achieve sensor-processing concurrency in the quest offast operation with reduced energy budget.Junta de AndalucĂa TIC 2012-2338Ministerio de EconomĂa y Competitividad TEC 2015-66878-C3-1-R y TEC 2015-66878-C3-3-
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
- âŠ