43,700 research outputs found
CUDA Based Performance Evaluation of the Computational Efficiency of the DCT Image Compression Technique on Both the CPU and GPU
Recent advances in computing such as the massively parallel GPUs (Graphical
Processing Units),coupled with the need to store and deliver large quantities
of digital data especially images, has brought a number of challenges for
Computer Scientists, the research community and other stakeholders. These
challenges, such as prohibitively large costs to manipulate the digital data
amongst others, have been the focus of the research community in recent years
and has led to the investigation of image compression techniques that can
achieve excellent results. One such technique is the Discrete Cosine Transform,
which helps separate an image into parts of differing frequencies and has the
advantage of excellent energy-compaction.
This paper investigates the use of the Compute Unified Device Architecture
(CUDA) programming model to implement the DCT based Cordic based Loeffler
algorithm for efficient image compression. The computational efficiency is
analyzed and evaluated under both the CPU and GPU. The PSNR (Peak Signal to
Noise Ratio) is used to evaluate image reconstruction quality in this paper.
The results are presented and discussed.Comment: 15 Pages, 11 Figures, 4 Tables, Advanced Computing: An International
Journal (ACIJ), Three Author Pictures (with little Bio for each) at last pag
A mixed signal architecture for convolutional neural networks
Deep neural network (DNN) accelerators with improved energy and delay are
desirable for meeting the requirements of hardware targeted for IoT and edge
computing systems. Convolutional neural networks (CoNNs) belong to one of the
most popular types of DNN architectures. This paper presents the design and
evaluation of an accelerator for CoNNs. The system-level architecture is based
on mixed-signal, cellular neural networks (CeNNs). Specifically, we present (i)
the implementation of different layers, including convolution, ReLU, and
pooling, in a CoNN using CeNN, (ii) modified CoNN structures with CeNN-friendly
layers to reduce computational overheads typically associated with a CoNN,
(iii) a mixed-signal CeNN architecture that performs CoNN computations in the
analog and mixed signal domain, and (iv) design space exploration that
identifies what CeNN-based algorithm and architectural features fare best
compared to existing algorithms and architectures when evaluated over common
datasets -- MNIST and CIFAR-10. Notably, the proposed approach can lead to
8.7 improvements in energy-delay product (EDP) per digit classification
for the MNIST dataset at iso-accuracy when compared with the state-of-the-art
DNN engine, while our approach could offer 4.3 improvements in EDP when
compared to other network implementations for the CIFAR-10 dataset.Comment: 25 page
Enhanced computation method of topological smoothing on shared memory parallel machines
To prepare images for better segmentation, we need preprocessing
applications, such as smoothing, to reduce noise. In this paper, we present an
enhanced computation method for smoothing 2D object in binary case. Unlike
existing approaches, proposed method provides a parallel computation and better
memory management, while preserving the topology (number of connected
components) of the original image by using homotopic transformations defined in
the framework of digital topology. We introduce an adapted parallelization
strategy called split, distribute and merge (SDM) strategy which allows
efficient parallelization of a large class of topological operators. To achieve
a good speedup and better memory allocation, we cared about task scheduling and
managing. Distributed work during smoothing process is done by a variable
number of threads. Tests on 2D grayscale image (512*512), using shared memory
parallel machine (SMPM) with 8 CPU cores (2 Xeon E5405 running at frequency of
2 GHz), showed an enhancement of 5.2 with cache success rate of 70%
INsight: A Neuromorphic Computing System for Evaluation of Large Neural Networks
Deep neural networks have been demonstrated impressive results in various
cognitive tasks such as object detection and image classification. In order to
execute large networks, Von Neumann computers store the large number of weight
parameters in external memories, and processing elements are timed-shared,
which leads to power-hungry I/O operations and processing bottlenecks. This
paper describes a neuromorphic computing system that is designed from the
ground up for the energy-efficient evaluation of large-scale neural networks.
The computing system consists of a non-conventional compiler, a neuromorphic
architecture, and a space-efficient microarchitecture that leverages existing
integrated circuit design methodologies. The compiler factorizes a trained,
feedforward network into a sparsely connected network, compresses the weights
linearly, and generates a time delay neural network reducing the number of
connections. The connections and units in the simplified network are mapped to
silicon synapses and neurons. We demonstrate an implementation of the
neuromorphic computing system based on a field-programmable gate array that
performs the MNIST hand-written digit classification with 97.64% accuracy
Proposal For Neuromorphic Hardware Using Spin Devices
We present a design-scheme for ultra-low power neuromorphic hardware using
emerging spin-devices. We propose device models for 'neuron', based on lateral
spin valves and domain wall magnets that can operate at ultra-low terminal
voltage of ~20 mV, resulting in small computation energy. Magnetic tunnel
junctions are employed for interfacing the spin-neurons with charge-based
devices like CMOS, for large-scale networks. Device-circuit
co-simulation-framework is used for simulating such hybrid designs, in order to
evaluate system-level performance. We present the design of different classes
of neuromorphic architectures using the proposed scheme that can be suitable
for different applications like, analog-data-sensing, data-conversion,
cognitive-computing, associative memory, programmable-logic and analog and
digital signal processing. We show that the spin-based neuromorphic designs can
achieve 15X-300X lower computation energy for these applications; as compared
to state of art CMOS designs
High Performance Reconfigurable Computing Systems
The rapid progress and advancement in electronic chips technology provide a
variety of new implementation options for system engineers. The choice varies
between the flexible programs running on a general-purpose processor (GPP) and
the fixed hardware implementation using an application specific integrated
circuit (ASIC). Many other implementation options present, for instance, a
system with a RISC processor and a DSP core. Other options include graphics
processors and microcontrollers. Specialist processors certainly improve
performance over general-purpose ones, but this comes as a quid pro quo for
flexibility. Combining the flexibility of GPPs and the high performance of
ASICs leads to the introduction of reconfigurable computing (RC) as a new
implementation option with a balance between versatility and speed. The focus
of this chapter is on introducing reconfigurable computers as modern super
computing architectures. The chapter also investigates the main reasons behind
the current advancement in the development of RC-systems. Furthermore, a
technical survey of various RC-systems is included laying common grounds for
comparisons. In addition, this chapter mainly presents case studies implemented
under the MorphoSys RC-system. The selected case studies belong to different
areas of application, such as, computer graphics and information coding.
Parallel versions of the studied algorithms are developed to match the
topologies supported by the MorphoSys. Performance evaluation and results
analyses are included for implementations with different characteristics.Comment: 53 pages, 14 tables, 15 figure
Resource Efficient LDPC Decoders for Multimedia Communication
Achieving high image quality is an important aspect in an increasing number
of wireless multimedia applications. These applications require resource
efficient error correction hardware to detect and correct errors introduced by
the communication channel. This paper presents an innovative flexible
architecture for error correction using Low-Density Parity-Check (LDPC) codes.
The proposed partially-parallel decoder architecture utilizes a novel code
construction technique based on multi-level Hierarchical Quasi-Cyclic (HQC)
matrix with innovative layering of random sub-matrices. Simulation of a
high-level MATLAB model shows that the proposed HQC matrices have bit error
rate (BER) performance close to that of unstructured random matrices. The
proposed decoder has been implemented on FPGA. It is very resource efficient
and provides very high throughput compared to other decoders reported to date.
Performance evaluation of the decoder has been carried out by transmitting JPEG
images over an AWGN channel and comparing the quality of the reconstructed
images with those from other decoders.Comment: 10 pages, 12 figures, 4 tables, submitted to Journa
Comparative Performance Analysis of Intel Xeon Phi, GPU, and CPU
We investigate and characterize the performance of an important class of
operations on GPUs and Many Integrated Core (MIC) architectures. Our work is
motivated by applications that analyze low-dimensional spatial datasets
captured by high resolution sensors, such as image datasets obtained from whole
slide tissue specimens using microscopy image scanners. We identify the data
access and computation patterns of operations in object segmentation and
feature computation categories. We systematically implement and evaluate the
performance of these core operations on modern CPUs, GPUs, and MIC systems for
a microscopy image analysis application. Our results show that (1) the data
access pattern and parallelization strategy employed by the operations strongly
affect their performance. While the performance on a MIC of operations that
perform regular data access is comparable or sometimes better than that on a
GPU; (2) GPUs are significantly more efficient than MICs for operations and
algorithms that irregularly access data. This is a result of the low
performance of the latter when it comes to random data access; (3) adequate
coordinated execution on MICs and CPUs using a performance aware task
scheduling strategy improves about 1.29x over a first-come-first-served
strategy. The example application attained an efficiency of 84% in an execution
with of 192 nodes (3072 CPU cores and 192 MICs).Comment: 11 pages, 2 figure
Memristor nanodevice for unconventional computing:review and applications
A memristor is a two-terminal nanodevice that its properties attract a wide
community of researchers from various domains such as physics, chemistry,
electronics, computer and neuroscience.The simple structure for manufacturing,
small scalability, nonvolatility and potential of using inlow power platforms
are outstanding characteristics of this emerging nanodevice. In this report,we
review a brief literature of memristor from mathematic model to the physical
realization. Wediscuss different classes of memristors based on the material
used for its manufacturing. Thepotential applications of memristor are
presented and a wide domain of applications are explainedand classified
A Survey of Neuromorphic Computing and Neural Networks in Hardware
Neuromorphic computing has come to refer to a variety of brain-inspired
computers, devices, and models that contrast the pervasive von Neumann computer
architecture. This biologically inspired approach has created highly connected
synthetic neurons and synapses that can be used to model neuroscience theories
as well as solve challenging machine learning problems. The promise of the
technology is to create a brain-like ability to learn and adapt, but the
technical challenges are significant, starting with an accurate neuroscience
model of how the brain works, to finding materials and engineering
breakthroughs to build devices to support these models, to creating a
programming framework so the systems can learn, to creating applications with
brain-like capabilities. In this work, we provide a comprehensive survey of the
research and motivations for neuromorphic computing over its history. We begin
with a 35-year review of the motivations and drivers of neuromorphic computing,
then look at the major research areas of the field, which we define as
neuro-inspired models, algorithms and learning approaches, hardware and
devices, supporting systems, and finally applications. We conclude with a broad
discussion on the major research topics that need to be addressed in the coming
years to see the promise of neuromorphic computing fulfilled. The goals of this
work are to provide an exhaustive review of the research conducted in
neuromorphic computing since the inception of the term, and to motivate further
work by illuminating gaps in the field where new research is needed
- …