5,589 research outputs found
The Design of a System Architecture for Mobile Multimedia Computers
This chapter discusses the system architecture of a portable computer, called Mobile Digital Companion, which provides support for handling multimedia applications energy efficiently. Because battery life is limited and battery weight is an important factor for the size and the weight of the Mobile Digital Companion, energy management plays a crucial role in the architecture. As the Companion must remain usable in a variety of environments, it has to be flexible and adaptable to various operating conditions. The Mobile Digital Companion has an unconventional architecture that saves energy by using system decomposition at different levels of the architecture and exploits locality of reference with dedicated, optimised modules. The approach is based on dedicated functionality and the extensive use of energy reduction techniques at all levels of system design. The system has an architecture with a general-purpose processor accompanied by a set of heterogeneous autonomous programmable modules, each providing an energy efficient implementation of dedicated tasks. A reconfigurable internal communication network switch exploits locality of reference and eliminates wasteful data copies
Force user's manual: A portable, parallel FORTRAN
The use of Force, a parallel, portable FORTRAN on shared memory parallel computers is described. Force simplifies writing code for parallel computers and, once the parallel code is written, it is easily ported to computers on which Force is installed. Although Force is nearly the same for all computers, specific details are included for the Cray-2, Cray-YMP, Convex 220, Flex/32, Encore, Sequent, Alliant computers on which it is installed
A C++-embedded Domain-Specific Language for programming the MORA soft processor array
MORA is a novel platform for high-level FPGA programming of streaming vector and matrix operations, aimed at multimedia applications. It consists of soft array of pipelined low-complexity SIMD processors-in-memory (PIM). We present a Domain-Specific Language (DSL) for high-level programming of the MORA soft processor array. The DSL is embedded in C++, providing designers with a familiar language framework and the ability to compile designs using a standard compiler for functional testing before generating the FPGA bitstream using the MORA toolchain. The paper discusses the MORA-C++ DSL and the compilation route into the assembly for the MORA machine and provides examples to illustrate the programming model and performance
Memory and information processing in neuromorphic systems
A striking difference between brain-inspired neuromorphic processors and
current von Neumann processors architectures is the way in which memory and
processing is organized. As Information and Communication Technologies continue
to address the need for increased computational power through the increase of
cores within a digital processor, neuromorphic engineers and scientists can
complement this need by building processor architectures where memory is
distributed with the processing. In this paper we present a survey of
brain-inspired processor architectures that support models of cortical networks
and deep neural networks. These architectures range from serial clocked
implementations of multi-neuron systems to massively parallel asynchronous ones
and from purely digital systems to mixed analog/digital systems which implement
more biological-like models of neurons and synapses together with a suite of
adaptation and learning mechanisms analogous to the ones found in biological
nervous systems. We describe the advantages of the different approaches being
pursued and present the challenges that need to be addressed for building
artificial neural processing systems that can display the richness of behaviors
seen in biological systems.Comment: Submitted to Proceedings of IEEE, review of recently proposed
neuromorphic computing platforms and system
A Scalable Correlator Architecture Based on Modular FPGA Hardware, Reuseable Gateware, and Data Packetization
A new generation of radio telescopes is achieving unprecedented levels of
sensitivity and resolution, as well as increased agility and field-of-view, by
employing high-performance digital signal processing hardware to phase and
correlate large numbers of antennas. The computational demands of these imaging
systems scale in proportion to BMN^2, where B is the signal bandwidth, M is the
number of independent beams, and N is the number of antennas. The
specifications of many new arrays lead to demands in excess of tens of PetaOps
per second.
To meet this challenge, we have developed a general purpose correlator
architecture using standard 10-Gbit Ethernet switches to pass data between
flexible hardware modules containing Field Programmable Gate Array (FPGA)
chips. These chips are programmed using open-source signal processing libraries
we have developed to be flexible, scalable, and chip-independent. This work
reduces the time and cost of implementing a wide range of signal processing
systems, with correlators foremost among them,and facilitates upgrading to new
generations of processing technology. We present several correlator
deployments, including a 16-antenna, 200-MHz bandwidth, 4-bit, full Stokes
parameter application deployed on the Precision Array for Probing the Epoch of
Reionization.Comment: Accepted to Publications of the Astronomy Society of the Pacific. 31
pages. v2: corrected typo, v3: corrected Fig. 1
OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF
The actor model of computation has been designed for a seamless support of
concurrency and distribution. However, it remains unspecific about data
parallel program flows, while available processing power of modern many core
hardware such as graphics processing units (GPUs) or coprocessors increases the
relevance of data parallelism for general-purpose computation.
In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework
(CAF). This offers a high level interface for accessing any OpenCL device
without leaving the actor paradigm. The new type of actor is integrated into
the runtime environment of CAF and gives rise to transparent message passing in
distributed systems on heterogeneous hardware. Following the actor logic in
CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence
operate in a multi-stage fashion on data resident at the GPU. Developers are
thus enabled to build complex data parallel programs from primitives without
leaving the actor paradigm, nor sacrificing performance. Our evaluations on
commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear
scaling behavior when offloading larger workloads. For sub-second duties, the
efficiency of offloading was found to largely differ between devices. Moreover,
our findings indicate a negligible overhead over programming with the native
OpenCL API.Comment: 28 page
Shared versus distributed memory multiprocessors
The question of whether multiprocessors should have shared or distributed memory has attracted a great deal of attention. Some researchers argue strongly for building distributed memory machines, while others argue just as strongly for programming shared memory multiprocessors. A great deal of research is underway on both types of parallel systems. Special emphasis is placed on systems with a very large number of processors for computation intensive tasks and considers research and implementation trends. It appears that the two types of systems will likely converge to a common form for large scale multiprocessors
Force user's manual, revised
A methodology for writing parallel programs for shared memory multiprocessors has been formalized as an extension to the Fortran language and implemented as a macro preprocessor. The extended language is known as the Force, and this manual describes how to write Force programs and execute them on the Flexible Computer Corporation Flex/32, the Encore Multimax and the Sequent Balance computers. The parallel extension macros are described in detail, but knowledge of Fortran is assumed
A Streaming Multi-GPU Implementation of Image Simulation Algorithms for Scanning Transmission Electron Microscopy
Simulation of atomic resolution image formation in scanning transmission
electron microscopy can require significant computation times using traditional
methods. A recently developed method, termed plane-wave reciprocal-space
interpolated scattering matrix (PRISM), demonstrates potential for significant
acceleration of such simulations with negligible loss of accuracy. Here we
present a software package called Prismatic for parallelized simulation of
image formation in scanning transmission electron microscopy (STEM) using both
the PRISM and multislice methods. By distributing the workload between multiple
CUDA-enabled GPUs and multicore processors, accelerations as high as 1000x for
PRISM and 30x for multislice are achieved relative to traditional multislice
implementations using a single 4-GPU machine. We demonstrate a potentially
important application of Prismatic, using it to compute images for atomic
electron tomography at sufficient speeds to include in the reconstruction
pipeline. Prismatic is freely available both as an open-source CUDA/C++ package
with a graphical user interface and as a Python package, PyPrismatic
- âŚ