85,533 research outputs found
Parallel accelerated cyclic reduction preconditioner for three-dimensional elliptic PDEs with variable coefficients
We present a robust and scalable preconditioner for the solution of
large-scale linear systems that arise from the discretization of elliptic PDEs
amenable to rank compression. The preconditioner is based on hierarchical
low-rank approximations and the cyclic reduction method. The setup and
application phases of the preconditioner achieve log-linear complexity in
memory footprint and number of operations, and numerical experiments exhibit
good weak and strong scalability at large processor counts in a distributed
memory environment. Numerical experiments with linear systems that feature
symmetry and nonsymmetry, definiteness and indefiniteness, constant and
variable coefficients demonstrate the preconditioner applicability and
robustness. Furthermore, it is possible to control the number of iterations via
the accuracy threshold of the hierarchical matrix approximations and their
arithmetic operations, and the tuning of the admissibility condition parameter.
Together, these parameters allow for optimization of the memory requirements
and performance of the preconditioner.Comment: 24 pages, Elsevier Journal of Computational and Applied Mathematics,
Dec 201
Motion estimation and CABAC VLSI co-processors for real-time high-quality H.264/AVC video coding
Real-time and high-quality video coding is gaining a wide interest in the research and industrial community for different applications. H.264/AVC, a recent standard for high performance video coding, can be successfully exploited in several scenarios including digital video broadcasting, high-definition TV and DVD-based systems, which require to sustain up to tens of Mbits/s. To that purpose this paper proposes optimized architectures for H.264/AVC most critical tasks, Motion estimation and context adaptive binary arithmetic coding. Post synthesis results on sub-micron CMOS standard-cells technologies show that the proposed architectures can actually process in real-time 720 Ă— 480 video sequences at 30 frames/s and grant more than 50 Mbits/s. The achieved circuit complexity and power consumption budgets are suitable for their integration in complex VLSI multimedia systems based either on AHB bus centric on-chip communication system or on novel Network-on-Chip (NoC) infrastructures for MPSoC (Multi-Processor System on Chip
Anatomy of quantum chaotic eigenstates
The eigenfunctions of quantized chaotic systems cannot be described by
explicit formulas, even approximate ones. This survey summarizes (selected)
analytical approaches used to describe these eigenstates, in the semiclassical
limit. The levels of description are macroscopic (one wants to understand the
quantum averages of smooth observables), and microscopic (one wants
informations on maxima of eigenfunctions, "scars" of periodic orbits, structure
of the nodal sets and domains, local correlations), and often focusses on
statistical results. Various models of "random wavefunctions" have been
introduced to understand these statistical properties, with usually good
agreement with the numerical data. We also discuss some specific systems (like
arithmetic ones) which depart from these random models.Comment: Corrected typos, added a few references and updated some result
An Application-Specific VLIW Processor with Vector Instruction Set for CNN Acceleration
In recent years, neural networks have surpassed classical algorithms in areas
such as object recognition, e.g. in the well-known ImageNet challenge. As a
result, great effort is being put into developing fast and efficient
accelerators, especially for Convolutional Neural Networks (CNNs). In this work
we present ConvAix, a fully C-programmable processor, which -- contrary to many
existing architectures -- does not rely on a hard-wired array of
multiply-and-accumulate (MAC) units. Instead it maps computations onto
independent vector lanes making use of a carefully designed vector instruction
set. The presented processor is targeted towards latency-sensitive applications
and is capable of executing up to 192 MAC operations per cycle. ConvAix
operates at a target clock frequency of 400 MHz in 28nm CMOS, thereby offering
state-of-the-art performance with proper flexibility within its target domain.
Simulation results for several 2D convolutional layers from well known CNNs
(AlexNet, VGG-16) show an average ALU utilization of 72.5% using vector
instructions with 16 bit fixed-point arithmetic. Compared to other well-known
designs which are less flexible, ConvAix offers competitive energy efficiency
of up to 497 GOP/s/W while even surpassing them in terms of area efficiency and
processing speed.Comment: Accepted for publication in the proceedings of the 2019 IEEE
International Symposium on Circuits and Systems (ISCAS
- …