10,974 research outputs found
Motion estimation and CABAC VLSI co-processors for real-time high-quality H.264/AVC video coding
Real-time and high-quality video coding is gaining a wide interest in the research and industrial community for different applications. H.264/AVC, a recent standard for high performance video coding, can be successfully exploited in several scenarios including digital video broadcasting, high-definition TV and DVD-based systems, which require to sustain up to tens of Mbits/s. To that purpose this paper proposes optimized architectures for H.264/AVC most critical tasks, Motion estimation and context adaptive binary arithmetic coding. Post synthesis results on sub-micron CMOS standard-cells technologies show that the proposed architectures can actually process in real-time 720 Ă 480 video sequences at 30 frames/s and grant more than 50 Mbits/s. The achieved circuit complexity and power consumption budgets are suitable for their integration in complex VLSI multimedia systems based either on AHB bus centric on-chip communication system or on novel Network-on-Chip (NoC) infrastructures for MPSoC (Multi-Processor System on Chip
Design of multimedia processor based on metric computation
Media-processing applications, such as signal processing, 2D and 3D graphics
rendering, and image compression, are the dominant workloads in many embedded
systems today. The real-time constraints of those media applications have
taxing demands on today's processor performances with low cost, low power and
reduced design delay. To satisfy those challenges, a fast and efficient
strategy consists in upgrading a low cost general purpose processor core. This
approach is based on the personalization of a general RISC processor core
according the target multimedia application requirements. Thus, if the extra
cost is justified, the general purpose processor GPP core can be enforced with
instruction level coprocessors, coarse grain dedicated hardware, ad hoc
memories or new GPP cores. In this way the final design solution is tailored to
the application requirements. The proposed approach is based on three main
steps: the first one is the analysis of the targeted application using
efficient metrics. The second step is the selection of the appropriate
architecture template according to the first step results and recommendations.
The third step is the architecture generation. This approach is experimented
using various image and video algorithms showing its feasibility
Energy-efficient acceleration of MPEG-4 compression tools
We propose novel hardware accelerator architectures for the most computationally demanding algorithms of the MPEG-4 video compression standard-motion estimation, binary motion estimation (for shape coding), and the forward/inverse discrete cosine transforms (incorporating shape adaptive modes). These accelerators have been designed using general low-energy design philosophies at the algorithmic/architectural abstraction levels. The themes of these philosophies are avoiding waste and trading area/performance for power and energy gains. Each core has been synthesised targeting TSMC 0.09
ÎŒm TCBN90LP technology, and the experimental results presented in this paper show that the proposed cores improve upon the prior art
Efficient hardware architectures for MPEG-4 core profile
Efficient hardware acceleration architectures are proposed for the most demandingMPEG-4 core profile algorithms, namely; texture motion estimation (TME), binary motion estimation (BME)and the shape adaptive discrete cosine transform (SA-DCT). The proposed ME designs may also be used for H.264, since both architectures can handle variable block sizes. Both ME architectures employ early termination techniques that reduce latency and save needless memory accesses and power consumption. They also use a pixel subsampling technique to facilitate parallelism,
while balancing the computational load. The BME datapath also saves operations by using Run Length Coded (RLC) pixel addressing. The SA-DCT module has a re-configuring multiplier-less serial datapath using adders and multiplexers only to improve area and power. The SA-DCT packing steps are done using a minimal switching addressing scheme with guarded evaluation. All three modules have been synthesised targeting the WildCard-II FPGA benchmarking platform adopted by the MPEG-4 Part9 reference hardware group
Video Processing Acceleration using Reconfigurable Logic and Graphics Processors
A vexing question is `which architecture will prevail as the core feature of the next state of
the art video processing system?' This thesis examines the substitutive and collaborative
use of the two alternatives of the reconfigurable logic and graphics processor architectures.
A structured approach to executing architecture comparison is presented - this includes a
proposed `Three Axes of Algorithm Characterisation' scheme and a formulation of perfor-
mance drivers. The approach is an appealing platform for clearly defining the problem,
assumptions and results of a comparison. In this work it is used to resolve the advanta-
geous factors of the graphics processor and reconfigurable logic for video processing, and
the conditions determining which one is superior. The comparison results prompt the
exploration of the customisable options for the graphics processor architecture. To clearly
define the architectural design space, the graphics processor is first identifed as part of
a wider scope of homogeneous multi-processing element (HoMPE) architectures. A novel
exploration tool is described which is suited to the investigation of the customisable op-
tions of HoMPE architectures. The tool adopts a systematic exploration approach and a
high-level parameterisable system model, and is used to explore pre- and post-fabrication
customisable options for the graphics processor. A positive result of the exploration is the
proposal of a reconfigurable engine for data access (REDA) to optimise graphics processor
performance for video processing-specific memory access patterns. REDA demonstrates
the viability of the use of reconfigurable logic as collaborative `glue logic' in the graphics
processor architecture
Resource Optimized Quantum Architectures for Surface Code Implementations of Magic-State Distillation
Quantum computers capable of solving classically intractable problems are
under construction, and intermediate-scale devices are approaching completion.
Current efforts to design large-scale devices require allocating immense
resources to error correction, with the majority dedicated to the production of
high-fidelity ancillary states known as magic-states. Leading techniques focus
on dedicating a large, contiguous region of the processor as a single
"magic-state distillation factory" responsible for meeting the magic-state
demands of applications. In this work we design and analyze a set of optimized
factory architectural layouts that divide a single factory into spatially
distributed factories located throughout the processor. We find that
distributed factory architectures minimize the space-time volume overhead
imposed by distillation. Additionally, we find that the number of distributed
components in each optimal configuration is sensitive to application
characteristics and underlying physical device error rates. More specifically,
we find that the rate at which T-gates are demanded by an application has a
significant impact on the optimal distillation architecture. We develop an
optimization procedure that discovers the optimal number of factory
distillation rounds and number of output magic states per factory, as well as
an overall system architecture that interacts with the factories. This yields
between a 10x and 20x resource reduction compared to commonly accepted single
factory designs. Performance is analyzed across representative application
classes such as quantum simulation and quantum chemistry.Comment: 16 pages, 14 figure
Exploring Processor and Memory Architectures for Multimedia
Multimedia has become one of the cornerstones of our 21st century society and, when combined with mobility, has enabled a tremendous evolution of our society. However, joining these two concepts introduces many technical challenges. These range from having sufficient performance for handling multimedia content to having the battery stamina for acceptable mobile usage. When taking a projection of where we are heading, we see these issues becoming ever more challenging by increased mobility as well as advancements in multimedia content, such as introduction of stereoscopic 3D and augmented reality. The increased performance needs for handling multimedia come not only from an ongoing step-up in resolution going from QVGA (320x240) to Full HD (1920x1080) a 27x increase in less than half a decade. On top of this, there is also codec evolution (MPEG-2 to H.264 AVC) that adds to the computational load increase. To meet these performance challenges there has been processing and memory architecture advances (SIMD, out-of-order superscalarity, multicore processing and heterogeneous multilevel memories) in the mobile domain, in conjunction with ever increasing operating frequencies (200MHz to 2GHz) and on-chip memory sizes (128KB to 2-3MB). At the same time there is an increase in requirements for mobility, placing higher demands on battery-powered systems despite the steady increase in battery capacity (500 to 2000mAh). This leaves negative net result in-terms of battery capacity versus performance advances. In order to make optimal use of these architectural advances and to meet the power limitations in mobile systems, there is a need for taking an overall approach on how to best utilize these systems. The right trade-off between performance and power is crucial. On top of these constraints, the flexibility aspects of the system need to be addressed. All this makes it very important to reach the right architectural balance in the system. The first goal for this thesis is to examine multimedia applications and propose a flexible solution that can meet the architectural requirements in a mobile system. Secondly, propose an automated methodology of optimally mapping multimedia data and instructions to a heterogeneous multilevel memory subsystem. The proposed methodology uses constraint programming for solving a multidimensional optimization problem. Results from this work indicate that using todayâs most advanced mobile processor technology together with a multi-level heterogeneous on-chip memory subsystem can meet the performance requirements for handling multimedia. By utilizing the automated optimal memory mapping method presented in this thesis lower total power consumption can be achieved, whilst performance for multimedia applications is improved, by employing enhanced memory management. This is achieved through reduced external accesses and better reuse of memory objects. This automatic method shows high accuracy, up to 90%, for predicting multimedia memory accesses for a given architecture
A Low Power Architectural Framework for Automated Surveillance System with Low Bit Rate Transmission
Abstract The changed security scenario of the modern time has necessitated increased and sophisticated vigilance of the countries' borders. The technological challenges involved in accomplishing such feat of automated security system are many and require research at the components-and-algorithms as well as the architectural levels. This paper proposes an architectural framework for automated video surveillance comprising a network of sensors and closed circuit television cameras as well as proposing algorithmic/component research of software and hardware for the core functioning of the framework, such as: communication protocols, object detection, data-integration, object identification, object tracking, video compression, threat identification, and alarm generation. In this paper, we are addressing some general topological and routing features that would be adopted in our system. There are two types of data with regard to data communication â video stream and object detection. The network is broken down into several disjoint, almost equal zones. A zone have one or more one cluster. A zone manager is chosen among the cluster heads depending on their relative residual energies. There are several levels of control that could be implemented with this arrangement with localized decision made, to get distributed effect at all levels. A cell tracks each target in its zone. If the target moves out of the range of a cell, the cell manager will send the target description to estimated next cell. The next cell starts tracking the target. If the estimated cell is wrongly chosen, corrections will be made by the cluster heads to get the new target-tracking. We also propose bitrate reduction algorithms to accommodate the limited bandwidth. One of the main feature of this paper is introducing a Low-Power Low-Bit rate video compression algorithm to accommodate the low power requirements at sensor nodes, and the low bit rate requirement for the communication protocol. We proposed two algorithms the ALBR and LPHSME. ALBR is addressing low bit rate required for sensors network with limited bandwidth which achieves a reduction in Average number of bits per Iframe by approximately 60% in case of low motion video sequences and 53% in case of fast motion video sequences . LPHSME addresses low power requirements of multi sensor network that has limited power batteries. The performance of the proposed LPHSME algorithm versus full search and three step search indicates a reduction in motion estimation time by approximately 89% in case of low motion video sequences (e.g., Claire ) and 84% in case of fast motion video sequences. The reduced complexity of LPHSME results in low power requirements
Performance Analysis of a Novel GPU Computation-to-core Mapping Scheme for Robust Facet Image Modeling
Though the GPGPU concept is well-known
in image processing, much more work remains to be done
to fully exploit GPUs as an alternative computation
engine. This paper investigates the computation-to-core
mapping strategies to probe the efficiency and scalability
of the robust facet image modeling algorithm on GPUs.
Our fine-grained computation-to-core mapping scheme
shows a significant performance gain over the standard
pixel-wise mapping scheme. With in-depth performance
comparisons across the two different mapping schemes,
we analyze the impact of the level of parallelism on
the GPU computation and suggest two principles for
optimizing future image processing applications on the
GPU platform
- âŠ