59 research outputs found
Exploring Processor and Memory Architectures for Multimedia
Multimedia has become one of the cornerstones of our 21st century society and, when combined with mobility, has enabled a tremendous evolution of our society. However, joining these two concepts introduces many technical challenges. These range from having sufficient performance for handling multimedia content to having the battery stamina for acceptable mobile usage. When taking a projection of where we are heading, we see these issues becoming ever more challenging by increased mobility as well as advancements in multimedia content, such as introduction of stereoscopic 3D and augmented reality. The increased performance needs for handling multimedia come not only from an ongoing step-up in resolution going from QVGA (320x240) to Full HD (1920x1080) a 27x increase in less than half a decade. On top of this, there is also codec evolution (MPEG-2 to H.264 AVC) that adds to the computational load increase. To meet these performance challenges there has been processing and memory architecture advances (SIMD, out-of-order superscalarity, multicore processing and heterogeneous multilevel memories) in the mobile domain, in conjunction with ever increasing operating frequencies (200MHz to 2GHz) and on-chip memory sizes (128KB to 2-3MB). At the same time there is an increase in requirements for mobility, placing higher demands on battery-powered systems despite the steady increase in battery capacity (500 to 2000mAh). This leaves negative net result in-terms of battery capacity versus performance advances. In order to make optimal use of these architectural advances and to meet the power limitations in mobile systems, there is a need for taking an overall approach on how to best utilize these systems. The right trade-off between performance and power is crucial. On top of these constraints, the flexibility aspects of the system need to be addressed. All this makes it very important to reach the right architectural balance in the system. The first goal for this thesis is to examine multimedia applications and propose a flexible solution that can meet the architectural requirements in a mobile system. Secondly, propose an automated methodology of optimally mapping multimedia data and instructions to a heterogeneous multilevel memory subsystem. The proposed methodology uses constraint programming for solving a multidimensional optimization problem. Results from this work indicate that using today’s most advanced mobile processor technology together with a multi-level heterogeneous on-chip memory subsystem can meet the performance requirements for handling multimedia. By utilizing the automated optimal memory mapping method presented in this thesis lower total power consumption can be achieved, whilst performance for multimedia applications is improved, by employing enhanced memory management. This is achieved through reduced external accesses and better reuse of memory objects. This automatic method shows high accuracy, up to 90%, for predicting multimedia memory accesses for a given architecture
VLSI design and FPGA-based prototyping of a buffered serial port for audio applications
The present market of semiconductor is very competitive; on one
side consumers ask for always increasing performance and new
possibilities, on the other companies have to offer low prices in
order to be successful. For what concerns performance just think
of the wide range of mobile applications, such as PDAs, cellular
phones, and laptops : quality of services, duration of the battery
and computational power are always taken into account when buying
new devices. On the other side, due to the competition, costs have
to be very low; this means that both recursive and non-recursive
engineering costs have to be kept under control.
Time is another important concern: it is usually true that the
earlier a product is presented to the market, the wider share of
the market it will gain. This leads modern semiconductor companies
to look for viable ways to design improved products in a short
time. Because of the complexity of the new electronic systems,
this is not an easy task to be accomplished; even tough electronic
design automation (EDA) tools have greatly improved in the recent
years, a gap still exists between the rate foundries can produce
chips and the rate these chips can be designed.
A very common approach to deal with complexity and performance
requirements is to integrate as many functions as possible on a
single chip (System-On-Chip); this allows higher clock frequency
and lower costs. In connection to this also design reuse has
spread in a great part of semiconductor world. This means using in
your system modules that others have already designed and tested.
This allows you to skip some steps in the design flow (at least
for those modules) and saving a significant amount of time.
In this framework lies the work of my thesis, developed at the
StarCore, a company headquartered in Austin, Texas. StarCore
designs and licences Digital Signal Processors as intellectual
property; this is basically one of the companies that offer its
product to be used in other electronic systems, avoiding licensees
to spend time in designing it by themselves.
A Digital Signal Processor is a special kind of processor,
designed to execute calculus-intensive applications: encoding and
decoding of information, voice synthesis and recognition,
compression and decompression of data, Fourier Transform are just
some examples. In many systems, thanks to its programmability and
its limited cost it is the suitable solution. For example most
mobile phones employs a DSP processor to perform base band
operation on the signal.
In these kind of systems, it is important that very few cycles are
spent doing other than signal processing, such as dealing with
peripherals. In the case of an audio signal it is important that
the audio port asks for the fewer cycle it is possible. For this
reason at StarCore my activity was to design and develop an audio
port controller aiming to reduce at least the cycles asked to the
processor in case that the algorithm run is frame based.
For this purpose I designed hardware to be mapped into an FPGA,
and wrote some software for the DSP; I worked mainly with the
Development Board, used to prototype applications based on the
StarCore processor
A configurable vector processor for accelerating speech coding algorithms
The growing demand for voice-over-packer (VoIP) services and multimedia-rich
applications has made increasingly important the efficient, real-time implementation of
low-bit rates speech coders on embedded VLSI platforms. Such speech coders are
designed to substantially reduce the bandwidth requirements thus enabling dense multichannel
gateways in small form factor. This however comes at a high computational cost
which mandates the use of very high performance embedded processors.
This thesis investigates the potential acceleration of two major ITU-T speech coding
algorithms, namely G.729A and G.723.1, through their efficient implementation on a
configurable extensible vector embedded CPU architecture. New scalar and vector ISAs
were introduced which resulted in up to 80% reduction in the dynamic instruction count
of both workloads. These instructions were subsequently encapsulated into a parametric,
hybrid SISD (scalar processor)–SIMD (vector) processor. This work presents the research
and implementation of the vector datapath of this vector coprocessor which is tightly-coupled
to a Sparc-V8 compliant CPU, the optimization and simulation methodologies
employed and the use of Electronic System Level (ESL) techniques to rapidly design
SIMD datapaths
Performance impact of unaligned memory operations in SIMD extensions for video CODEC applications
Although SIMD extensions are a cost effective way to exploit the data level parallelism present in most media applications, we will show that they had have a very limited memory architecture with a weak support for unaligned memory accesses. In video codec, and other applications, the overhead for accessing unaligned positions without an efficient architecture support has a big performance penalty and in some cases makes vectorization counter-productive. In this paper we analyze the performance impact of extending the Altivec SIMD ISA with unaligned memory operations. Results show that for several kernels in the H.264/AVC media codec, unaligned access support provides a speedup up to 3.8times compared to the plain SIMD version, translating into an average of 1.2times in the entire application. In addition to providing a significant performance advantage, the use of unaligned memory instructions makes programming SIMD code much easier both for the manual developer and the auto vectorizing compilerPeer ReviewedPostprint (published version
Enhancing a Neurosurgical Imaging System with a PC-based Video Processing Solution
This work presents a PC-based prototype video processing application developed to be used with a specific neurosurgical imaging device, the OPMI® PenteroTM operating microscope, in the Department of Neurosurgery of Helsinki University Central Hospital at Töölö, Helsinki. The motivation for implementing the software was the lack of some clinically important features in the imaging system provided by the microscope.
The imaging system is used as an online diagnostic aid during surgery. The microscope has two internal video cameras; one for regular white light imaging and one for near-infrared fluorescence imaging, used for indocyanine green videoangiography. The footage of the microscope’s current imaging mode is accessed via the composite auxiliary output of the device. The microscope also has an external high resolution white light video camera, accessed via a composite output of a separate video hub.
The PC was chosen as the video processing platform for its unparalleled combination of prototyping and high-throughput video processing capabilities. A thorough analysis of the platform and efficient video processing methods was conducted in the thesis and the results were used in the design of the imaging station. The features found feasible during the project were incorporated into a video processing application running on a GNU/Linux distribution Ubuntu. The clinical usefulness of the implemented features was ensured beforehand by consulting the neurosurgeons using the original system.
The most significant shortcomings of the original imaging system were mended in this work. The key features of the developed application include: live streaming, simultaneous streaming and recording, and playing back of upto two video streams. The playback mode provides full media player controls, with a frame-by-frame precision rewinding, in an intuitive and responsive interface. A single view and a side-by-side comparison mode are provided for the streams. The former gives more detail, while the latter can be used, for example, for before-after and anatomic-angiographic comparisons.fi=Opinnäytetyö kokotekstinä PDF-muodossa.|en=Thesis fulltext in PDF format.|sv=Lärdomsprov tillgängligt som fulltext i PDF-format
Meta-Programming and Policy-Based Design as a Technique of Architecting Modular and Efficient DSP Algorithm Implementations
Meta-programming paradigm and policy-based design are less known programming techniques in Digital Signal Processing (DSP) community, used to coding in pure C or assembly language. Major software components, like C++ STL, have proven usefulness of such paradigms in providing top performance of highly optimised native code, along with abstraction and modularity necessary in complex software projects. This paper describes composition of DSP code using these techniques, bringing as an example implementation of Feedback Delay Network (FDN) artificial reverberation algorithm. The proposed approach was proven to be practical, especially in case of prototyping computationally intense algorithms. To provide further performance insight, we discuss the techniques in context of other optimisation methods, like Single Instruction Multiple Data (SIMD) instruction sets usage and exploitation of superscalar architecture capabilities
Estudio de los procesadores digitales de señales para el desarrollo de aplicaciones en tiempo real
El procesamiento digital de señales es la ciencia de la ingeniería que se dedica al
análisis de señales del mundo real, como el audio, la voz, video, entre otras,
utilizando técnicas matemáticas para mejorar, modificar y extraer información de
esa señales. Muchos cambios se han obtenido a partir del procesamiento digital
de señales en diferentes campos: en comunicaciones, medicina, radar y sonar,
reproducción de música de alta calidad, entre otros.
Los procesadores digitales de señales o DSPs (sigla en inglés de Digital Signal
Processor) son microprocesadores específicamente diseñados para el
procesamiento digital de señales. Algunas de sus características básicas como el
formato aritmético, la velocidad, la organización de la memoria o la arquitectura
interna hacen que sean o no adecuados para una aplicación en particular.
Dado el auge que tiene el uso de los DSPs en gran cantidad de aplicaciones
industriales, a nivel de investigación y de electrónica de consumo, por medio de
este trabajo se pretende realizar un estudio general de las características,
arquitecturas, criterios de selección, diferencias con otros tipos de procesadores
que algunas veces se emplean en tareas de procesamiento de señales y en fin,
todo lo correspondiente a la implementación de los DSPs en el desarrollo de
aplicaciones en tiempo real.Incluye bibliografí
Media gateway utilizando um GPU
Mestrado em Engenharia de Computadores e Telemátic
KAVUAKA: a low-power application-specific processor architecture for digital hearing aids
The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements
- …