440 research outputs found
Neighbor cache prefetching for multimedia image and video processing
Cache performance is strongly influenced by the type of locality embodied in programs. In particular, multimedia programs handling images and videos are characterized by a bidimensional spatial locality, which is not adequately exploited by standard caches. In this paper we propose novel cache prefetching techniques for image data, called neighbor prefetching, able to improve exploitation of bidimensional spatial locality. A performance comparison is provided against other assessed prefetching techniques on a multimedia workload (with MPEG-2 and MPEG-4 decoding, image processing, and visual object segmentation), including a detailed evaluation of both the miss rate and the memory access time. Results prove that neighbor prefetching achieves a significant reduction in the time due to delayed memory cycles (more than 97% on MPEG-4 with respect to 75% of the second performing technique). This reduction leads to a substantial speedup on the overall memory access time (up to 140% for MPEG-4). Performance has been measured with the PRIMA trace-driven simulator, specifically devised to support cache prefetching
Improving data prefetching efficacy in multimedia applications
The workload of multimedia applications has a strong impact on cache memory performance, since the locality of memory references embedded in multimedia programs differs from that of traditional programs. In many cases, standard cache memory organization achieves poorer performance when used for multimedia. A widely-explored approach to improve cache performance is hardware prefetching, which allows the pre-loading of data in the cache before they are referenced. However, existing hardware prefetching approaches are unable to exploit the potential improvement in performance, since they are not tailored to multimedia locality. In this paper we propose novel effective approaches to hardware prefetching to be used in image processing programs for multimedia. Experimental results are reported for a suite of multimedia image processing programs including MPEG-2 decoding and encoding, convolution, thresholding, and edge chain coding
Recommended from our members
Energy-aware embedded media processing: customizable memory subsystems and energy management policies
textThe design of energy-efficient data memory architectures for embedded
system platforms has received considerable attention in recent years. In
this dissertation we propose a special-purpose data memory subsystem, called
Xtream-Fit, targeted to streaming media applications executing on both generic
uniprocessor embedded platforms and powerful SMT-based multi-threading
platforms. We empirically demonstrate that Xtream-Fit achieves high energydelay
efficiency across a wide range of media devices, from systems running a
single media application to systems concurrently executing multiple media applications
under synchronization constraints. Xtream-Fit’s energy efficiency
is predicated on a novel task-based execution model that exposes/enhances
opportunities for efficient prefetching, and aggressive dynamic energy conservation
techniques targeting on-chip and off-chip memory components. A key
novelty of Xtream-Fit is that it exposes a single customization parameter, thus
enabling a very simple and yet effective design space exploration methodology
to find the best memory configuration for the target application(s). Extensive
experimental results show that Xtream-Fit reduces energy-delay product
substantially – by 32% to 69% – as compared to ‘standard’ general-purpose
memory subsystems enhanced with state of the art cache decay and SDRAM
power mode control policies.Electrical and Computer Engineerin
A Survey of Techniques For Improving Energy Efficiency in Embedded Computing Systems
Recent technological advances have greatly improved the performance and
features of embedded systems. With the number of just mobile devices now
reaching nearly equal to the population of earth, embedded systems have truly
become ubiquitous. These trends, however, have also made the task of managing
their power consumption extremely challenging. In recent years, several
techniques have been proposed to address this issue. In this paper, we survey
the techniques for managing power consumption of embedded systems. We discuss
the need of power management and provide a classification of the techniques on
several important parameters to highlight their similarities and differences.
This paper is intended to help the researchers and application-developers in
gaining insights into the working of power management techniques and designing
even more efficient high-performance embedded systems of tomorrow
A C++-embedded Domain-Specific Language for programming the MORA soft processor array
MORA is a novel platform for high-level FPGA programming of streaming vector and matrix operations, aimed at multimedia applications. It consists of soft array of pipelined low-complexity SIMD processors-in-memory (PIM). We present a Domain-Specific Language (DSL) for high-level programming of the MORA soft processor array. The DSL is embedded in C++, providing designers with a familiar language framework and the ability to compile designs using a standard compiler for functional testing before generating the FPGA bitstream using the MORA toolchain. The paper discusses the MORA-C++ DSL and the compilation route into the assembly for the MORA machine and provides examples to illustrate the programming model and performance
Software caching techniques and hardware optimizations for on-chip local memories
Despite the fact that the most viable L1 memories in processors are caches,
on-chip local memories have been a great topic of consideration lately. Local
memories are an interesting design option due to their many benefits: less
area occupancy, reduced energy consumption and fast and constant access time.
These benefits are especially interesting for the design of modern multicore processors
since power and latency are important assets in computer architecture
today. Also, local memories do not generate coherency traffic which is important
for the scalability of the multicore systems.
Unfortunately, local memories have not been well accepted in modern processors
yet, mainly due to their poor programmability. Systems with on-chip local
memories do not have hardware support for transparent data transfers between
local and global memories, and thus ease of programming is one of the main
impediments for the broad acceptance of those systems. This thesis addresses
software and hardware optimizations regarding the programmability, and the
usage of the on-chip local memories in the context of both single-core and multicore
systems.
Software optimizations are related to the software caching techniques. Software
cache is a robust approach to provide the user with a transparent view
of the memory architecture; but this software approach can suffer from poor
performance. In this thesis, we start optimizing traditional software cache by
proposing a hierarchical, hybrid software-cache architecture. Afterwards, we develop
few optimizations in order to speedup our hybrid software cache as much
as possible. As the result of the software optimizations we obtain that our hybrid
software cache performs from 4 to 10 times faster than traditional software
cache on a set of NAS parallel benchmarks.
We do not stop with software caching. We cover some other aspects of the
architectures with on-chip local memories, such as the quality of the generated
code and its correspondence with the quality of the buffer management in local
memories, in order to improve performance of these architectures. Therefore,
we run our research till we reach the limit in software and start proposing optimizations
on the hardware level. Two hardware proposals are presented in this
thesis. One is about relaxing alignment constraints imposed in the architectures
with on-chip local memories and the other proposal is about accelerating the
management of local memories by providing hardware support for the majority
of actions performed in our software cache.Malgrat les memòries cau encara son el component basic pel disseny del subsistema de memòria, les memòries locals han esdevingut una alternativa degut a les seves característiques pel que fa a l’ocupació d’àrea, el seu consum energètic i el seu rendiment amb un temps d’accés ràpid i constant. Aquestes característiques son d’especial interès quan les properes arquitectures multi-nucli estan limitades pel consum de potencia i la latència del subsistema de memòria.Les memòries locals pateixen de limitacions respecte la complexitat en la seva programació, fet que dificulta la seva introducció en arquitectures multi-nucli, tot i els avantatges esmentats anteriorment. Aquesta tesi presenta un seguit de solucions basades en programari i maquinari específicament dissenyat per resoldre aquestes limitacions.Les optimitzacions del programari estan basades amb tècniques d'emmagatzematge de memòria cau suportades per llibreries especifiques. La memòria cau per programari és un sòlid mètode per proporcionar a l'usuari una visió transparent de l'arquitectura, però aquest enfocament pot patir d'un rendiment deficient. En aquesta tesi, es proposa una estructura jeràrquica i híbrida. Posteriorment, desenvolupem optimitzacions per tal d'accelerar l’execució del programari que suporta el disseny de la memòria cau. Com a resultat de les optimitzacions realitzades, obtenim que el nostre disseny híbrid es comporta de 4 a 10 vegades més ràpid que una implementació tradicional de memòria cau sobre un conjunt d’aplicacions de referencia, com son els “NAS parallel benchmarks”.El treball de tesi inclou altres aspectes de les arquitectures amb memòries locals, com ara la qualitat del codi generat i la seva correspondència amb la qualitat de la gestió de memòria intermèdia en les memòries locals, per tal de millorar el rendiment d'aquestes arquitectures. La tesi desenvolupa propostes basades estrictament en el disseny de nou maquinari per tal de millorar el rendiment de les memòries locals quan ja no es possible realitzar mes optimitzacions en el programari. En particular, la tesi presenta dues propostes de maquinari: una relaxa les restriccions imposades per les memòries locals respecte l’alineament de dades, l’altra introdueix maquinari específic per accelerar les operacions mes usuals sobre les memòries locals
Feasibility analysis of correlation based prefetching using digital signal processing
As the gap between processor performance and memory performance continues to broaden with time, techniques to hide memory latency such as correlation based prefetching become exceedingly important. When a memory reference issued by the processor misses the level one cache, the request propagates down the memory hierarchy until it finally finds the requested datum. With each layer traversed, the latency grows exponentially. Prefetching is a technique used to hide this latency by attempting to predict which memory references will be requested in the near future, and then load them into cache before they are needed. This work investigates the use of digital signal processing techniques in designing an effective prefetch algorithm. The algorithm proposed in this work uses the Kalman Filter as the basic digital signal processing block. The sequence of memory address references with respect to time is interpreted as a digital signal. By applying Kalman filtering techniques, a robust prediction algorithm is presented to predict future miss references based on the pattern of previous miss references. The algorithm was simulated using 40 benchmark programs from the Olden, MediaBench, and SPEC benchmark suites for the Alpha 21264 and the PISA (a MIPS-like ISA) instruction set architectures. A main difference between these two ISAs is that the Alpha 21264 ISA contains software prefetch instructions, and the PISA instruction set architecture does not. The simulations place a prefetcher unit between the level one data cache and the level two unified cache. SimpleScalar simulation results for a broad set of benchmark programs using 32 Kalman filter blocks show an average of 6.5% speedup for the Alpha 21264 ISA, and an average of 5.6% speedup for the PISA instruction set architecture for those benchmark programs which have a potential speedup from prefetching greater than 10%
- …