212 research outputs found
Super-scalar RAM-CPU cache compression
High-performance data-intensive query processing tasks like OLAP, data mining or scientific data analysis can be severely I/O bound, even when high-e
Super-Scalar RAM-CPU Cache Compression
High-performance data-intensive query processing tasks like OLAP, data mining or scientific data analysis can be severely I/O bound, even whe
The Fifth NASA Symposium on VLSI Design
The fifth annual NASA Symposium on VLSI Design had 13 sessions including Radiation Effects, Architectures, Mixed Signal, Design Techniques, Fault Testing, Synthesis, Signal Processing, and other Featured Presentations. The symposium provides insights into developments in VLSI and digital systems which can be used to increase data systems performance. The presentations share insights into next generation advances that will serve as a basis for future VLSI design
Access pattern-based code compression for memory-constrained systems
As compared to a large spectrum of performance optimizations, relatively less effort has been dedicated to optimize other aspects of embedded applications such as memory space requirements, power, real-time predictability, and reliability. In particular, many modern embedded systems operate under tight memory space constraints. One way of addressing this constraint is to compress executable code and data as much as possible. While researchers on code compression have studied efficient hardware and software based code compression strategies, many of these techniques do not take application behavior into account; that is, the same compression/decompression strategy is used irrespective of the application being optimized. This article presents an application-sensitive code compression strategy based on control flow graph (CFG) representation of the embedded program. The idea is to start with a memory image wherein all basic blocks of the application are compressed, and decompress only the blocks that are predicted to be needed in the near future. When the current access to a basic block is over, our approach also decides the point at which the block could be compressed. We propose and evaluate several compression and decompression strategies that try to reduce memory requirements without excessively increasing the original instruction cycle counts. Some of our strategies make use of profile data, whereas others are fully automatic. Our experimental evaluation using seven applications from the MediaBench suite and three large embedded applications reveals that the proposed code compression strategy is very successful in practice. Our results also indicate that working at a basic block granularity, as opposed to a procedure granularity, is important for maximizing memory space savings. © 2008 ACM
Trends in hardware architecture for mobile devices
In the last ten years, two main factors have fueled the steady growth in sales
of mobile computing and communication devices: a) the reduction of the
footprint of the devices themselves, such as cellular handsets and small
computers; and b) the success in developing low-power hardware which allows
the devices to operate autonomously for hours or even days. In this review, I
show that the first generation of mobile devices was DSP centric – that is,
its architecture was based in fast processing of digitized signals using low-
power, yet numerically powerful DSPs. However, the next generation of mobile
devices will be built around DSPs and low power microprocessor cores for
general processing applications. Mobile devices will become data-centric. The
main challenge for designers of such hybrid architectures is to increase the
computational performance of the computing unit, while keeping power constant,
or even reducing it. This report shows that low-power mobile hardware
architectures design goes hand in hand with advances in compiling techniques.
We look at the synergy between hardware and software, and show that a good
balance between both can lead to innovative lowpower processor architectures
Dynamically Reconfigurable Systolic Array Accelerators: A Case Study with Extended Kalman Filter and Discrete Wavelet Transform Algorithms
Field programmable grid arrays (FPGA) are increasingly being adopted as the primary on-board computing system for autonomous deep space vehicles. There is a need to support several complex applications for navigation and image processing in a rapidly responsive on-board FPGA-based computer. This requires exploring and combining several design concepts such as systolic arrays, hardware-software partitioning, and partial dynamic reconfiguration. A microprocessor/co-processor design that can accelerate two single precision oating-point algorithms, extended Kalman lter and a discrete wavelet transform, is presented. This research makes three key contributions. (i) A polymorphic systolic array framework comprising of recofigurable partial region-based sockets to accelerate algorithms amenable to being mapped onto linear systolic arrays. When implemented on a low end Xilinx Virtex4 SX35 FPGA the design provides a speedup of at least 4.18x and 6.61x over a state of the art microprocessor used in spacecraft systems for the extended Kalman lter and discrete wavelet transform algorithms, respectively. (ii) Switchboxes to enable communication between static and partial reconfigurable regions and a simple protocol to enable schedule changes when a socket\u27s contents are dynamically reconfigured to alter the concurrency of the participating systolic arrays. (iii) A hybrid partial dynamic reconfiguration method that combines Xilinx early access partial reconfiguration, on-chip bitstream decompression, and bitstream relocation to enable fast scaling of systolic arrays on the PolySAF. This technique provided a 2.7x improvement in reconfiguration time compared to an o-chip partial reconfiguration technique that used a Flash card on the FPGA board, and a 44% improvement in BRAM usage compared to not using compression
Super-Scalar RAM-CPU Cache Compression
textabstractHigh-performance data-intensive query processing tasks like OLAP, data mining or scientific data analysis can be severely I/O bound, even when high-end RAID storage systems are used. Compression can alleviate this bottleneck only if encoding and decoding speeds significantly exceed RAID I/O bandwidth. For this purpose, we propose three new versatile compression schemes (PDICT, PFOR, and PFOR-DELTA) that are specifically designed to extract maximum IPC from modern CPUs. We compare these algorithms with compression techniques used in (commercial) database and information retrieval systems. Our experiments on the MonetDB/X100 database system, using both DSM and PAX disk storage, show that these techniques strongly accelerate TPC-H performance to the point that the I/O bottleneck is eliminated
Gbit/second lossless data compression hardware
This thesis investigates how to improve the performance of lossless data compression hardware
as a tool to reduce the cost per bit stored in a computer system or transmitted over a
communication network.
Lossless data compression allows the exact reconstruction of the original data after
decompression. Its deployment in some high-bandwidth applications has been hampered due to
performance limitations in the compressing hardware that needs to match the performance of the
original system to avoid becoming a bottleneck. Advancing the area of lossless data compression
hardware, hence, offers a valid motivation with the potential of doubling the performance of the
system that incorporates it with minimum investment.
This work starts by presenting an analysis of current compression methods with the objective of
identifying the factors that limit performance and also the factors that increase it. [Continues.
- …