322 research outputs found
Space-Efficient Predictive Block Management
With growing disk and storage capacities, the amount of required metadata for tracking all blocks in a system becomes a daunting task by itself. In previous work, we have demonstrated a system software effort in the area of predictive data grouping for reducing power and latency on hard disks. The structures used, very similar to prior efforts in prefetching and prefetch caching, track access successor information at the block level, keeping a fixed number of immediate successors per block. While providing powerful predictive expansion capabilities and being more space efficient in the amount of required metadata than many previous strategies, there remains a growing concern of how much data is actually required. In this paper, we present a novel method of storing equivalent information, SESH, a Space Efficient Storage of Heredity. This method utilizes the high amount of block-level predictability observed in a number of workload trace sets to reduce the overall metadata storage by up to 99% without any loss of information. As a result, we are able to provide a predictive tool that is adaptive, accurate, and robust in the face of workload noise, for a tiny fraction of the metadata cost previously anticipated; in some cases, reducing the required size from 12 gigabytes to less than 150 megabytes
Compiler-driven data layout transformations for network applications
This work approaches the little studied topic of compiler optimisations directed to
network applications.
It starts by investigating if there exist any fundamental differences between application
domains that justify the development and tuning of domain-specific compiler optimisations.
It shows an automated approach that is capable of identifying domain-specific
workload characterisations and presenting them in a readily interpretable format based
on decision trees. The generated workload profiles summarise key resource utilisation
issues and enable compiler engineers to address the highlighted bottlenecks.
By applying this methodology to data intensive network infrastructure application it
shows that data organisation is the key obstacle to overcome in order to achieve high
performance.
It therefore proposes and evaluates three specialised data transformations (structure
splitting, array regrouping, and software caching) against the industrial EEMBC networking
benchmarks and real-world data sets. It also demonstrates on one hand that
speedups of up to 2.62 can be achieved, but on the other that no single solution performs
equally well across different network traffic scenarios.
Hence, to address this issue, an adaptive software caching scheme for high frequency
route lookup operations is introduced and its effectiveness evaluated one more time
against EEMBC networking benchmarks and real-world data sets achieving speedups
of up to 3.30 and 2.27. The results clearly demonstrate that adaptive data organisation
schemes are necessary to ensure optimal performance under varying network loads.
Finally this research addresses another issue introduced by data transformations such
as array regrouping and software caching, i.e. the need for static analysis to allow
efficient resource allocation. This thesis proposes a static code analyser that allows the
automatic resource analysis of source code containing lists and tree structures. The tool
applies a combination of amortised analysis and separation logic methodology to real
code and is able to evaluate type and resource usage of existing data structures, which
can be used to compute global resource consumption values for full data intensive
network applications
TraNCE: Transforming Nested Collections Efficiently
Nested relational query languages have long been seen as an attractive tool for scenarios involving large hierarchical datasets. There has been a resurgence of interest in nested relational languages. One driver has been the affinity of these languages for large-scale processing platforms such as Spark and Flink.
This demonstration gives a tour of TraNCE, a new system for processing nested data on top of distributed processing systems. The core innovation of the system is a compiler that processes nested relational queries in a series of transformations; these include variants of two prior techniques, shredding and unnesting, as well as a materialization transformation that customizes the way levels of the nested output are generated. The TraNCE platform builds on these techniques by adding components for users to create and visualize queries, as well as data exploration and notebook execution targets to facilitate the construction of large-scale data science applications. The demonstration will both showcase the system from the viewpoint of usability by data scientists and illustrate the data management techniques employed
Optimizing Local Memory Allocation and Assignment Through a Decoupled Approach
International audienceSoftware-controlled local memories (LMs) are widely used to provide fast, scalable, power efficient and predictable access to critical data. While many studies addressed LM management, keeping hot data in the LM continues to cause major headache. This paper revisits LM management of arrays in light of recent progresses in register allocation, supporting multiple live-range splitting schemes through a generic integer linear program. These schemes differ in the grain of decision points. The model can also be extended to address fragmentation, assigning live ranges to precise offsets. We show that the links between LM management and register allocation have been underexploited, leaving much fundamental questions open and effective applications to be explored
EFFECTIVE GROUPING FOR ENERGY AND PERFORMANCE: CONSTRUCTION OF ADAPTIVE, SUSTAINABLE, AND MAINTAINABLE DATA STORAGE
The performance gap between processors and storage systems has been increasingly critical overthe years. Yet the performance disparity remains, and further, storage energy consumption israpidly becoming a new critical problem. While smarter caching and predictive techniques domuch to alleviate this disparity, the problem persists, and data storage remains a growing contributorto latency and energy consumption.Attempts have been made at data layout maintenance, or intelligent physical placement ofdata, yet in practice, basic heuristics remain predominant. Problems that early studies soughtto solve via layout strategies were proven to be NP-Hard, and data layout maintenance todayremains more art than science. With unknown potential and a domain inherently full of uncertainty,layout maintenance persists as an area largely untapped by modern systems. But uncertainty inworkloads does not imply randomness; access patterns have exhibited repeatable, stable behavior.Predictive information can be gathered, analyzed, and exploited to improve data layouts. Ourgoal is a dynamic, robust, sustainable predictive engine, aimed at improving existing layouts byreplicating data at the storage device level.We present a comprehensive discussion of the design and construction of such a predictive engine,including workload evaluation, where we present and evaluate classical workloads as well asour own highly detailed traces collected over an extended period. We demonstrate significant gainsthrough an initial static grouping mechanism, and compare against an optimal grouping method ofour own construction, and further show significant improvement over competing techniques. We also explore and illustrate the challenges faced when moving from static to dynamic (i.e. online)grouping, and provide motivation and solutions for addressing these challenges. These challengesinclude metadata storage, appropriate predictive collocation, online performance, and physicalplacement. We reduced the metadata needed by several orders of magnitude, reducing the requiredvolume from more than 14% of total storage down to less than 12%. We also demonstrate how ourcollocation strategies outperform competing techniques. Finally, we present our complete modeland evaluate a prototype implementation against real hardware. This model was demonstrated tobe capable of reducing device-level accesses by up to 65%
Face authentication on mobile devices: optimization techniques and applications.
Pun Kwok Ho.Thesis (M.Phil.)--Chinese University of Hong Kong, 2005.Includes bibliographical references (leaves 106-111).Abstracts in English and Chinese.Chapter 1. --- Introduction --- p.1Chapter 1.1 --- Background --- p.1Chapter 1.1.1 --- Introduction to Biometrics --- p.1Chapter 1.1.2 --- Face Recognition in General --- p.2Chapter 1.1.3 --- Typical Face Recognition Systems --- p.4Chapter 1.1.4 --- Face Database and Evaluation Protocol --- p.5Chapter 1.1.5 --- Evaluation Metrics --- p.7Chapter 1.1.6 --- Characteristics of Mobile Devices --- p.10Chapter 1.2 --- Motivation and Objectives --- p.12Chapter 1.3 --- Major Contributions --- p.13Chapter 1.3.1 --- Optimization Framework --- p.13Chapter 1.3.2 --- Real Time Principal Component Analysis --- p.14Chapter 1.3.3 --- Real Time Elastic Bunch Graph Matching --- p.14Chapter 1.4 --- Thesis Organization --- p.15Chapter 2. --- Related Work --- p.16Chapter 2.1 --- Face Recognition for Desktop Computers --- p.16Chapter 2.1.1 --- Global Feature Based Systems --- p.16Chapter 2.1.2 --- Local Feature Based Systems --- p.18Chapter 2.1.3 --- Commercial Systems --- p.20Chapter 2.2 --- Biometrics on Mobile Devices --- p.22Chapter 3. --- Optimization Framework --- p.24Chapter 3.1 --- Introduction --- p.24Chapter 3.2 --- Levels of Optimization --- p.25Chapter 3.2.1 --- Algorithm Level --- p.25Chapter 3.2.2 --- Code Level --- p.26Chapter 3.2.3 --- Instruction Level --- p.27Chapter 3.2.4 --- Architecture Level --- p.28Chapter 3.3 --- General Optimization Workflow --- p.29Chapter 3.4 --- Summary --- p.31Chapter 4. --- Real Time Principal Component Analysis --- p.32Chapter 4.1 --- Introduction --- p.32Chapter 4.2 --- System Overview --- p.33Chapter 4.2.1 --- Image Preprocessing --- p.33Chapter 4.2.2 --- PCA Subspace Training --- p.34Chapter 4.2.3 --- PCA Subspace Projection --- p.36Chapter 4.2.4 --- Template Matching --- p.36Chapter 4.3 --- Optimization using Fixed-point Arithmetic --- p.37Chapter 4.3.1 --- Profiling Analysis --- p.37Chapter 4.3.2 --- Fixed-point Representation --- p.38Chapter 4.3.3 --- Range Estimation --- p.39Chapter 4.3.4 --- Code Conversion --- p.42Chapter 4.4 --- Experiments and Discussions --- p.43Chapter 4.4.1 --- Experiment Setup --- p.43Chapter 4.4.2 --- Execution Time --- p.44Chapter 4.4.3 --- Space Requirement --- p.45Chapter 4.4.4 --- Verification Accuracy --- p.45Chapter 5. --- Real Time Elastic Bunch Graph Matching --- p.49Chapter 5.1 --- Introduction --- p.49Chapter 5.2 --- System Overview --- p.50Chapter 5.2.1 --- Image Preprocessing --- p.50Chapter 5.2.2 --- Landmark Localization --- p.51Chapter 5.2.3 --- Feature Extraction --- p.52Chapter 5.2.4 --- Template Matching --- p.53Chapter 5.3 --- Optimization Overview --- p.54Chapter 5.3.1 --- Computation Optimization --- p.55Chapter 5.3.2 --- Memory Optimization --- p.56Chapter 5.4 --- Optimization Strategies --- p.58Chapter 5.4.1 --- Fixed-point Arithmetic --- p.60Chapter 5.4.2 --- Gabor Masks and Bunch Graphs Precomputation --- p.66Chapter 5.4.3 --- Improving Array Access Efficiency using ID array --- p.68Chapter 5.4.4 --- Efficient Gabor Filter Selection --- p.75Chapter 5.4.5 --- Fine Tuning System Cache Policy --- p.79Chapter 5.4.6 --- Reducing Redundant Memory Access by Loop Merging --- p.80Chapter 5.4.7 --- Maximizing Cache Reuse by Array Merging --- p.90Chapter 5.4.8 --- Optimization of Trigonometric Functions using Table Lookup. --- p.97Chapter 5.5 --- Summary --- p.99Chapter 6. --- Conclusions --- p.103Chapter 7. --- Bibliography --- p.10
NASA Tech Briefs, September 2008
Topics covered include: Nanotip Carpets as Antireflection Surfaces; Nano-Engineered Catalysts for Direct Methanol Fuel Cells; Capillography of Mats of Nanofibers; Directed Growth of Carbon Nanotubes Across Gaps; High-Voltage, Asymmetric-Waveform Generator; Magic-T Junction Using Microstrip/Slotline Transitions; On-Wafer Measurement of a Silicon-Based CMOS VCO at 324 GHz; Group-III Nitride Field Emitters; HEMT Amplifiers and Equipment for their On-Wafer Testing; Thermal Spray Formation of Polymer Coatings; Improved Gas Filling and Sealing of an HC-PCF; Making More-Complex Molecules Using Superthermal Atom/Molecule Collisions; Nematic Cells for Digital Light Deflection; Improved Silica Aerogel Composite Materials; Microgravity, Mesh-Crawling Legged Robots; Advanced Active-Magnetic-Bearing Thrust- Measurement System; Thermally Actuated Hydraulic Pumps; A New, Highly Improved Two-Cycle Engine; Flexible Structural-Health-Monitoring Sheets; Alignment Pins for Assembling and Disassembling Structures; Purifying Nucleic Acids from Samples of Extremely Low Biomass; Adjustable-Viewing-Angle Endoscopic Tool for Skull Base and Brain Surgery; UV-Resistant Non-Spore-Forming Bacteria From Spacecraft-Assembly Facilities; Hard-X-Ray/Soft-Gamma-Ray Imaging Sensor Assembly for Astronomy; Simplified Modeling of Oxidation of Hydrocarbons; Near-Field Spectroscopy with Nanoparticles Deposited by AFM; Light Collimator and Monitor for a Spectroradiometer; Hyperspectral Fluorescence and Reflectance Imaging Instrument; Improving the Optical Quality Factor of the WGM Resonator; Ultra-Stable Beacon Source for Laboratory Testing of Optical Tracking; Transmissive Diffractive Optical Element Solar Concentrators; Delaying Trains of Short Light Pulses in WGM Resonators; Toward Better Modeling of Supercritical Turbulent Mixing; JPEG 2000 Encoding with Perceptual Distortion Control; Intelligent Integrated Health Management for a System of Systems; Delay Banking for Managing Air Traffic; and Spline-Based Smoothing of Airfoil Curvatures
Optimisation des mémoires dans le flot de conception des systèmes multiprocesseurs sur puces pour des applications de type multimédia
RÉSUMÉ
Les systèmes multiprocesseurs sur puce (MPSoC) constituent l'un des principaux moteurs de
la révolution industrielle des semi-conducteurs. Les MPSoCs jouissent d’une popularité
grandissante dans le domaine des systèmes embarqués. Leur grande capacité de parallélisation Ã
un très haut niveau d'intégration, en font de bons candidats pour les systèmes et les applications
telles que les applications multimédia. La consommation d’énergie, la capacité de calcul et
l’espace de conception sont les éléments dont dépendent les performances de ce type
d’applications. La mémoire est le facteur clé permettant d’améliorer de façon substantielle leurs
performances. Avec l’arrivée des applications multimédias embarquées dans l’industrie, le
problème des gains de performances est vital. La masse de données traitées par ces applications
requiert une grande capacité de calcul et de mémoire. Dernièrement, de nouveaux modèles de
programmation ont fait leur apparition. Ces modèles offrent une programmation de plus haut
niveau pour répondre aux besoins croissants des MPSoCs, d’où la nécessité de nouvelles
approches d'optimisation et de placement pour les systèmes embarqués et leurs modèles de
programmation.
La conception niveau système des architectures MPSoCs pour les applications de type
multimédia constitue un véritable défi technique. L’objectif général de cette thèse est de relever
ce défi en trouvant des solutions. Plus spécifiquement, cette thèse se propose d’introduire le
concept d’optimisation mémoire dans le flot de conception niveau système et d’observer leur
impact sur différents modèles de programmation utilisés lors de la conception de MPSoCs. Il
s’agit, autrement dit, de réaliser l’unification du domaine de la compilation avec celui de la
conception niveau système pour une meilleure conception globale.
La contribution de cette thèse est de proposer de nouvelles approches pour les techniques
d'optimisation mémoire pour la conception MPSoCs avec différents modèles de programmation.
Nos travaux de recherche concernent l'intégration des techniques d’optimisation mémoire dans le
flot de conception de MPSoCs pour différents types de modèle de programmation. Ces travaux
ont été exécutés en collaboration avec STMicroelectronics.----------ABSTRACT
Multiprocessor systems-on-chip (MPSoC) are defined as one of the main drivers of the
industrial semiconductors revolution. MPSoCs are gaining popularity in the field of embedded
systems. Pursuant to their great ability to parallelize at a very high integration level, they are
good candidates for systems and applications such as multimedia. Memory is becoming a key
player for significant improvements in these applications (i.e. power, performance and area).
With the emergence of more embedded multimedia applications in the industry, this issue
becomes increasingly vital. The large amount of data manipulated by these applications requires
high-capacity calculation and memory. Lately, new programming models have been introduced.
These programming models offer a higher programming level to answer the increasing needs of
MPSoCs. This leads to the need of new optimization and mapping approaches suitable for
embedded systems and their programming models.
The overall objective of this research is to find solutions to the challenges of system level
design of applications such as multimedia. This entails the development of new approaches and
new optimization techniques. The specific objective of this research is to introduce the concept
of memory optimization in the system level conception flow and study its impact on different
programming models used for MPSoCs’ design. In other words, it is the unification of the
compilation and system level design domains.
The contribution of this research is to propose new approaches for memory optimization
techniques for MPSoCs’ design in different programming models. This thesis relates to the
integration of memory optimization to varying programming model types in the MPSoCs
conception flow. Our research was done in collaboration with STMicroelectronics
Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today\u27s general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations
Customising compilers for customisable processors
The automatic generation of instruction set extensions to provide application-specific acceleration
for embedded processors has been a productive area of research in recent years. There
have been incremental improvements in the quality of the algorithms that discover and select
which instructions to add to a processor. The use of automatic algorithms, however, result in
instructions which are radically different from those found in conventional, human-designed,
RISC or CISC ISAs. This has resulted in a gap between the hardware’s capabilities and the
compiler’s ability to exploit them.
This thesis proposes and investigates the use of a high-level compiler pass that uses graph-subgraph
isomorphism checking to exploit these complex instructions. Operating in a separate
pass permits techniques to be applied that are uniquely suited for mapping complex instructions,
but unsuitable for conventional instruction selection. The existing, mature, compiler
back-end can then handle the remainder of the compilation. With this method, the high-level
pass was able to use 1965 different automatically produced instructions to obtain an initial average
speed-up of 1.11x over 179 benchmarks evaluated on a hardware-verified cycle-accurate
simulator.
This result was improved following an investigation of how the produced instructions were
being used by the compiler. It was established that the models the automatic tools were using to
develop instructions did not take account of how well the compiler could realistically use them.
Adding additional parameters to the search heuristic to account for compiler issues increased
the speed-up from 1.11x to 1.24x. An alternative approach using a re-designed hardware interface
was also investigated and this achieved a speed-up of 1.26x while reducing hardware and
compiler complexity.
A complementary, high-level, method of exploiting dual memory banks was created to increase
memory bandwidth to accommodate the increased data-processing bandwidth provided
by extension instructions. Finally, the compiler was considered for use in a non-conventional
role where rather than generating code it is used to apply source-level transformations prior to
the generation of extension instructions and thus affect the shape of the instructions that are
generated
- …