322 research outputs found

    Space-Efficient Predictive Block Management

    Get PDF
    With growing disk and storage capacities, the amount of required metadata for tracking all blocks in a system becomes a daunting task by itself. In previous work, we have demonstrated a system software effort in the area of predictive data grouping for reducing power and latency on hard disks. The structures used, very similar to prior efforts in prefetching and prefetch caching, track access successor information at the block level, keeping a fixed number of immediate successors per block. While providing powerful predictive expansion capabilities and being more space efficient in the amount of required metadata than many previous strategies, there remains a growing concern of how much data is actually required. In this paper, we present a novel method of storing equivalent information, SESH, a Space Efficient Storage of Heredity. This method utilizes the high amount of block-level predictability observed in a number of workload trace sets to reduce the overall metadata storage by up to 99% without any loss of information. As a result, we are able to provide a predictive tool that is adaptive, accurate, and robust in the face of workload noise, for a tiny fraction of the metadata cost previously anticipated; in some cases, reducing the required size from 12 gigabytes to less than 150 megabytes

    Compiler-driven data layout transformations for network applications

    Get PDF
    This work approaches the little studied topic of compiler optimisations directed to network applications. It starts by investigating if there exist any fundamental differences between application domains that justify the development and tuning of domain-specific compiler optimisations. It shows an automated approach that is capable of identifying domain-specific workload characterisations and presenting them in a readily interpretable format based on decision trees. The generated workload profiles summarise key resource utilisation issues and enable compiler engineers to address the highlighted bottlenecks. By applying this methodology to data intensive network infrastructure application it shows that data organisation is the key obstacle to overcome in order to achieve high performance. It therefore proposes and evaluates three specialised data transformations (structure splitting, array regrouping, and software caching) against the industrial EEMBC networking benchmarks and real-world data sets. It also demonstrates on one hand that speedups of up to 2.62 can be achieved, but on the other that no single solution performs equally well across different network traffic scenarios. Hence, to address this issue, an adaptive software caching scheme for high frequency route lookup operations is introduced and its effectiveness evaluated one more time against EEMBC networking benchmarks and real-world data sets achieving speedups of up to 3.30 and 2.27. The results clearly demonstrate that adaptive data organisation schemes are necessary to ensure optimal performance under varying network loads. Finally this research addresses another issue introduced by data transformations such as array regrouping and software caching, i.e. the need for static analysis to allow efficient resource allocation. This thesis proposes a static code analyser that allows the automatic resource analysis of source code containing lists and tree structures. The tool applies a combination of amortised analysis and separation logic methodology to real code and is able to evaluate type and resource usage of existing data structures, which can be used to compute global resource consumption values for full data intensive network applications

    TraNCE: Transforming Nested Collections Efficiently

    Get PDF
    Nested relational query languages have long been seen as an attractive tool for scenarios involving large hierarchical datasets. There has been a resurgence of interest in nested relational languages. One driver has been the affinity of these languages for large-scale processing platforms such as Spark and Flink. This demonstration gives a tour of TraNCE, a new system for processing nested data on top of distributed processing systems. The core innovation of the system is a compiler that processes nested relational queries in a series of transformations; these include variants of two prior techniques, shredding and unnesting, as well as a materialization transformation that customizes the way levels of the nested output are generated. The TraNCE platform builds on these techniques by adding components for users to create and visualize queries, as well as data exploration and notebook execution targets to facilitate the construction of large-scale data science applications. The demonstration will both showcase the system from the viewpoint of usability by data scientists and illustrate the data management techniques employed

    Optimizing Local Memory Allocation and Assignment Through a Decoupled Approach

    Get PDF
    International audienceSoftware-controlled local memories (LMs) are widely used to provide fast, scalable, power efficient and predictable access to critical data. While many studies addressed LM management, keeping hot data in the LM continues to cause major headache. This paper revisits LM management of arrays in light of recent progresses in register allocation, supporting multiple live-range splitting schemes through a generic integer linear program. These schemes differ in the grain of decision points. The model can also be extended to address fragmentation, assigning live ranges to precise offsets. We show that the links between LM management and register allocation have been underexploited, leaving much fundamental questions open and effective applications to be explored

    EFFECTIVE GROUPING FOR ENERGY AND PERFORMANCE: CONSTRUCTION OF ADAPTIVE, SUSTAINABLE, AND MAINTAINABLE DATA STORAGE

    Get PDF
    The performance gap between processors and storage systems has been increasingly critical overthe years. Yet the performance disparity remains, and further, storage energy consumption israpidly becoming a new critical problem. While smarter caching and predictive techniques domuch to alleviate this disparity, the problem persists, and data storage remains a growing contributorto latency and energy consumption.Attempts have been made at data layout maintenance, or intelligent physical placement ofdata, yet in practice, basic heuristics remain predominant. Problems that early studies soughtto solve via layout strategies were proven to be NP-Hard, and data layout maintenance todayremains more art than science. With unknown potential and a domain inherently full of uncertainty,layout maintenance persists as an area largely untapped by modern systems. But uncertainty inworkloads does not imply randomness; access patterns have exhibited repeatable, stable behavior.Predictive information can be gathered, analyzed, and exploited to improve data layouts. Ourgoal is a dynamic, robust, sustainable predictive engine, aimed at improving existing layouts byreplicating data at the storage device level.We present a comprehensive discussion of the design and construction of such a predictive engine,including workload evaluation, where we present and evaluate classical workloads as well asour own highly detailed traces collected over an extended period. We demonstrate significant gainsthrough an initial static grouping mechanism, and compare against an optimal grouping method ofour own construction, and further show significant improvement over competing techniques. We also explore and illustrate the challenges faced when moving from static to dynamic (i.e. online)grouping, and provide motivation and solutions for addressing these challenges. These challengesinclude metadata storage, appropriate predictive collocation, online performance, and physicalplacement. We reduced the metadata needed by several orders of magnitude, reducing the requiredvolume from more than 14% of total storage down to less than 12%. We also demonstrate how ourcollocation strategies outperform competing techniques. Finally, we present our complete modeland evaluate a prototype implementation against real hardware. This model was demonstrated tobe capable of reducing device-level accesses by up to 65%

    Face authentication on mobile devices: optimization techniques and applications.

    Get PDF
    Pun Kwok Ho.Thesis (M.Phil.)--Chinese University of Hong Kong, 2005.Includes bibliographical references (leaves 106-111).Abstracts in English and Chinese.Chapter 1. --- Introduction --- p.1Chapter 1.1 --- Background --- p.1Chapter 1.1.1 --- Introduction to Biometrics --- p.1Chapter 1.1.2 --- Face Recognition in General --- p.2Chapter 1.1.3 --- Typical Face Recognition Systems --- p.4Chapter 1.1.4 --- Face Database and Evaluation Protocol --- p.5Chapter 1.1.5 --- Evaluation Metrics --- p.7Chapter 1.1.6 --- Characteristics of Mobile Devices --- p.10Chapter 1.2 --- Motivation and Objectives --- p.12Chapter 1.3 --- Major Contributions --- p.13Chapter 1.3.1 --- Optimization Framework --- p.13Chapter 1.3.2 --- Real Time Principal Component Analysis --- p.14Chapter 1.3.3 --- Real Time Elastic Bunch Graph Matching --- p.14Chapter 1.4 --- Thesis Organization --- p.15Chapter 2. --- Related Work --- p.16Chapter 2.1 --- Face Recognition for Desktop Computers --- p.16Chapter 2.1.1 --- Global Feature Based Systems --- p.16Chapter 2.1.2 --- Local Feature Based Systems --- p.18Chapter 2.1.3 --- Commercial Systems --- p.20Chapter 2.2 --- Biometrics on Mobile Devices --- p.22Chapter 3. --- Optimization Framework --- p.24Chapter 3.1 --- Introduction --- p.24Chapter 3.2 --- Levels of Optimization --- p.25Chapter 3.2.1 --- Algorithm Level --- p.25Chapter 3.2.2 --- Code Level --- p.26Chapter 3.2.3 --- Instruction Level --- p.27Chapter 3.2.4 --- Architecture Level --- p.28Chapter 3.3 --- General Optimization Workflow --- p.29Chapter 3.4 --- Summary --- p.31Chapter 4. --- Real Time Principal Component Analysis --- p.32Chapter 4.1 --- Introduction --- p.32Chapter 4.2 --- System Overview --- p.33Chapter 4.2.1 --- Image Preprocessing --- p.33Chapter 4.2.2 --- PCA Subspace Training --- p.34Chapter 4.2.3 --- PCA Subspace Projection --- p.36Chapter 4.2.4 --- Template Matching --- p.36Chapter 4.3 --- Optimization using Fixed-point Arithmetic --- p.37Chapter 4.3.1 --- Profiling Analysis --- p.37Chapter 4.3.2 --- Fixed-point Representation --- p.38Chapter 4.3.3 --- Range Estimation --- p.39Chapter 4.3.4 --- Code Conversion --- p.42Chapter 4.4 --- Experiments and Discussions --- p.43Chapter 4.4.1 --- Experiment Setup --- p.43Chapter 4.4.2 --- Execution Time --- p.44Chapter 4.4.3 --- Space Requirement --- p.45Chapter 4.4.4 --- Verification Accuracy --- p.45Chapter 5. --- Real Time Elastic Bunch Graph Matching --- p.49Chapter 5.1 --- Introduction --- p.49Chapter 5.2 --- System Overview --- p.50Chapter 5.2.1 --- Image Preprocessing --- p.50Chapter 5.2.2 --- Landmark Localization --- p.51Chapter 5.2.3 --- Feature Extraction --- p.52Chapter 5.2.4 --- Template Matching --- p.53Chapter 5.3 --- Optimization Overview --- p.54Chapter 5.3.1 --- Computation Optimization --- p.55Chapter 5.3.2 --- Memory Optimization --- p.56Chapter 5.4 --- Optimization Strategies --- p.58Chapter 5.4.1 --- Fixed-point Arithmetic --- p.60Chapter 5.4.2 --- Gabor Masks and Bunch Graphs Precomputation --- p.66Chapter 5.4.3 --- Improving Array Access Efficiency using ID array --- p.68Chapter 5.4.4 --- Efficient Gabor Filter Selection --- p.75Chapter 5.4.5 --- Fine Tuning System Cache Policy --- p.79Chapter 5.4.6 --- Reducing Redundant Memory Access by Loop Merging --- p.80Chapter 5.4.7 --- Maximizing Cache Reuse by Array Merging --- p.90Chapter 5.4.8 --- Optimization of Trigonometric Functions using Table Lookup. --- p.97Chapter 5.5 --- Summary --- p.99Chapter 6. --- Conclusions --- p.103Chapter 7. --- Bibliography --- p.10

    NASA Tech Briefs, September 2008

    Get PDF
    Topics covered include: Nanotip Carpets as Antireflection Surfaces; Nano-Engineered Catalysts for Direct Methanol Fuel Cells; Capillography of Mats of Nanofibers; Directed Growth of Carbon Nanotubes Across Gaps; High-Voltage, Asymmetric-Waveform Generator; Magic-T Junction Using Microstrip/Slotline Transitions; On-Wafer Measurement of a Silicon-Based CMOS VCO at 324 GHz; Group-III Nitride Field Emitters; HEMT Amplifiers and Equipment for their On-Wafer Testing; Thermal Spray Formation of Polymer Coatings; Improved Gas Filling and Sealing of an HC-PCF; Making More-Complex Molecules Using Superthermal Atom/Molecule Collisions; Nematic Cells for Digital Light Deflection; Improved Silica Aerogel Composite Materials; Microgravity, Mesh-Crawling Legged Robots; Advanced Active-Magnetic-Bearing Thrust- Measurement System; Thermally Actuated Hydraulic Pumps; A New, Highly Improved Two-Cycle Engine; Flexible Structural-Health-Monitoring Sheets; Alignment Pins for Assembling and Disassembling Structures; Purifying Nucleic Acids from Samples of Extremely Low Biomass; Adjustable-Viewing-Angle Endoscopic Tool for Skull Base and Brain Surgery; UV-Resistant Non-Spore-Forming Bacteria From Spacecraft-Assembly Facilities; Hard-X-Ray/Soft-Gamma-Ray Imaging Sensor Assembly for Astronomy; Simplified Modeling of Oxidation of Hydrocarbons; Near-Field Spectroscopy with Nanoparticles Deposited by AFM; Light Collimator and Monitor for a Spectroradiometer; Hyperspectral Fluorescence and Reflectance Imaging Instrument; Improving the Optical Quality Factor of the WGM Resonator; Ultra-Stable Beacon Source for Laboratory Testing of Optical Tracking; Transmissive Diffractive Optical Element Solar Concentrators; Delaying Trains of Short Light Pulses in WGM Resonators; Toward Better Modeling of Supercritical Turbulent Mixing; JPEG 2000 Encoding with Perceptual Distortion Control; Intelligent Integrated Health Management for a System of Systems; Delay Banking for Managing Air Traffic; and Spline-Based Smoothing of Airfoil Curvatures

    Optimisation des mémoires dans le flot de conception des systèmes multiprocesseurs sur puces pour des applications de type multimédia

    Get PDF
    RÉSUMÉ Les systèmes multiprocesseurs sur puce (MPSoC) constituent l'un des principaux moteurs de la révolution industrielle des semi-conducteurs. Les MPSoCs jouissent d’une popularité grandissante dans le domaine des systèmes embarqués. Leur grande capacité de parallélisation à un très haut niveau d'intégration, en font de bons candidats pour les systèmes et les applications telles que les applications multimédia. La consommation d’énergie, la capacité de calcul et l’espace de conception sont les éléments dont dépendent les performances de ce type d’applications. La mémoire est le facteur clé permettant d’améliorer de façon substantielle leurs performances. Avec l’arrivée des applications multimédias embarquées dans l’industrie, le problème des gains de performances est vital. La masse de données traitées par ces applications requiert une grande capacité de calcul et de mémoire. Dernièrement, de nouveaux modèles de programmation ont fait leur apparition. Ces modèles offrent une programmation de plus haut niveau pour répondre aux besoins croissants des MPSoCs, d’où la nécessité de nouvelles approches d'optimisation et de placement pour les systèmes embarqués et leurs modèles de programmation. La conception niveau système des architectures MPSoCs pour les applications de type multimédia constitue un véritable défi technique. L’objectif général de cette thèse est de relever ce défi en trouvant des solutions. Plus spécifiquement, cette thèse se propose d’introduire le concept d’optimisation mémoire dans le flot de conception niveau système et d’observer leur impact sur différents modèles de programmation utilisés lors de la conception de MPSoCs. Il s’agit, autrement dit, de réaliser l’unification du domaine de la compilation avec celui de la conception niveau système pour une meilleure conception globale. La contribution de cette thèse est de proposer de nouvelles approches pour les techniques d'optimisation mémoire pour la conception MPSoCs avec différents modèles de programmation. Nos travaux de recherche concernent l'intégration des techniques d’optimisation mémoire dans le flot de conception de MPSoCs pour différents types de modèle de programmation. Ces travaux ont été exécutés en collaboration avec STMicroelectronics.----------ABSTRACT Multiprocessor systems-on-chip (MPSoC) are defined as one of the main drivers of the industrial semiconductors revolution. MPSoCs are gaining popularity in the field of embedded systems. Pursuant to their great ability to parallelize at a very high integration level, they are good candidates for systems and applications such as multimedia. Memory is becoming a key player for significant improvements in these applications (i.e. power, performance and area). With the emergence of more embedded multimedia applications in the industry, this issue becomes increasingly vital. The large amount of data manipulated by these applications requires high-capacity calculation and memory. Lately, new programming models have been introduced. These programming models offer a higher programming level to answer the increasing needs of MPSoCs. This leads to the need of new optimization and mapping approaches suitable for embedded systems and their programming models. The overall objective of this research is to find solutions to the challenges of system level design of applications such as multimedia. This entails the development of new approaches and new optimization techniques. The specific objective of this research is to introduce the concept of memory optimization in the system level conception flow and study its impact on different programming models used for MPSoCs’ design. In other words, it is the unification of the compilation and system level design domains. The contribution of this research is to propose new approaches for memory optimization techniques for MPSoCs’ design in different programming models. This thesis relates to the integration of memory optimization to varying programming model types in the MPSoCs conception flow. Our research was done in collaboration with STMicroelectronics

    Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU

    Get PDF
    Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today\u27s general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations

    Customising compilers for customisable processors

    Get PDF
    The automatic generation of instruction set extensions to provide application-specific acceleration for embedded processors has been a productive area of research in recent years. There have been incremental improvements in the quality of the algorithms that discover and select which instructions to add to a processor. The use of automatic algorithms, however, result in instructions which are radically different from those found in conventional, human-designed, RISC or CISC ISAs. This has resulted in a gap between the hardware’s capabilities and the compiler’s ability to exploit them. This thesis proposes and investigates the use of a high-level compiler pass that uses graph-subgraph isomorphism checking to exploit these complex instructions. Operating in a separate pass permits techniques to be applied that are uniquely suited for mapping complex instructions, but unsuitable for conventional instruction selection. The existing, mature, compiler back-end can then handle the remainder of the compilation. With this method, the high-level pass was able to use 1965 different automatically produced instructions to obtain an initial average speed-up of 1.11x over 179 benchmarks evaluated on a hardware-verified cycle-accurate simulator. This result was improved following an investigation of how the produced instructions were being used by the compiler. It was established that the models the automatic tools were using to develop instructions did not take account of how well the compiler could realistically use them. Adding additional parameters to the search heuristic to account for compiler issues increased the speed-up from 1.11x to 1.24x. An alternative approach using a re-designed hardware interface was also investigated and this achieved a speed-up of 1.26x while reducing hardware and compiler complexity. A complementary, high-level, method of exploiting dual memory banks was created to increase memory bandwidth to accommodate the increased data-processing bandwidth provided by extension instructions. Finally, the compiler was considered for use in a non-conventional role where rather than generating code it is used to apply source-level transformations prior to the generation of extension instructions and thus affect the shape of the instructions that are generated
    • …
    corecore