33 research outputs found

    A configurable vector processor for accelerating speech coding algorithms

    Get PDF
    The growing demand for voice-over-packer (VoIP) services and multimedia-rich applications has made increasingly important the efficient, real-time implementation of low-bit rates speech coders on embedded VLSI platforms. Such speech coders are designed to substantially reduce the bandwidth requirements thus enabling dense multichannel gateways in small form factor. This however comes at a high computational cost which mandates the use of very high performance embedded processors. This thesis investigates the potential acceleration of two major ITU-T speech coding algorithms, namely G.729A and G.723.1, through their efficient implementation on a configurable extensible vector embedded CPU architecture. New scalar and vector ISAs were introduced which resulted in up to 80% reduction in the dynamic instruction count of both workloads. These instructions were subsequently encapsulated into a parametric, hybrid SISD (scalar processor)–SIMD (vector) processor. This work presents the research and implementation of the vector datapath of this vector coprocessor which is tightly-coupled to a Sparc-V8 compliant CPU, the optimization and simulation methodologies employed and the use of Electronic System Level (ESL) techniques to rapidly design SIMD datapaths

    Specialization and reconfiguration of lightweight mobile processors for data-parallel applications

    Get PDF
    The worldwide utilization of mobile devices makes the segment of low power mobile processors leading in the entire computer industry. Customers demand low-cost, high-performance and energy-efficient mobile devices, which execute sophisticated mobile applications such as multimedia and 3D games. State-of-the-art mobile devices already utilize chip multiprocessors (CMP) with dedicated accelerators that exploit data-level parallelism (DLP) in these applications. Such heterogeneous system design enable the mobile processors to deliver the desired performance and efficiency. The heterogeneity however increases the processors complexity and manufacturing cost when adding extra special-purpose hardware for the accelerators. In this thesis, we propose new hardware techniques that leverage the available resources of a mobile CMP to achieve cost-effective acceleration of DLP workloads. Our techniques are inspired by classic vector architectures and the latest reconfigurable architectures, which both achieve high power efficiency when running DLP workloads. The high requirement of additional resources for these two architectures limits their applicability beyond high-performance computers. To achieve their advantages in mobile devices, we propose techniques that: 1) specialize the lightweight mobile cores for classic vector execution of DLP workloads; 2) dynamically tune the number of cores for the specialized execution; and 3) reconfigure a bulk of the existing general purpose execution resources into a compute hardware accelerator. Specialization enables one or more cores to process configurable large vector operands with new special purpose vector instructions. Reconfiguration goes one step further and allow the compute hardware in mobile cores to dynamically implement the entire functionality of diverse compute algorithms. The proposed specialization and reconfiguration techniques are applicable to a diverse range of general purpose processors available in mobile devices nowadays. However, we chose to implement and evaluate them on a lightweight processor based on the Explicit Data Graph Execution architecture, which we find promising for the research of low-power processors. The implemented techniques improve the mobile processor performance and the efficiency on its existing general purpose resources. The processor with enabled specialization/reconfiguration techniques efficiently exploits DLP without the extra cost of special-purpose accelerators.La utilización de dispositivos móviles a nivel mundial hace que el segmento de procesadores móviles de bajo consumo lidere la industria de computación. Los clientes piden dispositivos móviles de bajo coste, alto rendimiento y bajo consumo, que ejecuten aplicaciones móviles sofisticadas, tales como multimedia y juegos 3D.Los dispositivos móviles más avanzados utilizan chips con multiprocesadores (CMP) con aceleradores dedicados que explotan el paralelismo a nivel de datos (DLP) en estas aplicaciones. Tal diseño de sistemas heterogéneos permite a los procesadores móviles ofrecer el rendimiento y la eficiencia deseada. La heterogeneidad sin embargo aumenta la complejidad y el coste de fabricación de los procesadores al agregar hardware de propósito específico adicional para implementar los aceleradores. En esta tesis se proponen nuevas técnicas de hardware que aprovechan los recursos disponibles en un CMP móvil para lograr una aceleración con bajo coste de las aplicaciones con DLP. Nuestras técnicas están inspiradas por los procesadores vectoriales clásicos y por las recientes arquitecturas reconfigurables, pues ambas logran alta eficiencia en potencia al ejecutar cargas de trabajo DLP. Pero la alta exigencia de recursos adicionales que estas dos arquitecturas necesitan, limita sus aplicabilidad más allá de las computadoras de alto rendimiento. Para lograr sus ventajas en dispositivos móviles, en esta tesis se proponen técnicas que: 1) especializan núcleos móviles ligeros para la ejecución vectorial clásica de cargas de trabajo DLP; 2) ajustan dinámicamente el número de núcleos de ejecución especializada; y 3) reconfiguran en bloque los recursos existentes de ejecución de propósito general en un acelerador hardware de computación. La especialización permite a uno o más núcleos procesar cantidades configurables de operandos vectoriales largos con nuevas instrucciones vectoriales. La reconfiguración da un paso más y permite que el hardware de cómputo en los núcleos móviles ejecute dinámicamente toda la funcionalidad de diversos algoritmos informáticos. Las técnicas de especialización y reconfiguración propuestas son aplicables a diversos procesadores de propósito general disponibles en los dispositivos móviles de hoy en día. Sin embargo, en esta tesis se ha optado por implementarlas y evaluarlas en un procesador ligero basado en la arquitectura "Explicit Data Graph Execution", que encontramos prometedora para la investigación de procesadores de baja potencia. Las técnicas aplicadas mejoraran el rendimiento del procesador móvil y la eficiencia energética de sus recursos para propósito general ya existentes. El procesador con técnicas de especialización/reconfiguración habilitadas explota eficientemente el DLP sin el coste adicional de los aceleradores de propósito especial

    Numerical aerodynamic simulation facility feasibility study

    Get PDF
    There were three major issues examined in the feasibility study. First, the ability of the proposed system architecture to support the anticipated workload was evaluated. Second, the throughput of the computational engine (the flow model processor) was studied using real application programs. Third, the availability reliability, and maintainability of the system were modeled. The evaluations were based on the baseline systems. The results show that the implementation of the Numerical Aerodynamic Simulation Facility, in the form considered, would indeed be a feasible project with an acceptable level of risk. The technology required (both hardware and software) either already exists or, in the case of a few parts, is expected to be announced this year. Facets of the work described include the hardware configuration, software, user language, and fault tolerance

    Resiliency Mechanisms for In-Memory Column Stores

    Get PDF
    The key objective of database systems is to reliably manage data, while high query throughput and low query latency are core requirements. To date, database research activities mostly concentrated on the second part. However, due to the constant shrinking of transistor feature sizes, integrated circuits become more and more unreliable and transient hardware errors in the form of multi-bit flips become more and more prominent. In a more recent study (2013), in a large high-performance cluster with around 8500 nodes, a failure rate of 40 FIT per DRAM device was measured. For their system, this means that every 10 hours there occurs a single- or multi-bit flip, which is unacceptably high for enterprise and HPC scenarios. Causes can be cosmic rays, heat, or electrical crosstalk, with the latter being exploited actively through the RowHammer attack. It was shown that memory cells are more prone to bit flips than logic gates and several surveys found multi-bit flip events in main memory modules of today's data centers. Due to the shift towards in-memory data management systems, where all business related data and query intermediate results are kept solely in fast main memory, such systems are in great danger to deliver corrupt results to their users. Hardware techniques can not be scaled to compensate the exponentially increasing error rates. In other domains, there is an increasing interest in software-based solutions to this problem, but these proposed methods come along with huge runtime and/or storage overheads. These are unacceptable for in-memory data management systems. In this thesis, we investigate how to integrate bit flip detection mechanisms into in-memory data management systems. To achieve this goal, we first build an understanding of bit flip detection techniques and select two error codes, AN codes and XOR checksums, suitable to the requirements of in-memory data management systems. The most important requirement is effectiveness of the codes to detect bit flips. We meet this goal through AN codes, which exhibit better and adaptable error detection capabilities than those found in today's hardware. The second most important goal is efficiency in terms of coding latency. We meet this by introducing a fundamental performance improvements to AN codes, and by vectorizing both chosen codes' operations. We integrate bit flip detection mechanisms into the lowest storage layer and the query processing layer in such a way that the remaining data management system and the user can stay oblivious of any error detection. This includes both base columns and pointer-heavy index structures such as the ubiquitous B-Tree. Additionally, our approach allows adaptable, on-the-fly bit flip detection during query processing, with only very little impact on query latency. AN coding allows to recode intermediate results with virtually no performance penalty. We support our claims by providing exhaustive runtime and throughput measurements throughout the whole thesis and with an end-to-end evaluation using the Star Schema Benchmark. To the best of our knowledge, we are the first to present such holistic and fast bit flip detection in a large software infrastructure such as in-memory data management systems. Finally, most of the source code fragments used to obtain the results in this thesis are open source and freely available.:1 INTRODUCTION 1.1 Contributions of this Thesis 1.2 Outline 2 PROBLEM DESCRIPTION AND RELATED WORK 2.1 Reliable Data Management on Reliable Hardware 2.2 The Shift Towards Unreliable Hardware 2.3 Hardware-Based Mitigation of Bit Flips 2.4 Data Management System Requirements 2.5 Software-Based Techniques For Handling Bit Flips 2.5.1 Operating System-Level Techniques 2.5.2 Compiler-Level Techniques 2.5.3 Application-Level Techniques 2.6 Summary and Conclusions 3 ANALYSIS OF CODING TECHNIQUES 3.1 Selection of Error Codes 3.1.1 Hamming Coding 3.1.2 XOR Checksums 3.1.3 AN Coding 3.1.4 Summary and Conclusions 3.2 Probabilities of Silent Data Corruption 3.2.1 Probabilities of Hamming Codes 3.2.2 Probabilities of XOR Checksums 3.2.3 Probabilities of AN Codes 3.2.4 Concrete Error Models 3.2.5 Summary and Conclusions 3.3 Throughput Considerations 3.3.1 Test Systems Descriptions 3.3.2 Vectorizing Hamming Coding 3.3.3 Vectorizing XOR Checksums 3.3.4 Vectorizing AN Coding 3.3.5 Summary and Conclusions 3.4 Comparison of Error Codes 3.4.1 Effectiveness 3.4.2 Efficiency 3.4.3 Runtime Adaptability 3.5 Performance Optimizations for AN Coding 3.5.1 The Modular Multiplicative Inverse 3.5.2 Faster Softening 3.5.3 Faster Error Detection 3.5.4 Comparison to Original AN Coding 3.5.5 The Multiplicative Inverse Anomaly 3.6 Summary 4 BIT FLIP DETECTING STORAGE 4.1 Column Store Architecture 4.1.1 Logical Data Types 4.1.2 Storage Model 4.1.3 Data Representation 4.1.4 Data Layout 4.1.5 Tree Index Structures 4.1.6 Summary 4.2 Hardened Data Storage 4.2.1 Hardened Physical Data Types 4.2.2 Hardened Lightweight Compression 4.2.3 Hardened Data Layout 4.2.4 UDI Operations 4.2.5 Summary and Conclusions 4.3 Hardened Tree Index Structures 4.3.1 B-Tree Verification Techniques 4.3.2 Justification For Further Techniques 4.3.3 The Error Detecting B-Tree 4.4 Summary 5 BIT FLIP DETECTING QUERY PROCESSING 5.1 Column Store Query Processing 5.2 Bit Flip Detection Opportunities 5.2.1 Early Onetime Detection 5.2.2 Late Onetime Detection 5.2.3 Continuous Detection 5.2.4 Miscellaneous Processing Aspects 5.2.5 Summary and Conclusions 5.3 Hardened Intermediate Results 5.3.1 Materialization of Hardened Intermediates 5.3.2 Hardened Bitmaps 5.4 Summary 6 END-TO-END EVALUATION 6.1 Prototype Implementation 6.1.1 AHEAD Architecture 6.1.2 Diversity of Physical Operators 6.1.3 One Concrete Operator Realization 6.1.4 Summary and Conclusions 6.2 Performance of Individual Operators 6.2.1 Selection on One Predicate 6.2.2 Selection on Two Predicates 6.2.3 Join Operators 6.2.4 Grouping and Aggregation 6.2.5 Delta Operator 6.2.6 Summary and Conclusions 6.3 Star Schema Benchmark Queries 6.3.1 Query Runtimes 6.3.2 Improvements Through Vectorization 6.3.3 Storage Overhead 6.3.4 Summary and Conclusions 6.4 Error Detecting B-Tree 6.4.1 Single Key Lookup 6.4.2 Key Value-Pair Insertion 6.5 Summary 7 SUMMARY AND CONCLUSIONS 7.1 Future Work A APPENDIX A.1 List of Golden As A.2 More on Hamming Coding A.2.1 Code examples A.2.2 Vectorization BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLES LIST OF LISTINGS LIST OF ACRONYMS LIST OF SYMBOLS LIST OF DEFINITION

    Automatic Performance Optimization of Stencil Codes

    Get PDF
    A widely used class of codes are stencil codes. Their general structure is very simple: data points in a large grid are repeatedly recomputed from neighboring values. This predefined neighborhood is the so-called stencil. Despite their very simple structure, stencil codes are hard to optimize since only few computations are performed while a comparatively large number of values have to be accessed, i.e., stencil codes usually have a very low computational intensity. Moreover, the set of optimizations and their parameters also depend on the hardware on which the code is executed. To cut a long story short, current production compilers are not able to fully optimize this class of codes and optimizing each application by hand is not practical. As a remedy, we propose a set of optimizations and describe how they can be applied automatically by a code generator for the domain of stencil codes. A combination of a space and time tiling is able to increase the data locality, which significantly reduces the memory-bandwidth requirements: a standard three-dimensional 7-point Jacobi stencil can be accelerated by a factor of 3. This optimization can target basically any stencil code, while others are more specialized. E.g., support for arbitrary linear data layout transformations is especially beneficial for colored kernels, such as a Red-Black Gauss-Seidel smoother. On the one hand, an optimized data layout for such kernels reduces the bandwidth requirements while, on the other hand, it simplifies an explicit vectorization. Other noticeable optimizations described in detail are redundancy elimination techniques to eliminate common subexpressions both in a sequence of statements and across loop boundaries, arithmetic simplifications and normalizations, and the vectorization mentioned previously. In combination, these optimizations are able to increase the performance not only of the model problem given by Poisson’s equation, but also of real-world applications: an optical flow simulation and the simulation of a non-isothermal and non-Newtonian fluid flow
    corecore