2,992 research outputs found

    Runtime Optimizations for Prediction with Tree-Based Models

    Full text link
    Tree-based models have proven to be an effective solution for web ranking as well as other problems in diverse domains. This paper focuses on optimizing the runtime performance of applying such models to make predictions, given an already-trained model. Although exceedingly simple conceptually, most implementations of tree-based models do not efficiently utilize modern superscalar processor architectures. By laying out data structures in memory in a more cache-conscious fashion, removing branches from the execution flow using a technique called predication, and micro-batching predictions using a technique called vectorization, we are able to better exploit modern processor architectures and significantly improve the speed of tree-based models over hard-coded if-else blocks. Our work contributes to the exploration of architecture-conscious runtime implementations of machine learning algorithms

    PERFORMANCE EVALUATION OF MEMORY AND COMPUTATIONALLY BOUND CHEMISTRY APPLICATIONS ON STREAMING GPGPUS AND MULTI-CORE X86 CPUS

    Get PDF
    In recent years, multi-core processors have come to dominate the field in desktop and high performance computing. Graphics processors traditionally used in CAD, video games, and other 3-d applications, have become more programmable and are now suitable for general purpose computing. This thesis explores multi-core processors and GPU performance and limitations in two computational chemistry applications: a memory bound component of ab-initio modeling and a computationally bound Monte Carlo simulation. For the applications presented in this thesis, exploiting multiple processors is done using a variety of tools and languages including OpenMP and MKL. Brook+ and the Compute Abstraction Layer streaming environments are used to accelerate applications on AMD GPUs. This thesis gives qualitative assertions about these languages and tools regarding ease of use and optimization in addition to quantitative analyses of performance. GPUs can yield modest performance improvements with little effort in some applications and even larger speedups with simple optimizations

    A Detailed Analysis of Contemporary ARM and x86 Architectures

    Get PDF
    RISC vs. CISC wars raged in the 1980s when chip area and processor design complexity were the primary constraints and desktops and servers exclusively dominated the computing landscape. Today, energy and power are the primary design constraints and the computing landscape is significantly different: growth in tablets and smartphones running ARM (a RISC ISA) is surpassing that of desktops and laptops running x86 (a CISC ISA). Further, the traditionally low-power ARM ISA is entering the high-performance server market, while the traditionally high-performance x86 ISA is entering the mobile low-power device market. Thus, the question of whether ISA plays an intrinsic role in performance or energy efficiency is becoming important, and we seek to answer this question through a detailed measurement based study on real hardware running real applications. We analyze measurements on the ARM Cortex-A8 and Cortex-A9 and Intel Atom and Sandybridge i7 microprocessors over workloads spanning mobile, desktop, and server computing. Our methodical investigation demonstrates the role of ISA in modern microprocessors? performance and energy efficiency. We find that ARM and x86 processors are simply engineering design points optimized for different levels of performance, and there is nothing fundamentally more energy efficient in one ISA class or the other. The ISA being RISC or CISC seems irrelevant

    On the co-design of scientific applications and long vector architectures

    Get PDF
    The landscape of High Performance Computing (HPC) system architectures keeps expanding with new technologies and increased complexity. To improve the efficiency of next-generation compute devices, architects are looking for solutions beyond the commodity CPU approach. In 2021, the five most powerful supercomputers in the world use either GP-GPU (General-purpose computing on graphics processing units) accelerators or a customized CPU specially designed to target HPC applications. This trend is only expected to grow in the next years motivated by the compute demands of science and industry. As architectures evolve, the ecosystem of tools and applications must follow. The choices in the number of cores in a socket, the floating point-units per core and the bandwidth through the memory hierarchy among others, have a large impact in the power consumption and compute capabilities of the devices. To balance CPU and accelerators, designers require accurate tools for analyzing and predicting the impact of new architectural features on the performance of complex scientific applications at scale. In such a large design space, capturing and modeling with simulators the complex interactions between the system software and hardware components is a defying challenge. Moreover, applications must be able to exploit those designs with aggressive compute capabilities and memory bandwidth configurations. Algorithms and data structures will need to be redesigned accordingly to expose a high degree of data-level parallelism allowing them to scale in large systems. Therefore, next-generation computing devices will be the result of a co-design effort in hardware and applications supported by advanced simulation tools. In this thesis, we focus our work on the co-design of scientific applications and long vector architectures. We significantly extend a multi-scale simulation toolchain enabling accurate performance and power estimations of large-scale HPC systems. Through simulation, we explore the large design space in current HPC trends over a wide range of applications. We extract speedup and energy consumption figures analyzing the trade-offs and optimal configurations for each of the applications. We describe in detail the optimization process of two challenging applications on real vector accelerators, achieving outstanding operation performance and full memory bandwidth utilization. Overall, we provide evidence-based architectural and programming recommendations that will serve as hardware and software co-design guidelines for the next generation of specialized compute devices.El panorama de las arquitecturas de los sistemas para la Computación de Alto Rendimiento (HPC, de sus siglas en inglés) sigue expandiéndose con nuevas tecnologías y complejidad adicional. Para mejorar la eficiencia de la próxima generación de dispositivos de computación, los arquitectos están buscando soluciones más allá de las CPUs. En 2021, los cinco supercomputadores más potentes del mundo utilizan aceleradores gráficos aplicados a propósito general (GP-GPU, de sus siglas en inglés) o CPUs diseñadas especialmente para aplicaciones HPC. En los próximos años, se espera que esta tendencia siga creciendo motivada por las demandas de más potencia de computación de la ciencia y la industria. A medida que las arquitecturas evolucionan, el ecosistema de herramientas y aplicaciones les debe seguir. Las decisiones eligiendo el número de núcleos por zócalo, las unidades de coma flotante por núcleo y el ancho de banda a través de la jerarquía de memoría entre otros, tienen un gran impacto en el consumo de energía y las capacidades de cómputo de los dispositivos. Para equilibrar las CPUs y los aceleradores, los diseñadores deben utilizar herramientas precisas para analizar y predecir el impacto de nuevas características de la arquitectura en el rendimiento de complejas aplicaciones científicas a gran escala. Dado semejante espacio de diseño, capturar y modelar con simuladores las complejas interacciones entre el software de sistema y los componentes de hardware es un reto desafiante. Además, las aplicaciones deben ser capaces de explotar tales diseños con agresivas capacidades de cómputo y ancho de banda de memoria. Los algoritmos y estructuras de datos deberán ser rediseñadas para exponer un alto grado de paralelismo de datos permitiendo así escalarlos en grandes sistemas. Por lo tanto, la siguiente generación de dispósitivos de cálculo será el resultado de un esfuerzo de codiseño tanto en hardware como en aplicaciones y soportado por avanzadas herramientas de simulación. En esta tesis, centramos nuestro trabajo en el codiseño de aplicaciones científicas y arquitecturas vectoriales largas. Extendemos significativamente una serie de herramientas para la simulación multiescala permitiendo así obtener estimaciones de rendimiento y potencia de sistemas HPC de gran escala. A través de simulaciones, exploramos el gran espacio de diseño de las tendencias actuales en HPC sobre un amplio rango de aplicaciones. Extraemos datos sobre la mejora y el consumo energético analizando las contrapartidas y las configuraciones óptimas para cada una de las aplicaciones. Describimos en detalle el proceso de optimización de dos aplicaciones en aceleradores vectoriales, obteniendo un rendimiento extraordinario a nivel de operaciones y completa utilización del ancho de memoria disponible. Con todo, ofrecemos recomendaciones empíricas a nivel de arquitectura y programación que servirán como instrucciones para diseñar mejor hardware y software para la siguiente generación de dispositivos de cálculo especializados.Postprint (published version

    Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential

    Get PDF
    Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order to guide efforts to enhance data locality. Reuse distance analysis of memory address traces is a valuable tool to perform data locality characterization of programs. A single reuse distance analysis can be used to estimate the number of cache misses in a fully associative LRU cache of any size, thereby providing estimates on the minimum bandwidth requirements at different levels of the memory hierarchy to avoid being bandwidth bound. However, such an analysis only holds for the particular execution order that produced the trace. It cannot estimate potential improvement in data locality through dependence preserving transformations that change the execution schedule of the operations in the computation. In this article, we develop a novel dynamic analysis approach to characterize the inherent locality properties of a computation and thereby assess the potential for data locality enhancement via dependence preserving transformations. The execution trace of a code is analyzed to extract a computational directed acyclic graph (CDAG) of the data dependences. The CDAG is then partitioned into convex subsets, and the convex partitioning is used to reorder the operations in the execution trace to enhance data locality. The approach enables us to go beyond reuse distance analysis of a single specific order of execution of the operations of a computation in characterization of its data locality properties. It can serve a valuable role in identifying promising code regions for manual transformation, as well as assessing the effectiveness of compiler transformations for data locality enhancement. We demonstrate the effectiveness of the approach using a number of benchmarks, including case studies where the potential shown by the analysis is exploited to achieve lower data movement costs and better performance.Comment: Transaction on Architecture and Code Optimization (2014

    A low-power, high-performance speech recognition accelerator

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.Peer ReviewedPostprint (author's final draft

    Software Grand Exposure: SGX Cache Attacks Are Practical

    Full text link
    Side-channel information leakage is a known limitation of SGX. Researchers have demonstrated that secret-dependent information can be extracted from enclave execution through page-fault access patterns. Consequently, various recent research efforts are actively seeking countermeasures to SGX side-channel attacks. It is widely assumed that SGX may be vulnerable to other side channels, such as cache access pattern monitoring, as well. However, prior to our work, the practicality and the extent of such information leakage was not studied. In this paper we demonstrate that cache-based attacks are indeed a serious threat to the confidentiality of SGX-protected programs. Our goal was to design an attack that is hard to mitigate using known defenses, and therefore we mount our attack without interrupting enclave execution. This approach has major technical challenges, since the existing cache monitoring techniques experience significant noise if the victim process is not interrupted. We designed and implemented novel attack techniques to reduce this noise by leveraging the capabilities of the privileged adversary. Our attacks are able to recover confidential information from SGX enclaves, which we illustrate in two example cases: extraction of an entire RSA-2048 key during RSA decryption, and detection of specific human genome sequences during genomic indexing. We show that our attacks are more effective than previous cache attacks and harder to mitigate than previous SGX side-channel attacks
    corecore