348 research outputs found

    Instruction cache optimizations for embedded systems

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Balancing Design Options with Sherpa

    Get PDF
    Application specific processors offer the potential of rapidly designed logic specifically constructed to meet the performance and area demands of the task at hand. Recently, there have been several major projects that attempt to automate the process of transforming a predetermined processor configuration into a low level description for fabrication. These projects either leave the specification of the processor to the designer, which can be a significant engineering burden, or handle it in a fully automated fashion, which completely removes the designer from the loop. In this paper we introduce a technique for guiding the design and optimization of application specific processors. The goal of the Sherpa design framework is to automate certain design tasks and provide early feedback to help the designer navigate their way through the architecture design space. Our approach is to decompose the overall problem of choosing an optimal architecture into a set of sub-problems that are, to the first order, independent. For each subproblem, we create a model that relates performance to area. From this, we build a constraint system that can be solved using integer-linear programming techniques, and arrive at an ideal parameter selection for all architectural components. Our approach only takes a few minutes to explore the design space allowing the designer or compiler to see the potential benefits of optimizations rapidly. We show that the expected performance using our model correlates strongly to detailed pipeline simulations, and present results showing design tradeoffs for several different benchmarks

    Partial aggregation for collective communication in distributed memory machines

    Get PDF
    High Performance Computing (HPC) systems interconnect a large number of Processing Elements (PEs) in high-bandwidth networks to simulate complex scientific problems. The increasing scale of HPC systems poses great challenges on algorithm designers. As the average distance between PEs increases, data movement across hierarchical memory subsystems introduces high latency. Minimizing latency is particularly challenging in collective communications, where many PEs may interact in complex communication patterns. Although collective communications can be optimized for network-level parallelism, occasional synchronization delays due to dependencies in the communication pattern degrade application performance. To reduce the performance impact of communication and synchronization costs, parallel algorithms are designed with sophisticated latency hiding techniques. The principle is to interleave computation with asynchronous communication, which increases the overall occupancy of compute cores. However, collective communication primitives abstract parallelism which limits the integration of latency hiding techniques. Approaches to work around these limitations either modify the algorithmic structure of application codes, or replace collective primitives with verbose low-level communication calls. While these approaches give fine-grained control for latency hiding, implementing collective communication algorithms is challenging and requires expertise knowledge about HPC network topologies. A collective communication pattern is commonly described as a Directed Acyclic Graph (DAG) where a set of PEs, represented as vertices, resolve data dependencies through communication along the edges. Our approach improves latency hiding in collective communication through partial aggregation. Based on mathematical rules of binary operations and homomorphism, we expose data parallelism in a respective DAG to overlap computation with communication. The proposed concepts are implemented and evaluated with a subset of collective primitives in the Message Passing Interface (MPI), an established communication standard in scientific computing. An experimental analysis with communication-bound microbenchmarks shows considerable performance benefits for the evaluated collective primitives. A detailed case study with a large-scale distributed sort algorithm demonstrates, how partial aggregation significantly improves performance in data-intensive scenarios. Besides better latency hiding capabilities with collective communication primitives, our approach enables further optimizations of their implementations within MPI libraries. The vast amount of asynchronous programming models, which are actively studied in the HPC community, benefit from partial aggregation in collective communication patterns. Future work can utilize partial aggregation to improve the interaction of MPI collectives with acclerator architectures, and to design more efficient communication algorithms

    Design and simulation of a wafer scale integration magnetoresistive memory architecture

    Get PDF
    Magnetoresistive memory has been identified as a viable candidate for water scale integration. Water scale integration requires an architecture which tolerates defects with redundant elements. A wafer scale architecture with multiple levels of redundancy is designed for magnetoresistive memory. Yield models which consider multiple types of defects and defect clustering are used to calculate module and wafer level yields. The yields of various redundancy combinations are simulated to optimize the architecture for current semiconductor manufacturing processes. The optimized redundancy has six spares at the 16 Kbit module level and two spares at the 1 Mbit module level. This architecture yields a capacity of over 250 megabytes for all the statistical processing distributions considered in the simulation. The areal cost of the redundancy is less than 10 percent

    What broke where for distributed and parallel applications — a whodunit story

    Get PDF
    Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed and parallel systems is a difficult task. These large distributed and parallel systems are composed of various complex software and hardware components. When the system experiences some performance or correctness problem, developers struggle to understand the root cause of the problem and fix in a timely manner. In my thesis, I address these three components of the performance problems in computer systems. First, we focus on diagnosing performance problems in large-scale parallel applications running on supercomputers. We developed techniques to localize the performance problem for root-cause analysis. Parallel applications, most of which are complex scientific simulations running in supercomputers, can create up to millions of parallel tasks that run on different machines and communicate using the message passing paradigm. We developed a highly scalable and accurate automated debugging tool called PRODOMETER, which uses sophisticated algorithms to first, create a logical progress dependency graph of the tasks to highlight how the problem spread through the system manifesting as a system-wide performance issue. Second, uses this logical progress dependence graph to identify the task where the problem originated. Finally, PRODOMETER pinpoints the code region corresponding to the origin of the bug. Second, we developed a tool-chain that can detect performance anomaly using machine-learning techniques and can achieve very low false positive rate. Our input-aware performance anomaly detection system consists of a scalable data collection framework to collect performance related metrics from different granularity of code regions, an offline model creation and prediction-error characterization technique, and a threshold based anomaly-detection-engine for production runs. Our system requires few training runs and can handle unknown inputs and parameter combinations by dynamically calibrating the anomaly detection threshold according to the characteristics of the input data and the characteristics of the prediction-error of the models. Third, we developed performance problem mitigation scheme for erasure-coded distributed storage systems. Repair operations of the failed blocks in erasure-coded distributed storage system take really long time in networked constrained data-centers. The reason being, during the repair operation for erasure-coded distributed storage, a lot of data from multiple nodes are gathered into a single node and then a mathematical operation is performed to reconstruct the missing part. This process severely congests the links toward the destination where newly recreated data is to be hosted. We proposed a novel distributed repair technique, called Partial-Parallel-Repair (PPR) that performs this reconstruction in parallel on multiple nodes and eliminates network bottlenecks, and as a result, greatly speeds up the repair process. Fourth, we study how for a class of applications, performance can be improved (or performance problems can be mitigated) by selectively approximating some of the computations. For many applications, the main computation happens inside a loop that can be logically divided into a few temporal segments, we call phases. We found that while approximating the initial phases might severely degrade the quality of the results, approximating the computation for the later phases have very small impact on the final quality of the result. Based on this observation, we developed an optimization framework that for a given budget of quality-loss, would find the best approximation settings for each phase in the execution

    Design and Code Optimization for Systems with Next-generation Racetrack Memories

    Get PDF
    With the rise of computationally expensive application domains such as machine learning, genomics, and fluids simulation, the quest for performance and energy-efficient computing has gained unprecedented momentum. The significant increase in computing and memory devices in modern systems has resulted in an unsustainable surge in energy consumption, a substantial portion of which is attributed to the memory system. The scaling of conventional memory technologies and their suitability for the next-generation system is also questionable. This has led to the emergence and rise of nonvolatile memory ( NVM ) technologies. Today, in different development stages, several NVM technologies are competing for their rapid access to the market. Racetrack memory ( RTM ) is one such nonvolatile memory technology that promises SRAM -comparable latency, reduced energy consumption, and unprecedented density compared to other technologies. However, racetrack memory ( RTM ) is sequential in nature, i.e., data in an RTM cell needs to be shifted to an access port before it can be accessed. These shift operations incur performance and energy penalties. An ideal RTM , requiring at most one shift per access, can easily outperform SRAM . However, in the worst-cast shifting scenario, RTM can be an order of magnitude slower than SRAM . This thesis presents an overview of the RTM device physics, its evolution, strengths and challenges, and its application in the memory subsystem. We develop tools that allow the programmability and modeling of RTM -based systems. For shifts minimization, we propose a set of techniques including optimal, near-optimal, and evolutionary algorithms for efficient scalar and instruction placement in RTMs . For array accesses, we explore schedule and layout transformations that eliminate the longer overhead shifts in RTMs . We present an automatic compilation framework that analyzes static control flow programs and transforms the loop traversal order and memory layout to maximize accesses to consecutive RTM locations and minimize shifts. We develop a simulation framework called RTSim that models various RTM parameters and enables accurate architectural level simulation. Finally, to demonstrate the RTM potential in non-Von-Neumann in-memory computing paradigms, we exploit its device attributes to implement logic and arithmetic operations. As a concrete use-case, we implement an entire hyperdimensional computing framework in RTM to accelerate the language recognition problem. Our evaluation shows considerable performance and energy improvements compared to conventional Von-Neumann models and state-of-the-art accelerators

    Systematic analysis of the cache behavior of irregular codes

    Get PDF
    [Resumen] El rendimiento de las jerarquías de memoria, en las cuales la caché juega un papel fundamental, es crítico en los computadores de proposito general actuales y en los sistemas embebidos, debido al creciente problema del cuello de botella del sistema de memoria. Desafortunadamente, el comportamiento de la caché es muy inestable y difícil de predecir. Esto es especialmente cierto en presencia de patrones de acceso irregulares, los cuales exhiben poca localidad. Tales patrones son muy comunes por ejemplo en aplicaciones en las cuales algunas referencias están afectadas por sentencias condicionales o en las que el almacenamiento comprimido de matrices dispersas da lugar a la aparición de indirecciones. SIn embargo, el comportamiento caché en presencia de patrones de acceso irregulares no ha sido estudiado ampliamente. En esta tesis presentamos extensiones de una técnica de modelado analítico sistemático basadas en PMEs (Ecuaciones probabilísticas de fallos) que permiten el análisis automático del comportamiento caché para códigos que incluyen sentencias condicionales cuyo valor de verdad puede no ser determinable en tiempo de compilación y códigos con referencias irregulares debidas a indirecciones, respectivamente. El modelo genera predicciones muy precisar a pesar de la irregularidad y tiene un bajo coste computacional siendo el primer modelo que reune estas dos características capaz de analizar automáticamente esta clase de códigos. Estas propiedades convierten al modelo en adecuado para servir de guía en optimizaciones del compilador. La extensión del modelo para códigos irregulares con indirecciones ha sido integrada en el compilador XARK, un compilador orientado al reconocimiento automático de kernels en aplicaciones científicas. Mostramos como explotar las potentes capacidades de extracción de información de este compilador para permitir el modelado automático de códigos científicos basados en bucles

    Assorted algorithms and protocols for secure computation

    Get PDF
    corecore