218 research outputs found

    Enhancing the performance of Decoupled Software Pipeline through Backward Slicing

    Get PDF
    The rapidly increasing number of cores available in multicore processors does not necessarily lead directly to a commensurate increase in performance: programs written in conventional languages, such as C, need careful restructuring, preferably automatically, before the benefits can be observed in improved run-times. Even then, much depends upon the intrinsic capacity of the original program for concurrent execution. The subject of this paper is the performance gains from the combined effect of the complementary techniques of the Decoupled Software Pipeline (DSWP) and (backward) slicing. DSWP extracts threadlevel parallelism from the body of a loop by breaking it into stages which are then executed pipeline style: in effect cutting across the control chain. Slicing, on the other hand, cuts the program along the control chain, teasing out finer threads that depend on different variables (or locations). parts that depend on different variables. The main contribution of this paper is to demonstrate that the application of DSWP, followed by slicing offers notable improvements over DSWP alone, especially when there is a loop-carried dependence that prevents the application of the simpler DOALL optimization. Experimental results show an improvement of a factor of ?1.6 for DSWP + slicing over DSWP alone and a factor of ?2.4 for DSWP + slicing over the original sequential code

    DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures

    Get PDF
    ABSTRACT Today's computers employ significant heterogeneity to meet performance targets at manageable power. In adopting increased compute specialization, however, the relative amount of time spent on memory or communication latency has increased. System and software optimizations for memory and communication often come at the costs of increased complexity and reduced portability. We propose Decoupled Supply-Compute (DeSC) as a way to attack memory bottlenecks automatically, while maintaining good portability and low complexity. Drawing from Decoupled Access Execute (DAE) approaches, our work updates and expands on these techniques with increased specialization and automatic compiler support. Across the evaluated workloads, DeSC o↵ers an average of 2.04x speedup over baseline (on homogeneous CMPs) and 1.56x speedup when a DeSC data supplier feeds data to a hardware accelerator. Achieving performance very close to what a perfect cache hierarchy would o↵er, DeSC o↵ers the performance gains of specialized communication acceleration while maintaining useful generality across platforms

    DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures

    Get PDF
    ABSTRACT Today's computers employ significant heterogeneity to meet performance targets at manageable power. In adopting increased compute specialization, however, the relative amount of time spent on memory or communication latency has increased. System and software optimizations for memory and communication often come at the costs of increased complexity and reduced portability. We propose Decoupled Supply-Compute (DeSC) as a way to attack memory bottlenecks automatically, while maintaining good portability and low complexity. Drawing from Decoupled Access Execute (DAE) approaches, our work updates and expands on these techniques with increased specialization and automatic compiler support. Across the evaluated workloads, DeSC offers an average of 2.04x speedup over baseline (on homogeneous CMPs) and 1.56x speedup when a DeSC data supplier feeds data to a hardware accelerator. Achieving performance very close to what a perfect cache hierarchy would offer, DeSC offers the performance gains of specialized communication acceleration while maintaining useful generality across platforms

    Architectural Enhancements for Data Transport in Datacenter Systems

    Full text link
    Datacenter systems run myriad applications, which frequently communicate with each other and/or Input/Output (I/O) devices—including network adapters, storage devices, and accelerators. Due to the growing speed of I/O devices and the emergence of microservice-based programming models, the I/O software stacks have become a critical factor in end-to-end communication performance. As such, I/O software stacks have been evolving rapidly in recent years. Datacenters rely on fast, efficient “Software Data Planes”, which orchestrate data transfer between applications and I/O devices. The goal of this dissertation is to enhance the performance, efficiency, and scalability of software data planes by diagnosing their existing issues and addressing them through hardware-software solutions. In the first step, I characterize challenges of modern software data planes, which bypass the operating system kernel to avoid associated overheads. Since traditional interrupts and system calls cannot be delivered to user code without kernel assistance, kernel-bypass data planes use spinning cores on I/O queues to identify work/data arrival. Spin-polling obviously wastes CPU cycles on checking empty queues; however, I show that it entails even more drawbacks: (1) Full-tilt spinning cores perform more (useless) polling work when there is less work pending in the queues. (2) Spin-polling scales poorly with the number of polled queues due to processor cache capacity constraints, especially when traffic is unbalanced. (3) Spin-polling also scales poorly with the number of cores due to the overhead of polling and operation rate limits. (4) Whereas shared queues can mitigate load imbalance and head-of-line blocking, synchronization overheads of spinning on them limit their potential benefits. Next, I propose a notification accelerator, dubbed HyperPlane, which replaces spin-polling in software data planes. Design principles of HyperPlane are: (1) not iterating on empty I/O queues to find work/data in ready ones, (2) blocking/halting when all queues are empty rather than spinning fruitlessly, and (3) allowing multiple cores to efficiently monitor a shared set of queues. These principles lead to queue scalability, work proportionality, and enjoying theoretical merits of shared queues. HyperPlane is realized with a programming model front-end and a hardware microarchitecture back-end. Evaluation of HyperPlane shows its significant advantage in terms of throughput, average/tail latency, and energy efficiency over a state-of-the-art spin-polling-based software data plane, with very small power and area overheads. Finally, I focus on the data transfer aspect in software data planes. Cache misses incurred by accessing I/O data are a major bottleneck in software data planes. Despite considerable efforts put into delivering I/O data directly to the last-level cache, some access latency is still exposed. Cores cannot prefetch such data to nearer caches in today's systems because of the complex access pattern of data buffers and the lack of an appropriate notification mechanism that can trigger the prefetch operations. As such, I propose HyperData, a data transfer accelerator based on targeted prefetching. HyperData prefetches exact (rather than predicted) data buffers (or a required subset to avoid cache pollution) to the L1 cache of the consumer core at the right time. Prefetching can be done for both core-peripheral and core-core communications. HyperData's prefetcher is programmable and supports various queue formats—namely, direct (regular), indirect (Virtio), and multi-consumer queues. I show that with a minor overhead, HyperData effectively hides data access latency in software data planes, thereby improving both application- and system-level performance and efficiency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169826/1/hosseing_1.pd

    Functional requirements document for the Earth Observing System Data and Information System (EOSDIS) Scientific Computing Facilities (SCF) of the NASA/MSFC Earth Science and Applications Division, 1992

    Get PDF
    Five scientists at MSFC/ESAD have EOS SCF investigator status. Each SCF has unique tasks which require the establishment of a computing facility dedicated to accomplishing those tasks. A SCF Working Group was established at ESAD with the charter of defining the computing requirements of the individual SCFs and recommending options for meeting these requirements. The primary goal of the working group was to determine which computing needs can be satisfied using either shared resources or separate but compatible resources, and which needs require unique individual resources. The requirements investigated included CPU-intensive vector and scalar processing, visualization, data storage, connectivity, and I/O peripherals. A review of computer industry directions and a market survey of computing hardware provided information regarding important industry standards and candidate computing platforms. It was determined that the total SCF computing requirements might be most effectively met using a hierarchy consisting of shared and individual resources. This hierarchy is composed of five major system types: (1) a supercomputer class vector processor; (2) a high-end scalar multiprocessor workstation; (3) a file server; (4) a few medium- to high-end visualization workstations; and (5) several low- to medium-range personal graphics workstations. Specific recommendations for meeting the needs of each of these types are presented

    Timing model derivation : static analysis of hardware description languages

    Get PDF
    Safety-critical hard real-time systems are subject to strict timing constraints. In order to derive guarantees on the timing behavior, the worst-case execution time (WCET) of each task comprising the system has to be known. The aiT tool has been developed for computing safe upper bounds on the WCET of a task. Its computation is mainly based on abstract interpretation of timing models of the processor and its periphery. These models are currently hand-crafted by human experts, which is a time-consuming and error-prone process. Modern processors are automatically synthesized from formal hardware specifications. Besides the processor’s functional behavior, also timing aspects are included in these descriptions. A methodology to derive sound timing models using hardware specifications is described within this thesis. To ease the process of timing model derivation, the methodology is embedded into a sound framework. A key part of this framework are static analyses on hardware specifications. This thesis presents an analysis framework that is build on the theory of abstract interpretation allowing use of classical program analyses on hardware description languages. Its suitability to automate parts of the derivation methodology is shown by different analyses. Practical experiments demonstrate the applicability of the approach to derive timing models. Also the soundness of the analyses and the analyses’ results is proved.Sicherheitskritische Echtzeitsysteme unterliegen strikten Zeitanforderungen. Um ihr Zeitverhalten zu garantieren müssen die Ausführungszeiten der einzelnen Programme, die das System bilden, bekannt sein. Um sichere obere Schranken für die Ausführungszeit von Programmen zu berechnen wurde aiT entwickelt. Die Berechnung basiert auf abstrakter Interpretation von Zeitmodellen des Prozessors und seiner Peripherie. Diese Modelle werden händisch in einem zeitaufwendigen und fehleranfälligen Prozess von Experten entwickelt. Moderne Prozessoren werden automatisch aus formalen Spezifikationen erzeugt. Neben dem funktionalen Verhalten beschreiben diese auch das Zeitverhalten des Prozessors. In dieser Arbeit wird eine Methodik zur sicheren Ableitung von Zeitmodellen aus der Hardwarespezifikation beschrieben. Um den Ableitungsprozess zu vereinfachen ist diese Methodik in eine automatisierte Umgebung eingebettet. Ein Hauptbestandteil dieses Systems sind statische Analysen auf Hardwarebeschreibungen. Diese Arbeit stellt eine Analyse-Umgebung vor, die auf der Theorie der abstrakten Interpretation aufbaut und den Einsatz von klassischen Programmanalysen auf Hardwarebeschreibungssprachen erlaubt. Die Eignung des Systems, Teile der Ableitungsmethodik zu automatisieren, wird anhand einiger Analysen gezeigt. Experimentelle Ergebnisse zeigen die Anwendbarkeit der Methodik zur Ableitung von Zeitmodellen. Die Korrektheit der Analysen und der Analyse-Ergebnisse wird ebenfalls bewiesen

    Identifying, Quantifying, Extracting and Enhancing Implicit Parallelism

    Get PDF
    The shift of the microprocessor industry towards multicore architectures has placed a huge burden on the programmers by requiring explicit parallelization for performance. Implicit Parallelization is an alternative that could ease the burden on programmers by parallelizing applications ???under the covers??? while maintaining sequential semantics externally. This thesis develops a novel approach for thinking about parallelism, by casting the problem of parallelization in terms of instruction criticality. Using this approach, parallelism in a program region is readily identified when certain conditions about fetch-criticality are satisfied by the region. The thesis formalizes this approach by developing a criticality-driven model of task-based parallelization. The model can accurately predict the parallelism that would be exposed by potential task choices by capturing a wide set of sources of parallelism as well as costs to parallelization. The criticality-driven model enables the development of two key components for Implicit Parallelization: a task selection policy, and a bottleneck analysis tool. The task selection policy can partition a single-threaded program into tasks that will profitably execute concurrently on a multicore architecture in spite of the costs associated with enforcing data-dependences and with task-related actions. The bottleneck analysis tool gives feedback to the programmers about data-dependences that limit parallelism. In particular, there are several ???accidental dependences??? that can be easily removed with large improvements in parallelism. These tools combine into a systematic methodology for performance tuning in Implicit Parallelization. Finally, armed with the criticality-driven model, the thesis revisits several architectural design decisions, and finds several encouraging ways forward to increase the scope of Implicit Parallelization.unpublishednot peer reviewe

    Doctor of Philosophy

    Get PDF
    dissertationA modern software system is a composition of parts that are themselves highly complex: operating systems, middleware, libraries, servers, and so on. In principle, compositionality of interfaces means that we can understand any given module independently of the internal workings of other parts. In practice, however, abstractions are leaky, and with every generation, modern software systems grow in complexity. Traditional ways of understanding failures, explaining anomalous executions, and analyzing performance are reaching their limits in the face of emergent behavior, unrepeatability, cross-component execution, software aging, and adversarial changes to the system at run time. Deterministic systems analysis has a potential to change the way we analyze and debug software systems. Recorded once, the execution of the system becomes an independent artifact, which can be analyzed offline. The availability of the complete system state, the guaranteed behavior of re-execution, and the absence of limitations on the run-time complexity of analysis collectively enable the deep, iterative, and automatic exploration of the dynamic properties of the system. This work creates a foundation for making deterministic replay a ubiquitous system analysis tool. It defines design and engineering principles for building fast and practical replay machines capable of capturing complete execution of the entire operating system with an overhead of several percents, on a realistic workload, and with minimal installation costs. To enable an intuitive interface of constructing replay analysis tools, this work implements a powerful virtual machine introspection layer that enables an analysis algorithm to be programmed against the state of the recorded system through familiar terms of source-level variable and type names. To support performance analysis, the replay engine provides a faithful performance model of the original execution during replay

    Network coding switch

    Get PDF
    Tese de mestrado, Engenharia Informática (Arquitetura, Sistemas e Redes de Computadores) Universidade de Lisboa, Faculdade de Ciências, 2019O tráfego na internet está a crescer a um ritmo elevado. A ocorrência de Gargalos é, então, cada vez mais, uma ocorrência comum, resultando em atrasos no transporte de informação e em ineficiências. Isto é um problema em parte decorrente do paradigma tradicional “store and foreward”. Quando um pacote chega a um nó da rede, é armazenado numa fila de espera enquanto aguarda por uma decisão de encaminhamento. Quando existe tráfego elevado, as filas de pacotes crescem e os atrasos aumentam (assim como as perdas de pacotes). O conceito de Codificação na Rede procura oferecer um paradigma. A ideia fundamental é a seguinte: à capacidade de armazenamento e encaminhamento. Quando existe tráfego elevado, as filas de pacotes crescem e os atrasos aumentam (assim como a perda dos pacotes).O Conceito de Codificação na Rede procura oferecer uma alternativa de paradigma. A ideia fundamental é a seguinte: à capacidade de armazenamento e encaminhamento é adicionada aos nós a capacidade de combinar pacotes. Com esta Técnica é possível aumentar as taxas de transferência de informação, assim como a resiliência da rede. Para se entender melhor o conceito vajamos um exemplo. Considere um nó A e um B que comunicam através de um ponto de acesso S, num ambiente sem fios. Vejamos as transmissões necessárias para A enviar a e B, e para B enviar mensagem b para A, usando o modelo tradicional:1A envia a para S 2B envia b para S 3S faz broadcast de a para os dois nós4S faz broadcast de b para os dois nós Como se pode observar, foram necessárias quatro transmissões ao todo. Ao aplicarmos codificação na rede podemos poupar no número de transmissões da seguinte forma:1A envia a para S2B envia b para S3S combina as duas mensagens aplicando um XOR sobre elas e envia o resultado, a b, para A e BNo entanto, o exemplo demonstrado acima é um caso base de Linear Network Coding (LNC). Esta técnica de codificação consiste em dar capacidade, a cada nó da rede, de gerar novos pacotes através de combinações lineares de pacotes recebidos anteriormente, multiplicando-os por coeficientes escolhidos de um dado campo finito, sendo o mais comum de tamanho 28. Já no exemplo anterior, em que foi utilizado uma técnica de codificação através do XOR para codificar dois pacotes, o tamanho do campo finito era de 2. Sendo este, então, um caso particular.Porém, o LNC requere que os coeficientes utilizados nas combinações lineares sejam definidos e computados à prori por todos os nós da rede através de um algoritmo e de informação partilhada. Estamos, então, perante uma Limitação desta técnica que introduz um custo. Random Linear Network Coding (RLNC), uma variante da técnica de LNC, permite ultrapassar essa limitação. Isto é possível devido à sua natureza aleatória, significando que os coeficientes empregues nas combinações lineares são gerados deforma aleatória dado um certo campo finito. Esta propriedade garante com uma dada probabilidade, desde que o campo finito tenha um tamanho suficiente largo, de que as combinações lineares geradas sejam independentes entre si, com o intuito de aumentar esta probabilidade, RLNC introduz ainda a capacidade de recodificar pacotes, isto é, codificar pacotes que já foram codificados por outro nó na rede. Assim, quando o nó destinatário recebe uma quantidade suficiente de pacotes codificados que sejam linearmente independentes é possível descodificar os pacotes resolvendo as combinações lineares. Para tal, o destinatário tem de ter conhecimento dos coeficientes empregues nas combinações lineares. Então, por norma, em RLNC os coeficientes empregues nas combinações lineares. Então por norma, em RLNC os coeficientes são anexados ao cabeçalho do pacote, após a codificação deste, para que os coeficientes sejam levados até ao destinatário. Tanto a operação de codificação como de descodificação introduzem uma certa complexidade computacional proporcional ao tamanho dos dados a serem transmitidos. A técnica designada por Generation-based RLNC, permite solucionar este problema. Esta consiste em dividir em grandes quantidades de dados em blocos mais pequenos, chamados gerações. Então, tanto a operação de codificação como a de descodificação são aplicadas por geração e não na totalidade de dados. Existe uma grande quantidade de trabalho teórico relacionado com Network Coding e implementações ao nível da Camada aplicacional. No entanto, não existe nenhum trabalho concreto cujo objetivo tenha sido desenvolver e implementar uma solução de Network Coding diretamente no plano de dados da rede. Isto resulta do facto de os switches serem hardware especializado com função única, não permitindo a codificação de pacotes.Recentemente, no entanto, foram desenvolvidos switches programáveis, que removem esta restrição. Ao contrário dos switches tradicionais que são dispositivos fechados que seguem um conjunto de protocolos definidos pelo fabricante, estes switches permitem ao operador definir exatamente o processamento dos pacotes. Entretanto foi desenvolvida também uma linguagem de alto nível para programar estes novos switches programáveis, designada como P4. Em Suma, uma das limitações de todas as soluções de codificação em rede existentes prende-se com o facto de serem implementações em software. Esta Limitação é resultado de inflexibilidade dos planos de dados em hardware (switches e routers) tradicionais, que não permitem a combinação de pacotes. Nesta dissertação começamos a atacar este problema através da exploração dos novos switches em hardware programáveis, desenhando e implementando um switch que executa Random Linear Network Coding usando a versão mais recente da linguagem de programação de switches P4 (especificamente, P4_16). A avaliação da nossa solução oferece boas perspetivas para a possibilidade de deployment em hardware destas técnicas de codificação em rede, mas apresente também alguns dos desafios que permanecem em aberto para explorar em trabalho futuro
    corecore