973 research outputs found

    Field-based branch prediction for packet processing engines

    Get PDF
    Network processors have exploited many aspects of architecture design, such as employing multi-core, multi-threading and hardware accelerator, to support both the ever-increasing line rates and the higher complexity of network applications. Micro-architectural techniques like superscalar, deep pipeline and speculative execution provide an excellent method of improving performance without limiting either the scalability or flexibility, provided that the branch penalty is well controlled. However, it is difficult for traditional branch predictor to keep increasing the accuracy by using larger tables, due to the fewer variations in branch patterns of packet processing. To improve the prediction efficiency, we propose a flow-based prediction mechanism which caches the branch histories of packets with similar header fields, since they normally undergo the same execution path. For packets that cannot find a matching entry in the history table, a fallback gshare predictor is used to provide branch direction. Simulation results show that the our scheme achieves an average hit rate in excess of 97.5% on a selected set of network applications and real-life packet traces, with a similar chip area to the existing branch prediction architectures used in modern microprocessors

    Measuring and Managing Answer Quality for Online Data-Intensive Services

    Full text link
    Online data-intensive services parallelize query execution across distributed software components. Interactive response time is a priority, so online query executions return answers without waiting for slow running components to finish. However, data from these slow components could lead to better answers. We propose Ubora, an approach to measure the effect of slow running components on the quality of answers. Ubora randomly samples online queries and executes them twice. The first execution elides data from slow components and provides fast online answers; the second execution waits for all components to complete. Ubora uses memoization to speed up mature executions by replaying network messages exchanged between components. Our systems-level implementation works for a wide range of platforms, including Hadoop/Yarn, Apache Lucene, the EasyRec Recommendation Engine, and the OpenEphyra question answering system. Ubora computes answer quality much faster than competing approaches that do not use memoization. With Ubora, we show that answer quality can and should be used to guide online admission control. Our adaptive controller processed 37% more queries than a competing controller guided by the rate of timeouts.Comment: Technical Repor

    Fluid Stochastic Petri Nets: From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks

    Get PDF
    © 2012 Mitrevski and Kotevski, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Fluid Stochastic Petri Nets: From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networ

    POSE: getting over grainsize in parallel discrete event simulation

    Full text link
    Parallel discrete event simulations (PDES) encom-pass a broad range of analytical simulations. Their utility lies in their ability to model a system and pro-vide information about its behavior in a timely manner. Current PDES methods provide limited performance im-provements over sequential simulation. Many logical models for applications have fine granularity making them challenging to parallelize. In POSE, we exam-ine the overhead required for optimistically synchroniz-ing events. We have designed an object model based on the concept of virtualization and new adaptive op-timistic methods to improve the performance of fine-grained PDES applications. These novel approaches exploit the speculative nature of optimistic protocols to improve single-processor parallel over sequential per-formance and achieve scalability for previously hard-to-parallelize fine-grained simulations.1 1

    Empirical and Statistical Application Modeling Using on -Chip Performance Monitors.

    Get PDF
    To analyze the performance of applications and architectures, both programmers and architects desire formal methods to explain anomalous behavior. To this end, we present various methods that utilize non-intrusive, performance-monitoring hardware only recently available on microprocessors to provide further explanations of observed behavior. All the methods attempt to characterize and explain the instruction-level parallelism achieved by codes on different architectures. We also present a prototype tool automating the analysis process to exploit the advantages of the empirical and statistical methods proposed. The empirical, statistical and hybrid methods are discussed and explained with case study results provided. The given methods further the wealth of tools available to programmer\u27s and architects for generally understanding the performance of scientific applications. Specifically, the models and tools presented provide new methods for evaluating and categorizing application performance. The empirical memory model serves to quantify the hierarchical memory performance of applications by inferring the incurred latencies of codes after the effect of latency hiding techniques are realized. The instruction-level model and its extensions model on-chip performance analytically giving insight into inherent performance bottlenecks in superscalar architectures. The statistical model and its hybrid extension provide other methods of categorizing codes via their statistical variations. The PTERA performance tool automates the use of performance counters for use by these methods across platforms making the modeling process easier still. These unique methods provide alternatives to performance modeling and categorizing not available previously in an attempt to utilize the inherent modeling capabilities of performance monitors on commodity processors for scientific applications

    A Survey of Techniques for Architecting TLBs

    Get PDF
    “Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers
    corecore