3,881 research outputs found

    A case for merging the ILP and DLP paradigms

    Get PDF
    The goal of this paper is to show that instruction level parallelism (ILP) and data-level parallelism (DLP) can be merged in a single architecture to execute vectorizable code at a performance level that can not be achieved using either paradigm on its own. We will show that the combination of the two techniques yields very high performance at a low cost and a low complexity. We will show that this architecture can reach a performance equivalent to a superscalar processor that sustained 10 instructions per cycle. We will see that the machine exploiting both types of parallelism improves upon the ILP-only machine by factors of 1.5-1.8. We also present a study on the scalability of both paradigms and show that, when we increase resources to reach a 16-issue machine, the advantage of the ILP+DLP machine over the ILP-only machine increases up to 2.0-3.45. While the peak achieved IPC for the ILP machine is 4, the ILP+DLP machine exceeds 10 instructions per cycle.Peer ReviewedPostprint (published version

    DeLTA: GPU Performance Model for Deep Learning Applications with In-depth Memory System Traffic Analysis

    Full text link
    Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of the execution time of CNN training, and GPUs are commonly used to accelerate these layer workloads. GPU design optimization for efficient CNN training acceleration requires the accurate modeling of how their performance improves when computing and memory resources are increased. We present DeLTA, the first analytical model that accurately estimates the traffic at each GPU memory hierarchy level, while accounting for the complex reuse patterns of a parallel convolution algorithm. We demonstrate that our model is both accurate and robust for different CNNs and GPU architectures. We then show how this model can be used to carefully balance the scaling of different GPU resources for efficient CNN performance improvement

    MLPerf Inference Benchmark

    Full text link
    Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. In this paper, we present our benchmarking method for evaluating ML inference systems. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures. The first call for submissions garnered more than 600 reproducible inference-performance measurements from 14 organizations, representing over 30 systems that showcase a wide range of capabilities. The submissions attest to the benchmark's flexibility and adaptability.Comment: ISCA 202

    QoS Routing with worst-case delay constraints: models, algorithms and performance analysis

    Get PDF
    In a network where weighted fair-queueing schedulers are used at each link, a flow is guaranteed an end-to-end worst-case delays which depends on the rate reserved for it at each link it traverses. Therefore, it is possible to compute resource-constrained paths that meet target delay constraints, and optimize some key performance metrics (e.g., minimize the overall reserved rate, maximize the remaining capacity at bottleneck links, etc.). Despite the large amount of literature that has appeared on weighted fair-queueing schedulers since the mid '90s, this has so far been done only for a single type of scheduler, probably because the complexity of solving the problem in general appeared forbidding. In this paper, we formulate and solve the optimal path computation and resource allocation problem for a broad category of weighted fair-queueing schedulers, from those emulating a Generalized Processor Sharing fluid server to variants of Deficit Round Robin. We classify schedulers according to their latency expressions, and show that a significant divide exists between those where routing a new flow affects the performance of existing flows, and those for which this do not happen. For the former, explicit admission control constraints are required to ensure that existing flows still meet their deadline afterwards. However, despite this major difference and the differences among categories of schedulers, the problem can always be formulated as a Mixed-Integer Second-Order Cone problem (MI-SOCP), and be solved at optimality in split-second times even in fairly large networks
    • …
    corecore