3,881 research outputs found
A case for merging the ILP and DLP paradigms
The goal of this paper is to show that instruction level parallelism (ILP) and data-level parallelism (DLP) can be merged in a single architecture to execute vectorizable code at a performance level that can not be achieved using either paradigm on its own. We will show that the combination of the two techniques yields very high performance at a low cost and a low complexity. We will show that this architecture can reach a performance equivalent to a superscalar processor that sustained 10 instructions per cycle. We will see that the machine exploiting both types of parallelism improves upon the ILP-only machine by factors of 1.5-1.8. We also present a study on the scalability of both paradigms and show that, when we increase resources to reach a 16-issue machine, the advantage of the ILP+DLP machine over the ILP-only machine increases up to 2.0-3.45. While the peak achieved IPC for the ILP machine is 4, the ILP+DLP machine exceeds 10 instructions per cycle.Peer ReviewedPostprint (published version
DeLTA: GPU Performance Model for Deep Learning Applications with In-depth Memory System Traffic Analysis
Training convolutional neural networks (CNNs) requires intense compute
throughput and high memory bandwidth. Especially, convolution layers account
for the majority of the execution time of CNN training, and GPUs are commonly
used to accelerate these layer workloads. GPU design optimization for efficient
CNN training acceleration requires the accurate modeling of how their
performance improves when computing and memory resources are increased. We
present DeLTA, the first analytical model that accurately estimates the traffic
at each GPU memory hierarchy level, while accounting for the complex reuse
patterns of a parallel convolution algorithm. We demonstrate that our model is
both accurate and robust for different CNNs and GPU architectures. We then show
how this model can be used to carefully balance the scaling of different GPU
resources for efficient CNN performance improvement
MLPerf Inference Benchmark
Machine-learning (ML) hardware and software system demand is burgeoning.
Driven by ML applications, the number of different ML inference systems has
exploded. Over 100 organizations are building ML inference chips, and the
systems that incorporate existing models span at least three orders of
magnitude in power consumption and five orders of magnitude in performance;
they range from embedded devices to data-center solutions. Fueling the hardware
are a dozen or more software frameworks and libraries. The myriad combinations
of ML hardware and ML software make assessing ML-system performance in an
architecture-neutral, representative, and reproducible manner challenging.
There is a clear need for industry-wide standard ML benchmarking and evaluation
criteria. MLPerf Inference answers that call. In this paper, we present our
benchmarking method for evaluating ML inference systems. Driven by more than 30
organizations as well as more than 200 ML engineers and practitioners, MLPerf
prescribes a set of rules and best practices to ensure comparability across
systems with wildly differing architectures. The first call for submissions
garnered more than 600 reproducible inference-performance measurements from 14
organizations, representing over 30 systems that showcase a wide range of
capabilities. The submissions attest to the benchmark's flexibility and
adaptability.Comment: ISCA 202
QoS Routing with worst-case delay constraints: models, algorithms and performance analysis
In a network where weighted fair-queueing schedulers are used at each link, a flow is guaranteed an end-to-end worst-case delays which depends on the rate reserved for it at each link it traverses. Therefore, it is possible to compute resource-constrained paths that meet target delay constraints, and optimize some key performance metrics (e.g., minimize the overall reserved rate, maximize the remaining capacity at bottleneck links, etc.). Despite the large amount of literature that has appeared on weighted fair-queueing schedulers since the mid '90s, this has so far been done only for a single type of scheduler, probably because the complexity of solving the problem in general appeared forbidding. In this paper, we formulate and solve the optimal path computation and resource allocation problem for a broad category of weighted fair-queueing schedulers, from those emulating a Generalized Processor Sharing fluid server to variants of Deficit Round Robin. We classify schedulers according to their latency expressions, and show that a significant divide exists between those where routing a new flow affects the performance of existing flows, and those for which this do not happen. For the former, explicit admission control constraints are required to ensure that existing flows still meet their deadline afterwards. However, despite this major difference and the differences among categories of schedulers, the problem can always be formulated as a Mixed-Integer Second-Order Cone problem (MI-SOCP), and be solved at optimality in split-second times even in fairly large networks
Study of high voltage solar array configurations with integrated power control electronics
Solar array electrical configurations for voltage regulatio
- …