57 research outputs found
RETROSPECTIVE: Corona: System Implications of Emerging Nanophotonic Technology
The 2008 Corona effort was inspired by a pressing need for more of
everything, as demanded by the salient problems of the day. Dennard scaling was
no longer in effect. A lot of computer architecture research was in the
doldrums. Papers often showed incremental subsystem performance improvements,
but at incommensurate cost and complexity. The many-core era was moving
rapidly, and the approach with many simpler cores was at odds with the better
and more complex subsystem publications of the day. Core counts were doubling
every 18 months, while per-pin bandwidth was expected to double, at best, over
the next decade. Memory bandwidth and capacity had to increase to keep pace
with ever more powerful multi-core processors. With increasing core counts per
die, inter-core communication bandwidth and latency became more important. At
the same time, the area and power of electrical networks-on-chip were
increasingly problematic: To be reliably received, any signal that traverses a
wire spanning a full reticle-sized die would need significant equalization,
re-timing, and multiple clock cycles. This additional time, area, and power was
the crux of the concern, and things looked to get worse in the future.
Silicon nanophotonics was of particular interest and seemed to be improving
rapidly. This led us to consider taking advantage of 3D packaging, where one
die in the 3D stack would be a photonic network layer. Our focus was on a
system that could be built about a decade out. Thus, we tried to predict how
the technologies and the system performance requirements would converge in
about 2018. Corona was the result this exercise; now, 15 years later, it's
interesting to look back at the effort.Comment: 2 pages. Proceedings of ISCA-50: 50 years of the International
Symposia on Computer Architecture (selected papers) June 17-21 Orlando,
Florid
TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings
In response to innovations in machine learning (ML) models, production
workloads changed radically and rapidly. TPU v4 is the fifth Google domain
specific architecture (DSA) and its third supercomputer for such ML models.
Optical circuit switches (OCSes) dynamically reconfigure its interconnect
topology to improve scale, availability, utilization, modularity, deployment,
security, power, and performance; users can pick a twisted 3D torus topology if
desired. Much cheaper, lower power, and faster than Infiniband, OCSes and
underlying optical components are <5% of system cost and <3% of system power.
Each TPU v4 includes SparseCores, dataflow processors that accelerate models
that rely on embeddings by 5x-7x yet use only 5% of die area and power.
Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves
performance/Watt by 2.7x. The TPU v4 supercomputer is 4x larger at 4096 chips
and thus ~10x faster overall, which along with OCS flexibility helps large
language models. For similar sized systems, it is ~4.3x-4.5x faster than the
Graphcore IPU Bow and is 1.2x-1.7x faster and uses 1.3x-1.9x less power than
the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers
of Google Cloud use ~3x less energy and produce ~20x less CO2e than
contemporary DSAs in a typical on-premise data center.Comment: 15 pages; 16 figures; to be published at ISCA 2023 (the International
Symposium on Computer Architecture
In-Datacenter Performance Analysis of a Tensor Processing Unit
Many architects believe that major improvements in cost-energy-performance
must now come from domain-specific hardware. This paper evaluates a custom
ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since
2015 that accelerates the inference phase of neural networks (NN). The heart of
the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak
throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed
on-chip memory. The TPU's deterministic execution model is a better match to
the 99th-percentile response-time requirement of our NN applications than are
the time-varying optimizations of CPUs and GPUs (caches, out-of-order
execution, multithreading, multiprocessing, prefetching, ...) that help average
throughput more than guaranteed latency. The lack of such features helps
explain why, despite having myriad MACs and a big memory, the TPU is relatively
small and low power. We compare the TPU to a server-class Intel Haswell CPU and
an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters.
Our workload, written in the high-level TensorFlow framework, uses production
NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters'
NN inference demand. Despite low utilization for some applications, the TPU is
on average about 15X - 30X faster than its contemporary GPU or CPU, with
TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the
TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and
200X the CPU.Comment: 17 pages, 11 figures, 8 tables. To appear at the 44th International
Symposium on Computer Architecture (ISCA), Toronto, Canada, June 24-28, 201
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers
Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches. Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches. Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching. St..
Method and apparatus for compositing colors of images with memory constraints for storing pixel data
[[abstract]]A method and an apparatus determine a color for pixels in a graphics system in which images are defined by pixels. Multiple fragments of an image may be visible in any given pixel. Each visible fragment has a fragment value that includes the color of that fragment. For such given pixel, up to a predetermined number of the fragment values are stored. When a new fragment is visible in the given pixel, one of the fragment values is discarded to determine which fragment values are stored and subsequently used to generate the color of the pixel. The discarded fragment value may be the new fragment value or one of the stored fragment values. Various strategies can be used to determine which fragment value is discarded. One such scheme selects the stored fragment value with the greatest Z-depth. Another scheme selects the stored fragment value that produces the smallest color difference from the new fragment value. Still another scheme selects the new fragment value when one of the fragments is in front of the new fragment and the stored fragment value of that fragment produces the smallest color difference from the new fragment value.[[fileno]]2030237060001[[department]]資工
Reducing Compulsory and Capacity Misses
This paper investigates several methods for reducing cache miss rates. Longer cache lines can be advantageously used to decrease cache miss rates when used in conjunction with miss caches. Prefetch techniques can also be used to reduce cache miss rates. However, stream buffers are better than either of these two approaches. They are shown to have lower miss rates than an optimal line size for each program, and have better or near equal performance to traditional prefetch techniques even when single instruction-issue latency is assumed for prefetches. Stream buffers in conjunction with victim caches can often provide a reduction in miss rate equivalent to a doubling or quadupling of cache size. In some cases the reduction in miss rate provided by stream buffers and victim caches is larger than that of any size cache. Finally, the potential for compiler optimizations to increase the performance of stream buffers is investigated. This tech note is a copy of a paper that was submitted to ..
Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines
Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining
- …