28 research outputs found
Topology-aware Quality-of-Service Support in Highly Integrated Chip Multiprocessors
Current design complexity trends, poor wire scalability, and power limitations argue in favor of highly modular onchip systems. Today’s state-of-the-art CMPs already feature up to a hundred discrete cores. With increasing levels of integration, CMPs with hundreds of cores, cache tiles, and specialized accelerators are anticipated in the near future. Meanwhile, server consolidation and cloud computing paradigms have emerged as profit vehicles for exploiting abundant resources of chip-multiprocessors. As multiple, potentially malevolent, users begin to share virtualized resources of a single chip, CMP-level quality-of-service (QOS) support becomes necessary to provide performance isolation, service guarantees, and security. This work takes a topology-aware approach to on-chip QOS. We propose to segregate shared resources, such as memory controllers and accelerators, into dedicated islands (shared regions) of the chip with full hardware QOS support. We rely on a richly connected Multidrop Express Channel (MECS) topology to connect individual nodes to shared regions, foregoing QOS support in much of the substrate and eliminating its respective overheads. We evaluate several topologies for the QOSenabled shared regions, focusing on the interaction between network-on-chip (NOC) and QOS metrics. We explore a new topology called Destination Partitioned Subnets (DPS), which uses a light-weight dedicated network for each destination node. On synthetic workloads, DPS nearly matches or outperforms other topologies with comparable bisection bandwidth in terms of performance, area overhead, energyefficiency, fairness, and preemption resilience.
Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks
It is expected that future on-chip networks for many-core processors will impose huge overheads in terms of energy, delay, complexity, verification effort, and area. There is a common belief that the bandwidth necessary for future applications can only be provided by employing packet-switched networks with complex routers and a scalable directory-based coherence protocol. We posit that such a scheme might likely be overkill in a well designed system in addition to being expensive in terms of power because of a large number of power-hungry routers. We show that bus-based networks with snooping protocols can significantly lower energy consumption and simplify network/protocol design and verification, with no loss in performance. We achievethesecharacteristicsbydividingthe chip into multiple segments, each having its own broadcast bus, with these buses further connected by a central bus. This helps eliminate expensive routers, but suffers from the energy overhead of long wires. We propose the use of multiple Bloom filters to effectively track data presenceinthecacheandrestrict busbroadcaststoasubsetof segments, significantly reducing energy consumption. We further show that the use of OS page coloring helps maximize locality and improves the effectiveness of the Bloom filters. We also employ low-swing wiring to furtherreduce the energy overheads of the links. Performance can also be improved at relatively low costs by utilizing more of the abundant metal budgets on-chip and employing multiple address-interleaved buses rather than multiple routers. Thus, with the combination of all the above innovations, we extend thescalabilityofbusesandbelievethatbusescanbe a viable and attractive option for future on-chip networks. We show energy reductions of up to 31X on average compared to many state-of-the-art packet-switched networks
XoMA: Exclusive on-chip memory architecture for energy-efficient deep learning acceleration
State-of-the-art deep neural networks (DNNs) require hundreds of millions of multiply-accumulate (MAC) computations to perform inference, e.g. in image-recognition tasks. To improve the performance and energy efficiency, deep learning accelerators have been proposed, realized both on FPGAs and as custom ASICs. Generally, such accelerators comprise many parallel processing elements, capable of executing large numbers of concurrent MAC operations. From the energy perspective, however, most consumption arises due to memory accesses, both to off-chip external memory, and on-chip buffers. In this paper, we propose an on-chip DNN co-processor architecture where minimizing memory accesses is the primary design objective. To the maximum possible extent, off-chip memory accesses are eliminated, providing lowest-possible energy consumption for inference. Compared to a state-of-the-art ASIC, our architecture requires 36% fewer external memory accesses and 53% less energy consumption for low-latency image classification
HyComp: A Hybrid Cache Compression Method for Selection of Data-Type-Specific Compression Methods
Proposed cache compression schemes make design-time assumptions on value locality to reduce decompression latency. For example, some schemes assume that common values are spatially close whereas other schemes assume that null blocks are common. Most schemes, however, assume that value locality is best exploited by fixed-size data types (e.g., 32-bit integers). This assumption falls short when other data types, such as floating-point numbers, are common. This paper makes two contributions. First, HyComp - a hybrid cache compression scheme - selects the best-performing compression scheme, based on heuristics that predict data types. Data types considered are pointers, integers, floating-point numbers and the special (and trivial) case of null blocks. Second, this paper contributes with a compression method that exploits value locality in data types with predefined semantic value fields, e.g., as in the exponent and the mantissa in floating-point numbers. We show that HyComp, augmented with the proposed floating-point-number compression method, offers superior performance in comparison with prior art