Search CORE

28 research outputs found

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Author: Culler D. E.
Galles M.
Ho R.
John B. Carter
Karthik Ramani
Kongetira P.
Krewell K.
Liqun Cheng
Naveen Muralimanohar
Nelson N.
Rajeev Balasubramonian
Tendler J.
{13} Corporate Institute of Electrical and Electronics Engineers
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Topology-aware Quality-of-Service Support in Highly Integrated Chip Multiprocessors

Author: A. Demers
B. Grot
B. Grot
E. Ebrahimi
E. Rijpkema
G.E. Suh
J. Kim
J.D. Balfour
J.H. Kim
J.W. Lee
K.J. Nesbit
L. Zhang
M.R. Marty
N. Muralimanohar
O. Mutlu
O. Mutlu
P. Kermani
R. Bitirgen
R. Das
R. Iyer
S. Golestani
T. Ristenpart
W.J. Dally
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Current design complexity trends, poor wire scalability, and power limitations argue in favor of highly modular onchip systems. Today’s state-of-the-art CMPs already feature up to a hundred discrete cores. With increasing levels of integration, CMPs with hundreds of cores, cache tiles, and specialized accelerators are anticipated in the near future. Meanwhile, server consolidation and cloud computing paradigms have emerged as profit vehicles for exploiting abundant resources of chip-multiprocessors. As multiple, potentially malevolent, users begin to share virtualized resources of a single chip, CMP-level quality-of-service (QOS) support becomes necessary to provide performance isolation, service guarantees, and security. This work takes a topology-aware approach to on-chip QOS. We propose to segregate shared resources, such as memory controllers and accelerators, into dedicated islands (shared regions) of the chip with full hardware QOS support. We rely on a richly connected Multidrop Express Channel (MECS) topology to connect individual nodes to shared regions, foregoing QOS support in much of the substrate and eliminating its respective overheads. We evaluate several topologies for the QOSenabled shared regions, focusing on the interaction between network-on-chip (NOC) and QOS metrics. We explore a new topology called Destination Partitioned Subnets (DPS), which uses a light-weight dedicated network for each destination node. On synthetic workloads, DPS nearly matches or outperforms other topologies with comparable bisection bandwidth in terms of performance, area overhead, energyefficiency, fairness, and preemption resilience.

CiteSeerX

Crossref

Edinburgh Research Explorer

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

Author: Aniruddha N. Udipi
Naveen Muralimanohar
Rajeev Balasubramonian
Publication venue
Publication date: 01/01/2010
Field of study

It is expected that future on-chip networks for many-core processors will impose huge overheads in terms of energy, delay, complexity, verification effort, and area. There is a common belief that the bandwidth necessary for future applications can only be provided by employing packet-switched networks with complex routers and a scalable directory-based coherence protocol. We posit that such a scheme might likely be overkill in a well designed system in addition to being expensive in terms of power because of a large number of power-hungry routers. We show that bus-based networks with snooping protocols can significantly lower energy consumption and simplify network/protocol design and verification, with no loss in performance. We achievethesecharacteristicsbydividingthe chip into multiple segments, each having its own broadcast bus, with these buses further connected by a central bus. This helps eliminate expensive routers, but suffers from the energy overhead of long wires. We propose the use of multiple Bloom filters to effectively track data presenceinthecacheandrestrict busbroadcaststoasubsetof segments, significantly reducing energy consumption. We further show that the use of OS page coloring helps maximize locality and improves the effectiveness of the Bloom filters. We also employ low-swing wiring to furtherreduce the energy overheads of the links. Performance can also be improved at relatively low costs by utilizing more of the abundant metal budgets on-chip and employing multiple address-interleaved buses rather than multiple routers. Thus, with the combination of all the above innovations, we extend thescalabilityofbusesandbelievethatbusescanbe a viable and attractive option for future on-chip networks. We show energy reductions of up to 31X on average compared to many state-of-the-art packet-switched networks

CiteSeerX

Crossref

EECache

Author: Flautner K.
Gebhart M.
Muralimanohar N.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Using Silent Writes in Low-Power Traffic-Aware ECC

Author: C.W. Slayman
D. Burger
L.Z. Scheick
N. Muralimanohar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Crossref

XoMA: Exclusive on-chip memory architecture for energy-efficient deep learning acceleration

Author: Courbariaux M.
Muralimanohar N.
Simonyan K.
Szegedy C.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 21/01/2019
Field of study

State-of-the-art deep neural networks (DNNs) require hundreds of millions of multiply-accumulate (MAC) computations to perform inference, e.g. in image-recognition tasks. To improve the performance and energy efficiency, deep learning accelerators have been proposed, realized both on FPGAs and as custom ASICs. Generally, such accelerators comprise many parallel processing elements, capable of executing large numbers of concurrent MAC operations. From the energy perspective, however, most consumption arises due to memory accesses, both to off-chip external memory, and on-chip buffers. In this paper, we propose an on-chip DNN co-processor architecture where minimizing memory accesses is the primary design objective. To the maximum possible extent, off-chip memory accesses are eliminated, providing lowest-possible energy consumption for inference. Compared to a state-of-the-art ASIC, our architecture requires 36% fewer external memory accesses and 53% less energy consumption for low-latency image classification

Crossref

ScholarWorks@UNIST

HyComp: A Hybrid Cache Compression Method for Selection of Data-Type-Specific Compression Methods

Author: Alameldeen A. R.
Arelakis A.
Huffman D. A.
Muralimanohar N.
Nitta C.
Publication venue
Publication date: 01/01/2015
Field of study

Proposed cache compression schemes make design-time assumptions on value locality to reduce decompression latency. For example, some schemes assume that common values are spatially close whereas other schemes assume that null blocks are common. Most schemes, however, assume that value locality is best exploited by fixed-size data types (e.g., 32-bit integers). This assumption falls short when other data types, such as floating-point numbers, are common. This paper makes two contributions. First, HyComp - a hybrid cache compression scheme - selects the best-performing compression scheme, based on heuristics that predict data types. Data types considered are pointers, integers, floating-point numbers and the special (and trivial) case of null blocks. Second, this paper contributes with a compression method that exploits value locality in data types with predefined semantic value fields, e.g., as in the exponent and the mantissa in floating-point numbers. We show that HyComp, augmented with the proposed floating-point-number compression method, offers superior performance in comparison with prior art

Crossref

Chalmers Research

Chalmers Publication Library