Search CORE

116 research outputs found

Cache-coherent distributed shared memory: perspectives on its development and future challenges

Author: A. Gupta
J. Hennessy
M. Heinrich
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Simulation Of Multi-core Systems And Interconnections And Evaluation Of Fat-Mesh Networks

Author: Zhang Yu
Publication venue
Publication date: 28/01/2009
Field of study

Simulators are very important in computer architecture research as they enable the exploration of new architectures to obtain detailed performance evaluation without building costly physical hardware. Simulation is even more critical to study future many-core architectures as it provides the opportunity to assess currently non-existing computer systems. In this thesis, a multiprocessor simulator is presented based on a cycle accurate architecture simulator called SESC. The shared L2 cache system is extended into a distributed shared cache (DSC) with a directory-based cache coherency protocol. A mesh network module is extended and integrated into SESC to replace the bus for scalable inter-processor communication. While these efforts complete an extended multiprocessor simulation infrastructure, two interconnection enhancements are proposed and evaluated. A novel non-uniform fat-mesh network structure similar to the idea of fat-tree is proposed. This non-uniform mesh network takes advantage of the average traffic pattern, typically all-to-all in DSC, to dedicate additional links for connections with heavy traffic (e.g., near the center) and fewer links for lighter traffic (e.g., near the periphery). Two fat-mesh schemes are implemented based on different routing algorithms. Analytical fat-mesh models are constructed by presenting the expressions for the traffic requirements of personalized all-to-all traffic. Performance improvements over the uniform mesh are demonstrated in the results from the simulator. A hybrid network consisting of one packet switching plane and multiple circuit switching planes is constructed as the second enhancement. The circuit switching planes provide fast paths between neighbors with heavy communication traffic. A compiler technique that abstracts the symbolic expressions of benchmarks' communication patterns can be used to help facilitate the circuit establishment

D-Scholarship@Pitt

Parallel Programming Using Shared Objects and Broadcasting

Author: Bal H.E.
Kaashoek M.F.
Tanenbaum A.S.
Publication venue
Publication date: 01/01/1992
Field of study

The two major design approaches taken to build distributed and parallel computer systems, multiprocessing and multicomputing, are discussed. A model that combines the best properties of both multiprocessor and multicomputer systems, easy-to-build hardware, and a conceptually simple programming model is presented. Using this model, a programmer defines and invokes operations on shared objects, the runtime system handles reads and writes on these objects, and the reliable broadcast layer implements indivisible updates to objects using the sequencing protocol. The resulting system is easy to program, easy to build, and has acceptable performance on problems with a moderate grain size in which reads are much more common than writes. Orca, a procedural language whose sequential constructs are roughly similar to languages like C or Modula 2 but which also supports parallel processes and shared objects and has been used to develop applications for the prototype system, is described

VU Research Portal

Simulation models of shared-memory multiprocessor systems

Author: Coe Paul.
Publication venue: The University of Edinburgh
Publication date: 01/01/2000
Field of study

Edinburgh Research Archive

Distributed Shared Memory: A Survey of Issues and Algorithms

Author: Lo Virginia M.
Nitzberg Bill
Publication venue: University of Oregon
Publication date: 07/01/1991
Field of study

26 pagesA distributed shared memory (DSM) is an implementation of the shared memory abstraction on a multicomputer architecture which has no physically shared memory. Shared memory is important (as a programming model) not only because of the vast number of existing applications which use it, but also because it is a more appropriate paradigm for certain algorithms. The DSM concept was demonstrated to be viable by Li, in IVY. Recently, there has been a surge of new projects which implement DSM in a variety of software and hardware environments. This paper gives an integrated overview of distributed shared memory. We discuss theoretical lower bounds on the performance of DSM systems, design choices such as structure and granularity, access, coherence semantics, scalability, and heterogeneity, and open problems in DSM. In addition, we describe algorithms used to implement and improve efficiency: reducing thrashing, eliminating false sharing, matching the coherence protocol to the type of sharing, and relaxing the semantics of the memory coherence provided. A spectrum of current DSM systems are used as illustrative examples

University of Oregon Scholars' Bank

Architectural support for enhancing security in clusters

Author: Lee Man Hee
Publication venue
Publication date: 15/05/2009
Field of study

Cluster computing has emerged as a common approach for providing more comput- ing and data resources in industry as well as in academia. However, since cluster computer developers have paid more attention to performance and cost e±ciency than to security, numerous security loopholes in cluster servers come to the forefront. Clusters usually rely on ¯rewalls for their security, but the ¯rewalls cannot prevent all security attacks; therefore, cluster systems should be designed to be robust to security attacks intrinsically. In this research, we propose architectural supports for enhancing security of clus- ter systems with marginal performance overhead. This research proceeds in a bottom- up fashion starting from enforcing each cluster component's security to building an integrated secure cluster. First, we propose secure cluster interconnects providing con- ¯dentiality, authentication, and availability. Second, a security accelerating network interface card architecture is proposed to enable low performance overhead encryption and authentication. Third, to enhance security in an individual cluster node, we pro- pose a secure design for shared-memory multiprocessors (SMP) architecture, which is deployed in many clusters. The secure SMP architecture will provide con¯dential communication between processors. This will remove the vulnerability of eavesdrop- ping attacks in a cluster node. Finally, to put all proposed schemes together, we propose a security/performance trade-o® model which can precisely predict performance of an integrated secure cluster

Texas A&M Repository

A low-cost high-speed twin-prefetching DSP-based shared-memory system for real-time image processing applications

Author: Christou Charalambos Stephanou
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/1998
Field of study

This dissertation introduces, investigates, and evaluates a low-cost high-speed twin-prefetching DSP-based bus-interconnected shared-memory system for real-time image processing applications. The proposed architecture can effectively support 32 DSPs in contrast to a maximum of 4 DSPs supported by existing DSP-based bus- interconnected systems. This significant enhancement is achieved by introducing two small programmable fast memories (Twins) between the processor and the shared bus interconnect. While one memory is transferring data from/to the shared memory, the other is supplying the core processor with data. The elimination of the traditional direct linkage of the shared bus and processor data bus makes feasible the utilization of a wider shared bus i.e., shared bus width becomes independent of the data bus width of the processors. The fast prefetching memories and the wider shared bus provide additional bus bandwidth into the system, which eliminates large memory latencies; such memory latencies constitute the major drawback for the performance of shared-memory multiprocessors. Furthermore, in contrast to existing DSP-based uniprocessor or multiprocessor systems the proposed architecture does not require all data to be placed on on-chip or off-chip expensive fast memory in order to reach or maintain peak performance. Further, it can maintain peak performance regardless of whether the processed image is small or large. The performance of the proposed architecture has been extensively investigated executing computationally intensive applications such as real-time high-resolution image processing. The effect of a wide variety of hardware design parameters on performance has been examined. More specifically tables and graphs comprehensively analyze the performance of 1, 2, 4, 8, 16, 32 and 64 DSP-based systems, for a wide variety of shared data interconnect widths such as 32, 64, 128, 256 and 512. In addition, the effect of the wide variance of temporal and spatial locality (present in different applications) on the multiprocessor\u27s execution time is investigated and analyzed. Finally, the prefetching cache-size was varied from a few kilobytes to 4 Mbytes and the corresponding effect on the execution time was investigated. Our performance analysis has clearly showed that the execution time converges to a shallow minimum i.e., it is not sensitive to the size of the prefetching cache. The significance of this observation is that near optimum performance can be achieved with a small (16 to 300 Kbytes) amount of prefetching cache

Digital Commons @ New Jersey Institute of Technology (NJIT)

Interconnection networks: dimensions in design

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1996
Field of study

Crossref

Heracles: A Tool for Fast RTL-Based Design Space Exploration of Multicore Processors

Author: Devadas Srinivas
Kinsy Michel A.
Pellauer Michael
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

This paper presents Heracles, an open-source, functional, parameterized, synthesizable multicore system toolkit. Such a multi/many-core design platform is a powerful and versatile research and teaching tool for architectural exploration and hardware-software co-design. The Heracles toolkit comprises the soft hardware (HDL) modules, application compiler, and graphical user interface. It is designed with a high degree of modularity to support fast exploration of future multicore processors of di erent topologies, routing schemes, processing elements (cores), and memory system organizations. It is a component-based framework with parameterized interfaces and strong emphasis on module reusability. The compiler toolchain is used to map C or C++ based applications onto the processing units. The GUI allows the user to quickly con gure and launch a system instance for easy factorial development and evaluation. Hardware modules are implemented in synthesizable Verilog and are FPGA platform independent. The Heracles tool is freely available under the open-source MIT license at: http://projects.csail.mit.edu/heracle

DSpace@MIT

Crossref