Search CORE

252 research outputs found

Doctor of Philosophy

Author: Parker Michael Allen
Publication venue: University of Utah
Publication date: 01/08/2013
Field of study

dissertationHigh-performance supercomputers on the Top500 list are commonly designed around commodity CPUs. Most of the codes executed on these machines are message-passing codes using the message-passing toolkit (MPI). Thus it makes sense to look at these machines from a holistic systems architecture perspective and consider optimizations to commodity processors that make them more efficient in message-passing architectures. Described herein is a new User-Level Notification (ULN) architecture that significantly improves message-passing performance. The architecture integrates a simultaneous multithreaded (SMT) processor with a user-level network interface (NI) that can directly control the execution scheduling of threads on the processor. By allowing the network interface to control the execution of message handling code at the user level, the operating system (OS) related overhead for handling interrupts and user code dispatch related to notifications is eliminated. By using an SMT processor, message handling can be performed in one thread concurrent to user computation in other threads, thus most of the overhead of executing message handlers can be hidden. This dissertation presents measurements showing the OS overheads related to message-passing are significant in modern architectures and describes a new architecture that significantly reduces these overheads. On a communication-intensive real-world application, the ULN architecture provides a 50.9% performance improvement over a more traditional OS-based NIC and a 5.29-31.9% improvement over a best-of-class user-level NIC due to the user-level notifications

The University of Utah: J. Willard Marriott Digital Library

Coherent network interfaces for fine-grain communication

Author: Falsafi Babak
Hill Mark D.
Mukherjee Shubhendu S.
Wood David A.
Publication venue
Publication date: 06/04/2009
Field of study

Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads. This paper describes an attempt to explore network interfaces that use coherence, i.e., coherent network interfaces (CNIs), to improve communication performance. First, it reports on the development and optimization of two mechanisms that CNIs use to communicate with processors. A taxonomy and comparison of four CNIs with a more conventional NI are then presented

Infoscience - École polytechnique fédérale de Lausanne

Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations

Author: Dimitrov Rossen Petkov
Publication venue: Scholars Junction
Publication date: 12/05/2001
Field of study

This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered

Scholars Junction - Mississippi State University Institutional Repository

Mechanisms for efficient, protected messaging

Author: Lee Whay Sing, 1967-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1999
Field of study

Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 143-149).by Whay Sing Lee.Ph.D

CiteSeerX

DSpace@MIT

An efficient virtual network interface in the FUGU scalable workstation dc by Kenneth Martin Mackenzie.

Author: Mackenzie Kenneth M. (Kenneth Martin), 1965-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1998
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 123-129).Ph.D

DSpace@MIT

Communication Architectures for Parallel-Programming Systems

Author: Bhoedjang R.A.F.
Publication venue
Publication date: 01/01/2000
Field of study

Bal, H.E. [Promotor]Tanenbaum, A.S. [Promotor

VU Research Portal

Performance Evaluation of Specialized Hardware for Fast Global Operations on Distributed Memory Multicomputers

Author: Sankaran Rajesh Madukkarumukumana
Publication venue: PDXScholar
Publication date: 27/10/1995
Field of study

Workstation cluster multicomputers are increasingly being applied for solving scientific problems that require massive computing power. Parallel Virtual Machine (PVM) is a popular message-passing model used to program these clusters. One of the major performance limiting factors for cluster multicomputers is their inefficiency in performing parallel program operations involving collective communications. These operations include synchronization, global reduction, broadcast/multicast operations and orderly access to shared global variables. Hall has demonstrated that a .secondary network with wide tree topology and centralized coordination processors (COP) could improve the performance of global operations on a variety of distributed architectures [Hall94a]. My hypothesis was that the efficiency of many PVM applications on workstation clusters could be significantly improved by utilizing a COP system for collective communication operations. To test my hypothesis, I interfaced COP system with PVM. The interface software includes a virtual memory-mapped secondary network interface driver, and a function library which allows to use COP system in place of PVM function calls in application programs. My implementation makes it possible to easily port any existing PVM applications to perform fast global operations using the COP system. To evaluate the performance improvements of using a COP system, I measured cost of various PVM global functions, derived the cost of equivalent COP library global functions, and compared the results. To analyze the cost of global operations on overall execution time of applications, I instrumented a complex molecular dynamics PVM application and performed measurements. The measurements were performed for a sample cluster size of 5 and for message sizes up to 16 kilobytes. The comparison of PVM and COP system global operation performance clearly demonstrates that the COP system can speed up a variety of global operations involving small-to-medium sized messages by factors of 5-25. Analysis of the example application for a sample cluster size of 5 show that speedup provided by my global function libraries and the COP system reduces overall execution time for this and similar applications by above 1.5 times. Additionally, the performance improvement seen by applications increases as the cluster size increases, thus providing a scalable solution for performing global operations

PDXScholar (Portland State University)

Integrated System Architectures for High-Performance Internet Servers

Author: Binkert Nathan Lorenzo
Publication venue
Publication date: 01/01/2006
Field of study

Ph.D.Computer Science and EngineeringUniversity of Michiganhttp://deepblue.lib.umich.edu/bitstream/2027.42/90845/1/binkert-thesis.pd

CiteSeerX

Deep Blue Documents at the University of Michigan

Recommended from our members

Survey on System I/O Hardware Transactions and Impact on Latency, Throughput, and Other Factors

Author: Larsen Steen
Lee Ben
Publication venue: 'Elsevier BV'
Publication date
Field of study

Computer system I/O has evolved with processor and memory technologies in terms of reducing latency, increasing bandwidth and other factors. As requirements increase for I/O, such as networking, storage, and video, descriptor-based DMA transactions have become more important in high performance systems to move data between I/O adapters and system memory buffers. DMA transactions are done with hardware engines below the software protocol abstraction layers in all systems other than rudimentary embedded controllers. CPUs can switch to other tasks by offloading hardware DMA transfers to the I/O adapters. Each I/O interface has one or more separately instantiated descriptor-based DMA engines optimized for a given I/O port. I/O transactions are optimized by accelerator functions to reduce latency, improve throughput and reduce CPU overhead. This chapter surveys the current state of high-performance I/O architecture advances and explores benefits and limitations. With the proliferation of CPU multi-cores within a system, multi-GB/s ports, and on-die integration of system functions, changes beyond the techniques surveyed may be needed for optimal I/O architecture performance.This is an author's peer-reviewed final manuscript, as accepted by the publisher. The published article/chapter is copyrighted by Elsevier and can be found at: http://www.elsevier.com/books/advances-in-computers/hurson/978-0-12-420232-0.Keywords: memory, controllers, processors, DMA, input/output, latency, power, throughpu

ScholarsArchive@OSU