Search CORE

58 research outputs found

Optimizing Communication for Massively Parallel Processing

Author: Kumar Sameer
Publication venue
Publication date: 01/04/2005
Field of study

The current trends in high performance computing show that large machines with tens of thousands of processors will soon be readily available. The IBM Bluegene-L machine with 128k processors (which is currently being deployed) is an important step in this direction. In this scenario, it is going to be a significant burden for the programmer to manually scale his applications. This task of scaling involves addressing issues like load-imbalance and communication overhead. In this thesis, we explore several communication optimizations to help parallel applications to easily scale on a large number of processors. We also present automatic runtime techniques to relieve the programmer from the burden of optimizing communication in his applications. This thesis explores processor virtualization to improve communication performance in applications. With processor virtualization, the computation is mapped to virtual processors (VPs). After one VP has finished computation and is waiting for responses to its messages, another VP can compute, thus overlapping communication with computation. This overlap is only effective if the processor overhead of the communication operation is a small fraction of the total communication time. Fortunately, with network interfaces having co-processors, this happens to be true and processor virtualization has a natural advantage on such interconnects. The communication optimizations we present in this thesis, are motivated by applications such as NAMD (a classical molecular dynamics application) and CPAIMD (a quantum chemistry application). Applications like NAMD and CPAIMD consume a fair share of the time available on supercomputers. So, improving their performance would be of great value. We have successfully scaled NAMD to 1TF of peak performance on 3000 processors of PSC Lemieux, using the techniques presented in this thesis. We study both point-to-point communication and collective communication (specifically all-to-all communication). On a large number of processors all-to-all communication can take several milli-seconds to finish. With synchronous collectives defined in MPI, the processor idles while the collective messages are in flight. Therefore, we demonstrate an asynchronous collective communication framework, to let the CPU compute while the all-to-all messages are in flight. We also show that the best strategy for all-to-all communication depends on the message size, number of processors and other dynamic parameters. This suggests that these parameters can be observed at runtime and used to choose the optimal strategy for all-to-all communication. In this thesis, we demonstrate adaptive strategy switching for all-to-all communication. The communication optimization framework presented in this thesis, has been designed to optimize communication in the context of processor virtualization and dynamic migrating objects. We present the streaming strategy to optimize fine grained object-to-object communication. In this thesis, we motivate the need for hardware collectives, as processor based collectives can be delayed by intermediate that processors busy with computation. We explore a next generation interconnect that supports collectives in the switching hardware. We show the performance gains of hardware collectives through synthetic benchmarks

Illinois Digital Environment for Access to Learning and Scholarship Repository

Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations

Author: Dimitrov Rossen Petkov
Publication venue: Scholars Junction
Publication date: 12/05/2001
Field of study

This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered

Scholars Junction - Mississippi State University Institutional Repository

Mechanisms for efficient, protected messaging

Author: Lee Whay Sing, 1967-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1999
Field of study

Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 143-149).by Whay Sing Lee.Ph.D

CiteSeerX

DSpace@MIT

A Study of Client-based Caching for Parallel I/O

Author: Settlemyer Bradley
Publication venue: Clemson University Libraries
Publication date: 01/08/2009
Field of study

The trend in parallel computing toward large-scale cluster computers running thousands of cooperating processes per application has led to an I/O bottleneck that has only gotten more severe as the the number of processing cores per CPU has increased. Current parallel file systems are able to provide high bandwidth file access for large contiguous file region accesses; however, applications repeatedly accessing small file regions on unaligned file region boundaries continue to experience poor I/O throughput due to the high overhead associated with accessing parallel file system data. In this dissertation we demonstrate how client-side file data caching can improve parallel file system throughput for applications performing frequent small and unaligned file I/O. We explore the impacts of cache page size and cache capacity using the popular FLASH I/O benchmark and explore a novel cache sharing approach that leverages the trend toward multi-core processors. We also explore a technique we call progressive page caching that represents cache data using dynamic data structures rather than fixed-size pages of file data. Finally, we explore a cache aggregation scheme that leverages the high-level file I/O interfaces provided by the PVFS file system to provide further performance enhancements. In summary, our results indicate that a correctly configured middleware-based file data cache can dramatically improve the performance of I/O workloads dominated by small unaligned file accesses. Further, we demonstrate that a well designed cache can offer stable performance even when the selected cache page granularity is not well matched to the provided workload. Finally, we have shown that high-level file system interfaces can significantly accelerate application performance, and interfaces beyond those currently envisioned by the MPI-IO standard could provide further performance benefits

Clemson University: TigerPrints

Acceleration of the hardware-software interface of a communication device for parallel systems

Author: Nüßle Mondrian Benediktus
Publication venue: Universität Mannheim
Publication date: 01/01/2008
Field of study

During the last decades the ever growing need for computational power fostered the development of parallel computer architectures. Applications need to be parallelized and optimized to be able to exploit modern system architectures. Today, scalability of applications is more and more limited both by development resources, as programming of complex parallel applications becomes increasingly demanding, and by the fundamental scalability issues introduced by the cost of communication in distributed memory systems. Lowering the latency of communication is mandatory to increase scalability and serves as an enabling technology for programming of distributed memory systems at a higher abstraction layer using higher degrees of compiler driven automation. At the same time it can increase performance of such systems in general. In this work, the software/hardware interface and the network interface controller functions of the EXTOLL network architecture, which is specifically designed to satisfy the needs of low-latency networking for high-performance computing, is presented. Several new architectural contributions are made in this thesis, namely a new efficient method for virtual-tophysical address-translation named ATU and a novel method to issue operations to a virtual device in an optimal way which has been termed Transactional I/O. This new method needs changes in the architecture of the host CPU the device is connected to. Two additional methods that emulate most of the characteristics of Transactional I/O are developed and employed in the development of the EXTOLL hardware to facilitate usage together with contemporary CPUs. These new methods heavily leverage properties of the HyperTransport interface used to connect the device to the CPU. Finally, this thesis also introduces an optimized remote-memory-access architecture for efficient split-phase transactions and atomic operations. The complete architecture has been prototyped using FPGA technology enabling a more precise analysis and verification than is possible using simulation alone. The resulting design utilizes 95 % of a 90 nm FPGA device and reaches speeds of 200 MHz and 156 MHz in the different clock domains of the design. The EXTOLL software stack is developed and a performance evaluation of the software using the EXTOLL hardware is performed. The performance evaluation shows an excellent start-up latency value of 1.3 μs, which competes with the most advanced networks available, in spite of the technological performance handicap encountered by FPGA technology. The resulting network is, to the best of the knowledge of the author, the fastest FPGA-based interconnection network for commodity processors ever built

MAnnheim DOCument Server

Recommended from our members

SoC-Based In-Storage Processing: Bringing Flexibility and Efficiency to Near-Data Processing

Author: Torabzadehkashi Mahdi
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Data are among the most valuable assets in the modern world, and they have caused a revolutionary stage in human life. Nowadays, companies make knowledge-based decisions by analyzing a huge volume of data, super-scale data centers are used to process customers’ data to suggest products to them, government services rely on the data people provide to them, and there are many similar cases wherein data are used as an important asset. Data are originally stored in storage systems. To process data, application servers need to fetch the data from storage units, which imposes the cost of moving the data to the system. This cost has a direct relationship to the distance of the processing engines from the data, and this is the key motivation for the emergence of distributed processing platforms such as Hadoop, which bring the process closer to the data.In-storage processing (ISP) pushes the “bring the process to data” paradigm to its ultimate boundaries by utilizing processing engines inside the storage units to process data. The architecture of modern solid-state drives (SSDs) provides a suitable environment for implementing such technology. Thus, this dissertation focuses on SSD architectures that are able to run user applications in-place, which are called computational storage devices (CSDs). In this dissertation, we propose CSD architectures and investigate the benefits of deploying CSDs for running different applications. This research uses a practical approach that includes building fully functional prototypes of the proposed CSD architectures, developing storage systems equipped with the CSDs, and running different benchmarks to investigate the benefits of deploying the CSDs in the systems. This research proposes two different CSD architectures, namely CompStor and Catalina.These are the first CSDs to be equipped with a dedicated ISP engine for running user applications in-place that includes a quad-core ARM Cortex-A53 processor together with FPGA- and application-specific integrated circuit (ASIC) based accelerators. The proposed architectures run a full-fledged operating system inside, which provides a flexible environment for running a wide range of user applications in-place. The system-on-chip (SOC) based architecture of Catalina CSD, together with a software stack developed for seamless deployment of the CSD, makes it a platform for the implementation of different ISP concepts and ideas.To the best of our knowledge, Catalina is the only ISP platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and message passing interface (MPI) based applications in-place without any modifications to the underlying distributed processing framework. We performed extensive experimental tests using several datasets on both CompStor and Catalina CSDs. The experimental results show up to 2.2x and 4.3x improvements in performance and energy consumption, respectively, for running Hadoop MapReduce benchmarks using Catalina CSDs and up to 5.4x and 8.9x improvements for running 1-, 2-, and 3-dimensional DFT algorithms due to the Neon SIMD engines inside Catalina. Additionally, using FPGA-based accelerators, Catalina CSDs can improve the performance and energy consumption of a highly demanding image similarity search application up to 11x and 7x, respectively

eScholarship - University of California

CMS The TriDAS Project: Technical Design Report, Volume 2: Data Acquisition and High-Level Trigger

Author: Cittolin Sergio
Rácz Attila
Sphicas Paris
Publication venue: Union of Concerned Scientists
Publication date: 01/01/2002
Field of study

CERN Document Server

Swarming Reconnaissance Using Unmanned Aerial Vehicles in a Parallel Discrete Event Simulation

Author: Corner Joshua J.
Publication venue: AFIT Scholar
Publication date: 01/03/2004
Field of study

Current military affairs indicate that future military warfare requires safer, more accurate, and more fault-tolerant weapons systems. Unmanned Aerial Vehicles (UAV) are one answer to this military requirement. Technology in the UAV arena is moving toward smaller and more capable systems and is becoming available at a fraction of the cost. Exploiting the advances in these miniaturized flying vehicles is the aim of this research. How are the UAVs employed for the future military? The concept of operations for a micro-UAV system is adopted from nature from the appearance of flocking birds, movement of a school of fish, and swarming bees among others. All of these natural phenomena have a common thread: a global action resulting from many small individual actions. This emergent behavior is the aggregate result of many simple interactions occurring within the flock, school, or swarm. In a similar manner, a more robust weapon system uses emergent behavior resulting in no weakest link because the system itself is made up of simple interactions by hundreds or thousands of homogeneous UAVs. The global system in this research is referred to as a swarm. Losing one or a few individual unmanned vehicles would not dramatically impact the swarms ability to complete the mission or cause harm to any human operator. Swarming reconnaissance is the emergent behavior of swarms to perform a reconnaissance operation. An in-depth look at the design of a reconnaissance swarming mission is studied. A taxonomy of passive reconnaissance applications is developed to address feasibility. Evaluation of algorithms for swarm movement, communication, sensor input/analysis, targeting, and network topology result in priorities of each model\u27s desired features. After a thorough selection process of available implementations, a subset of those models are integrated and built upon resulting in a simulation that explores the innovations of swarming UAVs

AFTI Scholar (Air Force Institute of Technology)

Fault-tolerant routing in SCI networks

Author: Stensland Håkon Kvale
Publication venue
Publication date: 01/01/2006
Field of study

Fault-tolerant routing has been a hot topic in the academic community for quite some time now, and several different approaches have been suggested. In the interconnect industry however, fault-tolerant routing has not been implemented to the same extent. In this thesis we have adapted and implemented a local fault-tolerant routing approach in SCI interconnect technology produced by Dolphin Interconnect Solutions. The existing technology used in SCI is based in a static reconfiguration approach, where the traffic is disabled, while the new routing is calculated by a central front-end and distributed out to the nodes. Our algorithm builds upon the principle of enabling the nodes to make routing decisions from the information that is available to them locally, and having the rest of the nodes in the cluster to be prepared for this unexpected traffic. The algorithm has been tested on real hardware, and we have shown that it can handle several levels of traffic in the network. The test has also proven that our method gives the same performance both before and after the error occurs if the packets have the same conditions, such as competing traffic and link length. Our routing algorithm is currently integrated as a part of Dolphin Interconnect Solutions driver in the last official release

NORA - Norwegian Open Research Archives

Deployment of Stream Control Transmission Protocol (SCTP) to Maintain the Applications of Data Centers

Author: Abdelfattah Eman
Almajadub Fatma
Razaque Abdul
Publication venue
Publication date: 11/11/2013
Field of study

With developments of real-time applications into data centers, the need for alternatives of the standard TCP protocol has been prime demand in several applications of data centers. The several alternatives of TCP protocol has been proposed but SCTP has edge due to its several well-built characteristics that make it capable to work efficiently. In this paper, we examine the features of SCTP into data centers like Multi-streaming and Multi-Homing over the features of TCP protocol. In this paper, our objective is to introduce internal problems of data centers. Robust transport protocol reduces the problems with some extend. Focusing the problems of data centers, we also examine weakness of highly deployed standard TCP, and evaluate the performance of SCTP in context of faster communication for data centers. We also discover some weaknesses and shortcomings of SCTP into data centers and try to propose some ways to avoid them by maintaining SCTP native features. To validate strength and weakness of TCP and SCTP, we use ns2 for simulation in context of data center. On basis of findings, we highlight major strength of SCTP. At the end, we Implement finer grain TCP locking mechanisms for larger messages.http://arxiv.org/abs/1311.263

arXiv.org e-Print Archive

UB ScholarWorks