    Cluster Computing in the Classroom: Topics, Guidelines, and Experiences

    With the progress of research on cluster computing, more and more universities have begun to offer various courses covering cluster computing. A wide variety of content can be taught in these courses. Because of this, a difficulty that arises is the selection of appropriate course material. The selection is complicated by the fact that some content in cluster computing is also covered by other courses such as operating systems, networking, or computer architecture. In addition, the background of students enrolled in cluster computing courses varies. These aspects of cluster computing make the development of good course material difficult. Combining our experiences in teaching cluster computing in several universities in the USA and Australia and conducting tutorials at many international conferences all over the world, we present prospective topics in cluster computing along with a wide variety of information sources (books, software, and materials on the web) from which instructors can choose. The course material described includes system architecture, parallel programming, algorithms, and applications. Instructors are advised to choose selected units in each of the topical areas and develop their own syllabus to meet course objectives. For example, a full course can be taught on system architecture for core computer science students. Or, a course on parallel programming could contain a brief coverage of system architecture and then devote the majority of time to programming methods. Other combinations are also possible. We share our experiences in teaching cluster computing and the topics we have chosen depending on course objectives

    Incompressible flow simulation using S.I.M.P.L.E method on parallel computer

    Komputer selari merupakan gabungan beberapa pemproses yang bertujuan meningkatkan keupayaan se~ebt,iah sistem komputer dalam melaksanakafi sesuatu pengatucaraan. Dalam projek ini, sistem komputer seiad yang digu[)akan dikenali selJagaf sistem komputer serari berkelompok. Kelebihan menggunakan sistem ,komputer selari berkelo,mpok ini ialah ia mampu bergerak-sendiri sebagai komputer sesiri jika tidak beropera~i sebagai komputer s~laci. Peri sian komputer selari yang boleh digunakan sebagai sistem operasi kepada sistem komputer selari berkeIompok ini termasuklah uNIX, Window NT atau Linux. Projek ini rnemberikan penumpuan dalam penggumian sistem kotnputer seIari berkelompok rnenggunakan perisian PVM untuk rnenyelesaikan persamaan Navier-Stoke dalam membuat simulasi dua dimensi aliran tidak boleh mampa~ dalam ruang segiempat. Kaedah yang digunakan adalah berasaskan algorfuna SIMPLE dan algoritrna SIMPLE yang {elah diubahsuai dengan men,.ggunakan kaedah pel'tIbahagian domain dan kaedah pembahagian fungsi. Ketepatan kedua-dua kaedah tersebut telah dibandiflgkan dengan keputusan piawai yang berkaitan dengan rnasalah aliran dua dimensi dalam ruang segiempat. Keupayaan kedua-dua kaedah tersebut dari segi rnasa perlaksanaan, kecepatan dan keberkesanan juga telalLJ!iperoIehi dan didapati penggunaan komputer selari telah rnemberikan prestasi yang Iebih baik dalam menyelesaikan masalah persamaan Navier-Stoke tersebut. Dengan kaedah pembahagian domain, didapati masa perla1>.Sanaan daPJIt £il)urangkan sebanyak 700/6-' rnanakala dengan menggunakan kaedah pembah~gf~ri fungsi, masa perlaksanaan dapat dikurangkan sebanyak 25 % berbartding dengan menggunakan komputer sesiri

    Integrating Algorithmic and Systemic Load Balancing Strategies in Parallel Scientific Applications

    Load imbalance is a major source of performance degradation in parallel scientific applications. Load balancing increases the efficient use of existing resources and improves performance of parallel applications running in distributed environments. At a coarse level of granularity, advances in runtime systems for parallel programs have been proposed in order to control available resources as efficiently as possible by utilizing idle resources and using task migration. At a finer granularity level, advances in algorithmic strategies for dynamically balancing computational loads by data redistribution have been proposed in order to respond to variations in processor performance during the execution of a given parallel application. Algorithmic and systemic load balancing strategies have complementary set of advantages. An integration of these two techniques is possible and it should result in a system, which delivers advantages over each technique used in isolation. This thesis presents a design and implementation of a system that combines an algorithmic fine-grained data parallel load balancing strategy called Fractiling with a systemic coarse-grained task-parallel load balancing system called Hector. It also reports on experimental results of running N-body simulations under this integrated system. The experimental results indicate that a distributed runtime environment, which combines both algorithmic and systemic load balancing strategies, can provide performance advantages with little overhead, underscoring the importance of this approach in large complex scientific applications

    Engineering the performance of parallel applications

    Research summary, January 1989 - June 1990

    The Research Institute for Advanced Computer Science (RIACS) was established at NASA ARC in June of 1983. RIACS is privately operated by the Universities Space Research Association (USRA), a consortium of 62 universities with graduate programs in the aerospace sciences, under a Cooperative Agreement with NASA. RIACS serves as the representative of the USRA universities at ARC. This document reports our activities and accomplishments for the period 1 Jan. 1989 - 30 Jun. 1990. The following topics are covered: learning systems, networked systems, and parallel systems

    An Application Perspective on High-Performance Computing and Communications

    We review possible and probable industrial applications of HPCC focusing on the software and hardware issues. Thirty-three separate categories are illustrated by detailed descriptions of five areas -- computational chemistry; Monte Carlo methods from physics to economics; manufacturing; and computational fluid dynamics; command and control; or crisis management; and multimedia services to client computers and settop boxes. The hardware varies from tightly-coupled parallel supercomputers to heterogeneous distributed systems. The software models span HPF and data parallelism, to distributed information systems and object/data flow parallelism on the Web. We find that in each case, it is reasonably clear that HPCC works in principle, and postulate that this knowledge can be used in a new generation of software infrastructure based on the WebWindows approach, and discussed in an accompanying paper

    Performance Evaluation of Specialized Hardware for Fast Global Operations on Distributed Memory Multicomputers

    Workstation cluster multicomputers are increasingly being applied for solving scientific problems that require massive computing power. Parallel Virtual Machine (PVM) is a popular message-passing model used to program these clusters. One of the major performance limiting factors for cluster multicomputers is their inefficiency in performing parallel program operations involving collective communications. These operations include synchronization, global reduction, broadcast/multicast operations and orderly access to shared global variables. Hall has demonstrated that a .secondary network with wide tree topology and centralized coordination processors (COP) could improve the performance of global operations on a variety of distributed architectures [Hall94a]. My hypothesis was that the efficiency of many PVM applications on workstation clusters could be significantly improved by utilizing a COP system for collective communication operations. To test my hypothesis, I interfaced COP system with PVM. The interface software includes a virtual memory-mapped secondary network interface driver, and a function library which allows to use COP system in place of PVM function calls in application programs. My implementation makes it possible to easily port any existing PVM applications to perform fast global operations using the COP system. To evaluate the performance improvements of using a COP system, I measured cost of various PVM global functions, derived the cost of equivalent COP library global functions, and compared the results. To analyze the cost of global operations on overall execution time of applications, I instrumented a complex molecular dynamics PVM application and performed measurements. The measurements were performed for a sample cluster size of 5 and for message sizes up to 16 kilobytes. The comparison of PVM and COP system global operation performance clearly demonstrates that the COP system can speed up a variety of global operations involving small-to-medium sized messages by factors of 5-25. Analysis of the example application for a sample cluster size of 5 show that speedup provided by my global function libraries and the COP system reduces overall execution time for this and similar applications by above 1.5 times. Additionally, the performance improvement seen by applications increases as the cluster size increases, thus providing a scalable solution for performing global operations