4 research outputs found
Numerics of High Performance Computers and Benchmark Evaluation of Distributed Memory Computers
The internal representation of numerical data, their speed of manipulation to generate the desired result through efficient utilisation of central processing unit, memory, and communication links are essential steps of all high performance scientific computations. Machine parameters, in particular, reveal accuracy and error bounds of computation, required for performance tuning of codes. This paper reports diagnosis of machine parameters, measurement of computing power of several workstations, serial and parallel computers, and a component-wise test procedure for distributed memory computers. Hierarchical memory structure is illustrated by block copying and unrolling techniques. Locality of reference for cache reuse of data is amply demonstrated by fast Fourier transform codes. Cache and register-blocking technique results in their optimum utilisation with consequent gain in throughput during vector-matrix operations. Implementation of these memory management techniques reduces cache inefficiency loss, which is known to be proportional to the number of processors. Of the two Linux clusters-ANUP16, HPC22 and HPC64, it has been found from the measurement of intrinsic parameters and from application benchmark of multi-block Euler code test run that ANUP16 is suitable for problems that exhibit fine-grained parallelism. The delivered performance of ANUP16 is of immense utility for developing high-end PC clusters like HPC64 and customised parallel computers with added advantage of speed and high degree of parallelism
Design and analysis of a 3-dimensional cluster multicomputer architecture using optical interconnection for petaFLOP computing
In this dissertation, the design and analyses of an extremely scalable distributed
multicomputer architecture, using optical interconnects, that has the potential to
deliver in the order of petaFLOP performance is presented in detail. The design
takes advantage of optical technologies, harnessing the features inherent in optics,
to produce a 3D stack that implements efficiently a large, fully connected system of
nodes forming a true 3D architecture. To adopt optics in large-scale multiprocessor
cluster systems, efficient routing and scheduling techniques are needed. To this
end, novel self-routing strategies for all-optical packet switched networks and on-line
scheduling methods that can result in collision free communication and achieve real
time operation in high-speed multiprocessor systems are proposed. The system is designed
to allow failed/faulty nodes to stay in place without appreciable performance
degradation. The approach is to develop a dynamic communication environment that
will be able to effectively adapt and evolve with a high density of missing units or
nodes. A joint CPU/bandwidth controller that maximizes the resource allocation in
this dynamic computing environment is introduced with an objective to optimize the
distributed cluster architecture, preventing performance/system degradation in the
presence of failed/faulty nodes. A thorough analysis, feasibility study and description of the characteristics of a 3-Dimensional multicomputer system capable of achieving
100 teraFLOP performance is discussed in detail. Included in this dissertation is
throughput analysis of the routing schemes, using methods from discrete-time queuing
systems and computer simulation results for the different proposed algorithms. A
prototype of the 3D architecture proposed is built and a test bed developed to obtain
experimental results to further prove the feasibility of the design, validate initial assumptions,
algorithms, simulations and the optimized distributed resource allocation
scheme. Finally, as a prelude to further research, an efficient data routing strategy
for highly scalable distributed mobile multiprocessor networks is introduced
Recommended from our members
Allocation of SISAL program graphs to processors using BLAS
There are a number of well known techniques for extracting parallelism from a given program. They range from hardware implementations, building restructuring compilers or reorganizing of programs so as to specify all the available parallelism. The success rate of any of the known techniques is rather poor over all types of programs. This has pushed the research community to explore new languages and design different architectures to exploit program parallelism. The principles of dataflow architectures have addressed the problem of exploiting parallelism in systems by executing dataflow graphs. These graphs or programs represent data dependencies among instructions and execution of the graph proceeds in a data-driven manner. That is, an instruction is executed as soon as all its operands are available, without waiting for any program counter to sequence its execution, as is the case in conventional von Neumann architectures. In this thesis, data flow graphs are generated during the intermediate compilation of a functional language called SISAL (Streams and Iterations in a Single Assignment Language). The Intermediate Form (IFl) is a graphical language consisting of multiple acyclic function graphs that represent a given program. Each graph consists of a sequence of nodes and edges. The nodes specify the operation and the edges indicate the dependencies between the nodes. The graphs are further connected to each other by means of implicit dependencies. The Automator package developed in this project, preprocesses these multiple IF1 graphs and translates them into a single connected graph. It converts all implicit dependencies into actual ones. Additionally, complex language constructs like For All, loops and if-then-else are treated in special ways together with their nested levels by the Automator. There is virtually no limit to the number of nested levels that can be translated by this package. The Automator's prime contribution is in translating real programs written in SISAL into a specified format required by an allocation algorithm called the Balanced Layered Allocation Scheme (BLAS). BLAS partitions a connected graph into independent tasks and assigns them to processors in a multicomputer system. The problem of program allocation lies in maximizing parallelism while minimizing interprocessor communication costs. Hence, allocation is based on the best choice of communication to execution ratio for each task. BLAS utilizes heuristic rules to find a balance between computation and communication costs in the target system. Here the target architecture is a simulated nCUBE 3E computer, having a hypercube topology. Simulations show that, BLAS is effective in reducing the overall execution time of a program by considering the communication costs on the execution times. The results will help in understanding the effects in packing nodes (grain-packing), routing issues in the network and in general, the allocation problem to any processor in a network. In addition, tasks have also been assigned to adjacent processors only, instead of any processor on the hypercube network. The adjacent allocation to processors helps to determine trade-offs required between achieved speed-ups and the time it takes to completely allocate large graphs on compilation