291 research outputs found

    Performance analysis of direct N-body algorithms for astrophysical simulations on distributed systems

    Full text link
    We discuss the performance of direct summation codes used in the simulation of astrophysical stellar systems on highly distributed architectures. These codes compute the gravitational interaction among stars in an exact way and have an O(N^2) scaling with the number of particles. They can be applied to a variety of astrophysical problems, like the evolution of star clusters, the dynamics of black holes, the formation of planetary systems, and cosmological simulations. The simulation of realistic star clusters with sufficiently high accuracy cannot be performed on a single workstation but may be possible on parallel computers or grids. We have implemented two parallel schemes for a direct N-body code and we study their performance on general purpose parallel computers and large computational grids. We present the results of timing analyzes conducted on the different architectures and compare them with the predictions from theoretical models. We conclude that the simulation of star clusters with up to a million particles will be possible on large distributed computers in the next decade. Simulating entire galaxies however will in addition require new hybrid methods to speedup the calculation.Comment: 22 pages, 8 figures, accepted for publication in Parallel Computin

    DiFX: A software correlator for very long baseline interferometry using multi-processor computing environments

    Get PDF
    We describe the development of an FX style correlator for Very Long Baseline Interferometry (VLBI), implemented in software and intended to run in multi-processor computing environments, such as large clusters of commodity machines (Beowulf clusters) or computers specifically designed for high performance computing, such as multi-processor shared-memory machines. We outline the scientific and practical benefits for VLBI correlation, these chiefly being due to the inherent flexibility of software and the fact that the highly parallel and scalable nature of the correlation task is well suited to a multi-processor computing environment. We suggest scientific applications where such an approach to VLBI correlation is most suited and will give the best returns. We report detailed results from the Distributed FX (DiFX) software correlator, running on the Swinburne supercomputer (a Beowulf cluster of approximately 300 commodity processors), including measures of the performance of the system. For example, to correlate all Stokes products for a 10 antenna array, with an aggregate bandwidth of 64 MHz per station and using typical time and frequency resolution presently requires of order 100 desktop-class compute nodes. Due to the effect of Moore's Law on commodity computing performance, the total number and cost of compute nodes required to meet a given correlation task continues to decrease rapidly with time. We show detailed comparisons between DiFX and two existing hardware-based correlators: the Australian Long Baseline Array (LBA) S2 correlator, and the NRAO Very Long Baseline Array (VLBA) correlator. In both cases, excellent agreement was found between the correlators. Finally, we describe plans for the future operation of DiFX on the Swinburne supercomputer, for both astrophysical and geodetic science.Comment: 41 pages, 10 figures, accepted for publication in PAS

    Assessing the Utility of a Personal Desktop Cluster

    Get PDF
    The computer workstation, introduced by Sun Microsystems in 1982, was the tool of choice for scientists and engineers as an interactive computing environment for the development of scientific codes. However, by the mid-1990s, the performance of workstations began to lag behind high-end commodity PCs. This, coupled with the disappearance of BSD-based operating systems in workstations and the emergence of Linux as an opensource operating system for PCs, arguably led to the demise of the workstation as we knew it. Around the same time, computational scientists started to leverage PCs running Linux to create a commodity-based (Beowulf) cluster that provided dedicated compute cycles, i.e., supercomputing for the rest of us, as a cost-effective alternative to large supercomputers, i.e., supercomputing for the few. However, as the cluster movement has matured, with respect to cluster hardware and open-source software, these clusters have become much more like their large-scale supercomputing brethren — a shared datacenter resource that resides in a machine room. Consequently, the above observations, when coupled with the ever-increasing performance gap between the PC and cluster supercomputer, provide the motivation for a personal desktop cluster workstation — a turnkey solution that provides an interactive and parallel computing environment with the approximate form factor of a Sun SPARCstation 1 “pizza box” workstation. In this paper, we present the hardware and software architecture of such a solution as well as its prowess as a developmental platform for parallel codes. In short, imagine a 12-node personal desktop cluster that achieves 14 Gflops on Linpack but sips only 150-180 watts of power, resulting in a performance-power ratio that is over 300% better than our test SMP platform

    Investigating data throughput and partial dynamic reconfiguration in a commodity FPGA cluster framework

    Get PDF
    There are many computational kernels where parallelism can be exploited in applica- tion specific hardware, yielding significant speedup over a general purpose processor based solution. Commodity cluster computing technologies have been combined with FPGA co- processors, resulting in even greater performance capability through the exploitation of multiple levels of parallelism. One particularly economic solution both in terms of cost and power consumption is to cluster hybrid FPGAs with commodity network intercon- nects. Hybrid FPGAs combine embedded microprocessors with reconfigurable hardware resources on a single chip offering lower power consumption and cost compared to a tra- ditional I/O bus FPGA coprocessor solution. While there is a lot of promise in using com- modity hybrid FPGAs in a cluster configuration, the design flow and performance char- acteristics of such systems are currently a limiting factor to the range of applications that could benefit from such a system. The contribution of this thesis is a framework for clustering commodity FPGAs which integrates high speed DMA data transfers with a flexible FPGA resource sharing scheme enabled through partial reconfiguration. The framework includes an embedded Linux op- erating system, with a custom device driver to manage data transfers and hardware recon- figuration. User space tools for cluster computing including ssh and MPI are deployed allowing tasks to be split among nodes in the cluster. Performance analysis is performed with a homogeneous cluster composed of four Virtex-5 FXT based FPGA boards. The results demonstrate the advantages over previous work in terms of data throughput and reconfiguration, as well as promote future research efforts

    Feasibility of a low-cost computing cluster in comparison to a high-performance computing cluster: A developing country perspective

    Get PDF
    In recent years, many organisations use high-performance computing clusters to, within a few days, perform complex simulations and calculations that otherwise would have taken years, even lifetimes, with a single computer. However, these high-performance computing clusters can be very expensive to purchase and maintain. For developing countries, these factors are viewed as barriers that will slow them in their quest to develop the necessary computing platforms to solve complex, real-world problems. From previous studies, it was unclear if an off-the-shelf personal computer (single computer) and low-cost computing clusters are feasible alternatives to high- performance computing clusters for smaller scientific problems. The aim of this study was to investigate this gap in literature since according to our knowledge, this kind of study has not been conducted before. The study made use of High Performance Linpack benchmark applications to collect quantitative data comparing the time-to-complete, operational costs and computational efficiency of a single computer, a low-cost computing cluster and a high-performance cluster. The benchmark used the HPL main algorithm and matrix sizes for the n x n dense linear system ranged from 10 000 to 60 0000. The costs of the low-cost computing cluster were kept to the minimum (USD4000.00) and the cluster was constructed using locally available computer hardware components. In this study for the cases we studied, we found that a low-cost computing cluster was a viable alternative to a high-performance cluster if the environment requires that costs be kept to a minimum. We concluded that for smaller scientific problems, both the single computer and low- cost computing cluster was better alternatives to a high-performance cluster. However, with large scientific problems and where performance and time are of more importance than costs, a high- performance cluster is still the best solution, offering the best efficiency for both theoretical energy consumption and computation

    Performance Portability Through Semi-explicit Placement in Distributed Erlang

    Get PDF
    We consider the problem of adapting distributed Erlang applications to large or heterogeneous architectures to achieve good performance in a portable way. In many architectures, and especially large architectures, the communication latency between pairs of virtual machines (nodes) is no longer uniform. We propose two language-level methods that enable programs to automatically adapt to heterogeneity and non-uniform communication latencies, and both provide information enabling a program to identify an appropriate node when spawning a process. We provide a means of recording node attributes describing the hardware and software capabilities of nodes, and mechanisms that allow an application to examine the attributes of remote nodes. We provide an abstraction of communication distances that enables an application to select nodes to facilitate efficient communication. We have developed open source libraries that implement these ideas. We show that the use of attributes for node selection can lead to significant performance improvements if different components of the application have different processing requirements. We report a detailed empirical investigation of non-uniform communication times in several representative architectures, and show that our abstract model provides a good description of the hierarchy of communication times

    Investigation of the Effects of Image Signal-to-Noise Ratio on TSPO PET Quantification of Neuroinflammation

    Get PDF
    Neuroinflammation may be imaged using positron emission tomography (PET) and the tracer [11C]-PK11195. Accurate and precise quantification of 18 kilodalton Translocator Protein (TSPO) binding parameters in the brain has proven difficult with this tracer, due to an unfavourable combination of low target concentration in tissue, low brain uptake of the tracer and relatively high non-specific binding, all of which leads to higher levels of relative image noise. To address these limitations, research into new radioligands for the TSPO, with higher brain uptake and lower non-specific binding relative to [11C]-PK11195, is being conducted world-wide. However, factors other than radioligand properties are known to influence signal-to-noise ratio in quantitative PET studies, including the scanner sensitivity, image reconstruction algorithms and data analysis methodology. The aim of this thesis was to investigate and validate computational tools for predicting image noise in dynamic TSPO PET studies, and to employ those tools to investigate the factors that affect image SNR and reliability of TSPO quantification in the human brain. The feasibility of performing multiple (n≥40) independent Monte Carlo simulations for each dynamic [11C]-PK11195 frame- with realistic modelling of the radioactivity source, attenuation and PET tomograph geometries- was investigated. A Beowulf-type high performance computer cluster, constructed from commodity components, was found to be well suited to this task. Timing tests on a single desktop computer system indicated that a computer cluster capable of simulating an hour-long dynamic [11C]-PK11195 PET scan, with 40 independent repeats, and with a total simulation time of less than 6 weeks, could be constructed for less than 10,000 Australian dollars. A computer cluster containing 44 computing cores was therefore assembled, and a peak simulation rate of 2.84x105 photon pairs per second was achieved using the GEANT4 Application for Tomographic Emission (GATE) Monte Carlo simulation software. A simulated PET tomograph was developed in GATE that closely modelled the performance characteristics of several real-world clinical PET systems in terms of spatial resolution, sensitivity, scatter fraction and counting rate performance. The simulated PET system was validated using adaptations of the National Electrical Manufacturers Association (NEMA) quality assurance procedures within GATE. Image noise in dynamic TSPO PET scans was estimated by performing n=40 independent Monte Carlo simulations of an hour-long [11C]-PK11195 scan, and of an hour- long dynamic scan for a hypothetical TSPO ligand with double the brain activity concentration of [11C]-PK11195. From these data an analytical noise model was developed that allowed image noise to be predicted for any combination of brain tissue activity concentration and scan duration. The noise model was validated for the purpose of determining the precision of kinetic parameter estimates for TSPO PET. An investigation was made into the effects of activity concentration in tissue, radionuclide half-life, injected dose and compartmental model complexity on the reproducibility of kinetic parameters. Injecting 555 MBq of carbon-11 labelled TSPO tracer produced similar binding parameter precision to 185 MBq of fluorine-18, and a moderate (20%) reduction in precision was observed for the reduced carbon-11 dose of 370 MBq. Results indicated that a factor of 2 increase in frame count level (relative to [11C]-PK11195, and due for example to higher ligand uptake, injected dose or absolute scanner sensitivity) is required to obtain reliable binding parameter estimates for small regions of interest when fitting a two-tissue compartment, four-parameter compartmental model. However, compartmental model complexity had a similarly large effect, with the reduction of model complexity from the two-tissue compartment, four-parameter to a one-tissue compartment, two-parameter model producing a 78% reduction in coefficient of variation of the binding parameter estimates at each tissue activity level and region size studied. In summary, this thesis describes the development and validation of Monte Carlo methods for estimating image noise in dynamic TSPO PET scans, and analytical methods for predicting relative image noise for a wide range of tissue activity concentration and acquisition durations. The findings of this research suggest that a broader consideration of the kinetic properties of novel TSPO radioligands, with a view to selection of ligands that are potentially amenable to analysis with a simple one-tissue compartment model, is at least as important as efforts directed towards reducing image noise, such as higher brain uptake, in the search for the next generation of TSPO PET tracers

    Simulation of networks of spiking neurons: A review of tools and strategies

    Full text link
    We review different aspects of the simulation of spiking neural networks. We start by reviewing the different types of simulation strategies and algorithms that are currently implemented. We next review the precision of those simulation strategies, in particular in cases where plasticity depends on the exact timing of the spikes. We overview different simulators and simulation environments presently available (restricted to those freely available, open source and documented). For each simulation tool, its advantages and pitfalls are reviewed, with an aim to allow the reader to identify which simulator is appropriate for a given task. Finally, we provide a series of benchmark simulations of different types of networks of spiking neurons, including Hodgkin-Huxley type, integrate-and-fire models, interacting with current-based or conductance-based synapses, using clock-driven or event-driven integration strategies. The same set of models are implemented on the different simulators, and the codes are made available. The ultimate goal of this review is to provide a resource to facilitate identifying the appropriate integration strategy and simulation tool to use for a given modeling problem related to spiking neural networks.Comment: 49 pages, 24 figures, 1 table; review article, Journal of Computational Neuroscience, in press (2007

    Optimisation of John the Ripper in a clustered Linux environment

    Get PDF
    To aid system administrators in enforcing strict password policies, the use of password cracking tools such as Cisilia (C.I.S.I.ar, 2003) and John the Ripper (Solar Designer, 2002), have been employed as software utilities to look for weak passwords. John the Ripper (JtR) attempts to crack the passwords by using a dictionary, brute-force or other mode of attack. The computational intensity of cracking passwords has led to the utilisation of parallel-processing environments to increase the speed of the password-cracking task. Parallel-processing environments can consist of either single systems with multiple processors, or a collection of separate computers working together as a single, logical computer system; both of these configurations allow operations to run concurrently. This study aims to optimise and compare the execution of JtR on a pair of Beowulf clusters, which arc a collection of computers configured to run in a parallel manner. Each of the clusters will run the Rocks cluster distribution, which is a Linux RedHat based cluster-toolkit. An implementation of the Message Passing Interface (MPI), MPICH, will be used for inter-node communication, allowing the password cracker to run in a parallel manner. Experiments were performed to test the reliability of cracking a single set of password samples on both a 32-bit and 64-bit Beowulf cluster comprised of Intel Pentium and AMD64 Opteron processors respectively. These experiments were also used to test the effectiveness of the brute-force attack against the dictionary attack of JtR. The results from this thesis may provide assistance to organisations in enforcing strong password policies on user accounts through the use of computer clusters and also to examine the possibility of using JtR as a tool to reliably measure password strength
    • …
    corecore