907 research outputs found

    Monte Carlo Simulation With The GATE Software Using Grid Computing

    Get PDF
    DémonstrationInternational audienceMonte Carlo simulations needing many replicates to obtain good statistical results can be easily executed in parallel using the "Multiple Replications In Parallel" approach. However, several precautions have to be taken in the generation of the parallel streams of pseudo-random numbers. In this paper, we present the distribution of Monte Carlo simulations performed with the GATE software using local clusters and grid computing. We obtained very convincing results with this large medical application, thanks to the EGEE Grid (Enabling Grid for E-sciencE), achieving in one week computations that could have taken more than 3 years of processing on a single computer. This work has been achieved thanks to a generic object-oriented toolbox called DistMe which we designed to automate this kind of parallelization for Monte Carlo simulations. This toolbox, written in Java is freely available on SourceForge and helped to ensure a rigorous distribution of pseudo-random number streams. It is based on the use of a documented XML format for random numbers generators statuses

    Connected component identification and cluster update on GPU

    Full text link
    Cluster identification tasks occur in a multitude of contexts in physics and engineering such as, for instance, cluster algorithms for simulating spin models, percolation simulations, segmentation problems in image processing, or network analysis. While it has been shown that graphics processing units (GPUs) can result in speedups of two to three orders of magnitude as compared to serial codes on CPUs for the case of local and thus naturally parallelized problems such as single-spin flip update simulations of spin models, the situation is considerably more complicated for the non-local problem of cluster or connected component identification. I discuss the suitability of different approaches of parallelization of cluster labeling and cluster update algorithms for calculations on GPU and compare to the performance of serial implementations.Comment: 15 pages, 14 figures, one table, submitted to PR

    Suitability of the Spark framework for data classification

    Get PDF
    Selle lõputöö eesmärk on näidata Spark raamistiku sobivust erinevate klassifitseerimis algoritmite rakendamisel ja näidata kuidas täpselt algoritmid MapReduce-ist Spark-i üle viia. Eesmärgi täitmiseks said implementeertud kolm algoritmi: paralleelne k-nearest neighbor’s algoritm, paralleelne naïve Bayesian algoritm ja Clara algoritm. Et näidata erinevaid lähenemisviise otsustati rakendada need algoritmid kasutades kahte raamistiku: Hadoop ja Spark. Et tulemusi kätte saada, jooksutati mõlema raamistiku puhul testid samade sisend-andmete ja parameetritega. Testid käivitati erinevate parameetritega et näidata realiseerimise korrektsust. Tulemustele vastavad graafikud ja tabelid genereeriti et näidata kui hästi on algoritmide käivitamisel töö hajutatud paralleelsete protsesside vahel. Tulemused näitavad et Spark saab hakkama lihtsamate algoritmidega, nagu näiteks k-nearest neighbor’s, edukalt aga vahe Hadoop tulemustega ei ole väga suur. Naïve Bayesian algoritm osutus lihtsate algoritmide erijuhtumiks. Selle tulemused näitavad et väga kiire algoritmide korral kulutab Spark raamistik rohkem aega andmete jaotamiseks ning konfigureerimiseks kui andmete töötlemiseks. Clara algoritmi tulemused näitavad et Spark raamistik saab suurema keerukusega algoritmidega hakkama märgatavalt paremini kui Hadoop.The goal of this thesis is to show the suitability of the Spark framework when dealing with different types of classification algorithms and to show how exactly to adapt algorithms from MapReduce to Spark. To fulfill the goal three algorithms were chosen: k-nearest neighbor’s algorithm, naïve Bayesian algorithm and Clara algorithm. To show the various approaches it was decided to implement those algorithms using two frameworks, Hadoop and Spark. To get the results, tests were run using the same input data and input parameters for both frameworks. During the tests varied parameters were used to show the correctness of the implementations. As a result charts and tables were generated for each algorithm separately. In addition parallel speedup charts were generated to show how well algorithm implementations can be distributed between the worker nodes. Results show that Spark handles easy algorithms, like k-nearest neighbor’s algorithm, well, but the difference with Hadoop results is not very large. Naïve Bayesian algorithm revealed the special case with easy algorithms. The results show that with very fast algorithms Spark framework use more time for data distribution and configuration than for data processing itself. Clara algorithm results have shown that Spark framework handles more difficult algorithms noticeably better

    Feature selection in high-dimensional dataset using MapReduce

    Full text link
    This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving millions of observations or features
    corecore