907 research outputs found
Monte Carlo Simulation With The GATE Software Using Grid Computing
DémonstrationInternational audienceMonte Carlo simulations needing many replicates to obtain good statistical results can be easily executed in parallel using the "Multiple Replications In Parallel" approach. However, several precautions have to be taken in the generation of the parallel streams of pseudo-random numbers. In this paper, we present the distribution of Monte Carlo simulations performed with the GATE software using local clusters and grid computing. We obtained very convincing results with this large medical application, thanks to the EGEE Grid (Enabling Grid for E-sciencE), achieving in one week computations that could have taken more than 3 years of processing on a single computer. This work has been achieved thanks to a generic object-oriented toolbox called DistMe which we designed to automate this kind of parallelization for Monte Carlo simulations. This toolbox, written in Java is freely available on SourceForge and helped to ensure a rigorous distribution of pseudo-random number streams. It is based on the use of a documented XML format for random numbers generators statuses
Connected component identification and cluster update on GPU
Cluster identification tasks occur in a multitude of contexts in physics and
engineering such as, for instance, cluster algorithms for simulating spin
models, percolation simulations, segmentation problems in image processing, or
network analysis. While it has been shown that graphics processing units (GPUs)
can result in speedups of two to three orders of magnitude as compared to
serial codes on CPUs for the case of local and thus naturally parallelized
problems such as single-spin flip update simulations of spin models, the
situation is considerably more complicated for the non-local problem of cluster
or connected component identification. I discuss the suitability of different
approaches of parallelization of cluster labeling and cluster update algorithms
for calculations on GPU and compare to the performance of serial
implementations.Comment: 15 pages, 14 figures, one table, submitted to PR
Suitability of the Spark framework for data classification
Selle lõputöö eesmärk on näidata Spark raamistiku sobivust erinevate klassifitseerimis algoritmite rakendamisel ja näidata kuidas täpselt algoritmid MapReduce-ist Spark-i üle viia. Eesmärgi täitmiseks said implementeertud kolm algoritmi: paralleelne k-nearest neighbor’s algoritm, paralleelne naïve Bayesian algoritm ja Clara algoritm. Et näidata erinevaid lähenemisviise otsustati rakendada need algoritmid kasutades kahte raamistiku: Hadoop ja Spark. Et tulemusi kätte saada, jooksutati mõlema raamistiku puhul testid samade sisend-andmete ja parameetritega. Testid käivitati erinevate parameetritega et näidata realiseerimise korrektsust. Tulemustele vastavad graafikud ja tabelid genereeriti et näidata kui hästi on algoritmide käivitamisel töö hajutatud paralleelsete protsesside vahel. Tulemused näitavad et Spark saab hakkama lihtsamate algoritmidega, nagu näiteks k-nearest neighbor’s, edukalt aga vahe Hadoop tulemustega ei ole väga suur. Naïve Bayesian algoritm osutus lihtsate algoritmide erijuhtumiks. Selle tulemused näitavad et väga kiire algoritmide korral kulutab Spark raamistik rohkem aega andmete jaotamiseks ning konfigureerimiseks kui andmete töötlemiseks. Clara algoritmi tulemused näitavad et Spark raamistik saab suurema keerukusega algoritmidega hakkama märgatavalt paremini kui Hadoop.The goal of this thesis is to show the suitability of the Spark framework when dealing with different types of classification algorithms and to show how exactly to adapt algorithms from MapReduce to Spark. To fulfill the goal three algorithms were chosen: k-nearest neighbor’s algorithm, naïve Bayesian algorithm and Clara algorithm. To show the various approaches it was decided to implement those algorithms using two frameworks, Hadoop and Spark. To get the results, tests were run using the same input data and input parameters for both frameworks. During the tests varied parameters were used to show the correctness of the implementations. As a result charts and tables were generated for each algorithm separately. In addition parallel speedup charts were generated to show how well algorithm implementations can be distributed between the worker nodes. Results show that Spark handles easy algorithms, like k-nearest neighbor’s algorithm, well, but the difference with Hadoop results is not very large. Naïve Bayesian algorithm revealed the special case with easy algorithms. The results show that with very fast algorithms Spark framework use more time for data distribution and configuration than for data processing itself. Clara algorithm results have shown that Spark framework handles more difficult algorithms noticeably better
Feature selection in high-dimensional dataset using MapReduce
This paper describes a distributed MapReduce implementation of the minimum
Redundancy Maximum Relevance algorithm, a popular feature selection method in
bioinformatics and network inference problems. The proposed approach handles
both tall/narrow and wide/short datasets. We further provide an open source
implementation based on Hadoop/Spark, and illustrate its scalability on
datasets involving millions of observations or features
- …