7 research outputs found

    Implementing VCODE with static processes

    No full text
    The NESL parallel functional language developed at CMU supports a combination of dataand control parallelism through so-called nested parallelism. The designers of NESL have defined the portable intermediate language VCODE into which NESL is compiled. VCODE realises nested parallelism by data structures called segmented vectors akin to lists of lists. Arbitrary trees of calls to parallel procedures are thus implemented by VCODE primitives on segmented vectors. The simplicity of those primitives enhances the portability of NESL, but their efficiency depends strongly on the action of algorithms on the layout, shape and size of the segmented vectors. CMU's current implementation uses algorithms oblivious to the number of processors and hence to exact data layout. Our experiment improves performance by programming VCODE primitives with explicit static processes. 1 Introduction The experiment described here identifies a weakness in the implementation of the nested-parallel programming lan..

    Handling limits of high degree vertices in graph processing using MapReduce and Pregel

    No full text
    Even if Pregel scales better than MapReduce in graph processing by reducing iteration's disk I/O, while offering an easy programming model using " think like vertex " approach, large scale graph processing is still challenging in the presence of high degree vertices: Communication and load imbalance among processing nodes can have disastrous effects on performance. In this paper, we introduce a scalable MapReduce graph partitioning approach for high degree vertices using a master/slave partitioning allowing to balance communication and computation among processing nodes during all the stages of graph processing. Cost analysis and performance tests of this partitioning are given to show the effectiveness and the scalability of this approach in large scale systems

    Scalability and Optimisation of GroupBy-Joins in MapReduce Scalability and Optimisation of GroupBy-Joins in MapReduce

    No full text
    For over a decade, MapReduce has become the leading programming model for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hive, facilitating data analysis on large-scale systems. However, these frameworks still remain vulnerable to communication costs, data skew and tasks imbalance problems. This can have a devastating effect on the performance and on the scalability of these systems, more particularly when treating GroupBy-Join queries of large datasets. In this paper, we present a new GroupBy-Join algorithm allowing to reduce communication costs considerably while avoiding data skew effects. A cost analysis of this algorithm shows that our approach is insensitive to data skew and ensures perfect balancing properties during all stages of GroupBy-Join computation even for highly skewed data. These performances have been confirmed by a series of experimentations

    A Scalable MapReduce Similarity Join Algorithm Using LSH

    No full text
    Similarity Joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predened threshold λ. In this paper, we introduce the MRS-join algorithm to perform similarity joins on large trajectories datasets. The MapReduce model and a randomized LSH (Local Sensitive Hashing) keys redistribution approach are used to balance load among processing nodes while reducing communications and computations to almost all relevant data by using distributed histograms. A cost analysis of MRS-join algorithm shows that our approach is insensitive to data skew and guarantees perfect balancing properties, in large scale systems, during all stages of similarity join computations. These performances have been conrmed by a series of experiments using the Fréchet distance on large datasets of trajectories from real world and synthetic data benchmarks

    Personalized Environment for Querying Semantic Knowledge Graphs: a MapReduce Solution

    No full text
    Querying according to a personalised context is an increasingly required feature on semantic graph databases. We define contexts by using constraints imposed on queries and not on data sources. No correction trial is performed on an inconsistent database but answers are ensured to be valid. Data confidence according to provenance is also taken into account. As constraint validation and query evaluation are two independent modules, our approach can be tested with di↵erent query evaluators. This paper focus on a MapReduce query environment
    corecore