10,276 research outputs found

    Robust Group Linkage

    Full text link
    We study the problem of group linkage: linking records that refer to entities in the same group. Applications for group linkage include finding businesses in the same chain, finding conference attendees from the same affiliation, finding players from the same team, etc. Group linkage faces challenges not present for traditional record linkage. First, although different members in the same group can share some similar global values of an attribute, they represent different entities so can also have distinct local values for the same or different attributes, requiring a high tolerance for value diversity. Second, groups can be huge (with tens of thousands of records), requiring high scalability even after using good blocking strategies. We present a two-stage algorithm: the first stage identifies cores containing records that are very likely to belong to the same group, while being robust to possible erroneous values; the second stage collects strong evidence from the cores and leverages it for merging more records into the same group, while being tolerant to differences in local values of an attribute. Experimental results show the high effectiveness and efficiency of our algorithm on various real-world data sets

    Feature selection in high-dimensional dataset using MapReduce

    Full text link
    This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving millions of observations or features

    goSLP: Globally Optimized Superword Level Parallelism Framework

    Full text link
    Modern microprocessors are equipped with single instruction multiple data (SIMD) or vector instruction sets which allow compilers to exploit superword level parallelism (SLP), a type of fine-grained parallelism. Current SLP auto-vectorization techniques use heuristics to discover vectorization opportunities in high-level language code. These heuristics are fragile, local and typically only present one vectorization strategy that is either accepted or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization framework which solves the statement packing problem in a pairwise optimal manner. Using an integer linear programming (ILP) solver, goSLP searches the entire space of statement packing opportunities for a whole function at a time, while limiting total compilation time to a few minutes. Furthermore, goSLP optimally solves the vector permutation selection problem using dynamic programming. We implemented goSLP in the LLVM compiler infrastructure, achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp and 4.07% on NAS benchmarks compared to LLVM's existing SLP auto-vectorizer.Comment: Published at OOPSLA 201

    Patterns-based Evaluation of Open Source BPM Systems: The Cases of jBPM, OpenWFE, and Enhydra Shark

    Get PDF
    In keeping with the proliferation of free software development initiatives and the increased interest in the business process management domain, many open source workflow and business process management systems have appeared during the last few years and are now under active development. This upsurge gives rise to two important questions: what are the capabilities of these systems? and how do they compare to each other and to their closed source counterparts? i.e. in other words what is the state-of-the-art in the area?. To gain an insight into the area, we have conducted an in-depth analysis of three of the major open source workflow management systems - jBPM, OpenWFE and Enhydra Shark, the results of which are reported here. This analysis is based on the workflow patterns framework and provides a continuation of the series of evaluations performed using the same framework on closed source systems, business process modeling languages and web-service composition standards. The results from evaluations of the three open source systems are compared with each other and also with the results from evaluations of three representative closed source systems - Staffware, WebSphere MQ and Oracle BPEL PM, documented in earlier works. The overall conclusion is that open source systems are targeted more toward developers rather than business analysts. They generally provide less support for the patterns than closed source systems, particularly with respect to the resource perspective which describes the various ways in which work is distributed amongst business users and managed through to completion
    corecore