Search CORE

10,276 research outputs found

Robust Group Linkage

Author: Dong Xin Luna
Guo Songtao
Li Pei
Maurino Andrea
Srivastava Divesh
Publication venue
Publication date: 02/03/2015
Field of study

We study the problem of group linkage: linking records that refer to entities in the same group. Applications for group linkage include finding businesses in the same chain, finding conference attendees from the same affiliation, finding players from the same team, etc. Group linkage faces challenges not present for traditional record linkage. First, although different members in the same group can share some similar global values of an attribute, they represent different entities so can also have distinct local values for the same or different attributes, requiring a high tolerance for value diversity. Second, groups can be huge (with tens of thousands of records), requiring high scalability even after using good blocking strategies. We present a two-stage algorithm: the first stage identifies cores containing records that are very likely to belong to the same group, while being robust to possible erroneous values; the second stage collects strong evidence from the cores and leverages it for merging more records into the same group, while being tolerant to differences in local values of an attribute. Experimental results show the high effectiveness and efficiency of our algorithm on various real-world data sets

arXiv.org e-Print Archive

CiteSeerX

Feature selection in high-dimensional dataset using MapReduce

Author: Bontempi Gianluca
Borgne Yann-Aël Le
Reggiani Claudio
Publication venue
Publication date: 07/09/2017
Field of study

This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving millions of observations or features

arXiv.org e-Print Archive

DI-fusion

goSLP: Globally Optimized Superword Level Parallelism Framework

Author: Amarasinghe Saman
Mendis Charith
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/10/2018
Field of study

Modern microprocessors are equipped with single instruction multiple data (SIMD) or vector instruction sets which allow compilers to exploit superword level parallelism (SLP), a type of fine-grained parallelism. Current SLP auto-vectorization techniques use heuristics to discover vectorization opportunities in high-level language code. These heuristics are fragile, local and typically only present one vectorization strategy that is either accepted or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization framework which solves the statement packing problem in a pairwise optimal manner. Using an integer linear programming (ILP) solver, goSLP searches the entire space of statement packing opportunities for a whole function at a time, while limiting total compilation time to a few minutes. Furthermore, goSLP optimally solves the vector permutation selection problem using dynamic programming. We implemented goSLP in the LLVM compiler infrastructure, achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp and 4.07% on NAS benchmarks compared to LLVM's existing SLP auto-vectorizer.Comment: Published at OOPSLA 201

arXiv.org e-Print Archive

DSpace@MIT

Patterns-based Evaluation of Open Source BPM Systems: The Cases of jBPM, OpenWFE, and Enhydra Shark

Author: Arthur H.M. ter Hofstede
Birger Andersson
Bunge
Cumberlidge
Green
Guarino
Jablonski
Kiepuszewski
Krogstie
Krogstie
Lindland
Nick Russell
Opdahl
Petia Wohed
Puhlmann
Rosemann
Russell
Russell
van der Aalst
van der Aalst
Wahl
Wand
Wil M.P. van der Aalst
Wohed
Wohed
Wohed
Wyssusek
Publication venue
Publication date: 01/08/2008
Field of study

In keeping with the proliferation of free software development initiatives and the increased interest in the business process management domain, many open source workflow and business process management systems have appeared during the last few years and are now under active development. This upsurge gives rise to two important questions: what are the capabilities of these systems? and how do they compare to each other and to their closed source counterparts? i.e. in other words what is the state-of-the-art in the area?. To gain an insight into the area, we have conducted an in-depth analysis of three of the major open source workflow management systems - jBPM, OpenWFE and Enhydra Shark, the results of which are reported here. This analysis is based on the workflow patterns framework and provides a continuation of the series of evaluations performed using the same framework on closed source systems, business process modeling languages and web-service composition standards. The results from evaluations of the three open source systems are compared with each other and also with the results from evaluations of three representative closed source systems - Staffware, WebSphere MQ and Oracle BPEL PM, documented in earlier works. The overall conclusion is that open source systems are targeted more toward developers rather than business analysts. They generally provide less support for the patterns than closed source systems, particularly with respect to the resource perspective which describes the various ways in which work is distributed amongst business users and managed through to completion

Queensland University of Technology ePrints Archive