10,276 research outputs found
Robust Group Linkage
We study the problem of group linkage: linking records that refer to entities
in the same group. Applications for group linkage include finding businesses in
the same chain, finding conference attendees from the same affiliation, finding
players from the same team, etc. Group linkage faces challenges not present for
traditional record linkage. First, although different members in the same group
can share some similar global values of an attribute, they represent different
entities so can also have distinct local values for the same or different
attributes, requiring a high tolerance for value diversity. Second, groups can
be huge (with tens of thousands of records), requiring high scalability even
after using good blocking strategies.
We present a two-stage algorithm: the first stage identifies cores containing
records that are very likely to belong to the same group, while being robust to
possible erroneous values; the second stage collects strong evidence from the
cores and leverages it for merging more records into the same group, while
being tolerant to differences in local values of an attribute. Experimental
results show the high effectiveness and efficiency of our algorithm on various
real-world data sets
Feature selection in high-dimensional dataset using MapReduce
This paper describes a distributed MapReduce implementation of the minimum
Redundancy Maximum Relevance algorithm, a popular feature selection method in
bioinformatics and network inference problems. The proposed approach handles
both tall/narrow and wide/short datasets. We further provide an open source
implementation based on Hadoop/Spark, and illustrate its scalability on
datasets involving millions of observations or features
goSLP: Globally Optimized Superword Level Parallelism Framework
Modern microprocessors are equipped with single instruction multiple data
(SIMD) or vector instruction sets which allow compilers to exploit superword
level parallelism (SLP), a type of fine-grained parallelism. Current SLP
auto-vectorization techniques use heuristics to discover vectorization
opportunities in high-level language code. These heuristics are fragile, local
and typically only present one vectorization strategy that is either accepted
or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization
framework which solves the statement packing problem in a pairwise optimal
manner. Using an integer linear programming (ILP) solver, goSLP searches the
entire space of statement packing opportunities for a whole function at a time,
while limiting total compilation time to a few minutes. Furthermore, goSLP
optimally solves the vector permutation selection problem using dynamic
programming. We implemented goSLP in the LLVM compiler infrastructure,
achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp
and 4.07% on NAS benchmarks compared to LLVM's existing SLP auto-vectorizer.Comment: Published at OOPSLA 201
Patterns-based Evaluation of Open Source BPM Systems: The Cases of jBPM, OpenWFE, and Enhydra Shark
In keeping with the proliferation of free software development initiatives and the increased interest in the business process management domain, many open source workflow and business process management systems have appeared during the last few years and are now under active development. This upsurge gives rise to two important questions: what are the capabilities of these systems? and how do they compare to each other and to their closed source counterparts? i.e. in other words what is the state-of-the-art in the area?. To gain an insight into the area, we have conducted an in-depth analysis of three of the major open source workflow management systems - jBPM, OpenWFE and Enhydra Shark, the results of which are reported here. This analysis is based on the workflow patterns framework and provides a continuation of the series of evaluations performed using the same framework on closed source systems, business process modeling languages and web-service composition standards. The results from evaluations of the three open source systems are compared with each other and also with the results from evaluations of three representative closed source systems - Staffware, WebSphere MQ and Oracle BPEL PM, documented in earlier works. The overall conclusion is that open source systems are targeted more toward developers rather than business analysts. They generally provide less support for the patterns than closed source systems, particularly with respect to the resource perspective which describes the various ways in which work is distributed amongst business users and managed through to completion
- …