4 research outputs found
r-HUMO: A Risk-Aware Human-Machine Cooperation Framework for Entity Resolution with Quality Guarantees
Even though many approaches have been proposed for entity resolution (ER), it
remains very challenging to find one with quality guarantees. To this end, we
proposea risk-aware HUman-Machine cOoperation framework for ER, denoted by
r-HUMO. Built on the existing HUMO framework, r-HUMO similarly enforces both
precision and recall levels by partitioning an ER workload between the human
and the machine. However, r-HUMO is the first solution to optimize the process
of human workload selection from a risk perspective. It iteratively selects
human workload based on real-time risk analysis on human-labeled results as
well as prespecified machine metrics. In this paper,we first introduce the
r-HUMO framework and then present the risk analysis technique to prioritize the
instances for manual labeling. Finally,we empirically evaluate r-HUMO's
performance on real data. Our extensive experiments show that r-HUMO is
effective in enforcing quality guarantees,and compared with the
state-of-the-art alternatives, it can achieve better quality control with
reduced human cost.Comment: 12 pages, 7 figures. arXiv admin note: text overlap with
arXiv:1710.0020
Adaptive Deep Learning for Entity Resolution by Risk Analysis
The state-of-the-art performance on entity resolution (ER) has been achieved
by deep learning. However, deep models are usually trained on large quantities
of accurately labeled training data, and can not be easily tuned towards a
target workload. Unfortunately, in real scenarios, there may not be sufficient
labeled training data, and even worse, their distribution is usually more or
less different from the target workload even when they come from the same
domain.
To alleviate the said limitations, this paper proposes a novel risk-based
approach to tune a deep model towards a target workload by its particular
characteristics. Built on the recent advances on risk analysis for ER, the
proposed approach first trains a deep model on labeled training data, and then
fine-tunes it by minimizing its estimated misprediction risk on unlabeled
target data. Our theoretical analysis shows that risk-based adaptive training
can correct the label status of a mispredicted instance with a fairly good
chance. We have also empirically validated the efficacy of the proposed
approach on real benchmark data by a comparative study. Our extensive
experiments show that it can considerably improve the performance of deep
models. Furthermore, in the scenario of distribution misalignment, it can
similarly outperform the state-of-the-art alternative of transfer learning by
considerable margins. Using ER as a test case, we demonstrate that risk-based
adaptive training is a promising approach potentially applicable to various
challenging classification tasks.Comment: 31 pages, 5 figures, 4 table
Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework
Even though many machine algorithms have been proposed for entity resolution,
it remains very challenging to find a solution with quality guarantees. In this
paper, we propose a novel HUman and Machine cOoperation (HUMO) framework for
entity resolution (ER), which divides an ER workload between the machine and
the human. HUMO enables a mechanism for quality control that can flexibly
enforce both precision and recall levels. We introduce the optimization problem
of HUMO, minimizing human cost given a quality requirement, and then present
three optimization approaches: a conservative baseline one purely based on the
monotonicity assumption of precision, a more aggressive one based on sampling
and a hybrid one that can take advantage of the strengths of both previous
approaches. Finally, we demonstrate by extensive experiments on real and
synthetic datasets that HUMO can achieve high-quality results with reasonable
return on investment (ROI) in terms of human cost, and it performs considerably
better than the state-of-the-art alternatives in quality control.Comment: 12 pages, 11 figures. Camera-ready version of the paper submitted to
ICDE 2018, In Proceedings of the 34th IEEE International Conference on Data
Engineering (ICDE 2018
Active Deep Learning on Entity Resolution by Risk Sampling
While the state-of-the-art performance on entity resolution (ER) has been
achieved by deep learning, its effectiveness depends on large quantities of
accurately labeled training data. To alleviate the data labeling burden, Active
Learning (AL) presents itself as a feasible solution that focuses on data
deemed useful for model training. Building upon the recent advances in risk
analysis for ER, which can provide a more refined estimate on label
misprediction risk than the simpler classifier outputs, we propose a novel AL
approach of risk sampling for ER. Risk sampling leverages misprediction risk
estimation for active instance selection. Based on the core-set
characterization for AL, we theoretically derive an optimization model which
aims to minimize core-set loss with non-uniform Lipschitz continuity. Since the
defined weighted K-medoids problem is NP-hard, we then present an efficient
heuristic algorithm. Finally, we empirically verify the efficacy of the
proposed approach on real data by a comparative study. Our extensive
experiments have shown that it outperforms the existing alternatives by
considerable margins. Using ER as a test case, we demonstrate that risk
sampling is a promising approach potentially applicable to other challenging
classification tasks.Comment: 13 pages, 6 figure