22,315 research outputs found
Entity Resolution with Active Learning
Entity Resolution refers to the process of identifying records which represent the same real-world entity from one or more datasets. In the big data era, large numbers of entities need to be resolved, which leads to several key challenges, especially for learning-based ER approaches: (1) With the number of records increasing, the computational complexity of the algorithm grows exponentially. (2) Quite a number of samples are necessary for training, but only a limited number of labels are available, especially when the training samples are highly imbalanced. Blocking technique helps to improve the time efficiency by grouping potentially matched records into the same block. Thus to address the above two challenges, in this thesis, we first introduce a novel blocking scheme learning approach based on active learning techniques. With a limited label budget, our approach can learn a blocking scheme to generate high quality blocks. Two strategies called active sampling and active branching are proposed to select samples and generate blocking schemes efficiently. Additionally, a skyblocking framework is proposed as an extension, which aims to learn scheme skylines. In this framework, each blocking scheme is mapped as a point to a multi-dimensional scheme space where each block-ing measure represents one dimension. A scheme skyline contains blocking schemes that are not dominated by any other blocking schemes in the scheme space. We develop three scheme skyline learning algorithms for efficiently learning scheme skylines under a given number of blocking measures and within a label budget limit.
While blocks are well established, we further develop the Learning To Sample approach to deal with the second challenge, i.e. training a learning-based active learning model with as mall number of labeled samples. This approach has two key components: a sampling model and a boosting model, which can mutually learn from each other in iterations to improve the performance of each other. Within this framework, the sampling model incorporates uncertainty sampling and diversity sampling into a unified process for optimization, enabling us to actively select the most representative and informative samples based on an optimized integration of uncertainty and diversity. On the contrary of training with a limited number of samples, a powerful machine learning model may be overfitting by remembering all the sample features. Inspired by recent advances of generative adversarial network (GAN), in this paper, we propose a novel deep learning method, called ERGAN, to address the challenge. ERGAN consists of two key components: a label generator and a discriminator which are optimized alternatively through adversarial learning. To alleviate the issues of overfitting and highly imbalanced distribution, we design two novel modules for diversity and propagation, which can greatly improve the model generalization power. We theoretically prove that ERGAN can overcome the model collapse and convergence problems in the original GAN. We also conduct extensive experiments to empirically verify the labeling and learning efficiency of ERGAN
MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities
Entity Resolution (ER) aims to identify different descriptions in various
Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the
Variety, Volume and Veracity of entity descriptions published in the Web of
Data. To address them, we propose the MinoanER framework that simultaneously
fulfills full automation, support of highly heterogeneous entities, and massive
parallelization of the ER process. MinoanER leverages a token-based similarity
of entities to define a new metric that derives the similarity of neighboring
entities from the most important relations, as they are indicated only by
statistics. A composite blocking method is employed to capture different
sources of matching evidence from the content, neighbors, or names of entities.
The search space of candidate pairs for comparison is compactly abstracted by a
novel disjunctive blocking graph and processed by a non-iterative, massively
parallel matching algorithm that consists of four generic, schema-agnostic
matching rules that are quite robust with respect to their internal
configuration. We demonstrate that the effectiveness of MinoanER is comparable
to existing ER tools over real KBs exhibiting low Variety, but it outperforms
them significantly when matching KBs with high Variety.Comment: Presented at EDBT 2001
- âŠ