17 research outputs found

    Entity Data Management in OKKAM

    Full text link

    A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration

    Get PDF
    Abstract-The proliferation of deep Web offers users a great opportunity to search high-quality information from Web. As a necessary step in deep Web data integration, the goal of duplicate entity identification is to discover the duplicate records from the integrated Web databases for further applications(e.g. price-comparison services). However, most of existing works address this issue only between two data sources, which are not practical to deep Web data integration systems. That is, one duplicate entity matcher trained over two specific Web databases cannot be applied to other Web databases. In addition, the cost of preparing the training set for n Web databases is ۱ ‫ܖ‬ times higher than that for two Web databases. In this paper, we propose a holistic solution to address the new challenges posed by deep Web, whose goal is to build one duplicate entity matcher over multiple Web databases. The extensive experiments on two domains show that the proposed solution is highly effective for deep Web data integration

    Training Selection for Tuning Entity Matching

    Get PDF
    Entity matching is a crucial and difficult task for data integration. An effective solution strategy typically has to combine several techniques and to find suitable settings for critical configuration parameters such as similarity thresholds. Supervised (training-based) approaches promise to reduce the manual work for determining (learning) effective strategies for entity matching. However, they critically depend on training data selection which is a difficult problem that has so far mostly been addressed manually by human experts. In this paper we propose a training-based framework called STEM for entity matching and present different generic methods for automatically selecting training data to combine and configure several matching techniques. We evaluate the proposed methods for different match tasks and small- and medium-sized training sets

    A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration

    Full text link

    A new semantic similarity join method using diffusion maps and long string table attributes

    Get PDF
    With the rapid increase of the distributed data sources, and in order to make information integration, there is a need to combine the information that refers to the same entity from different sources. However, there are no global conventions that control the format of the data, and it is impractical to impose such global conventions. Also, there could be some spelling errors in the data as it is entered manually in most of the cases. For such reasons, the need to find and join similar records instead of exact records is important in order to integrate the data. Most of the previous work has concentrated on similarity join when the join attribute is a short string attribute, such as person name and address. However, most databases contain long string attributes as well, such as product description and paper abstract, and up to our knowledge, no work has been done in this direction. The use of long string attributes is promising as these attributes contain much more information than short string attributes, which could improve the similarity join performance. On the other hand, most of the literature work did not consider the semantic similarities during the similarity join process. To address these issues, 1) we showed that the use of long attributes outperformed the use of short attributes in the similarity join process in terms of similarity join accuracy with a comparable running time under both supervised and unsupervised learning scenarios; 2) we found the best semantic similarity method to join long attributes in both supervised and unsupervised learning scenarios; 3) we proposed efficient semantic similarity join methods using long attributes under both supervised and unsupervised learning scenarios; 4) we proposed privacy preserving similarity join protocols that supports the use of long attributes to increase the similarity join accuracy under both supervised and unsupervised learning scenarios; 5) we studied the effect of using multi-label supervised learning on the similarity join performance; 6) we found an efficient similarity join method for expandable databases

    An efficient robust hyperheuristic clustering algorithm

    Get PDF
    Observations on recent research of clustering problems illustrate that most of the approaches used to deal with these problems are based on meta-heuristic and hybrid meta-heuristic to improve the solutions. Hyperheuristic is a set of heuristics, meta- heuristics and high-level search strategies that work on the heuristic search space instead of solution search space. Hyperheuristics techniques have been employed to develop approaches that are more general than optimization search methods and traditional techniques. In the last few years, most studies have focused considerably on the hyperheuristic algorithms to find generalized solutions but highly required robust and efficient solutions. The main idea in this research is to develop techniques that are able to provide an appropriate level of efficiency and high performance to find a class of basic level heuristic over different type of combinatorial optimization problems. Clustering is an unsupervised method in the data mining and pattern recognition. Nevertheless, most of the clustering algorithms are unstable and very sensitive to their input parameters. This study, proposes an efficient and robust hyperheuristic clustering algorithm to find approximate solutions and attempts to generalize the algorithm for different cluster problem domains. Our proposed clustering algorithm has managed to minimize the dissimilarity of all points of a cluster using hyperheuristic method, from the gravity center of the cluster with respect to capacity constraints in each cluster. The algorithm of hyperheuristic has emerged from pool of heuristic techniques. Mapping between solution spaces is one of the powerful and prevalent techniques in optimization domains. Most of the existing algorithms work directly with solution spaces where in some cases is very difficult and is sometime impossible due to the dynamic behavior of data and algorithm. By mapping the heuristic space into solution spaces, it would be possible to make easy decision to solve clustering problems. The proposed hyperheuristic clustering algorithm performs four major components including selection, decision, admission and hybrid metaheuristic algorithm. The intensive experiments have proven that the proposed algorithm has successfully produced robust and efficient clustering results

    Proceedings of the International Workshop on Quality in Databases and Management of Uncertain Data (QDBMUD2008)

    Get PDF
    corecore