8 research outputs found

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p

    Educational Technology and Related Education Conferences for June to December 2015

    Get PDF
    The 33rd edition of the conference list covers selected events that primarily focus on the use of technology in educational settings and on teaching, learning, and educational administration. Only listings until December 2015 are complete as dates, locations, or Internet addresses (URLs) were not available for a number of events held from January 2016 onward. In order to protect the privacy of individuals, only URLs are used in the listing as this enables readers of the list to obtain event information without submitting their e-mail addresses to anyone. A significant challenge during the assembly of this list is incomplete or conflicting information on websites and the lack of a link between conference websites from one year to the next

    Indexing techniques for real-time entity resolution

    No full text
    Entity resolution (ER), which is the process of identifying records in one or several data set(s) that refer to the same real-world entity, is an important task in improving data quality and in data integration. In general, unique entity identifiers are not available in real-world data sets. Therefore, identifying attributes such as names and addresses are required to perform the ER process using approximate matching techniques. Since many services in both the private and public sectors are moving on-line, organizations increasingly require to perform real-time ER (with sub-second response times) on query records that need to be matched with existing data sets. Indexing is a major step in the ER process which aims to group similar records together using a blocking key criterion to reduce the search space. Most existing indexing techniques that are currently used with ER are static and can only be employed off-line with batch processing algorithms. A major aspect of achieving ER in real-time is to develop novel efficient and effective dynamic indexing techniques that allow dynamic updates and facilitate real-time matching. In this thesis, we focus on the indexing step in the context of real-time ER. We propose three dynamic indexing techniques and a blocking key learning algorithm to be used with real-time ER. The first index (named DySimII) is a blocking-based technique that is updated whenever a new query record arrives. We reduce the size of DySimII by proposing a frequency-filtered alteration that only indexes the most frequent attribute values. The second index (named DySNI) is a tree-based dynamic indexing technique that is tailored for real-time ER. DySNI is based on the sorted neighborhood method that is commonly used in ER. We investigate several static and adaptive window approaches when retrieving candidate records. The third index (named F-DySNI) is a multi-tree technique that uses multiple distinct trees in the index data structure where each tree has a unique sorting key. The aim of F-DySNI is to reduce the effects of errors and variations at the beginning of attribute values that are used as sorting keys on matching quality. Finally, we propose an unsupervised learning algorithm that automatically generates optimal blocking keys for building indexes that are adequate for real-time ER. We experimentally evaluate the proposed approaches using various real-world data sets with millions of records and synthetic data sets with different data characteristics. The results show that, for the growing sizes of our indexing solutions, no appreciable increase occurs in both record insertion and query times. DySNI is the fastest amongst the proposed solutions, while F-DySNI achieves better matching quality. Compared to an existing indexing baseline, our proposed techniques achieve better query times and matching quality. Moreover, our blocking key learning algorithm achieves an average query time that is around two orders of magnitude faster than an existing learning baseline while maintaining similar matching quality. Our proposed solutions are therefore shown to be suitable for real-time ER
    corecore