Search CORE

51 research outputs found

Approximate string matching methods for duplicate detection and clustering tasks

Author: Rudniy Oleksandr
Publication venue: Digital Commons @ NJIT
Publication date: 31/01/2012
Field of study

Approximate string matching methods are utilized by a vast number of duplicate detection and clustering applications in various knowledge domains. The application area is expected to grow due to the recent significant increase in the amount of digital data and knowledge sources. Despite the large number of existing string similarity metrics, there is a need for more precise approximate string matching methods to improve the efficiency of computer-driven data processing, thus decreasing labor-intensive human involvement. This work introduces a family of novel string similarity methods, which outperform a number of effective well-known and widely used string similarity functions. The new algorithms are designed to overcome the most common problem of the existing methods which is the lack of context sensitivity. In this evaluation, the Longest Approximately Common Prefix (LACP) method achieved the highest values of average precision and maximum F1 on three out of four medical informatics datasets used. The LACP demonstrated the lowest execution time ensured by the linear computational complexity within the set of evaluated algorithms. An online interactive spell checker of biomedical terms was developed based on the LACP method. The main goal of the spell checker was to evaluate the LACP method’s ability to make it possible to estimate the similarity of resulting sets at a glance. The Shortest Path Edit Distance (SPED) outperformed all evaluated similarity functions and gained the highest possible values of the average precision and maximum F1 measures on the bioinformatics datasets. The SPED design was inspired by the preceding work on the Markov Random Field Edit Distance (MRFED). The SPED eradicates two shortcomings of the MRFED, which are prolonged execution time and moderate performance. Four modifications of the Histogram Difference (HD) method demonstrated the best performance on the majority of the life and social sciences data sources used in the experiments. The modifications of the HD algorithm were achieved using several re- scorers: HD with Normalized Smith-Waterman Re-scorer, HD with TFIDF and Jaccard re-scorers, HD with the Longest Common Prefix and TFIDF re-scorers, and HD with the Unweighted Longest Common Prefix Re-scorer. Another contribution of this dissertation includes the extensive analysis of the string similarity methods evaluation for duplicate detection and clustering tasks on the life and social sciences, bioinformatics, and medical informatics domains. The experimental results are illustrated with precision-recall charts and a number of tables presenting the average precision, maximum F1, and execution time

Digital Commons @ New Jersey Institute of Technology (NJIT)

State-of-the-art of related technologies to Alfanet

Author: Arana Cristina
Ayala Antonio
Barrera Carmen
Boticario Jesús
Brouns Francis
De Croock Marcel
Gaudioso Elena
Hernández Félix
Mofers Frans
Santos Olga
Trueba Irma
Van Rosmalen Peter
Van Veen Maarten
Publication venue
Publication date: 25/10/2002
Field of study

Open University of the Netherlands Research Portal

Semantic discovery and reuse of business process patterns

Author: Aldin L
de Cesare S
Lycett M
Publication venue: Athens University of Economics and Business
Publication date: 01/01/2009
Field of study

Patterns currently play an important role in modern information systems (IS) development and their use has mainly been restricted to the design and implementation phases of the development lifecycle. Given the increasing significance of business modelling in IS development, patterns have the potential of providing a viable solution for promoting reusability of recurrent generalized models in the very early stages of development. As a statement of research-in-progress this paper focuses on business process patterns and proposes an initial methodological framework for the discovery and reuse of business process patterns within the IS development lifecycle. The framework borrows ideas from the domain engineering literature and proposes the use of semantics to drive both the discovery of patterns as well as their reuse

Brunel University Research Archive

AIS Electronic Library (AISeL)

Acta Cybernetica : Volume 25. Number 2.

Author
Publication venue: 'University of Szeged'
Publication date: 01/01/2021
Field of study

University of Szeged