467 research outputs found
Reasoning about Social Semantic Web Applications using String Similarity and Frame Logic
Social semantic Web or Web 3.0 application gained major attention from academia and industry in recent times. Such applications try to take advantage of user supplied meta data, using ideas from the semantic Web initiative, in order to provide better services. An open problem is the formalization of such meta data, due to its complex and often inconsistent nature. A possible solution to inconsistencies are string similarity metrics which are explained and analyzed. A study of performance and applicability in a frame logic environment is conducted on the case of agent reasoning about multiple domains in TaOPis - a social semantic Web application for self-organizing communities. Results show that the NYSIIS metric yields surprisingly good results on Croatian words and phrases
Enhanced Levenshtein Edit Distance Method functioning as a String-to-String Similarity Measure
Levenshtein is a Minimum Edit Distance method; it is usually used in spell checking applications for generatingcandidates. The method computes the number of the required edit operations to transform one string to another and it canrecognize three types of edit operations: deletion, insertion, and substitution of one letter. Damerau modified the Levenshteinmethod to consider another type of edit operations, the transposition of two adjacent letters, in addition to theconsidered three types. However, the modification suffers from the time complexity which was added to the original quadratictime complexity of the original method. In this paper, we proposed a modification for the original Levenshtein toconsider the same four types using very small number of matching operations which resulted in a shorter execution timeand a similarity measure is also achieved to exploit the resulted distance from any Edit Distance method for finding the amountof similarity between two given strings
Automatic Spelling Corrector to improve Unified Registry analysis for Brazilian social development Adjustment of automatic spelling corrector system applied in Brazilian low-income family´s data
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceThis dissertation has the goal to develop a solution to correct spelling errors when inserting the neighborhoods of Brazilian low-income families. The Brazil Government uses data from the Unified Registry (Cadastro Único) to diagnose the basic social right of low-income families and to map public policies based on the real needs of Brazilian society. Therefore, the best solution found was an adjustment of the string correction method, Automatic Spelling Correction (ASC) system, and an automatic dictionary creator, to the CECAD registry family status data and correction of the neighborhood names wrongly typed. The research will describe the algorithm’s process with an explanation of the main mathematical concepts
Damerau Levenshtein Distance for Indonesian Spelling Correction
Word correction used to find an incorrect word in writing. Levenshtein distance is one of algorithm to correcting typing error. It is an algorithm that calculates a difference between two strings. The operations that used to the calculation are insert, delete, and substitution. However, this algorithm has a disadvantage that it cannot overcome two switched letters in the same word. The algorithm that can solve those issues is a damerau Levenshtein. This research aims to analyse a damerau Levenshtein algorithm that used to correcting Indonesian spelling. The dataset in this research consists of two fairy tale stories with a total of 1266 words and 100 typing errors. From these two algorithms, the accuracy is up to 73% on Levenshtein distance and 75% on damerau Levenshtein
Typo handling in searching of Quran verse based on phonetic similarities
The Quran search system is a search system that was built to make it easier for Indonesians to find a verse with text by Indonesian pronunciation, this is a solution for users who have difficulty writing or typing Arabic characters. Quran search system with phonetic similarity can make it easier for Indonesian Muslims to find a particular verse. Lafzi was one of the systems that developed the search, then Lafzi was further developed under the name Lafzi+. The Lafzi+ system can handle searches with typo queries but there are still fewer variations regarding typing error types. In this research Lafzi++, an improvement from previous development to handle typographical error types was carried out by applying typo correction using the autocomplete method to correct incorrect queries and Damerau Levenshtein distance to calculate the edit distance, so that the system can provide query suggestions when a user mistypes a search, either in the form of substitution, insertion, deletion, or transposition. Users can also search easily because they use Latin characters according to pronunciation in Indonesian. Based on the evaluation results it is known that the system can be better developed, this can be seen from the accuracy value in each query that is tested can surpass the accuracy of the previous system, by getting the highest recall of 96.20% and the highest Mean Average Precision (MAP) reaching 90.69%. The Lafzi++ system can improve the previous system
Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information
In computing, spell checking is the process of detecting and sometimes
providing spelling suggestions for incorrectly spelled words in a text.
Basically, a spell checker is a computer program that uses a dictionary of
words to perform spell checking. The bigger the dictionary is, the higher is
the error detection rate. The fact that spell checkers are based on regular
dictionaries, they suffer from data sparseness problem as they cannot capture
large vocabulary of words including proper names, domain-specific terms,
technical jargons, special acronyms, and terminologies. As a result, they
exhibit low error detection rate and often fail to catch major errors in the
text. This paper proposes a new context-sensitive spelling correction method
for detecting and correcting non-word and real-word errors in digital text
documents. The approach hinges around data statistics from Google Web 1T 5-gram
data set which consists of a big volume of n-gram word sequences, extracted
from the World Wide Web. Fundamentally, the proposed method comprises an error
detector that detects misspellings, a candidate spellings generator based on a
character 2-gram model that generates correction suggestions, and an error
corrector that performs contextual error correction. Experiments conducted on a
set of text documents from different domains and containing misspellings,
showed an outstanding spelling error correction rate and a drastic reduction of
both non-word and real-word errors. In a further study, the proposed algorithm
is to be parallelized so as to lower the computational cost of the error
detection and correction processes.Comment: LACSC - Lebanese Association for Computational Sciences -
http://www.lacsc.or
A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures
This paper aims to introduce an optimized Damerau–Levenshtein and dice-coefficients using enumeration operations (ODADNEN) for providing fast string similarity measure with maintaining the results accuracy; searching to find specific words within a large text is a hard job which takes a lot of time and efforts. The string similarity measure plays a critical role in many searching problems. In this paper, different experiments were conducted to handle some spelling mistakes. An enhanced algorithm for string similarity assessment was proposed. This algorithm is a combined set of well-known algorithms with some improvements (e.g. the dice-coefficient was modified to deal with numbers instead of characters using certain conditions). These algorithms were adopted after conducting on a number of experimental tests to check its suitability. The ODADNN algorithm was tested using real data; its performance was compared with the original similarity measure. The results indicated that the most convincing measure is the proposed hybrid measure, which uses the Damerau–Levenshtein and dicedistance based on n-gram of each word to handle; also, it requires less processing time in comparison with the standard algorithms. Furthermore, it provides efficient results to assess the similarity between two words without the need to restrict the word length
- …