10,774 research outputs found
A comparison of standard spell checking algorithms and a novel binary neural approach
In this paper, we propose a simple, flexible, and efficient hybrid spell checking methodology based upon phonetic matching, supervised learning, and associative matching in the AURA neural system. We integrate Hamming Distance and n-gram algorithms that have high recall for typing errors and a phonetic spell-checking algorithm in a single novel architecture. Our approach is suitable for any spell checking application though aimed toward isolated word error correction, particularly spell checking user queries in a search engine. We use a novel scoring scheme to integrate the retrieved words from each spelling approach and calculate an overall score for each matched word. From the overall scores, we can rank the possible matches. In this paper, we evaluate our approach against several benchmark spellchecking algorithms for recall accuracy. Our proposed hybrid methodology has the highest recall rate of the techniques evaluated. The method has a high recall rate and low-computational cost
Earth Observations Division version of the Laboratory for Applications of Remote Sensing system (EOD-LARSYS) user guide for the IBM 370/148. Volume 2: User's reference manual
There are no author-identified significant results in this report
Simple, compact and robust approximate string dictionary
This paper is concerned with practical implementations of approximate string
dictionaries that allow edit errors. In this problem, we have as input a
dictionary of strings of total length over an alphabet of size
. Given a bound and a pattern of length , a query has to
return all the strings of the dictionary which are at edit distance at most
from , where the edit distance between two strings and is defined as
the minimum-cost sequence of edit operations that transform into . The
cost of a sequence of operations is defined as the sum of the costs of the
operations involved in the sequence. In this paper, we assume that each of
these operations has unit cost and consider only three operations: deletion of
one character, insertion of one character and substitution of a character by
another. We present a practical implementation of the data structure we
recently proposed and which works only for one error. We extend the scheme to
. Our implementation has many desirable properties: it has a very
fast and space-efficient building algorithm. The dictionary data structure is
compact and has fast and robust query time. Finally our data structure is
simple to implement as it only uses basic techniques from the literature,
mainly hashing (linear probing and hash signatures) and succinct data
structures (bitvectors supporting rank queries).Comment: Accepted to a journal (19 pages, 2 figures
Handling Massive N-Gram Datasets Efficiently
This paper deals with the two fundamental problems concerning the handling of
large n-gram language models: indexing, that is compressing the n-gram strings
and associated satellite data without compromising their retrieval speed; and
estimation, that is computing the probability distribution of the strings from
a large textual source. Regarding the problem of indexing, we describe
compressed, exact and lossless data structures that achieve, at the same time,
high space reductions and no time degradation with respect to state-of-the-art
solutions and related software packages. In particular, we present a compressed
trie data structure in which each word following a context of fixed length k,
i.e., its preceding k words, is encoded as an integer whose value is
proportional to the number of words that follow such context. Since the number
of words following a given context is typically very small in natural
languages, we lower the space of representation to compression levels that were
never achieved before. Despite the significant savings in space, our technique
introduces a negligible penalty at query time. Regarding the problem of
estimation, we present a novel algorithm for estimating modified Kneser-Ney
language models, that have emerged as the de-facto choice for language modeling
in both academia and industry, thanks to their relatively low perplexity
performance. Estimating such models from large textual sources poses the
challenge of devising algorithms that make a parsimonious use of the disk. The
state-of-the-art algorithm uses three sorting steps in external memory: we show
an improved construction that requires only one sorting step thanks to
exploiting the properties of the extracted n-gram strings. With an extensive
experimental analysis performed on billions of n-grams, we show an average
improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February
2019, Article No: 2
Earth Observations Division version of the Laboratory for Applications of Remote Sensing System (EOD-LARSYS) user guide for the IBM 370/148. Volume 2: User reference manual
This document presents instructions for analysts who use the EOD-LARSYS as programmed on the Purdue University IBM 370/148 (recently replaced by the IBM 3031) computer. It presents sample applications, control cards, and error messages for all processors in the system and gives detailed descriptions of the mathematical procedures and information needed to execute the system and obtain the desired output. EOD-LARSYS is the JSC version of an integrated batch system for analysis of multispectral scanner imagery data. The data included is designed for use with the as built documentation (volume 3) and the program listings (volume 4). The system is operational from remote terminals at Johnson Space Center under the virtual machine/conversational monitor system environment
An improved Levenshtein algorithm for spelling correction word candidate list generation
Candidates’ list generation in spelling correction is a process of finding words from a lexicon that should be close to the incorrect word. The most widely used algorithm for generating candidates’ list for incorrect words is based on Levenshtein distance. However, this algorithm takes too much time when there is a large number of spelling errors. The reason is that calculating Levenshtein algorithm includes operations that create an array and fill the cells of this array by comparing the characters of an incorrect word with the characters of a word from a lexicon. Since most lexicons contain millions of words, then these operations will be repeated millions of times for each incorrect word to generate its candidates list. This dissertation improved Levenshtein algorithm by designing an operational technique that has been included in this algorithm. The proposed operational technique enhances Levenshtein algorithm in terms of the processing time of its executing without affecting its accuracy. It reduces the operations required to measure cells’ values in the first row, first column, second row, second column, third row, and third column in Levenshtein array. The improved Levenshtein algorithm was evaluated against the original algorithm. Experimental results show that the proposed algorithm outperforms Levenshtein algorithm in terms of the processing time by 36.45% while the accuracy of both algorithms is still the same
- …