Search CORE

3,801 research outputs found

Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information

Author: Alwani Mohammad
Bassil Youssef
Publication venue: 'Canadian Center of Science and Education'
Publication date: 01/01/2012
Field of study

In computing, spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. Basically, a spell checker is a computer program that uses a dictionary of words to perform spell checking. The bigger the dictionary is, the higher is the error detection rate. The fact that spell checkers are based on regular dictionaries, they suffer from data sparseness problem as they cannot capture large vocabulary of words including proper names, domain-specific terms, technical jargons, special acronyms, and terminologies. As a result, they exhibit low error detection rate and often fail to catch major errors in the text. This paper proposes a new context-sensitive spelling correction method for detecting and correcting non-word and real-word errors in digital text documents. The approach hinges around data statistics from Google Web 1T 5-gram data set which consists of a big volume of n-gram word sequences, extracted from the World Wide Web. Fundamentally, the proposed method comprises an error detector that detects misspellings, a candidate spellings generator based on a character 2-gram model that generates correction suggestions, and an error corrector that performs contextual error correction. Experiments conducted on a set of text documents from different domains and containing misspellings, showed an outstanding spelling error correction rate and a drastic reduction of both non-word and real-word errors. In a further study, the proposed algorithm is to be parallelized so as to lower the computational cost of the error detection and correction processes.Comment: LACSC - Lebanese Association for Computational Sciences - http://www.lacsc.or

arXiv.org e-Print Archive

CiteSeerX

Crossref

An Experimental Evaluation of Nearest Neighbour Time Series Classification

Author: Bagnall Anthony
Lines Jason
Publication venue
Publication date: 01/01/2014
Field of study

Data mining research into time series classification (TSC) has focussed on alternative distance measures for nearest neighbour classifiers. It is standard practice to use 1-NN with Euclidean or dynamic time warping (DTW) distance as a straw man for comparison. As part of a wider investigation into elastic distance measures for TSC~\cite{lines14elastic}, we perform a series of experiments to test whether this standard practice is valid. Specifically, we compare 1-NN classifiers with Euclidean and DTW distance to standard classifiers, examine whether the performance of 1-NN Euclidean approaches that of 1-NN DTW as the number of cases increases, assess whether there is any benefit of setting

k

for

k

-NN through cross validation whether it is worth setting the warping path for DTW through cross validation and finally is it better to use a window or weighting for DTW. Based on experiments on 77 problems, we conclude that 1-NN with Euclidean distance is fairly easy to beat but 1-NN with DTW is not, if window size is set through cross validation

arXiv.org e-Print Archive

CiteSeerX

University of East Anglia digital repository

Learning Character Strings via Mastermind Queries, with a Case Study Involving mtDNA

Author: Goodrich Michael T.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/04/2010
Field of study

We study the degree to which a character string,

Q

, leaks details about itself any time it engages in comparison protocols with a strings provided by a querier, Bob, even if those protocols are cryptographically guaranteed to produce no additional information other than the scores that assess the degree to which

Q

matches strings offered by Bob. We show that such scenarios allow Bob to play variants of the game of Mastermind with

Q

so as to learn the complete identity of

Q

. We show that there are a number of efficient implementations for Bob to employ in these Mastermind attacks, depending on knowledge he has about the structure of

Q

, which show how quickly he can determine

Q

. Indeed, we show that Bob can discover

Q

using a number of rounds of test comparisons that is much smaller than the length of

Q

, under reasonable assumptions regarding the types of scores that are returned by the cryptographic protocols and whether he can use knowledge about the distribution that

Q

comes from. We also provide the results of a case study we performed on a database of mitochondrial DNA, showing the vulnerability of existing real-world DNA data to the Mastermind attack.Comment: Full version of related paper appearing in IEEE Symposium on Security and Privacy 2009, "The Mastermind Attack on Genomic Data." This version corrects the proofs of what are now Theorems 2 and 4

arXiv.org e-Print Archive

Crossref

Provably Good Solutions to the Knapsack Problem via Neural Networks of Bounded Size

Author: Hertrich Christoph
Skutella Martin
Publication venue
Publication date: 04/01/2021
Field of study

The development of a satisfying and rigorous mathematical understanding of the performance of neural networks is a major challenge in artificial intelligence. Against this background, we study the expressive power of neural networks through the example of the classical NP-hard Knapsack Problem. Our main contribution is a class of recurrent neural networks (RNNs) with rectified linear units that are iteratively applied to each item of a Knapsack instance and thereby compute optimal or provably good solution values. We show that an RNN of depth four and width depending quadratically on the profit of an optimum Knapsack solution is sufficient to find optimum Knapsack solutions. We also prove the following tradeoff between the size of an RNN and the quality of the computed Knapsack solution: for Knapsack instances consisting of

n

items, an RNN of depth five and width

w

computes a solution of value at least

1-\mathcal{O}(n^2/\sqrt{w})

times the optimum solution value. Our results build upon a classical dynamic programming formulation of the Knapsack Problem as well as a careful rounding of profit values that are also at the core of the well-known fully polynomial-time approximation scheme for the Knapsack Problem. A carefully conducted computational study qualitatively supports our theoretical size bounds. Finally, we point out that our results can be generalized to many other combinatorial optimization problems that admit dynamic programming solution methods, such as various Shortest Path Problems, the Longest Common Subsequence Problem, and the Traveling Salesperson Problem.Comment: A short version of this paper appears in the proceedings of AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Mining Heterogeneous Multivariate Time-Series for Learning Meaningful Patterns: Application to Home Health Telecare

Author: Duchene Florence
Garbay Catherine
Rialle Vincent
Publication venue
Publication date: 25/11/2004
Field of study

For the last years, time-series mining has become a challenging issue for researchers. An important application lies in most monitoring purposes, which require analyzing large sets of time-series for learning usual patterns. Any deviation from this learned profile is then considered as an unexpected situation. Moreover, complex applications may involve the temporal study of several heterogeneous parameters. In that paper, we propose a method for mining heterogeneous multivariate time-series for learning meaningful patterns. The proposed approach allows for mixed time-series -- containing both pattern and non-pattern data -- such as for imprecise matches, outliers, stretching and global translating of patterns instances in time. We present the early results of our approach in the context of monitoring the health status of a person at home. The purpose is to build a behavioral profile of a person by analyzing the time variations of several quantitative or qualitative parameters recorded through a provision of sensors installed in the home

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes