12,447 research outputs found
Automated Detection of Usage Errors in non-native English Writing
In an investigation of the use of a novelty detection algorithm for identifying inappropriate word
combinations in a raw English corpus, we employ an
unsupervised detection algorithm based on the one-
class support vector machines (OC-SVMs) and extract
sentences containing word sequences whose frequency
of appearance is significantly low in native English
writing. Combined with n-gram language models and
document categorization techniques, the OC-SVM classifier assigns given sentences into two different
groups; the sentences containing errors and those
without errors. Accuracies are 79.30 % with bigram
model, 86.63 % with trigram model, and 34.34 % with four-gram model
Detecting word substitutions in text
Searching for words on a watchlist is one way in which large-scale surveillance of communication can be done, for example in intelligence and counterterrorism settings. One obvious defense is to replace words that might attract attention to a message with other, more innocuous, words. For example, the sentence the attack will be tomorrow" might be altered to the complex will be tomorrow", since 'complex' is a word whose frequency is close to that of 'attack'. Such substitutions are readily detectable by humans since they do not make sense. We address the problem of detecting such substitutions automatically, by looking for discrepancies between words and their contexts, and using only syntactic information. We define a set of measures, each of which is quite weak, but which together produce per-sentence detection rates around 90% with false positive rates around 10%. Rules for combining persentence detection into per-message detection can reduce the false positive and false negative rates for messages to practical levels. We test the approach using sentences from the Enron email and Brown corpora, representing informal and formal text respectively
- …