866 research outputs found
Memory-Based Lexical Acquisition and Processing
Current approaches to computational lexicology in language technology are
knowledge-based (competence-oriented) and try to abstract away from specific
formalisms, domains, and applications. This results in severe complexity,
acquisition and reusability bottlenecks. As an alternative, we propose a
particular performance-oriented approach to Natural Language Processing based
on automatic memory-based learning of linguistic (lexical) tasks. The
consequences of the approach for computational lexicology are discussed, and
the application of the approach on a number of lexical acquisition and
disambiguation tasks in phonology, morphology and syntax is described.Comment: 18 page
Nearest Neighbor-Based Indonesian G2P Conversion
Grapheme-to-phoneme conversion (G2P), also known as letter-to-sound conversion, is an important module in both speech synthesis and speech recognition. The methods of G2P give varying accuracies for different languages although they are designed to be language independent. This paper discusses a new model based on pseudo nearest neighbor rule (PNNR) for Indonesian G2P. In this model, partial orthogonal binary code for graphemes, contextual weighting, and neighborhood weighting are introduced. Testing to 9,604 unseen words shows that the model parameters are easy to be tuned to reach high accuracy. Testing to 123 sentences containing homographs shows that the model could disambiguate homographs if it uses long graphemic context. Compare to information gain tree, PNNR gives slightly higher phoneme error rate, but it could disambiguate homographs
Use of ensemble based on GA for imbalance problem
In real-world applications, it has been observed that class imbalance (significant differences in class prior probabilities) may produce an important deterioration of the classifier performance, in particular with patterns belonging to the less represented classes. One method to tackle this problem consists to resample the original training set, either by over-sampling the minority class and/or under-sampling the majority class. In this paper, we propose two ensemble models (using a modular neural network and the nearest neighbor rule) trained on datasets under-sampled with genetic algorithms. Experiments with real datasets demonstrate the effectiveness of the methodology here propose
Forgetting Exceptions is Harmful in Language Learning
We show that in language learning, contrary to received wisdom, keeping
exceptional training instances in memory can be beneficial for generalization
accuracy. We investigate this phenomenon empirically on a selection of
benchmark natural language processing tasks: grapheme-to-phoneme conversion,
part-of-speech tagging, prepositional-phrase attachment, and base noun phrase
chunking. In a first series of experiments we combine memory-based learning
with training set editing techniques, in which instances are edited based on
their typicality and class prediction strength. Results show that editing
exceptional instances (with low typicality or low class prediction strength)
tends to harm generalization accuracy. In a second series of experiments we
compare memory-based learning and decision-tree learning methods on the same
selection of tasks, and find that decision-tree learning often performs worse
than memory-based learning. Moreover, the decrease in performance can be linked
to the degree of abstraction from exceptions (i.e., pruning or eagerness). We
provide explanations for both results in terms of the properties of the natural
language processing tasks and the learning algorithms.Comment: 31 pages, 7 figures, 10 tables. uses 11pt, fullname, a4wide tex
styles. Pre-print version of article to appear in Machine Learning 11:1-3,
Special Issue on Natural Language Learning. Figures on page 22 slightly
compressed to avoid page overloa
Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease
From a fresh data science perspective, this thesis discusses the prediction
of coronary artery disease based on genetic variations at the DNA base pair
level, called Single-Nucleotide Polymorphisms (SNPs), collected from the
Ontario Heart Genomics Study (OHGS).
First, the thesis explains two commonly used supervised learning algorithms,
the k-Nearest Neighbour (k-NN) and Random Forest classifiers, and includes a
complete proof that the k-NN classifier is universally consistent in any finite
dimensional normed vector space. Second, the thesis introduces two
dimensionality reduction steps, Random Projections, a known feature extraction
technique based on the Johnson-Lindenstrauss lemma, and a new method termed
Mass Transportation Distance (MTD) Feature Selection for discrete domains.
Then, this thesis compares the performance of Random Projections with the k-NN
classifier against MTD Feature Selection and Random Forest, for predicting
artery disease based on accuracy, the F-Measure, and area under the Receiver
Operating Characteristic (ROC) curve.
The comparative results demonstrate that MTD Feature Selection with Random
Forest is vastly superior to Random Projections and k-NN. The Random Forest
classifier is able to obtain an accuracy of 0.6660 and an area under the ROC
curve of 0.8562 on the OHGS genetic dataset, when 3335 SNPs are selected by MTD
Feature Selection for classification. This area is considerably better than the
previous high score of 0.608 obtained by Davies et al. in 2010 on the same
dataset.Comment: This is a Master of Science in Mathematics thesis under the
supervision of Dr. Vladimir Pestov and Dr. George Wells submitted on January
31, 2014 at the University of Ottawa; 102 pages and 15 figure
- …