Search CORE

26,224 research outputs found

A Winnow-Based Approach to Context-Sensitive Spelling Correction

Author: Golding Andrew R.
Roth Dan
Publication venue
Publication date: 31/10/1998
Field of study

A large class of machine-learning problems in natural language require the characterization of linguistic context. Two characteristic properties of such problems are that their feature space is of very high dimensionality, and their target concepts refer to only a small subset of the features in the space. Under such conditions, multiplicative weight-update algorithms such as Winnow have been shown to have exceptionally good theoretical properties. We present an algorithm combining variants of Winnow and weighted-majority voting, and apply it to a problem in the aforementioned class: context-sensitive spelling correction. This is the task of fixing spelling errors that happen to result in valid words, such as substituting "to" for "too", "casual" for "causal", etc. We evaluate our algorithm, WinSpell, by comparing it against BaySpell, a statistics-based method representing the state of the art for this task. We find: (1) When run with a full (unpruned) set of features, WinSpell achieves accuracies significantly higher than BaySpell was able to achieve in either the pruned or unpruned condition; (2) When compared with other systems in the literature, WinSpell exhibits the highest performance; (3) The primary reason that WinSpell outperforms BaySpell is that WinSpell learns a better linear separator; (4) When run on a test set drawn from a different corpus than the training set was drawn from, WinSpell is better able than BaySpell to adapt, using a strategy we will present that combines supervised learning on the training set with unsupervised learning on the (noisy) test set.Comment: To appear in Machine Learning, Special Issue on Natural Language Learning, 1999. 25 page

arXiv.org e-Print Archive

CiteSeerX

On PAC-Bayesian Bounds for Random Forests

Author: Igel Christian
Lorenzen Stephan Sloth
Seldin Yevgeny
Publication venue
Publication date: 01/01/2019
Field of study

Existing guarantees in terms of rigorous upper bounds on the generalization error for the original random forest algorithm, one of the most frequently used machine learning methods, are unsatisfying. We discuss and evaluate various PAC-Bayesian approaches to derive such bounds. The bounds do not require additional hold-out data, because the out-of-bag samples from the bagging in the training process can be exploited. A random forest predicts by taking a majority vote of an ensemble of decision trees. The first approach is to bound the error of the vote by twice the error of the corresponding Gibbs classifier (classifying with a single member of the ensemble selected at random). However, this approach does not take into account the effect of averaging out of errors of individual classifiers when taking the majority vote. This effect provides a significant boost in performance when the errors are independent or negatively correlated, but when the correlations are strong the advantage from taking the majority vote is small. The second approach based on PAC-Bayesian C-bounds takes dependencies between ensemble members into account, but it requires estimating correlations between the errors of the individual classifiers. When the correlations are high or the estimation is poor, the bounds degrade. In our experiments, we compute generalization bounds for random forests on various benchmark data sets. Because the individual decision trees already perform well, their predictions are highly correlated and the C-bounds do not lead to satisfactory results. For the same reason, the bounds based on the analysis of Gibbs classifiers are typically superior and often reasonably tight. Bounds based on a validation set coming at the cost of a smaller training set gave better performance guarantees, but worse performance in most experiments

arXiv.org e-Print Archive

Copenhagen University Research Information System

Photometric Catalogue of Quasars and Other Point Sources in the Sloan Digital Sky Survey

Author: Abazajian
Adelman-McCarthy
Ajit Kembhavi
Andrei
Aslan
Bianchi
Brand
Cappelluti
Cirasuolo
Covey
Croom
Croom
Curran
Ellison
Fomalont
Fukugita
Goderya
Gould
Gunn
Gunn
Haakonsen
Healey
Healey
Heller
Kelly
Koo
Kuraszkiewicz
Lupton
Maddox
Massaro
Niemack
Ninan Sajeeth Philip
Odewahn
Oguri
Oyaizu
Padmanabhan
Philip
Philip
Richards
Richards
Richards
Richards
Rita Sinha
Schneider
Schneider
Sheelu Abraham
Sinha
Skiff
Souchay
Stoughton
Suchkov
Véron-Cetty
Véron-Cetty
Watson
XMM-Newton Survey Science Centre
Yogesh G. Wadadekar
York
Young
Zhang
Publication venue: 'Wiley'
Publication date: 25/08/2011
Field of study

We present a catalogue of about 6 million unresolved photometric detections in the Sloan Digital Sky Survey Seventh Data Release classifying them into stars, galaxies and quasars. We use a machine learning classifier trained on a subset of spectroscopically confirmed objects from 14th to 22nd magnitude in the SDSS {\it i}-band. Our catalogue consists of 2,430,625 quasars, 3,544,036 stars and 63,586 unresolved galaxies from 14th to 24th magnitude in the SDSS {\it i}-band. Our algorithm recovers 99.96% of spectroscopically confirmed quasars and 99.51% of stars to i

\sim

21.3 in the colour window that we study. The level of contamination due to data artefacts for objects beyond

i=21.3

is highly uncertain and all mention of completeness and contamination in the paper are valid only for objects brighter than this magnitude. However, a comparison of the predicted number of quasars with the theoretical number counts shows reasonable agreement.Comment: 16 pages, Ref. No. MN-10-2382-MJ.R2, accepted for publication in MNRAS Main Journal, April 201

arXiv.org e-Print Archive

Crossref