65,083 research outputs found
Token and Type Constraints for Cross-Lingual Part-of-Speech Tagging
We consider the construction of part-of-speech taggers for resource-poor languages. Recently, manually constructed tag dictionaries from Wiktionary and dictionaries projected via bitext have been used as type constraints to overcome the scarcity of annotated data in this setting. In this paper, we show that additional token constraints can be projected from a resource-rich source language to a resource-poor target language via word-aligned bitext. We present several models to this end; in particular a partially observed conditional random field model, where coupled token and type constraints provide a partial signal for training. Averaged across eight previously studied Indo-European languages, our model achieves a 25% relative error reduction over the prior state of the art. We further present successful results on seven additional languages from different families, empirically demonstrating the applicability of coupled token and type constraints across a diverse set of languages
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
Automated supervised classification of variable stars I. Methodology
The fast classification of new variable stars is an important step in making
them available for further research. Selection of science targets from large
databases is much more efficient if they have been classified first. Defining
the classes in terms of physical parameters is also important to get an
unbiased statistical view on the variability mechanisms and the borders of
instability strips. Our goal is twofold: provide an overview of the stellar
variability classes that are presently known, in terms of some relevant stellar
parameters; use the class descriptions obtained as the basis for an automated
`supervised classification' of large databases. Such automated classification
will compare and assign new objects to a set of pre-defined variability
training classes. For every variability class, a literature search was
performed to find as many well-known member stars as possible, or a
considerable subset if too many were present. Next, we searched on-line and
private databases for their light curves in the visible band and performed
period analysis and harmonic fitting. The derived light curve parameters are
used to describe the classes and define the training classifiers. We compared
the performance of different classifiers in terms of percentage of correct
identification, of confusion among classes and of computation time. We describe
how well the classes can be separated using the proposed set of parameters and
how future improvements can be made, based on new large databases such as the
light curves to be assembled by the CoRoT and Kepler space missions.Comment: This paper has been accepted for publication in Astronomy and
Astrophysics (reference AA/2007/7638) Number of pages: 27 Number of figures:
1
- …