1,928 research outputs found
Automated Morphological Segmentation and Evaluation
In this paper we introduce (i) a new method for morphological segmentation of part of speech labelled German words and (ii) some measures related to the MDL principle for evaluation of morphological segmentations. The segmentation algorithm is capable to discover hierarchical structure and to retrieve new morphemes. It achieved 75 % recall and 99 % precision. Regarding MDL based evaluation, a linear combination of vocabulary size and size of reduced deterministic finite state automata matching exactly the segmentation output turned out to be an appropriate measure to rank segmentation models according to their quality
A Survey and Classification of Methods for (Mostly) Unsupervised Learning
Proceedings of the 16th Nordic Conference
of Computational Linguistics NODALIDA-2007.
Editors: Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit.
University of Tartu, Tartu, 2007.
ISBN 978-9985-4-0513-0 (online)
ISBN 978-9985-4-0514-7 (CD-ROM)
pp. 292-296
Spectral learning of transducers over continuous sequences
In this paper we present a spectral algorithm for learning weighted nite state transducers (WFSTs) over paired input-output sequences, where the input is continuous and the output discrete. WFSTs are an important tool for modeling paired input-output sequences and have numerous applications in
real-world problems. Recently, Balle et al (2011) proposed a spectral method for learning WFSTs that overcomes some of the well known limitations of gradient-based or EM optimizations which can be computationally expensive and su er from local optima issues. Their algorithm can model distributions where both inputs and outputs are sequences from a discrete alphabet.
However, many real world problems require modeling paired sequences where the inputs are not discrete but continuos sequences. Modelling continuous sequences with spectral methods has been studied in the context of HMMs (Song et al 2010), where a spectral algorithm for this case was derived. In this
paper we follow that line of work and propose a spectral learning algorithm
for modelling paired input-output sequences where the inputs are continuous and the outputs are discrete. Our approach is based on generalizing the class of weighted nite state transducers over discrete input-output sequences to a class where transitions are linear combinations of elementary transitions and the weights of this linear combinations are determined by dynamic features of the continuous input sequence.
At its core, the algorithm is simple and scalable to large data sets. We present experiments on a real task that validate the eff ectiveness of the proposed approach.Postprint (published version
Local String Transduction as Sequence Labeling
[EN]We show that the general problem of string transduction can be reduced to the problem of sequence labeling. While character deletion and insertions are allowed in string transduction, they do not exist in sequence labeling. We show how to overcome this difference. Our approach can be used with any sequence labeling algorithm and it works best for problems in which string transduction imposes a strong notion of locality (no long range dependencies). We experiment with spelling correction for social media, OCR correction, and morphological inflection, and we see that it behaves better than seq2seq models and yields state-of-the-art results in several cases.Peer reviewe
Morphological analysis for the Maltese language : the challenges of a hybrid system
Maltese is a morphologically rich language with a hybrid morphological system which features both concatenative and non-concatenative processes. This paper analyses the impact of this hybridity on the performance of machine learning techniques for morphological labelling and clustering. In particular, we analyse a dataset of morphologically related word clusters to evaluate the difference in results for concatenative and non-concatenative clusters. We also describe research carried out in morphological labelling, with a particular focus on the verb category. Two evaluations were carried out, one using an unseen dataset, and another one using a gold standard dataset which was manually labelled. The gold standard dataset was split into concatenative and non-concatenative to analyse the difference in results between the two morphological systems.non peer-reviewe
Probabilistic Modelling of Morphologically Rich Languages
This thesis investigates how the sub-structure of words can be accounted for
in probabilistic models of language. Such models play an important role in
natural language processing tasks such as translation or speech recognition,
but often rely on the simplistic assumption that words are opaque symbols. This
assumption does not fit morphologically complex language well, where words can
have rich internal structure and sub-word elements are shared across distinct
word forms.
Our approach is to encode basic notions of morphology into the assumptions of
three different types of language models, with the intention that leveraging
shared sub-word structure can improve model performance and help overcome data
sparsity that arises from morphological processes.
In the context of n-gram language modelling, we formulate a new Bayesian
model that relies on the decomposition of compound words to attain better
smoothing, and we develop a new distributed language model that learns vector
representations of morphemes and leverages them to link together
morphologically related words. In both cases, we show that accounting for word
sub-structure improves the models' intrinsic performance and provides benefits
when applied to other tasks, including machine translation.
We then shift the focus beyond the modelling of word sequences and consider
models that automatically learn what the sub-word elements of a given language
are, given an unannotated list of words. We formulate a novel model that can
learn discontiguous morphemes in addition to the more conventional contiguous
morphemes that most previous models are limited to. This approach is
demonstrated on Semitic languages, and we find that modelling discontiguous
sub-word structures leads to improvements in the task of segmenting words into
their contiguous morphemes.Comment: DPhil thesis, University of Oxford, submitted and accepted 2014.
http://ora.ox.ac.uk/objects/uuid:8df7324f-d3b8-47a1-8b0b-3a6feb5f45c
- …