75 research outputs found
Disambiguoiva morfologinen jäsennys probabilistisilla sekvenssimalleilla
A morphological tagger is a computer program that provides complete morphological descriptions of sentences. Morphological taggers find applications in many NLP fields. For example, they can be used as a pre-processing step for syntactic parsers, in information retrieval and machine translation. The task of morphological tagging is closely related to POS tagging but morphological taggers provide more fine-grained morphological information than POS taggers. Therefore, they are often applied to morphologically complex languages, which extensively utilize inflection, derivation and compounding for encoding structural and semantic information. This thesis presents work on data-driven morphological tagging for Finnish and other morphologically complex languages.
There exists a very limited amount of previous work on data-driven morphological tagging for Finnish because of the lack of freely available manually prepared morphologically tagged corpora. The work presented in this thesis is made possible by the recently published Finnish dependency treebanks FinnTreeBank and Turku Dependency Treebank. Additionally, the Finnish open-source morphological analyzer OMorFi is extensively utilized in the experiments presented in the thesis.
The thesis presents methods for improving tagging accuracy, estimation speed and tagging speed in presence of large structured morphological label sets that are typical for morphologically complex languages. More specifically, it presents a novel formulation of generative morphological taggers using weighted finite-state machines and applies finite-state taggers to context sensitive spelling correction of Finnish. The thesis also explores discriminative morphological tagging. It presents structured sub-label dependencies that can be used for improving tagging accuracy. Additionally, the thesis presents a cascaded variant of the averaged perceptron tagger. In presence of large label sets, a cascaded design results in substantial reduction of estimation speed compared to a standard perceptron tagger. Moreover, the thesis explores pruning strategies for perceptron taggers. Finally, the thesis presents the FinnPos toolkit for morphological tagging. FinnPos is an open-source state-of-the-art averaged perceptron tagger implemented by the author.Disambiguoiva morfologinen jäsennin on ohjelma, joka tuottaa yksikäsitteisiä morfologisia kuvauksia virkkeen sanoille. Tällaisia jäsentimiä voidaan hyödyntää monilla kielenkäsittelyn osa-alueilla, esimerkiksi syntaktisen jäsentimen tai konekäännösjärjestelmän esikäsittelyvaiheena. Kieliteknologisena tehtävänä disambiguoiva morfologinen jäsennys muistuttaa perinteistä sanaluokkajäsennystä, mutta se tuottaa hienojakoisempaa morfologista informaatiota kuin perinteinen sanaluokkajäsennin. Tämän takia disambiguoivia morfologisia jäsentimiä hyödynnetäänkin pääsääntöisesti morfologisesti monimutkaisten kielten, kuten suomen kielen, kieliteknologiassa. Tällaisissa kielissä käytetään paljon sananmuodostuskeinoja kuten taivutusta, johtamista ja yhdyssananmuodostusta. Väitöskirjan esittelemä tutkimus liittyy morfologisesti rikkaiden kielten disambiguoivaan morfologiseen jäsentämiseen koneoppimismenetelmin.
Vaikka suomen disambiguoivaa morfologista jäsentämistä on tutkittu aiemmin (esim. Constraint Grammar -formalismin avulla), koneoppimismenetelmiä ei ole aiemmin juurikaan sovellettu. Tämä johtuu siitä että jäsentimen oppimiseen tarvittavia korkealuokkaisia morfologisesti annotoituja korpuksia ei ole ollut avoimesti saatavilla. Tässä väitöskirjassa esitelty tutkimus hyödyntää vastikään julkaistuja suomen kielen dependenssijäsennettyjä FinnTreeBank ja Turku Dependency Treebank korpuksia. Lisäksi tutkimus hyödyntää suomen kielen avointa morfologista OMorFi-jäsennintä.
Väitöskirja esittelee menetelmiä jäsennystarkkuuden parantamiseen ja jäsentimen opetusnopeuden sekä jäsennysnopeuden kasvattamiseen. Väitöskirja esittää uuden tavan rakentaa generatiivisia jäsentimiä hyödyntäen painollisia äärellistilaisia koneita ja soveltaa tällaisia jäsentimiä suomen kielen kontekstisensitiiviseen oikeinkirjoituksentarkistukseen. Lisäksi väitöskirja käsittelee diskriminatiivisia jäsennysmalleja. Se esittelee tapoja hyödyntää morfologisten analyysien osia jäsennystarkkuuden parantamiseen. Lisäksi se esittää kaskadimallin, jonka avulla jäsentimen opetusaika lyhenee huomattavasi. Väitöskirja esittää myös tapoja jäsenninmallien pienentämiseen. Lopuksi esitellään FinnPos, joka on kirjoittaman toteuttama avoimen lähdekoodin työkalu disambiguoivien morfologisten jäsentimien opettamiseen
Proceedings
Proceedings of the NODALIDA 2011 Workshop
Constraint Grammar Applications.
Editors: Eckhard Bick, Kristin Hagen, Kaili Müürisep, Trond Trosterud.
NEALT Proceedings Series, Vol. 14 (2011), vi+69 pp.
© 2011 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/19231
FinnPos : an open-source morphological tagging and lemmatization toolkit for Finnish
This paper describes FinnPos, an open-source morphological tagging and lemmatization toolkit for Finnish. The morphological tagging model is based on the averaged structured perceptron classifier. Given training data, new taggers are estimated in a computationally efficient manner using a combination of beam search and model cascade. The lemmatization is performed employing a combination of a rule-based morphological analyzer, OMorFi, and a data-driven lemmatization model. The toolkit is readily applicable for tagging and lemmatization of running text with models learned from the recently published Finnish Turku Dependency Treebank and FinnTreeBank. Empirical evaluation on these corpora shows that FinnPos performs favorably compared to reference systems in terms of tagging and lemmatization accuracy. In addition, we demonstrate that our system is highly competitive with regard to computational efficiency of learning new models and assigning analyses to novel sentences.Peer reviewe
Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish
In morphologically complex languages, many high-level tasks in natural language
processing rely on accurate morphosyntactic analyses of the input. However, in
light of the risk of error propagation in present-day pipeline architectures for basic
linguistic pre-processing, the state of the art for morphosyntactic tagging is still
not satisfactory. The main obstacle here is data sparsity inherent to natural lan-
guage in general and highly inflected languages in particular.
In this work, we investigate whether semi-supervised systems may alleviate the
data sparsity problem. Our approach uses word clusters obtained from large
amounts of unlabelled text in an unsupervised manner in order to provide a su-
pervised probabilistic tagger with morphologically informed features. Our evalua-
tions on a number of datasets for the Polish language suggest that this simple
technique improves tagging accuracy, especially with regard to out-of-vocabulary
words. This may prove useful to increase cross-domain performance of taggers,
and to alleviate the dependency on large amounts of supervised training data,
which is especially important from the perspective of less-resourced languages
Empirical Methods for Development and Expanding of the Bulgarian WordNet
Some basic points from the automated creation of a Bulgarian WordNet – an analogue of the Princeton
WordNet, are treated. The used computer tools, the received results and their estimation are discussed. A side
effect from the proposed approach is the receiving of patterns for the Bulgarian syntactic analyzer
Nahuatl contemporary writing: studying convergence in the absence of a written norm
Language revitalisation (LR) is influenced by a concern with authenticity around which written conventions are largely seen as the result of careful designs based on authentic spoken usage. This dissertation proposes to see written conventions as the result of an authentication process carried on by a community of practice of writers, and to explore the points of convergence in this practice as a set of examples which could eventually become shared conventions in most varieties of a linguistic continuum. I focus on the revitalisation of Nahuatl, a linguistic continuum spoken in Mexico. The study of convergence in the written practice of Nahuatl must be carried out in a context of ideological, linguistic, and orthographical heterogeneity. I have tested a methodology to extensively investigate points of convergence between eight contemporary Nahuatl texts from eight contemporary varieties, comparing them with Classical Nahuatl (CN), an old Nahuatl variety codified in prescriptive sources. I have attempted to locate commonalities in these texts by identifying nuclear clauses (NCs): morphosyntactic structures which are a common feature across the Nahuatl continuum. I have used a Finite State (FS) model of CN to attempt a morphological analysis of the word types found in the contemporary texts. The word types that could be plausibly analysed as CN NCs by our FS model, were proposed as plausible points of convergence between the texts and CN. Using a force atlas diagram, each text was represented as a node in a network, with the distance between them being proportional to the number of plausible points of convergence between them. Findings are that the ambiguity of analyses proposed by the developing FS model are currently a pitfall of our approach, but that the plausible points of convergence could be used to locate texts occupying a ‘central’ position in an expanding network
16th International NooJ 2022 Conference: Book of Abstracts
Libro de resúmenes presentados en la "16th International NooJ 2022 Conference", de modalidad híbrida, realizada en el ECU (Espacio Cultural Universitario, UNR) en Rosario, Santa Fe, Argentina, entre el 14 y 15 de junio de 2022.Fil: Reyes, Silvia Susana. Universidad Nacional de Rosario. Facultad de Humanidades y Artes; Argentin
Probabilistic Modelling of Morphologically Rich Languages
This thesis investigates how the sub-structure of words can be accounted for
in probabilistic models of language. Such models play an important role in
natural language processing tasks such as translation or speech recognition,
but often rely on the simplistic assumption that words are opaque symbols. This
assumption does not fit morphologically complex language well, where words can
have rich internal structure and sub-word elements are shared across distinct
word forms.
Our approach is to encode basic notions of morphology into the assumptions of
three different types of language models, with the intention that leveraging
shared sub-word structure can improve model performance and help overcome data
sparsity that arises from morphological processes.
In the context of n-gram language modelling, we formulate a new Bayesian
model that relies on the decomposition of compound words to attain better
smoothing, and we develop a new distributed language model that learns vector
representations of morphemes and leverages them to link together
morphologically related words. In both cases, we show that accounting for word
sub-structure improves the models' intrinsic performance and provides benefits
when applied to other tasks, including machine translation.
We then shift the focus beyond the modelling of word sequences and consider
models that automatically learn what the sub-word elements of a given language
are, given an unannotated list of words. We formulate a novel model that can
learn discontiguous morphemes in addition to the more conventional contiguous
morphemes that most previous models are limited to. This approach is
demonstrated on Semitic languages, and we find that modelling discontiguous
sub-word structures leads to improvements in the task of segmenting words into
their contiguous morphemes.Comment: DPhil thesis, University of Oxford, submitted and accepted 2014.
http://ora.ox.ac.uk/objects/uuid:8df7324f-d3b8-47a1-8b0b-3a6feb5f45c
- …