414 research outputs found

    Automatic Question Generation Using Semantic Role Labeling for Morphologically Rich Languages

    Get PDF
    In this paper, a novel approach to automatic question generation (AQG) using semantic role labeling (SRL) for morphologically rich languages is presented. A model for AQG is developed for our native speaking language, Croatian. Croatian language is a highly inflected language that belongs to Balto-Slavic family of languages. Globally this article can be divided into two stages. In the first stage we present a novel approach to SRL of texts written in Croatian language that uses Conditional Random Fields (CRF). SRL traditionally consists of predicate disambiguation, argument identification and argument classification. After these steps most approaches use beam search to find optimal sequence of arguments based on given predicate. We propose the architecture for predicate identification and argument classification in which finding the best sequence of arguments is handled by Viterbi decoding. We enrich SRL features with custom attributes that are custom made for this language. Our SRL system achieves F1 score of 78% in argument classification step on Croatian hr 500k corpus. In the second stage the proposed SRL model is used to develop AQG system for question generation from texts written in Croatian language. We proposed custom templates for AQG that were used to generate a total of 628 questions which were evaluated by experts scoring every question on a Likert scale. Expert evaluation of the system showed that our AQG achieved good results. The evaluation showed that 68% of the generated questions could be used for educational purposes. With these results the proposed AQG system could be used for possible implementation inside educational systems such as Intelligent Tutoring Systems

    Disambiguoiva morfologinen jäsennys probabilistisilla sekvenssimalleilla

    Get PDF
    A morphological tagger is a computer program that provides complete morphological descriptions of sentences. Morphological taggers find applications in many NLP fields. For example, they can be used as a pre-processing step for syntactic parsers, in information retrieval and machine translation. The task of morphological tagging is closely related to POS tagging but morphological taggers provide more fine-grained morphological information than POS taggers. Therefore, they are often applied to morphologically complex languages, which extensively utilize inflection, derivation and compounding for encoding structural and semantic information. This thesis presents work on data-driven morphological tagging for Finnish and other morphologically complex languages. There exists a very limited amount of previous work on data-driven morphological tagging for Finnish because of the lack of freely available manually prepared morphologically tagged corpora. The work presented in this thesis is made possible by the recently published Finnish dependency treebanks FinnTreeBank and Turku Dependency Treebank. Additionally, the Finnish open-source morphological analyzer OMorFi is extensively utilized in the experiments presented in the thesis. The thesis presents methods for improving tagging accuracy, estimation speed and tagging speed in presence of large structured morphological label sets that are typical for morphologically complex languages. More specifically, it presents a novel formulation of generative morphological taggers using weighted finite-state machines and applies finite-state taggers to context sensitive spelling correction of Finnish. The thesis also explores discriminative morphological tagging. It presents structured sub-label dependencies that can be used for improving tagging accuracy. Additionally, the thesis presents a cascaded variant of the averaged perceptron tagger. In presence of large label sets, a cascaded design results in substantial reduction of estimation speed compared to a standard perceptron tagger. Moreover, the thesis explores pruning strategies for perceptron taggers. Finally, the thesis presents the FinnPos toolkit for morphological tagging. FinnPos is an open-source state-of-the-art averaged perceptron tagger implemented by the author.Disambiguoiva morfologinen jäsennin on ohjelma, joka tuottaa yksikäsitteisiä morfologisia kuvauksia virkkeen sanoille. Tällaisia jäsentimiä voidaan hyödyntää monilla kielenkäsittelyn osa-alueilla, esimerkiksi syntaktisen jäsentimen tai konekäännösjärjestelmän esikäsittelyvaiheena. Kieliteknologisena tehtävänä disambiguoiva morfologinen jäsennys muistuttaa perinteistä sanaluokkajäsennystä, mutta se tuottaa hienojakoisempaa morfologista informaatiota kuin perinteinen sanaluokkajäsennin. Tämän takia disambiguoivia morfologisia jäsentimiä hyödynnetäänkin pääsääntöisesti morfologisesti monimutkaisten kielten, kuten suomen kielen, kieliteknologiassa. Tällaisissa kielissä käytetään paljon sananmuodostuskeinoja kuten taivutusta, johtamista ja yhdyssananmuodostusta. Väitöskirjan esittelemä tutkimus liittyy morfologisesti rikkaiden kielten disambiguoivaan morfologiseen jäsentämiseen koneoppimismenetelmin. Vaikka suomen disambiguoivaa morfologista jäsentämistä on tutkittu aiemmin (esim. Constraint Grammar -formalismin avulla), koneoppimismenetelmiä ei ole aiemmin juurikaan sovellettu. Tämä johtuu siitä että jäsentimen oppimiseen tarvittavia korkealuokkaisia morfologisesti annotoituja korpuksia ei ole ollut avoimesti saatavilla. Tässä väitöskirjassa esitelty tutkimus hyödyntää vastikään julkaistuja suomen kielen dependenssijäsennettyjä FinnTreeBank ja Turku Dependency Treebank korpuksia. Lisäksi tutkimus hyödyntää suomen kielen avointa morfologista OMorFi-jäsennintä. Väitöskirja esittelee menetelmiä jäsennystarkkuuden parantamiseen ja jäsentimen opetusnopeuden sekä jäsennysnopeuden kasvattamiseen. Väitöskirja esittää uuden tavan rakentaa generatiivisia jäsentimiä hyödyntäen painollisia äärellistilaisia koneita ja soveltaa tällaisia jäsentimiä suomen kielen kontekstisensitiiviseen oikeinkirjoituksentarkistukseen. Lisäksi väitöskirja käsittelee diskriminatiivisia jäsennysmalleja. Se esittelee tapoja hyödyntää morfologisten analyysien osia jäsennystarkkuuden parantamiseen. Lisäksi se esittää kaskadimallin, jonka avulla jäsentimen opetusaika lyhenee huomattavasi. Väitöskirja esittää myös tapoja jäsenninmallien pienentämiseen. Lopuksi esitellään FinnPos, joka on kirjoittaman toteuttama avoimen lähdekoodin työkalu disambiguoivien morfologisten jäsentimien opettamiseen

    Feature decay algorithms for fast deployment of accurate statistical machine translation systems

    Get PDF
    We use feature decay algorithms (FDA) for fast deployment of accurate statistical machine translation systems taking only about half a day for each translation direction. We develop parallel FDA for solving computational scalability problems caused by the abundance of training data for SMT models and LM models and still achieve SMT performance that is on par with using all of the training data or better. Parallel FDA runs separate FDA models on randomized subsets of the training data and combines the instance selections later. Parallel FDA can also be used for selecting the LM corpus based on the training set selected by parallel FDA. The high quality of the selected training data allows us to obtain very accurate translation outputs close to the top performing SMT systems. The relevancy of the selected LM corpus can reach up to 86% reduction in the number of OOV tokens and up to 74% reduction in the perplexity. We perform SMT experiments in all language pairs in the WMT13 translation task and obtain SMT performance close to the top systems using significantly less resources for training and development

    General methods for fine-grained morphological and syntactic disambiguation

    Get PDF
    We present methods for improved handling of morphologically rich languages (MRLS) where we define MRLS as languages that are morphologically more complex than English. Standard algorithms for language modeling, tagging and parsing have problems with the productive nature of such languages. Consider for example the possible forms of a typical English verb like work that generally has four four different forms: work, works, working and worked. Its Spanish counterpart trabajar has 6 different forms in present tense: trabajo, trabajas, trabaja, trabajamos, trabajáis and trabajan and more than 50 different forms when including the different tenses, moods (indicative, subjunctive and imperative) and participles. Such a high number of forms leads to sparsity issues: In a recent Wikipedia dump of more than 400 million tokens we find that 20 of these forms occur only twice or less and that 10 forms do not occur at all. This means that even if we only need unlabeled data to estimate a model and even when looking at a relatively common and frequent verb, we do not have enough data to make reasonable estimates for some of its forms. However, if we decompose an unseen form such as trabajaréis `you will work', we find that it is trabajar in future tense and second person plural. This allows us to make the predictions that are needed to decide on the grammaticality (language modeling) or syntax (tagging and parsing) of a sentence. In the first part of this thesis, we develop a morphological language model. A language model estimates the grammaticality and coherence of a sentence. Most language models used today are word-based n-gram models, which means that they estimate the transitional probability of a word following a history, the sequence of the (n - 1) preceding words. The probabilities are estimated from the frequencies of the history and the history followed by the target word in a huge text corpus. If either of the sequences is unseen, the length of the history has to be reduced. This leads to a less accurate estimate as less context is taken into account. Our morphological language model estimates an additional probability from the morphological classes of the words. These classes are built automatically by extracting morphological features from the word forms. To this end, we use unsupervised segmentation algorithms to find the suffixes of word forms. Such an algorithm might for example segment trabajaréis into trabaja and réis and we can then estimate the properties of trabajaréis from other word forms with the same or similar morphological properties. The data-driven nature of the segmentation algorithms allows them to not only find inflectional suffixes (such as -réis), but also more derivational phenomena such as the head nouns of compounds or even endings such as -tec, which identify technology oriented companies such as Vortec, Memotec and Portec and would not be regarded as a morphological suffix by traditional linguistics. Additionally, we extract shape features such as if a form contains digits or capital characters. This is important because many rare or unseen forms are proper names or numbers and often do not have meaningful suffixes. Our class-based morphological model is then interpolated with a word-based model to combine the generalization capabilities of the first and the high accuracy in case of sufficient data of the second. We evaluate our model across 21 European languages and find improvements between 3% and 11% in perplexity, a standard language modeling evaluation measure. Improvements are highest for languages with more productive and complex morphology such as Finnish and Estonian, but also visible for languages with a relatively simple morphology such as English and Dutch. We conclude that a morphological component yields consistent improvements for all the tested languages and argue that it should be part of every language model. Dependency trees represent the syntactic structure of a sentence by attaching each word to its syntactic head, the word it is directly modifying. Dependency parsing is usually tackled using heavily lexicalized (word-based) models and a thorough morphological preprocessing is important for optimal performance, especially for MRLS. We investigate if the lack of morphological features can be compensated by features induced using hidden Markov models with latent annotations (HMM-LAs) and find this to be the case for German. HMM-LAs were proposed as a method to increase part-of-speech tagging accuracy. The model splits the observed part-of-speech tags (such as verb and noun) into subtags. An expectation maximization algorithm is then used to fit the subtags to different roles. A verb tag for example might be split into an auxiliary verb and a full verb subtag. Such a split is usually beneficial because these two verb classes have different contexts. That is, a full verb might follow an auxiliary verb, but usually not another full verb. For German and English, we find that our model leads to consistent improvements over a parser not using subtag features. Looking at the labeled attachment score (LAS), the number of words correctly attached to their head, we observe an improvement from 90.34 to 90.75 for English and from 87.92 to 88.24 for German. For German, we additionally find that our model achieves almost the same performance (88.24) as a model using tags annotated by a supervised morphological tagger (LAS of 88.35). We also find that the German latent tags correlate with morphology. Articles for example are split by their grammatical case. We also investigate the part-of-speech tagging accuracies of models using the traditional treebank tagset and models using induced tagsets of the same size and find that the latter outperform the former, but are in turn outperformed by a discriminative tagger. Furthermore, we present a method for fast and accurate morphological tagging. While part-of-speech tagging annotates tokens in context with their respective word categories, morphological tagging produces a complete annotation containing all the relevant inflectional features such as case, gender and tense. A complete reading is represented as a single tag. As a reading might consist of several morphological features the resulting tagset usually contains hundreds or even thousands of tags. This is an issue for many decoding algorithms such as Viterbi which have runtimes depending quadratically on the number of tags. In the case of morphological tagging, the problem can be avoided by using a morphological analyzer. A morphological analyzer is a manually created finite-state transducer that produces the possible morphological readings of a word form. This analyzer can be used to prune the tagging lattice and to allow for the application of standard sequence labeling algorithms. The downside of this approach is that such an analyzer is not available for every language or might not have the coverage required for the task. Additionally, the output tags of some analyzers are not compatible with the annotations of the treebanks, which might require some manual mapping of the different annotations or even to reduce the complexity of the annotation. To avoid this problem we propose to use the posterior probabilities of a conditional random field (CRF) lattice to prune the space of possible taggings. At the zero-order level the posterior probabilities of a token can be calculated independently from the other tokens of a sentence. The necessary computations can thus be performed in linear time. The features available to the model at this time are similar to the features used by a morphological analyzer (essentially the word form and features based on it), but also include the immediate lexical context. As the ambiguity of word types varies substantially, we just fix the average number of readings after pruning by dynamically estimating a probability threshold. Once we obtain the pruned lattice, we can add tag transitions and convert it into a first-order lattice. The quadratic forward-backward computations are now executed on the remaining plausible readings and thus efficient. We can now continue pruning and extending the lattice order at a relatively low additional runtime cost (depending on the pruning thresholds). The training of the model can be implemented efficiently by applying stochastic gradient descent (SGD). The CRF gradient can be calculated from a lattice of any order as long as the correct reading is still in the lattice. During training, we thus run the lattice pruning until we either reach the maximal order or until the correct reading is pruned. If the reading is pruned we perform the gradient update with the highest order lattice still containing the reading. This approach is similar to early updating in the structured perceptron literature and forces the model to learn how to keep the correct readings in the lower order lattices. In practice, we observe a high number of lower updates during the first training epoch and almost exclusively higher order updates during later epochs. We evaluate our CRF tagger on six languages with different morphological properties. We find that for languages with a high word form ambiguity such as German, the pruning results in a moderate drop in tagging accuracy while for languages with less ambiguity such as Spanish and Hungarian the loss due to pruning is negligible. However, our pruning strategy allows us to train higher order models (order > 1), which give substantial improvements for all languages and also outperform unpruned first-order models. That is, the model might lose some of the correct readings during pruning, but is also able to solve more of the harder cases that require more context. We also find our model to substantially and significantly outperform a number of frequently used taggers such as Morfette and SVMTool. Based on our morphological tagger we develop a simple method to increase the performance of a state-of-the-art constituency parser. A constituency tree describes the syntactic properties of a sentence by assigning spans of text to a hierarchical bracket structure. developed a language-independent approach for the automatic annotation of accurate and compact grammars. Their implementation -- known as the Berkeley parser -- gives state-of-the-art results for many languages such as English and German. For some MRLS such as Basque and Korean, however, the parser gives unsatisfactory results because of its simple unknown word model. This model maps unknown words to a small number of signatures (similar to our morphological classes). These signatures do not seem expressive enough for many of the subtle distinctions made during parsing. We propose to replace rare words by the morphological reading generated by our tagger instead. The motivation is twofold. First, our tagger has access to a number of lexical and sublexical features not available during parsing. Second, we expect the morphological readings to contain most of the information required to make the correct parsing decision even though we know that things such as the correct attachment of prepositional phrases might require some notion of lexical semantics. In experiments on the SPMRL 2013 dataset of nine MRLS we find our method to give improvements for all languages except French for which we observe a minor drop in the Parseval score of 0.06. For Hebrew, Hungarian and Basque we find substantial absolute improvements of 5.65, 11.87 and 15.16, respectively. We also performed an extensive evaluation on the utility of word representations for morphological tagging. Our goal was to reduce the drop in performance that is caused when a model trained on a specific domain is applied to some other domain. This problem is usually addressed by domain adaption (DA). DA adapts a model towards a specific domain using a small amount of labeled or a huge amount of unlabeled data from that domain. However, this procedure requires us to train a model for every target domain. Instead we are trying to build a robust system that is trained on domain-specific labeled and domain-independent or general unlabeled data. We believe word representations to be key in the development of such models because they allow us to leverage unlabeled data efficiently. We compare data-driven representations to manually created morphological analyzers. We understand data-driven representations as models that cluster word forms or map them to a vectorial representation. Examples heavily used in the literature include Brown clusters, Singular Value Decompositions of count vectors and neural-network-based embeddings. We create a test suite of six languages consisting of in-domain and out-of-domain test sets. To this end we converted annotations for Spanish and Czech and annotated the German part of the Smultron treebank with a morphological layer. In our experiments on these data sets we find Brown clusters to outperform the other data-driven representations. Regarding the comparison with morphological analyzers, we find Brown clusters to give slightly better performance in part-of-speech tagging, but to be substantially outperformed in morphological tagging

    Towards a machine-learning architecture for lexical functional grammar parsing

    Get PDF
    Data-driven grammar induction aims at producing wide-coverage grammars of human languages. Initial efforts in this field produced relatively shallow linguistic representations such as phrase-structure trees, which only encode constituent structure. Recent work on inducing deep grammars from treebanks addresses this shortcoming by also recovering non-local dependencies and grammatical relations. My aim is to investigate the issues arising when adapting an existing Lexical Functional Grammar (LFG) induction method to a new language and treebank, and find solutions which will generalize robustly across multiple languages. The research hypothesis is that by exploiting machine-learning algorithms to learn morphological features, lemmatization classes and grammatical functions from treebanks we can reduce the amount of manual specification and improve robustness, accuracy and domain- and language -independence for LFG parsing systems. Function labels can often be relatively straightforwardly mapped to LFG grammatical functions. Learning them reliably permits grammar induction to depend less on language-specific LFG annotation rules. I therefore propose ways to improve acquisition of function labels from treebanks and translate those improvements into better-quality f-structure parsing. In a lexicalized grammatical formalism such as LFG a large amount of syntactically relevant information comes from lexical entries. It is, therefore, important to be able to perform morphological analysis in an accurate and robust way for morphologically rich languages. I propose a fully data-driven supervised method to simultaneously lemmatize and morphologically analyze text and obtain competitive or improved results on a range of typologically diverse languages

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Deep learning in medical document classification

    Get PDF
    Text-based data is produced at an ever growing rate each year which has in turn increased the need for automatic text processing. Thus, to keep up with the amount of data, automatic natural language processing techniques have also been increasingly researched and developed, especially in the last decade or so. This has lead to substantial improvements in various natural language processing tasks such as classification, translation and information retrieval. A major breakthrough has been the utilization of deep neural networks and massive amounts of data to train them. Using such methods in areas where time is valuable, such as the medical field, could provide considerable value. In this thesis, an overview is given of natural language processing w.r.t deep learning and text classification. Additionally, a dataset of medical reports in Finnish was preprocessed and used to train and evaluate a number of text classifiers for diagnosis code prediction in order to define the feasibility of such methods for medical text classification. The chosen methods include deep learning -based FinBERT, ULMFiT and ELECTRA, and a simpler linear baseline classifier, fastText. The results show that with a limited dataset, linear methods like fastText work surprisingly well. Deep learning -based methods, on the other hand, seem work reasonably well, and show a lot of potential especially in utilizing larger amounts of training data. In order to define the full potential of such methods, further investigation is required with different datasets and classification tasks.Tekstipohjaista tietoa tuotetaan vuosi vuodelta enemmän mikä puolestaan on lisännyt tarvetta automaattiselle tekstinkäsittelylle. Täten myös automaattisia tekniikoita luonnollisen kielen käsittelyyn on enenevissä määrin tutkittu ja kehitetty, erityisesti viimeisen vuosikymmenen aikana. Tämä on johtanut huomattaviin parannuksiin erilaisissa luonnollisen kielen käsittelytehtävissä. Suuri läpimurto on ollut valtavilla tietomäärillä koulutettujen syvien neuroverkkojen käyttäminen. Tällaisten menetelmien käyttö alueilla joilla aika on arvokasta, kuten lääketiede, voisi tarjota huomattavaa lisäarvoa. Tämä tutkielma antaa yleiskuvan luonnollisen kielen käsittelystä keskittyen syväoppimiseen ja tekstinluokitteluun. Lisäksi erilaisten syväoppivien menetelmien käytettävyyttä arvioitiin kouluttamalla tekstiluokittelijoita ennustamaan suomenkielisten lääketieteellisten dokumenttien diagnoosikoodeja. Valittuihin menetelmiin kuuluvat syväoppimiseen perustuvat FinBERT, ULMFiT ja ELECTRA, sekä yksinkertaisempi lineaarinen luokittelija fastText. Tulokset osoittavat, että rajallisella aineistolla lineaariset menetelmät, kuten fastText, toimivat yllättävän hyvin. Syväoppimiselle perustuvat menetelmät taasen vaikuttavat toimivan kohtuullisen hyvin, vaikkakin niiden aito potentiaali pitäisi todentaa käyttäen suurempia datajoukkoja. Täten jatkotutkimusta syväoppiviin menetelmiin liittyen tarvitaan
    corecore