695 research outputs found

    A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena

    Get PDF
    Word reordering is one of the most difficult aspects of statistical machine translation (SMT), and an important factor of its quality and efficiency. Despite the vast amount of research published to date, the interest of the community in this problem has not decreased, and no single method appears to be strongly dominant across language pairs. Instead, the choice of the optimal approach for a new translation task still seems to be mostly driven by empirical trials. To orientate the reader in this vast and complex research area, we present a comprehensive survey of word reordering viewed as a statistical modeling challenge and as a natural language phenomenon. The survey describes in detail how word reordering is modeled within different string-based and tree-based SMT frameworks and as a stand-alone task, including systematic overviews of the literature in advanced reordering modeling. We then question why some approaches are more successful than others in different language pairs. We argue that, besides measuring the amount of reordering, it is important to understand which kinds of reordering occur in a given language pair. To this end, we conduct a qualitative analysis of word reordering phenomena in a diverse sample of language pairs, based on a large collection of linguistic knowledge. Empirical results in the SMT literature are shown to support the hypothesis that a few linguistic facts can be very useful to anticipate the reordering characteristics of a language pair and to select the SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic

    Dual contextual module for neural machine translation

    Get PDF

    Neuroverkkopohjainen faktoidikysymyksiin vastaaminen ja kysymysten generointi suomen kielellÀ

    Get PDF
    Automaattinen kysymyksiin vastaaminen ja kysymysten generointi ovat kaksi tiiviisti toisiinsa liittyvÀÀ luonnollisen kielen kĂ€sittelyn tehtĂ€vÀÀ. Molempia tehtĂ€viĂ€ on tutkittu useiden vuosikymmenten ajan ja niillĂ€ on useita kĂ€yttökohteita. JĂ€rjestelmĂ€t, jotka osaavat vastata luonnollisella kielellĂ€ muodostettuihin kysymyksiin toimivat apuna ihmisten informaatiotarpeissa, kun taas automaattista kysymysten generointia voidaan hyödyntÀÀ muun muassa luetunymmĂ€rtĂ€mistehtĂ€vien automaattisessa luomisessa sekĂ€ virtuaaliassistenttien interaktiivisuuden parantamisessa. SekĂ€ kysymyksiin vastaamisessa ettĂ€ niiden generoinnissa parhaat tulokset saadaan tĂ€llĂ€ hetkellĂ€ hyödyntĂ€mĂ€llĂ€ esikoulutettuja, transformer-arkkitehtuuriin pohjautuvia neuraalisia kielimalleja. TĂ€llaiset mallit tyypillisesti ensin esikoulutetaan raa’alla kielidatalla ja sitten hienosÀÀdetÀÀn erilaisiin tehtĂ€viin kĂ€yttĂ€en tehtĂ€vĂ€kohtaisia annotoituja aineistoja. Malleja, jotka osaavat vastata suomenkielisiin kysymyksiin tai generoida niitĂ€, ei ole tĂ€hĂ€n mennessĂ€ raportoitu juurikaan olevan olemassa. Jotta niitĂ€ voitaisiin luoda moderneja transformer-arkkitehtuuriin perustuvia menetelmiĂ€ kĂ€yttĂ€en, tarvitaan sekĂ€ esikoulutettu kielimalli ettĂ€ tarpeeksi suuri mÀÀrĂ€ suomenkielistĂ€ dataa, joka soveltuu esikoulutettujen mallien hienosÀÀtĂ€miseen juuri kysymyksiin vastaamiseen tai generointiin. Vaikka sekĂ€ puhtaasti suomen kielellĂ€ esikoulutettuja yksikielisiĂ€ malleja ettĂ€ osittain suomen kielellĂ€ esikoulutettuja monikielisiĂ€ malleja onkin jo jonkin verran avoimesti saatavilla, ongelmaksi muodostuu hienosÀÀtöön tarvittavan datan puuttuminen. TĂ€ssĂ€ tutkielmassa luodaan ensimmĂ€iset suomenkieliset transformer-arkkitehtuuriin pohjautuvat kysymyksiin vastaamiseen ja kysymysten generointiin hienosÀÀdetyt neuroverkkomallit. Esittelen menetelmĂ€n, jolla pyritÀÀn luomaan aineisto, joka soveltuu esikoulutettujen mallien hienosÀÀtĂ€miseen molempiin edellĂ€ mainittuihin tehtĂ€viin. Aineiston luonti perustuu olemassa olevan englanninkielisen SQuAD-aineiston koneelliseen kÀÀntĂ€miseen sekĂ€ kÀÀnnöksen jĂ€lkeisten automaattisten normalisointimenetelmien kĂ€yttöön. HienosÀÀdĂ€n luodun aineiston avulla useita esikoulutettuja malleja suomenkieliseen kysymyksiin vastaamiseen ja kysymysten generointiin, sekĂ€ vertailen niiden suorituskykyĂ€. KĂ€ytĂ€n sekĂ€ puhtaasti suomen kielellĂ€ esikoulutettuja BERT- ja GPT-2-malleja ettĂ€ yhtĂ€ monikielisellĂ€ aineistolla esikoulutettua BERT-mallia. Tulokset osoittavat, ettĂ€ transformer-arkkitehtuuri soveltuu hyvin myös suomenkieliseen kysymyksiin vastaamiseen ja kysymysten generointiin. Synteettisesti luotu aineisto on tulosten perusteella kĂ€yttökelpoinen resurssi esikoulutettujen mallien hienosÀÀtĂ€miseen. Parhaat tulokset molemmissa tehtĂ€vissĂ€ tuottavat hienosÀÀdetyt BERT-mallit, jotka on esikoulutettu ainoastaan suomenkielisellĂ€ kieliaineistolla. Monikielisen BERT:n tulokset ovat lĂ€hes yhtĂ€ hyviĂ€ molemmissa tehtĂ€vissĂ€, kun taas GPT-2-mallien tulokset ovat reilusti huonompia.Automatic question answering and question generation are two closely related natural language processing tasks. They both have been studied for decades, and both have a wide range of uses. While systems that can answer questions formed in natural language can help with all kinds of information needs, automatic question generation can be used, for example, to automatically create reading comprehension tasks and improve the interactivity of virtual assistants. These days, the best results in both question answering and question generation are obtained by utilizing pre-trained neural language models based on the transformer architecture. Such models are typically first pre-trained with raw language data and then fine-tuned for various tasks using task-specific annotated datasets. So far, no models that can answer or generate questions purely in Finnish have been reported. In order to create them using modern transformer-based methods, both a pre-trained language model and a sufficiently big dataset suitable for question answering or question generation fine-tuning are required. Although some suitable models that have been pre-trained with Finnish or multilingual data are already available, a big bottleneck is the lack of annotated data needed for fine-tuning the models. In this thesis, I create the first transformer-based neural network models for Finnish question answering and question generation. I present a method for creating a dataset for fine-tuning pre-trained models for the two tasks. The dataset creation is based on automatic translation of an existing dataset (SQuAD) and automatic normalization of the translated data. Using the created dataset, I fine-tune several pre-trained models to answer and generate questions in Finnish and evaluate their performance. I use monolingual BERT and GPT-2 models as well as a multilingual BERT model. The results show that the transformer architecture is well suited also for Finnish question answering and question generation. They also indicate that the synthetically generated dataset can be a useful fine-tuning resource for these tasks. The best results in both tasks are obtained by fine-tuned BERT models which have been pre-trained with only Finnish data. The fine-tuned multilingual BERT models come in close, whereas fine-tuned GPT-2 models are generally found to underperform. The data developed for this thesis will be released to the research community to support future research on question answering and generation, and the models will be released as benchmarks

    Semantic Parsing in Limited Resource Conditions

    Full text link
    This thesis explores challenges in semantic parsing, specifically focusing on scenarios with limited data and computational resources. It offers solutions using techniques like automatic data curation, knowledge transfer, active learning, and continual learning. For tasks with no parallel training data, the thesis proposes generating synthetic training examples from structured database schemas. When there is abundant data in a source domain but limited parallel data in a target domain, knowledge from the source is leveraged to improve parsing in the target domain. For multilingual situations with limited data in the target languages, the thesis introduces a method to adapt parsers using a limited human translation budget. Active learning is applied to select source-language samples for manual translation, maximizing parser performance in the target language. In addition, an alternative method is also proposed to utilize machine translation services, supplemented by human-translated data, to train a more effective parser. When computational resources are limited, a continual learning approach is introduced to minimize training time and computational memory. This maintains the parser's efficiency in previously learned tasks while adapting it to new tasks, mitigating the problem of catastrophic forgetting. Overall, the thesis provides a comprehensive set of methods to improve semantic parsing in resource-constrained conditions.Comment: PhD thesis, year of award 2023, 172 page

    DeepCause: Hypothesis Extraction from Information Systems Papers with Deep Learning for Theory Ontology Learning

    Get PDF
    This paper applies different deep learning architectures for sequence labelling to extract causes, effects, moderators, and mediators from hypotheses of information systems papers for theory ontology learning. We compared a variety of recurrent neural networks (RNN) architectures, like long short-term memory (LSTM), bidirectional LSTM (BiLSTM), simple RNNs, and gated recurrent units (GRU). We analyzed GloVe word embedding, character level vector representation of words, and part-of-speech (POS) tags. Furthermore, we evaluated various hyperparameters and architectures to achieve the highest performance scores. The prototype was evaluated on hypotheses from the AIS basket of eight. The F1 result for the sequence labelling task of causal variables on a chunk level was 80%, with a precision of 80% and a recall of 80%
    • 

    corecore