695 research outputs found
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
Neuroverkkopohjainen faktoidikysymyksiin vastaaminen ja kysymysten generointi suomen kielellÀ
Automaattinen kysymyksiin vastaaminen ja kysymysten generointi ovat kaksi tiiviisti toisiinsa liittyvÀÀ luonnollisen kielen kĂ€sittelyn tehtĂ€vÀÀ. Molempia tehtĂ€viĂ€ on tutkittu useiden vuosikymmenten ajan ja niillĂ€ on useita kĂ€yttökohteita. JĂ€rjestelmĂ€t, jotka osaavat vastata luonnollisella kielellĂ€ muodostettuihin kysymyksiin toimivat apuna ihmisten informaatiotarpeissa, kun taas automaattista kysymysten generointia voidaan hyödyntÀÀ muun muassa luetunymmĂ€rtĂ€mistehtĂ€vien automaattisessa luomisessa sekĂ€ virtuaaliassistenttien interaktiivisuuden parantamisessa. SekĂ€ kysymyksiin vastaamisessa ettĂ€ niiden generoinnissa parhaat tulokset saadaan tĂ€llĂ€ hetkellĂ€ hyödyntĂ€mĂ€llĂ€ esikoulutettuja, transformer-arkkitehtuuriin pohjautuvia neuraalisia kielimalleja. TĂ€llaiset mallit tyypillisesti ensin esikoulutetaan raaâalla kielidatalla ja sitten hienosÀÀdetÀÀn erilaisiin tehtĂ€viin kĂ€yttĂ€en tehtĂ€vĂ€kohtaisia annotoituja aineistoja.
Malleja, jotka osaavat vastata suomenkielisiin kysymyksiin tai generoida niitÀ, ei ole tÀhÀn mennessÀ raportoitu juurikaan olevan olemassa. Jotta niitÀ voitaisiin luoda moderneja transformer-arkkitehtuuriin perustuvia menetelmiÀ kÀyttÀen, tarvitaan sekÀ esikoulutettu kielimalli ettÀ tarpeeksi suuri mÀÀrÀ suomenkielistÀ dataa, joka soveltuu esikoulutettujen mallien hienosÀÀtÀmiseen juuri kysymyksiin vastaamiseen tai generointiin. Vaikka sekÀ puhtaasti suomen kielellÀ esikoulutettuja yksikielisiÀ malleja ettÀ osittain suomen kielellÀ esikoulutettuja monikielisiÀ malleja onkin jo jonkin verran avoimesti saatavilla, ongelmaksi muodostuu hienosÀÀtöön tarvittavan datan puuttuminen.
TÀssÀ tutkielmassa luodaan ensimmÀiset suomenkieliset transformer-arkkitehtuuriin pohjautuvat kysymyksiin vastaamiseen ja kysymysten generointiin hienosÀÀdetyt neuroverkkomallit. Esittelen menetelmÀn, jolla pyritÀÀn luomaan aineisto, joka soveltuu esikoulutettujen mallien hienosÀÀtÀmiseen molempiin edellÀ mainittuihin tehtÀviin. Aineiston luonti perustuu olemassa olevan englanninkielisen SQuAD-aineiston koneelliseen kÀÀntÀmiseen sekÀ kÀÀnnöksen jÀlkeisten automaattisten normalisointimenetelmien kÀyttöön. HienosÀÀdÀn luodun aineiston avulla useita esikoulutettuja malleja suomenkieliseen kysymyksiin vastaamiseen ja kysymysten generointiin, sekÀ vertailen niiden suorituskykyÀ. KÀytÀn sekÀ puhtaasti suomen kielellÀ esikoulutettuja BERT- ja GPT-2-malleja ettÀ yhtÀ monikielisellÀ aineistolla esikoulutettua BERT-mallia.
Tulokset osoittavat, ettÀ transformer-arkkitehtuuri soveltuu hyvin myös suomenkieliseen kysymyksiin vastaamiseen ja kysymysten generointiin. Synteettisesti luotu aineisto on tulosten perusteella kÀyttökelpoinen resurssi esikoulutettujen mallien hienosÀÀtÀmiseen. Parhaat tulokset molemmissa tehtÀvissÀ tuottavat hienosÀÀdetyt BERT-mallit, jotka on esikoulutettu ainoastaan suomenkielisellÀ kieliaineistolla. Monikielisen BERT:n tulokset ovat lÀhes yhtÀ hyviÀ molemmissa tehtÀvissÀ, kun taas GPT-2-mallien tulokset ovat reilusti huonompia.Automatic question answering and question generation are two closely related natural language processing tasks. They both have been studied for decades, and both have a wide range of uses. While systems that can answer questions formed in natural language can help with all kinds of information needs, automatic question generation can be used, for example, to automatically create reading comprehension tasks and improve the interactivity of virtual assistants. These days, the best results in both question answering and question generation are obtained by utilizing pre-trained neural language models based on the transformer architecture. Such models are typically first pre-trained with raw language data and then fine-tuned for various tasks using task-specific annotated datasets.
So far, no models that can answer or generate questions purely in Finnish have been reported. In order to create them using modern transformer-based methods, both a pre-trained language model and a sufficiently big dataset suitable for question answering or question generation fine-tuning are required. Although some suitable models that have been pre-trained with Finnish or multilingual data are already available, a big bottleneck is the lack of annotated data needed for fine-tuning the models.
In this thesis, I create the first transformer-based neural network models for Finnish question answering and question generation. I present a method for creating a dataset for fine-tuning pre-trained models for the two tasks. The dataset creation is based on automatic translation of an existing dataset (SQuAD) and automatic normalization of the translated data. Using the created dataset, I fine-tune several pre-trained models to answer and generate questions in Finnish and evaluate their performance. I use monolingual BERT and GPT-2 models as well as a multilingual BERT model.
The results show that the transformer architecture is well suited also for Finnish question answering and question generation. They also indicate that the synthetically generated dataset can be a useful fine-tuning resource for these tasks. The best results in both tasks are obtained by fine-tuned BERT models which have been pre-trained with only Finnish data. The fine-tuned multilingual BERT models come in close, whereas fine-tuned GPT-2 models are generally found to underperform.
The data developed for this thesis will be released to the research community to support future research on question answering and generation, and the models will be released as benchmarks
Recommended from our members
Domain adaptation for neural machine translation
The development of deep learning techniques has allowed Neural Machine Translation (NMT) models to become extremely powerful, given sufficient training data and training time. However, such translation models struggle when translating text of a specific domain. A domain may consist of text on a well-defined topic, or text of unknown provenance with an identifiable vocabulary distribution, or language with some other stylometric feature. While NMT models can achieve good translation performance on domain-specific data via simple tuning on a representative training corpus, such data-centric approaches have negative side-effects. These include over-fitting, brittleness, and `catastrophic forgetting' of previous training examples.
In this thesis we instead explore more robust approaches to domain adaptation for NMT. We consider the case where a system is adapted to a specified domain of interest, but may also need to accommodate new language, or domain-mismatched sentences. We explore techniques relating to data selection and curriculum, model parameter adaptation procedure, and inference procedure. We show that iterative fine-tuning can achieve strong performance over multiple related domains, and that Elastic Weight Consolidation can be used to mitigate catastrophic forgetting in NMT domain adaptation across multiple sequential domains. We develop a robust variant of Minimum Risk Training which allows more beneficial use of small, highly domain-specific tuning sets than simple cross-entropy fine-tuning, and can mitigate exposure bias resulting from domain over-fitting. We extend Bayesian Interpolation inference schemes to Neural Machine Translation, allowing adaptive weighting of NMT ensembles to translate text from an unknown domain.
Finally we demonstrate the benefit of multi-domain adaptation approaches for other lines of NMT research. We show that NMT systems using multiple forms of data representation can benefit from multi-domain inference approaches. We also demonstrate a series of domain adaptation approaches to mitigating the effects of gender bias in machine translation
Semantic Parsing in Limited Resource Conditions
This thesis explores challenges in semantic parsing, specifically focusing on
scenarios with limited data and computational resources. It offers solutions
using techniques like automatic data curation, knowledge transfer, active
learning, and continual learning.
For tasks with no parallel training data, the thesis proposes generating
synthetic training examples from structured database schemas. When there is
abundant data in a source domain but limited parallel data in a target domain,
knowledge from the source is leveraged to improve parsing in the target domain.
For multilingual situations with limited data in the target languages, the
thesis introduces a method to adapt parsers using a limited human translation
budget. Active learning is applied to select source-language samples for manual
translation, maximizing parser performance in the target language. In addition,
an alternative method is also proposed to utilize machine translation services,
supplemented by human-translated data, to train a more effective parser.
When computational resources are limited, a continual learning approach is
introduced to minimize training time and computational memory. This maintains
the parser's efficiency in previously learned tasks while adapting it to new
tasks, mitigating the problem of catastrophic forgetting.
Overall, the thesis provides a comprehensive set of methods to improve
semantic parsing in resource-constrained conditions.Comment: PhD thesis, year of award 2023, 172 page
DeepCause: Hypothesis Extraction from Information Systems Papers with Deep Learning for Theory Ontology Learning
This paper applies different deep learning architectures for sequence labelling to extract causes, effects, moderators, and mediators from hypotheses of information systems papers for theory ontology learning. We compared a variety of recurrent neural networks (RNN) architectures, like long short-term memory (LSTM), bidirectional LSTM (BiLSTM), simple RNNs, and gated recurrent units (GRU). We analyzed GloVe word embedding, character level vector representation of words, and part-of-speech (POS) tags. Furthermore, we evaluated various hyperparameters and architectures to achieve the highest performance scores. The prototype was evaluated on hypotheses from the AIS basket of eight. The F1 result for the sequence labelling task of causal variables on a chunk level was 80%, with a precision of 80% and a recall of 80%
- âŠ