Search CORE

9,701 research outputs found

Cross-Lingual Induction and Transfer of Verb Classes Based on Word Vector Space Specialisation

Author: Korhonen Anna
Mrkšić Nikola
Vulić Ivan
Publication venue
Publication date: 01/01/2017
Field of study

Existing approaches to automatic VerbNet-style verb classification are heavily dependent on feature engineering and therefore limited to languages with mature NLP pipelines. In this work, we propose a novel cross-lingual transfer method for inducing VerbNets for multiple languages. To the best of our knowledge, this is the first study which demonstrates how the architectures for learning word embeddings can be applied to this challenging syntactic-semantic task. Our method uses cross-lingual translation pairs to tie each of the six target languages into a bilingual vector space with English, jointly specialising the representations to encode the relational information from English VerbNet. A standard clustering algorithm is then run on top of the VerbNet-specialised representations, using vector dimensions as features for learning verb classes. Our results show that the proposed cross-lingual transfer approach sets new state-of-the-art verb classification performance across all six target languages explored in this work.Comment: EMNLP 2017 (long paper

arXiv.org e-Print Archive

Natural language processing

Author: Adams
Amsler
Bangalore
Barker
Benoît
Bian
Bondale
Carrick
Ceric
Chandrasekar
Chang
Charniak
Chen
Chowdhury
Chowdhury
Costantino
Cowie
Craven
Craven
Craven
Dogru
Evans
Feldman
Fernandez
Gaizauskas
Glasgow
Haas
Hayes
Hayes
Hedlund
Herath
Ide
Isahara
Jelinek
Jeong
Jurafsky
Kazakov
Kehler
Khoo
Kim
King
Lange
Lee
Lehmam
Lehtokangas
Lewis
Liddy
Liddy
Lovis
Ma
Magnini
Mani
Manning
Marquez
Martinez
Martinez
McMurchie
Meyer
Mihalcea
Mock
Moens
Morin
Narita
Nerbonne
Oard
Ogura
Oudet
Owei
Paris
Pasero
Pedersen
Perez-Carballo
Petreley
Pirkola
Poesio
Rosenfield
Roux
Say
Scarlett
Schenker
Silber
Smeaton
Smeaton
Smith
Sokol
Song
Sparck Jones
Staab
Stock
Tolle
Trybula
Tsuda
Vickery
Waldrop
Warner
Weigard
Wilks
Wong
Yang
Yang
Zadrozny
Zweigenbaum
Publication venue: 'Wiley'
Publication date: 01/01/2003
Field of study

Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

OPUS - University of Technology Sydney

Cross-language evaluation forum : objectives, results, achievements

Author: Braschler Martin
Peters Carol
Publication venue: Springer
Publication date: 01/01/2004
Field of study

Question Answering and Question Generation for Finnish

Author: Kylliäinen Ilmari
Yangarber Roman
Publication venue
Publication date: 24/11/2022
Field of study

Recent advances in the field of language modeling have improved the state-of-the-art in question answering (QA) and question generation (QG). However, the development of modern neural models, their benchmarks, and datasets for training them has mainly focused on English. Finnish, like many other languages, faces a shortage of large QA/QG model training resources, which has prevented experimenting with state-of-the-art QA/QG fine-tuning methods. We present the first neural QA and QG models that work with Finnish. To train the models, we automatically translate the SQuAD dataset and then use normalization methods to reduce the amount of problematic data created during the translation. Using the synthetic data, together with the Finnish partition of the TyDi-QA dataset, we fine-tune several transformer-based models to both QA and QG and evaluate their performance. To the best of our knowledge, the resulting dataset is the first large-scale QA/QG resource for Finnish. This paper also sets the initial benchmarks for Finnish-language QA and QG

arXiv.org e-Print Archive

Neuroverkkopohjainen faktoidikysymyksiin vastaaminen ja kysymysten generointi suomen kielellä

Author: Kylliäinen Ilmari
Publication venue: Helsingfors universitet
Publication date: 01/01/2022
Field of study

Automaattinen kysymyksiin vastaaminen ja kysymysten generointi ovat kaksi tiiviisti toisiinsa liittyvää luonnollisen kielen käsittelyn tehtävää. Molempia tehtäviä on tutkittu useiden vuosikymmenten ajan ja niillä on useita käyttökohteita. Järjestelmät, jotka osaavat vastata luonnollisella kielellä muodostettuihin kysymyksiin toimivat apuna ihmisten informaatiotarpeissa, kun taas automaattista kysymysten generointia voidaan hyödyntää muun muassa luetunymmärtämistehtävien automaattisessa luomisessa sekä virtuaaliassistenttien interaktiivisuuden parantamisessa. Sekä kysymyksiin vastaamisessa että niiden generoinnissa parhaat tulokset saadaan tällä hetkellä hyödyntämällä esikoulutettuja, transformer-arkkitehtuuriin pohjautuvia neuraalisia kielimalleja. Tällaiset mallit tyypillisesti ensin esikoulutetaan raa’alla kielidatalla ja sitten hienosäädetään erilaisiin tehtäviin käyttäen tehtäväkohtaisia annotoituja aineistoja. Malleja, jotka osaavat vastata suomenkielisiin kysymyksiin tai generoida niitä, ei ole tähän mennessä raportoitu juurikaan olevan olemassa. Jotta niitä voitaisiin luoda moderneja transformer-arkkitehtuuriin perustuvia menetelmiä käyttäen, tarvitaan sekä esikoulutettu kielimalli että tarpeeksi suuri määrä suomenkielistä dataa, joka soveltuu esikoulutettujen mallien hienosäätämiseen juuri kysymyksiin vastaamiseen tai generointiin. Vaikka sekä puhtaasti suomen kielellä esikoulutettuja yksikielisiä malleja että osittain suomen kielellä esikoulutettuja monikielisiä malleja onkin jo jonkin verran avoimesti saatavilla, ongelmaksi muodostuu hienosäätöön tarvittavan datan puuttuminen. Tässä tutkielmassa luodaan ensimmäiset suomenkieliset transformer-arkkitehtuuriin pohjautuvat kysymyksiin vastaamiseen ja kysymysten generointiin hienosäädetyt neuroverkkomallit. Esittelen menetelmän, jolla pyritään luomaan aineisto, joka soveltuu esikoulutettujen mallien hienosäätämiseen molempiin edellä mainittuihin tehtäviin. Aineiston luonti perustuu olemassa olevan englanninkielisen SQuAD-aineiston koneelliseen kääntämiseen sekä käännöksen jälkeisten automaattisten normalisointimenetelmien käyttöön. Hienosäädän luodun aineiston avulla useita esikoulutettuja malleja suomenkieliseen kysymyksiin vastaamiseen ja kysymysten generointiin, sekä vertailen niiden suorituskykyä. Käytän sekä puhtaasti suomen kielellä esikoulutettuja BERT- ja GPT-2-malleja että yhtä monikielisellä aineistolla esikoulutettua BERT-mallia. Tulokset osoittavat, että transformer-arkkitehtuuri soveltuu hyvin myös suomenkieliseen kysymyksiin vastaamiseen ja kysymysten generointiin. Synteettisesti luotu aineisto on tulosten perusteella käyttökelpoinen resurssi esikoulutettujen mallien hienosäätämiseen. Parhaat tulokset molemmissa tehtävissä tuottavat hienosäädetyt BERT-mallit, jotka on esikoulutettu ainoastaan suomenkielisellä kieliaineistolla. Monikielisen BERT:n tulokset ovat lähes yhtä hyviä molemmissa tehtävissä, kun taas GPT-2-mallien tulokset ovat reilusti huonompia.Automatic question answering and question generation are two closely related natural language processing tasks. They both have been studied for decades, and both have a wide range of uses. While systems that can answer questions formed in natural language can help with all kinds of information needs, automatic question generation can be used, for example, to automatically create reading comprehension tasks and improve the interactivity of virtual assistants. These days, the best results in both question answering and question generation are obtained by utilizing pre-trained neural language models based on the transformer architecture. Such models are typically first pre-trained with raw language data and then fine-tuned for various tasks using task-specific annotated datasets. So far, no models that can answer or generate questions purely in Finnish have been reported. In order to create them using modern transformer-based methods, both a pre-trained language model and a sufficiently big dataset suitable for question answering or question generation fine-tuning are required. Although some suitable models that have been pre-trained with Finnish or multilingual data are already available, a big bottleneck is the lack of annotated data needed for fine-tuning the models. In this thesis, I create the first transformer-based neural network models for Finnish question answering and question generation. I present a method for creating a dataset for fine-tuning pre-trained models for the two tasks. The dataset creation is based on automatic translation of an existing dataset (SQuAD) and automatic normalization of the translated data. Using the created dataset, I fine-tune several pre-trained models to answer and generate questions in Finnish and evaluate their performance. I use monolingual BERT and GPT-2 models as well as a multilingual BERT model. The results show that the transformer architecture is well suited also for Finnish question answering and question generation. They also indicate that the synthetically generated dataset can be a useful fine-tuning resource for these tasks. The best results in both tasks are obtained by fine-tuned BERT models which have been pre-trained with only Finnish data. The fine-tuned multilingual BERT models come in close, whereas fine-tuned GPT-2 models are generally found to underperform. The data developed for this thesis will be released to the research community to support future research on question answering and generation, and the models will be released as benchmarks

Helsingin yliopiston digitaalinen arkisto

Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

Author: Torres-Moreno Juan-Manuel
Publication venue
Publication date: 14/09/2012
Field of study

In Automatic Text Summarization, preprocessing is an important phase to reduce the space of textual representation. Classically, stemming and lemmatization have been widely used for normalizing words. However, even using normalization on large texts, the curse of dimensionality can disturb the performance of summarizers. This paper describes a new method for normalization of words to further reduce the space of representation. We propose to reduce each word to its initial letters, as a form of Ultra-stemming. The results show that Ultra-stemming not only preserve the content of summaries produced by this representation, but often the performances of the systems can be dramatically improved. Summaries on trilingual corpora were evaluated automatically with Fresa. Results confirm an increase in the performance, regardless of summarizer system used.Comment: 22 pages, 12 figures, 9 table

arXiv.org e-Print Archive

CiteSeerX

Beyond English text: Multilingual and multimedia information retrieval.

Author: Jones Gareth J.F.
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2005
Field of study

Non

CiteSeerX