262 research outputs found
Automatic summarization as means of simplifying texts, an evaluation for Swedish
Proceedings of the 18th Nordic Conference of Computational Linguistics
NODALIDA 2011.
Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa.
NEALT Proceedings Series, Vol. 11 (2011), 198-205.
© 2011 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/16955
Conference Program
Proceedings of the 18th Nordic Conference of Computational Linguistics
NODALIDA 2011.
Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa.
NEALT Proceedings Series, Vol. 11 (2011), xii-xvii.
© 2011 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/16955
Text Summarization
Text summarization is the process of distilling the most important information from a
source (or sources) to produce an abridged version for a particular user (or users) and task
(or tasks) [2]. By providing a text summarization system that will simplify the bulk of
information and producing only the most important points, the task of reading and
understanding a text would inevitably be made easier and faster. With a large volume of
text documents, a summary of each document greatly facilitates the task of finding the
desired documents and the desired data from the documents. As a solution for the above
matter, this project objective is to simplify the texts from a previous text summarization
system and further reducing the number of words in a sentence, shortening the sentences
and eliminating sentences with similar meanings and also produce grammar rules that
generate sentences that are human-like. The waterfall model is chosen as the project
development life cycle. A detailed research has been conducted during the requirement
definition phase and the system prototype is designed in the system and software design
phase. During the development phase, the coding implementation will be conducted and
the unit testing part will be done throughout that development process. After the entire
unit has been tested, they will be integrated together and the system testing can be done
as a whole. The complete program is put through thorough test and evaluation to ensure
its functionality and efficiency. As the conclusion, this project should be able to produce
a summarized text as the output product and meet the project requirements and
objectives
Contents
Proceedings of the 18th Nordic Conference of Computational Linguistics
NODALIDA 2011.
Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa.
NEALT Proceedings Series, Vol. 11 (2011), iii-vii.
© 2011 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/16955
Sentence Compressor
Nowadays, internet becomes the main source of information. Most people rely on the internet to find the information for research and assignments. People will try to find the right articles, journals, or web pages that are related to their task. In order to choose the right materials, they have to go through every articles, journals, and web pages to find the important points. However, it is very time-consuming to find go through every long articles. This information explosion as led to a constant state of information overload problem. As the solution, a desktop application named Sentence Compressor is developed to compress the long articles. This project aims to develop a desktop application that shortens the length of the long sentences without changing the original meaning. Integer Linear Programming (ILP) techniques is used to solve the sentence compression problem. Bilingual Evaluation Understudy (BLEU) is used to measure the quality of the produced output. Five articles were randomly selected for the experiment. The BLEU score for the articles compressed by Sentence Compressor and articles compressed by human is compared. The system performance evaluation is also done to measure the usefulness of this application. More than 65% of the respondents agreed that Sentence Compressor is useful in information searching
Text Summarization
Text summarization is the process of distilling the most important information from a
source (or sources) to produce an abridged version for a particular user (or users) and task
(or tasks) [2]. By providing a text summarization system that will simplify the bulk of
information and producing only the most important points, the task of reading and
understanding a text would inevitably be made easier and faster. With a large volume of
text documents, a summary of each document greatly facilitates the task of finding the
desired documents and the desired data from the documents. As a solution for the above
matter, this project objective is to simplify the texts from a previous text summarization
system and further reducing the number of words in a sentence, shortening the sentences
and eliminating sentences with similar meanings and also produce grammar rules that
generate sentences that are human-like. The waterfall model is chosen as the project
development life cycle. A detailed research has been conducted during the requirement
definition phase and the system prototype is designed in the system and software design
phase. During the development phase, the coding implementation will be conducted and
the unit testing part will be done throughout that development process. After the entire
unit has been tested, they will be integrated together and the system testing can be done
as a whole. The complete program is put through thorough test and evaluation to ensure
its functionality and efficiency. As the conclusion, this project should be able to produce
a summarized text as the output product and meet the project requirements and
objectives
Sentence Compressor
Nowadays, internet becomes the main source of information. Most people rely on the internet to find the information for research and assignments. People will try to find the right articles, journals, or web pages that are related to their task. In order to choose the right materials, they have to go through every articles, journals, and web pages to find the important points. However, it is very time-consuming to find go through every long articles. This information explosion as led to a constant state of information overload problem. As the solution, a desktop application named Sentence Compressor is developed to compress the long articles. This project aims to develop a desktop application that shortens the length of the long sentences without changing the original meaning. Integer Linear Programming (ILP) techniques is used to solve the sentence compression problem. Bilingual Evaluation Understudy (BLEU) is used to measure the quality of the produced output. Five articles were randomly selected for the experiment. The BLEU score for the articles compressed by Sentence Compressor and articles compressed by human is compared. The system performance evaluation is also done to measure the usefulness of this application. More than 65% of the respondents agreed that Sentence Compressor is useful in information searching
Neuroverkkopohjainen faktoidikysymyksiin vastaaminen ja kysymysten generointi suomen kielellä
Automaattinen kysymyksiin vastaaminen ja kysymysten generointi ovat kaksi tiiviisti toisiinsa liittyvää luonnollisen kielen käsittelyn tehtävää. Molempia tehtäviä on tutkittu useiden vuosikymmenten ajan ja niillä on useita käyttökohteita. Järjestelmät, jotka osaavat vastata luonnollisella kielellä muodostettuihin kysymyksiin toimivat apuna ihmisten informaatiotarpeissa, kun taas automaattista kysymysten generointia voidaan hyödyntää muun muassa luetunymmärtämistehtävien automaattisessa luomisessa sekä virtuaaliassistenttien interaktiivisuuden parantamisessa. Sekä kysymyksiin vastaamisessa että niiden generoinnissa parhaat tulokset saadaan tällä hetkellä hyödyntämällä esikoulutettuja, transformer-arkkitehtuuriin pohjautuvia neuraalisia kielimalleja. Tällaiset mallit tyypillisesti ensin esikoulutetaan raa’alla kielidatalla ja sitten hienosäädetään erilaisiin tehtäviin käyttäen tehtäväkohtaisia annotoituja aineistoja.
Malleja, jotka osaavat vastata suomenkielisiin kysymyksiin tai generoida niitä, ei ole tähän mennessä raportoitu juurikaan olevan olemassa. Jotta niitä voitaisiin luoda moderneja transformer-arkkitehtuuriin perustuvia menetelmiä käyttäen, tarvitaan sekä esikoulutettu kielimalli että tarpeeksi suuri määrä suomenkielistä dataa, joka soveltuu esikoulutettujen mallien hienosäätämiseen juuri kysymyksiin vastaamiseen tai generointiin. Vaikka sekä puhtaasti suomen kielellä esikoulutettuja yksikielisiä malleja että osittain suomen kielellä esikoulutettuja monikielisiä malleja onkin jo jonkin verran avoimesti saatavilla, ongelmaksi muodostuu hienosäätöön tarvittavan datan puuttuminen.
Tässä tutkielmassa luodaan ensimmäiset suomenkieliset transformer-arkkitehtuuriin pohjautuvat kysymyksiin vastaamiseen ja kysymysten generointiin hienosäädetyt neuroverkkomallit. Esittelen menetelmän, jolla pyritään luomaan aineisto, joka soveltuu esikoulutettujen mallien hienosäätämiseen molempiin edellä mainittuihin tehtäviin. Aineiston luonti perustuu olemassa olevan englanninkielisen SQuAD-aineiston koneelliseen kääntämiseen sekä käännöksen jälkeisten automaattisten normalisointimenetelmien käyttöön. Hienosäädän luodun aineiston avulla useita esikoulutettuja malleja suomenkieliseen kysymyksiin vastaamiseen ja kysymysten generointiin, sekä vertailen niiden suorituskykyä. Käytän sekä puhtaasti suomen kielellä esikoulutettuja BERT- ja GPT-2-malleja että yhtä monikielisellä aineistolla esikoulutettua BERT-mallia.
Tulokset osoittavat, että transformer-arkkitehtuuri soveltuu hyvin myös suomenkieliseen kysymyksiin vastaamiseen ja kysymysten generointiin. Synteettisesti luotu aineisto on tulosten perusteella käyttökelpoinen resurssi esikoulutettujen mallien hienosäätämiseen. Parhaat tulokset molemmissa tehtävissä tuottavat hienosäädetyt BERT-mallit, jotka on esikoulutettu ainoastaan suomenkielisellä kieliaineistolla. Monikielisen BERT:n tulokset ovat lähes yhtä hyviä molemmissa tehtävissä, kun taas GPT-2-mallien tulokset ovat reilusti huonompia.Automatic question answering and question generation are two closely related natural language processing tasks. They both have been studied for decades, and both have a wide range of uses. While systems that can answer questions formed in natural language can help with all kinds of information needs, automatic question generation can be used, for example, to automatically create reading comprehension tasks and improve the interactivity of virtual assistants. These days, the best results in both question answering and question generation are obtained by utilizing pre-trained neural language models based on the transformer architecture. Such models are typically first pre-trained with raw language data and then fine-tuned for various tasks using task-specific annotated datasets.
So far, no models that can answer or generate questions purely in Finnish have been reported. In order to create them using modern transformer-based methods, both a pre-trained language model and a sufficiently big dataset suitable for question answering or question generation fine-tuning are required. Although some suitable models that have been pre-trained with Finnish or multilingual data are already available, a big bottleneck is the lack of annotated data needed for fine-tuning the models.
In this thesis, I create the first transformer-based neural network models for Finnish question answering and question generation. I present a method for creating a dataset for fine-tuning pre-trained models for the two tasks. The dataset creation is based on automatic translation of an existing dataset (SQuAD) and automatic normalization of the translated data. Using the created dataset, I fine-tune several pre-trained models to answer and generate questions in Finnish and evaluate their performance. I use monolingual BERT and GPT-2 models as well as a multilingual BERT model.
The results show that the transformer architecture is well suited also for Finnish question answering and question generation. They also indicate that the synthetically generated dataset can be a useful fine-tuning resource for these tasks. The best results in both tasks are obtained by fine-tuned BERT models which have been pre-trained with only Finnish data. The fine-tuned multilingual BERT models come in close, whereas fine-tuned GPT-2 models are generally found to underperform.
The data developed for this thesis will be released to the research community to support future research on question answering and generation, and the models will be released as benchmarks
- …