262 research outputs found

    Automatic summarization as means of simplifying texts, an evaluation for Swedish

    Get PDF
    Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011. Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa. NEALT Proceedings Series, Vol. 11 (2011), 198-205. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/16955

    Conference Program

    Get PDF
    Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011. Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa. NEALT Proceedings Series, Vol. 11 (2011), xii-xvii. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/16955

    Text Summarization

    Get PDF
    Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) [2]. By providing a text summarization system that will simplify the bulk of information and producing only the most important points, the task of reading and understanding a text would inevitably be made easier and faster. With a large volume of text documents, a summary of each document greatly facilitates the task of finding the desired documents and the desired data from the documents. As a solution for the above matter, this project objective is to simplify the texts from a previous text summarization system and further reducing the number of words in a sentence, shortening the sentences and eliminating sentences with similar meanings and also produce grammar rules that generate sentences that are human-like. The waterfall model is chosen as the project development life cycle. A detailed research has been conducted during the requirement definition phase and the system prototype is designed in the system and software design phase. During the development phase, the coding implementation will be conducted and the unit testing part will be done throughout that development process. After the entire unit has been tested, they will be integrated together and the system testing can be done as a whole. The complete program is put through thorough test and evaluation to ensure its functionality and efficiency. As the conclusion, this project should be able to produce a summarized text as the output product and meet the project requirements and objectives

    Contents

    Get PDF
    Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011. Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa. NEALT Proceedings Series, Vol. 11 (2011), iii-vii. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/16955

    Sentence Compressor

    Get PDF
    Nowadays, internet becomes the main source of information. Most people rely on the internet to find the information for research and assignments. People will try to find the right articles, journals, or web pages that are related to their task. In order to choose the right materials, they have to go through every articles, journals, and web pages to find the important points. However, it is very time-consuming to find go through every long articles. This information explosion as led to a constant state of information overload problem. As the solution, a desktop application named Sentence Compressor is developed to compress the long articles. This project aims to develop a desktop application that shortens the length of the long sentences without changing the original meaning. Integer Linear Programming (ILP) techniques is used to solve the sentence compression problem. Bilingual Evaluation Understudy (BLEU) is used to measure the quality of the produced output. Five articles were randomly selected for the experiment. The BLEU score for the articles compressed by Sentence Compressor and articles compressed by human is compared. The system performance evaluation is also done to measure the usefulness of this application. More than 65% of the respondents agreed that Sentence Compressor is useful in information searching

    Text Summarization

    Get PDF
    Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) [2]. By providing a text summarization system that will simplify the bulk of information and producing only the most important points, the task of reading and understanding a text would inevitably be made easier and faster. With a large volume of text documents, a summary of each document greatly facilitates the task of finding the desired documents and the desired data from the documents. As a solution for the above matter, this project objective is to simplify the texts from a previous text summarization system and further reducing the number of words in a sentence, shortening the sentences and eliminating sentences with similar meanings and also produce grammar rules that generate sentences that are human-like. The waterfall model is chosen as the project development life cycle. A detailed research has been conducted during the requirement definition phase and the system prototype is designed in the system and software design phase. During the development phase, the coding implementation will be conducted and the unit testing part will be done throughout that development process. After the entire unit has been tested, they will be integrated together and the system testing can be done as a whole. The complete program is put through thorough test and evaluation to ensure its functionality and efficiency. As the conclusion, this project should be able to produce a summarized text as the output product and meet the project requirements and objectives

    Sentence Compressor

    Get PDF
    Nowadays, internet becomes the main source of information. Most people rely on the internet to find the information for research and assignments. People will try to find the right articles, journals, or web pages that are related to their task. In order to choose the right materials, they have to go through every articles, journals, and web pages to find the important points. However, it is very time-consuming to find go through every long articles. This information explosion as led to a constant state of information overload problem. As the solution, a desktop application named Sentence Compressor is developed to compress the long articles. This project aims to develop a desktop application that shortens the length of the long sentences without changing the original meaning. Integer Linear Programming (ILP) techniques is used to solve the sentence compression problem. Bilingual Evaluation Understudy (BLEU) is used to measure the quality of the produced output. Five articles were randomly selected for the experiment. The BLEU score for the articles compressed by Sentence Compressor and articles compressed by human is compared. The system performance evaluation is also done to measure the usefulness of this application. More than 65% of the respondents agreed that Sentence Compressor is useful in information searching

    Neuroverkkopohjainen faktoidikysymyksiin vastaaminen ja kysymysten generointi suomen kielellä

    Get PDF
    Automaattinen kysymyksiin vastaaminen ja kysymysten generointi ovat kaksi tiiviisti toisiinsa liittyvää luonnollisen kielen käsittelyn tehtävää. Molempia tehtäviä on tutkittu useiden vuosikymmenten ajan ja niillä on useita käyttökohteita. Järjestelmät, jotka osaavat vastata luonnollisella kielellä muodostettuihin kysymyksiin toimivat apuna ihmisten informaatiotarpeissa, kun taas automaattista kysymysten generointia voidaan hyödyntää muun muassa luetunymmärtämistehtävien automaattisessa luomisessa sekä virtuaaliassistenttien interaktiivisuuden parantamisessa. Sekä kysymyksiin vastaamisessa että niiden generoinnissa parhaat tulokset saadaan tällä hetkellä hyödyntämällä esikoulutettuja, transformer-arkkitehtuuriin pohjautuvia neuraalisia kielimalleja. Tällaiset mallit tyypillisesti ensin esikoulutetaan raa’alla kielidatalla ja sitten hienosäädetään erilaisiin tehtäviin käyttäen tehtäväkohtaisia annotoituja aineistoja. Malleja, jotka osaavat vastata suomenkielisiin kysymyksiin tai generoida niitä, ei ole tähän mennessä raportoitu juurikaan olevan olemassa. Jotta niitä voitaisiin luoda moderneja transformer-arkkitehtuuriin perustuvia menetelmiä käyttäen, tarvitaan sekä esikoulutettu kielimalli että tarpeeksi suuri määrä suomenkielistä dataa, joka soveltuu esikoulutettujen mallien hienosäätämiseen juuri kysymyksiin vastaamiseen tai generointiin. Vaikka sekä puhtaasti suomen kielellä esikoulutettuja yksikielisiä malleja että osittain suomen kielellä esikoulutettuja monikielisiä malleja onkin jo jonkin verran avoimesti saatavilla, ongelmaksi muodostuu hienosäätöön tarvittavan datan puuttuminen. Tässä tutkielmassa luodaan ensimmäiset suomenkieliset transformer-arkkitehtuuriin pohjautuvat kysymyksiin vastaamiseen ja kysymysten generointiin hienosäädetyt neuroverkkomallit. Esittelen menetelmän, jolla pyritään luomaan aineisto, joka soveltuu esikoulutettujen mallien hienosäätämiseen molempiin edellä mainittuihin tehtäviin. Aineiston luonti perustuu olemassa olevan englanninkielisen SQuAD-aineiston koneelliseen kääntämiseen sekä käännöksen jälkeisten automaattisten normalisointimenetelmien käyttöön. Hienosäädän luodun aineiston avulla useita esikoulutettuja malleja suomenkieliseen kysymyksiin vastaamiseen ja kysymysten generointiin, sekä vertailen niiden suorituskykyä. Käytän sekä puhtaasti suomen kielellä esikoulutettuja BERT- ja GPT-2-malleja että yhtä monikielisellä aineistolla esikoulutettua BERT-mallia. Tulokset osoittavat, että transformer-arkkitehtuuri soveltuu hyvin myös suomenkieliseen kysymyksiin vastaamiseen ja kysymysten generointiin. Synteettisesti luotu aineisto on tulosten perusteella käyttökelpoinen resurssi esikoulutettujen mallien hienosäätämiseen. Parhaat tulokset molemmissa tehtävissä tuottavat hienosäädetyt BERT-mallit, jotka on esikoulutettu ainoastaan suomenkielisellä kieliaineistolla. Monikielisen BERT:n tulokset ovat lähes yhtä hyviä molemmissa tehtävissä, kun taas GPT-2-mallien tulokset ovat reilusti huonompia.Automatic question answering and question generation are two closely related natural language processing tasks. They both have been studied for decades, and both have a wide range of uses. While systems that can answer questions formed in natural language can help with all kinds of information needs, automatic question generation can be used, for example, to automatically create reading comprehension tasks and improve the interactivity of virtual assistants. These days, the best results in both question answering and question generation are obtained by utilizing pre-trained neural language models based on the transformer architecture. Such models are typically first pre-trained with raw language data and then fine-tuned for various tasks using task-specific annotated datasets. So far, no models that can answer or generate questions purely in Finnish have been reported. In order to create them using modern transformer-based methods, both a pre-trained language model and a sufficiently big dataset suitable for question answering or question generation fine-tuning are required. Although some suitable models that have been pre-trained with Finnish or multilingual data are already available, a big bottleneck is the lack of annotated data needed for fine-tuning the models. In this thesis, I create the first transformer-based neural network models for Finnish question answering and question generation. I present a method for creating a dataset for fine-tuning pre-trained models for the two tasks. The dataset creation is based on automatic translation of an existing dataset (SQuAD) and automatic normalization of the translated data. Using the created dataset, I fine-tune several pre-trained models to answer and generate questions in Finnish and evaluate their performance. I use monolingual BERT and GPT-2 models as well as a multilingual BERT model. The results show that the transformer architecture is well suited also for Finnish question answering and question generation. They also indicate that the synthetically generated dataset can be a useful fine-tuning resource for these tasks. The best results in both tasks are obtained by fine-tuned BERT models which have been pre-trained with only Finnish data. The fine-tuned multilingual BERT models come in close, whereas fine-tuned GPT-2 models are generally found to underperform. The data developed for this thesis will be released to the research community to support future research on question answering and generation, and the models will be released as benchmarks
    corecore