Search CORE

6,399 research outputs found

A Lemma Based Evaluator for Semitic Language Text Summarization Systems

Author: El-Ghannam Fatma
El-Shishtawy Tarek
Publication venue
Publication date: 21/03/2014
Field of study

Matching texts in highly inflected languages such as Arabic by simple stemming strategy is unlikely to perform well. In this paper, we present a strategy for automatic text matching technique for for inflectional languages, using Arabic as the test case. The system is an extension of ROUGE test in which texts are matched on token's lemma level. The experimental results show an enhancement of detecting similarities between different sentences having same semantics but written in different lexical forms.

arXiv.org e-Print Archive

Reference and Document Aware Semantic Evaluation Methods for Korean Language Summarization

Author: Cho Seungwoo
Jo Jaechoon
Kim Eunggyun
Ko Byeongil
Lee Daniel
Lee Dongyub
Shin Myeongcheol
Whang Taesun
Publication venue
Publication date: 01/11/2020
Field of study

Text summarization refers to the process that generates a shorter form of text from the source document preserving salient information. Many existing works for text summarization are generally evaluated by using recall-oriented understudy for gisting evaluation (ROUGE) scores. However, as ROUGE scores are computed based on n-gram overlap, they do not reflect semantic meaning correspondences between generated and reference summaries. Because Korean is an agglutinative language that combines various morphemes into a word that express several meanings, ROUGE is not suitable for Korean summarization. In this paper, we propose evaluation metrics that reflect semantic meanings of a reference summary and the original document, Reference and Document Aware Semantic Score (RDASS). We then propose a method for improving the correlation of the metrics with human judgment. Evaluation results show that the correlation with human judgment is significantly higher for our evaluation metrics than for ROUGE scores.Comment: COLING 202

arXiv.org e-Print Archive

A survey of methods to ease the development of highly multilingual text mining applications

Author: Steinberger Ralf
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/01/2014
Field of study

Multilingual text processing is useful because the information content found in different languages is complementary, both regarding facts and opinions. While Information Extraction and other text mining software can, in principle, be developed for many languages, most text analysis tools have only been applied to small sets of languages because the development effort per language is large. Self-training tools obviously alleviate the problem, but even the effort of providing training data and of manually tuning the results is usually considerable. In this paper, we gather insights by various multilingual system developers on how to minimise the effort of developing natural language processing applications for many languages. We also explain the main guidelines underlying our own effort to develop complex text mining software for tens of languages. While these guidelines - most of all: extreme simplicity - can be very restrictive and limiting, we believe to have shown the feasibility of the approach through the development of the Europe Media Monitor (EMM) family of applications (http://emm.newsbrief.eu/overview.html). EMM is a set of complex media monitoring tools that process and analyse up to 100,000 online news articles per day in between twenty and fifty languages. We will also touch upon the kind of language resources that would make it easier for all to develop highly multilingual text mining applications. We will argue that - to achieve this - the most needed resources would be freely available, simple, parallel and uniform multilingual dictionaries, corpora and software tools.Comment: 22 pages. Published online on 12 October 201

arXiv.org e-Print Archive

Lessons from the Bible on Modern Topics: Low-Resource Multilingual Topic Model Evaluation

Author: Boyd-Graber Jordan
Hao Shudong
Paul Michael J.
Publication venue
Publication date: 26/04/2018
Field of study

Multilingual topic models enable document analysis across languages through coherent multilingual summaries of the data. However, there is no standard and effective metric to evaluate the quality of multilingual topics. We introduce a new intrinsic evaluation of multilingual topic models that correlates well with human judgments of multilingual topic coherence as well as performance in downstream applications. Importantly, we also study evaluation for low-resource languages. Because standard metrics fail to accurately measure topic quality when robust external resources are unavailable, we propose an adaptation model that improves the accuracy and reliability of these metrics in low-resource settings.Comment: North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), New Orleans, Louisiana. June 201

arXiv.org e-Print Archive

An Enhanced Latent Semantic Analysis Approach for Arabic Document Summarization

Author: Al-Sabahi Kamal
Alwesabi Khaled
Long Jun
Zhang Zuping
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/07/2018
Field of study

The fast-growing amount of information on the Internet makes the research in automatic document summarization very urgent. It is an effective solution for information overload. Many approaches have been proposed based on different strategies, such as latent semantic analysis (LSA). However, LSA, when applied to document summarization, has some limitations which diminish its performance. In this work, we try to overcome these limitations by applying statistic and linear algebraic approaches combined with syntactic and semantic processing of text. First, the part of speech tagger is utilized to reduce the dimension of LSA. Then, the weight of the term in four adjacent sentences is added to the weighting schemes while calculating the input matrix to take into account the word order and the syntactic relations. In addition, a new LSA-based sentence selection algorithm is proposed, in which the term description is combined with sentence description for each topic which in turn makes the generated summary more informative and diverse. To ensure the effectiveness of the proposed LSA-based sentence selection algorithm, extensive experiment on Arabic and English are done. Four datasets are used to evaluate the new model, Linguistic Data Consortium (LDC) Arabic Newswire-a corpus, Essex Arabic Summaries Corpus (EASC), DUC2002, and Multilingual MSS 2015 dataset. Experimental results on the four datasets show the effectiveness of the proposed model on Arabic and English datasets. It performs comprehensively better compared to the state-of-the-art methods.Comment: This is a pre-print of an article published in Arabian Journal for Science and Engineering. The final authenticated version is available online at: https://doi.org/10.1007/s13369-018-3286-

arXiv.org e-Print Archive

Autoencoder as Assistant Supervisor: Improving Text Representation for Chinese Social Media Text Summarization

Author: Lin Junyang
Ma Shuming
Sun Xu
Wang Houfeng
Publication venue
Publication date: 13/05/2018
Field of study

Most of the current abstractive text summarization models are based on the sequence-to-sequence model (Seq2Seq). The source content of social media is long and noisy, so it is difficult for Seq2Seq to learn an accurate semantic representation. Compared with the source content, the annotated summary is short and well written. Moreover, it shares the same meaning as the source content. In this work, we supervise the learning of the representation of the source content with that of the summary. In implementation, we regard a summary autoencoder as an assistant supervisor of Seq2Seq. Following previous work, we evaluate our model on a popular Chinese social media dataset. Experimental results show that our model achieves the state-of-the-art performances on the benchmark dataset.Comment: accepted by ACL 201

arXiv.org e-Print Archive

Sentence Compression in Spanish driven by Discourse Segmentation and Language Models

Author: da Cunha Iria
Molina Alejandro
SanJuan Eric
Sierra Gerardo
Torres-Moreno Juan-Manuel
Publication venue
Publication date: 17/12/2012
Field of study

Previous works demonstrated that Automatic Text Summarization (ATS) by sentences extraction may be improved using sentence compression. In this work we present a sentence compressions approach guided by level-sentence discourse segmentation and probabilistic language models (LM). The results presented here show that the proposed solution is able to generate coherent summaries with grammatical compressed sentences. The approach is simple enough to be transposed into other languages.Comment: 7 pages, 3 table

arXiv.org e-Print Archive

IntelliCode Compose: Code Generation Using Transformer

Author: Deng Shao Kun
Fu Shengyu
Sundaresan Neel
Svyatkovskiy Alexey
Publication venue
Publication date: 29/10/2020
Field of study

In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments. In this paper, we introduce IntelliCode Compose

-

a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python,

C\#

, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook. Our best model yields an average edit similarity of

86.7\%

and a perplexity of 1.82 for Python programming language.Comment: Accepted for publication at ESEC/FSE conferenc

arXiv.org e-Print Archive

Automated text summarisation and evidence-based medicine: A survey of two domains

Author: Molla Diego
Paris Cecile
Sarker Abeed
Publication venue
Publication date: 25/06/2017
Field of study

The practice of evidence-based medicine (EBM) urges medical practitioners to utilise the latest research evidence when making clinical decisions. Because of the massive and growing volume of published research on various medical topics, practitioners often find themselves overloaded with information. As such, natural language processing research has recently commenced exploring techniques for performing medical domain-specific automated text summarisation (ATS) techniques-- targeted towards the task of condensing large medical texts. However, the development of effective summarisation techniques for this task requires cross-domain knowledge. We present a survey of EBM, the domain-specific needs for EBM, automated summarisation techniques, and how they have been applied hitherto. We envision that this survey will serve as a first resource for the development of future operational text summarisation techniques for EBM

arXiv.org e-Print Archive

Consistency by Agreement in Zero-shot Neural Machine Translation

Author: Al-Shedivat Maruan
Parikh Ankur P.
Publication venue
Publication date: 10/04/2019
Field of study

Generalization and reliability of multilingual translation often highly depend on the amount of available parallel data for each language pair of interest. In this paper, we focus on zero-shot generalization---a challenging setup that tests models on translation directions they have not been optimized for at training time. To solve the problem, we (i) reformulate multilingual translation as probabilistic inference, (ii) define the notion of zero-shot consistency and show why standard training often results in models unsuitable for zero-shot tasks, and (iii) introduce a consistent agreement-based training method that encourages the model to produce equivalent translations of parallel sentences in auxiliary languages. We test our multilingual NMT models on multiple public zero-shot translation benchmarks (IWSLT17, UN corpus, Europarl) and show that agreement-based learning often results in 2-3 BLEU zero-shot improvement over strong baselines without any loss in performance on supervised translation directions.Comment: NAACL 2019 (14 pages, 5 figures

arXiv.org e-Print Archive