6,399 research outputs found
A Lemma Based Evaluator for Semitic Language Text Summarization Systems
Matching texts in highly inflected languages such as Arabic by simple
stemming strategy is unlikely to perform well. In this paper, we present a
strategy for automatic text matching technique for for inflectional languages,
using Arabic as the test case. The system is an extension of ROUGE test in
which texts are matched on token's lemma level. The experimental results show
an enhancement of detecting similarities between different sentences having
same semantics but written in different lexical forms.
Reference and Document Aware Semantic Evaluation Methods for Korean Language Summarization
Text summarization refers to the process that generates a shorter form of
text from the source document preserving salient information. Many existing
works for text summarization are generally evaluated by using recall-oriented
understudy for gisting evaluation (ROUGE) scores. However, as ROUGE scores are
computed based on n-gram overlap, they do not reflect semantic meaning
correspondences between generated and reference summaries. Because Korean is an
agglutinative language that combines various morphemes into a word that express
several meanings, ROUGE is not suitable for Korean summarization. In this
paper, we propose evaluation metrics that reflect semantic meanings of a
reference summary and the original document, Reference and Document Aware
Semantic Score (RDASS). We then propose a method for improving the correlation
of the metrics with human judgment. Evaluation results show that the
correlation with human judgment is significantly higher for our evaluation
metrics than for ROUGE scores.Comment: COLING 202
A survey of methods to ease the development of highly multilingual text mining applications
Multilingual text processing is useful because the information content found
in different languages is complementary, both regarding facts and opinions.
While Information Extraction and other text mining software can, in principle,
be developed for many languages, most text analysis tools have only been
applied to small sets of languages because the development effort per language
is large. Self-training tools obviously alleviate the problem, but even the
effort of providing training data and of manually tuning the results is usually
considerable. In this paper, we gather insights by various multilingual system
developers on how to minimise the effort of developing natural language
processing applications for many languages. We also explain the main guidelines
underlying our own effort to develop complex text mining software for tens of
languages. While these guidelines - most of all: extreme simplicity - can be
very restrictive and limiting, we believe to have shown the feasibility of the
approach through the development of the Europe Media Monitor (EMM) family of
applications (http://emm.newsbrief.eu/overview.html). EMM is a set of complex
media monitoring tools that process and analyse up to 100,000 online news
articles per day in between twenty and fifty languages. We will also touch upon
the kind of language resources that would make it easier for all to develop
highly multilingual text mining applications. We will argue that - to achieve
this - the most needed resources would be freely available, simple, parallel
and uniform multilingual dictionaries, corpora and software tools.Comment: 22 pages. Published online on 12 October 201
Lessons from the Bible on Modern Topics: Low-Resource Multilingual Topic Model Evaluation
Multilingual topic models enable document analysis across languages through
coherent multilingual summaries of the data. However, there is no standard and
effective metric to evaluate the quality of multilingual topics. We introduce a
new intrinsic evaluation of multilingual topic models that correlates well with
human judgments of multilingual topic coherence as well as performance in
downstream applications. Importantly, we also study evaluation for low-resource
languages. Because standard metrics fail to accurately measure topic quality
when robust external resources are unavailable, we propose an adaptation model
that improves the accuracy and reliability of these metrics in low-resource
settings.Comment: North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (NAACL-HLT), New Orleans, Louisiana.
June 201
An Enhanced Latent Semantic Analysis Approach for Arabic Document Summarization
The fast-growing amount of information on the Internet makes the research in
automatic document summarization very urgent. It is an effective solution for
information overload. Many approaches have been proposed based on different
strategies, such as latent semantic analysis (LSA). However, LSA, when applied
to document summarization, has some limitations which diminish its performance.
In this work, we try to overcome these limitations by applying statistic and
linear algebraic approaches combined with syntactic and semantic processing of
text. First, the part of speech tagger is utilized to reduce the dimension of
LSA. Then, the weight of the term in four adjacent sentences is added to the
weighting schemes while calculating the input matrix to take into account the
word order and the syntactic relations. In addition, a new LSA-based sentence
selection algorithm is proposed, in which the term description is combined with
sentence description for each topic which in turn makes the generated summary
more informative and diverse. To ensure the effectiveness of the proposed
LSA-based sentence selection algorithm, extensive experiment on Arabic and
English are done. Four datasets are used to evaluate the new model, Linguistic
Data Consortium (LDC) Arabic Newswire-a corpus, Essex Arabic Summaries Corpus
(EASC), DUC2002, and Multilingual MSS 2015 dataset. Experimental results on the
four datasets show the effectiveness of the proposed model on Arabic and
English datasets. It performs comprehensively better compared to the
state-of-the-art methods.Comment: This is a pre-print of an article published in Arabian Journal for
Science and Engineering. The final authenticated version is available online
at: https://doi.org/10.1007/s13369-018-3286-
Autoencoder as Assistant Supervisor: Improving Text Representation for Chinese Social Media Text Summarization
Most of the current abstractive text summarization models are based on the
sequence-to-sequence model (Seq2Seq). The source content of social media is
long and noisy, so it is difficult for Seq2Seq to learn an accurate semantic
representation. Compared with the source content, the annotated summary is
short and well written. Moreover, it shares the same meaning as the source
content. In this work, we supervise the learning of the representation of the
source content with that of the summary. In implementation, we regard a summary
autoencoder as an assistant supervisor of Seq2Seq. Following previous work, we
evaluate our model on a popular Chinese social media dataset. Experimental
results show that our model achieves the state-of-the-art performances on the
benchmark dataset.Comment: accepted by ACL 201
Sentence Compression in Spanish driven by Discourse Segmentation and Language Models
Previous works demonstrated that Automatic Text Summarization (ATS) by
sentences extraction may be improved using sentence compression. In this work
we present a sentence compressions approach guided by level-sentence discourse
segmentation and probabilistic language models (LM). The results presented here
show that the proposed solution is able to generate coherent summaries with
grammatical compressed sentences. The approach is simple enough to be
transposed into other languages.Comment: 7 pages, 3 table
IntelliCode Compose: Code Generation Using Transformer
In software development through integrated development environments (IDEs),
code completion is one of the most widely used features. Nevertheless, majority
of integrated development environments only support completion of methods and
APIs, or arguments.
In this paper, we introduce IntelliCode Compose a general-purpose
multilingual code completion tool which is capable of predicting sequences of
code tokens of arbitrary types, generating up to entire lines of syntactically
correct code. It leverages state-of-the-art generative transformer model
trained on 1.2 billion lines of source code in Python, , JavaScript and
TypeScript programming languages. IntelliCode Compose is deployed as a
cloud-based web service. It makes use of client-side tree-based caching,
efficient parallel implementation of the beam search decoder, and compute graph
optimizations to meet edit-time completion suggestion requirements in the
Visual Studio Code IDE and Azure Notebook.
Our best model yields an average edit similarity of and a perplexity
of 1.82 for Python programming language.Comment: Accepted for publication at ESEC/FSE conferenc
Automated text summarisation and evidence-based medicine: A survey of two domains
The practice of evidence-based medicine (EBM) urges medical practitioners to
utilise the latest research evidence when making clinical decisions. Because of
the massive and growing volume of published research on various medical topics,
practitioners often find themselves overloaded with information. As such,
natural language processing research has recently commenced exploring
techniques for performing medical domain-specific automated text summarisation
(ATS) techniques-- targeted towards the task of condensing large medical texts.
However, the development of effective summarisation techniques for this task
requires cross-domain knowledge. We present a survey of EBM, the
domain-specific needs for EBM, automated summarisation techniques, and how they
have been applied hitherto. We envision that this survey will serve as a first
resource for the development of future operational text summarisation
techniques for EBM
Consistency by Agreement in Zero-shot Neural Machine Translation
Generalization and reliability of multilingual translation often highly
depend on the amount of available parallel data for each language pair of
interest. In this paper, we focus on zero-shot generalization---a challenging
setup that tests models on translation directions they have not been optimized
for at training time. To solve the problem, we (i) reformulate multilingual
translation as probabilistic inference, (ii) define the notion of zero-shot
consistency and show why standard training often results in models unsuitable
for zero-shot tasks, and (iii) introduce a consistent agreement-based training
method that encourages the model to produce equivalent translations of parallel
sentences in auxiliary languages. We test our multilingual NMT models on
multiple public zero-shot translation benchmarks (IWSLT17, UN corpus, Europarl)
and show that agreement-based learning often results in 2-3 BLEU zero-shot
improvement over strong baselines without any loss in performance on supervised
translation directions.Comment: NAACL 2019 (14 pages, 5 figures
- …