69,550 research outputs found
Accuracy-based scoring for phrase-based statistical machine translation
Although the scoring features of state-of-the-art Phrase-Based Statistical Machine Translation (PB-SMT) models are weighted so as to optimise an objective function measuring
translation quality, the estimation of the features
themselves does not have any relation to such quality metrics. In this paper, we introduce a translation quality-based feature to PBSMT in a bid to improve the translation quality of the system. Our feature is estimated by averaging
the edit-distance between phrase pairs involved in the translation of oracle sentences, chosen by automatic evaluation metrics from the N-best outputs of a baseline system, and phrase pairs occurring in the N-best list. Using
our method, we report a statistically significant 2.11% relative improvement in BLEU score for the WMT 2009 Spanish-to-English translation task. We also report that using our
method we can achieve statistically significant improvements over the baseline using many other MT evaluation metrics, and a substantial increase in speed and reduction in memory use (due to a reduction in phrase-table size of 87%) while maintaining significant gains in
translation quality
Summarization of Films and Documentaries Based on Subtitles and Scripts
We assess the performance of generic text summarization algorithms applied to
films and documentaries, using the well-known behavior of summarization of news
articles as reference. We use three datasets: (i) news articles, (ii) film
scripts and subtitles, and (iii) documentary subtitles. Standard ROUGE metrics
are used for comparing generated summaries against news abstracts, plot
summaries, and synopses. We show that the best performing algorithms are LSA,
for news articles and documentaries, and LexRank and Support Sets, for films.
Despite the different nature of films and documentaries, their relative
behavior is in accordance with that obtained for news articles.Comment: 7 pages, 9 tables, 4 figures, submitted to Pattern Recognition
Letters (Elsevier
Combining Thesaurus Knowledge and Probabilistic Topic Models
In this paper we present the approach of introducing thesaurus knowledge into
probabilistic topic models. The main idea of the approach is based on the
assumption that the frequencies of semantically related words and phrases,
which are met in the same texts, should be enhanced: this action leads to their
larger contribution into topics found in these texts. We have conducted
experiments with several thesauri and found that for improving topic models, it
is useful to utilize domain-specific knowledge. If a general thesaurus, such as
WordNet, is used, the thesaurus-based improvement of topic models can be
achieved with excluding hyponymy relations in combined topic models.Comment: Accepted to AIST-2017 conference (http://aistconf.ru/). The final
publication will be available at link.springer.co
The Mechanism of Additive Composition
Additive composition (Foltz et al, 1998; Landauer and Dumais, 1997; Mitchell
and Lapata, 2010) is a widely used method for computing meanings of phrases,
which takes the average of vector representations of the constituent words. In
this article, we prove an upper bound for the bias of additive composition,
which is the first theoretical analysis on compositional frameworks from a
machine learning point of view. The bound is written in terms of collocation
strength; we prove that the more exclusively two successive words tend to occur
together, the more accurate one can guarantee their additive composition as an
approximation to the natural phrase vector. Our proof relies on properties of
natural language data that are empirically verified, and can be theoretically
derived from an assumption that the data is generated from a Hierarchical
Pitman-Yor Process. The theory endorses additive composition as a reasonable
operation for calculating meanings of phrases, and suggests ways to improve
additive compositionality, including: transforming entries of distributional
word vectors by a function that meets a specific condition, constructing a
novel type of vector representations to make additive composition sensitive to
word order, and utilizing singular value decomposition to train word vectors.Comment: More explanations on theory and additional experiments added.
Accepted by Machine Learning Journa
Distributional composition using higher-order dependency vectors
This paper concerns how to apply compositional methods to vectors based on grammatical dependency relation vectors. We demonstrate the potential of a novel approach which uses higher-order grammatical dependency relations as features. We apply the approach to adjective-noun compounds with promising results in the prediction of the vectors for (held-out) observed phrases
Multi-Level Modeling of Quotation Families Morphogenesis
This paper investigates cultural dynamics in social media by examining the
proliferation and diversification of clearly-cut pieces of content: quoted
texts. In line with the pioneering work of Leskovec et al. and Simmons et al.
on memes dynamics we investigate in deep the transformations that quotations
published online undergo during their diffusion. We deliberately put aside the
structure of the social network as well as the dynamical patterns pertaining to
the diffusion process to focus on the way quotations are changed, how often
they are modified and how these changes shape more or less diverse families and
sub-families of quotations. Following a biological metaphor, we try to
understand in which way mutations can transform quotations at different scales
and how mutation rates depend on various properties of the quotations.Comment: Published in the Proceedings of the ASE/IEEE 4th Intl. Conf. on
Social Computing "SocialCom 2012", Sep. 3-5, 2012, Amsterdam, N
The role of handbooks in knowledge creation and diffusion: A case of science and technology studies
Genre is considered to be an important element in scholarly communication and
in the practice of scientific disciplines. However, scientometric studies have
typically focused on a single genre, the journal article. The goal of this
study is to understand the role that handbooks play in knowledge creation and
diffusion and their relationship with the genre of journal articles,
particularly in highly interdisciplinary and emergent social science and
humanities disciplines. To shed light on these questions we focused on
handbooks and journal articles published over the last four decades belonging
to the research area of Science and Technology Studies (STS), broadly defined.
To get a detailed picture we used the full-text of five handbooks (500,000
words) and a well-defined set of 11,700 STS articles. We confirmed the
methodological split of STS into qualitative and quantitative (scientometric)
approaches. Even when the two traditions explore similar topics (e.g., science
and gender) they approach them from different starting points. The change in
cognitive foci in both handbooks and articles partially reflects the changing
trends in STS research, often driven by technology. Using text similarity
measures we found that, in the case of STS, handbooks play no special role in
either focusing the research efforts or marking their decline. In general, they
do not represent the summaries of research directions that have emerged since
the previous edition of the handbook.Comment: Accepted for publication in Journal of Informetric
- …