69,550 research outputs found

    Accuracy-based scoring for phrase-based statistical machine translation

    Get PDF
    Although the scoring features of state-of-the-art Phrase-Based Statistical Machine Translation (PB-SMT) models are weighted so as to optimise an objective function measuring translation quality, the estimation of the features themselves does not have any relation to such quality metrics. In this paper, we introduce a translation quality-based feature to PBSMT in a bid to improve the translation quality of the system. Our feature is estimated by averaging the edit-distance between phrase pairs involved in the translation of oracle sentences, chosen by automatic evaluation metrics from the N-best outputs of a baseline system, and phrase pairs occurring in the N-best list. Using our method, we report a statistically significant 2.11% relative improvement in BLEU score for the WMT 2009 Spanish-to-English translation task. We also report that using our method we can achieve statistically significant improvements over the baseline using many other MT evaluation metrics, and a substantial increase in speed and reduction in memory use (due to a reduction in phrase-table size of 87%) while maintaining significant gains in translation quality

    Summarization of Films and Documentaries Based on Subtitles and Scripts

    Get PDF
    We assess the performance of generic text summarization algorithms applied to films and documentaries, using the well-known behavior of summarization of news articles as reference. We use three datasets: (i) news articles, (ii) film scripts and subtitles, and (iii) documentary subtitles. Standard ROUGE metrics are used for comparing generated summaries against news abstracts, plot summaries, and synopses. We show that the best performing algorithms are LSA, for news articles and documentaries, and LexRank and Support Sets, for films. Despite the different nature of films and documentaries, their relative behavior is in accordance with that obtained for news articles.Comment: 7 pages, 9 tables, 4 figures, submitted to Pattern Recognition Letters (Elsevier

    Combining Thesaurus Knowledge and Probabilistic Topic Models

    Full text link
    In this paper we present the approach of introducing thesaurus knowledge into probabilistic topic models. The main idea of the approach is based on the assumption that the frequencies of semantically related words and phrases, which are met in the same texts, should be enhanced: this action leads to their larger contribution into topics found in these texts. We have conducted experiments with several thesauri and found that for improving topic models, it is useful to utilize domain-specific knowledge. If a general thesaurus, such as WordNet, is used, the thesaurus-based improvement of topic models can be achieved with excluding hyponymy relations in combined topic models.Comment: Accepted to AIST-2017 conference (http://aistconf.ru/). The final publication will be available at link.springer.co

    The Mechanism of Additive Composition

    Get PDF
    Additive composition (Foltz et al, 1998; Landauer and Dumais, 1997; Mitchell and Lapata, 2010) is a widely used method for computing meanings of phrases, which takes the average of vector representations of the constituent words. In this article, we prove an upper bound for the bias of additive composition, which is the first theoretical analysis on compositional frameworks from a machine learning point of view. The bound is written in terms of collocation strength; we prove that the more exclusively two successive words tend to occur together, the more accurate one can guarantee their additive composition as an approximation to the natural phrase vector. Our proof relies on properties of natural language data that are empirically verified, and can be theoretically derived from an assumption that the data is generated from a Hierarchical Pitman-Yor Process. The theory endorses additive composition as a reasonable operation for calculating meanings of phrases, and suggests ways to improve additive compositionality, including: transforming entries of distributional word vectors by a function that meets a specific condition, constructing a novel type of vector representations to make additive composition sensitive to word order, and utilizing singular value decomposition to train word vectors.Comment: More explanations on theory and additional experiments added. Accepted by Machine Learning Journa

    Distributional composition using higher-order dependency vectors

    Get PDF
    This paper concerns how to apply compositional methods to vectors based on grammatical dependency relation vectors. We demonstrate the potential of a novel approach which uses higher-order grammatical dependency relations as features. We apply the approach to adjective-noun compounds with promising results in the prediction of the vectors for (held-out) observed phrases

    Multi-Level Modeling of Quotation Families Morphogenesis

    Get PDF
    This paper investigates cultural dynamics in social media by examining the proliferation and diversification of clearly-cut pieces of content: quoted texts. In line with the pioneering work of Leskovec et al. and Simmons et al. on memes dynamics we investigate in deep the transformations that quotations published online undergo during their diffusion. We deliberately put aside the structure of the social network as well as the dynamical patterns pertaining to the diffusion process to focus on the way quotations are changed, how often they are modified and how these changes shape more or less diverse families and sub-families of quotations. Following a biological metaphor, we try to understand in which way mutations can transform quotations at different scales and how mutation rates depend on various properties of the quotations.Comment: Published in the Proceedings of the ASE/IEEE 4th Intl. Conf. on Social Computing "SocialCom 2012", Sep. 3-5, 2012, Amsterdam, N

    The role of handbooks in knowledge creation and diffusion: A case of science and technology studies

    Get PDF
    Genre is considered to be an important element in scholarly communication and in the practice of scientific disciplines. However, scientometric studies have typically focused on a single genre, the journal article. The goal of this study is to understand the role that handbooks play in knowledge creation and diffusion and their relationship with the genre of journal articles, particularly in highly interdisciplinary and emergent social science and humanities disciplines. To shed light on these questions we focused on handbooks and journal articles published over the last four decades belonging to the research area of Science and Technology Studies (STS), broadly defined. To get a detailed picture we used the full-text of five handbooks (500,000 words) and a well-defined set of 11,700 STS articles. We confirmed the methodological split of STS into qualitative and quantitative (scientometric) approaches. Even when the two traditions explore similar topics (e.g., science and gender) they approach them from different starting points. The change in cognitive foci in both handbooks and articles partially reflects the changing trends in STS research, often driven by technology. Using text similarity measures we found that, in the case of STS, handbooks play no special role in either focusing the research efforts or marking their decline. In general, they do not represent the summaries of research directions that have emerged since the previous edition of the handbook.Comment: Accepted for publication in Journal of Informetric
    corecore