1,199 research outputs found

    Friendships, Rivalries, and Trysts: Characterizing Relations between Ideas in Texts

    Full text link
    Understanding how ideas relate to each other is a fundamental question in many domains, ranging from intellectual history to public communication. Because ideas are naturally embedded in texts, we propose the first framework to systematically characterize the relations between ideas based on their occurrence in a corpus of documents, independent of how these ideas are represented. Combining two statistics --- cooccurrence within documents and prevalence correlation over time --- our approach reveals a number of different ways in which ideas can cooperate and compete. For instance, two ideas can closely track each other's prevalence over time, and yet rarely cooccur, almost like a "cold war" scenario. We observe that pairwise cooccurrence and prevalence correlation exhibit different distributions. We further demonstrate that our approach is able to uncover intriguing relations between ideas through in-depth case studies on news articles and research papers.Comment: 11 pages, 9 figures, to appear in Proceedings of ACL 2017, code and data available at https://chenhaot.com/pages/idea-relations.html (fixed a typo

    Inference and Evaluation of the Multinomial Mixture Model for Text Clustering

    Full text link
    In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. The model considered in this contribution consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We present and contrast various estimation procedures, which apply both in supervised and unsupervised contexts. In supervised learning, this work suggests a criterion for evaluating the posterior odds of new documents which is more statistically sound than the "naive Bayes" approach. In an unsupervised context, we propose measures to set up a systematic evaluation framework and start with examining the Expectation-Maximization (EM) algorithm as the basic tool for inference. We discuss the importance of initialization and the influence of other features such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We also propose a heuristic algorithm based on iterative EM with vocabulary reduction to solve this problem. Using the fact that the latent variables can be analytically integrated out, we finally show that Gibbs sampling algorithm is tractable and compares favorably to the basic expectation maximization approach

    Constructing Datasets for Multi-hop Reading Comprehension Across Documents

    Get PDF
    Most Reading Comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine comprehension methods, but currently there exist no resources to train and test this capability. We propose a novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods. In our task, a model learns to seek and combine evidence - effectively performing multi-hop (alias multi-step) inference. We devise a methodology to produce datasets for this task, given a collection of query-answer pairs and thematically linked documents. Two datasets from different domains are induced, and we identify potential pitfalls and devise circumvention strategies. We evaluate two previously proposed competitive models and find that one can integrate information across documents. However, both models struggle to select relevant information, as providing documents guaranteed to be relevant greatly improves their performance. While the models outperform several strong baselines, their best accuracy reaches 42.9% compared to human performance at 74.0% - leaving ample room for improvement.Comment: This paper directly corresponds to the TACL version (https://transacl.org/ojs/index.php/tacl/article/view/1325) apart from minor changes in wording, additional footnotes, and appendice

    Learning from Noisy Data in Statistical Machine Translation

    Get PDF
    In dieser Arbeit wurden Methoden entwickelt, die in der Lage sind die negativen Effekte von verrauschten Daten in SMT Systemen zu senken und dadurch die Leistung des Systems zu steigern. Hierbei wird das Problem in zwei verschiedenen Schritten des Lernprozesses behandelt: Bei der Vorverarbeitung und wĂ€hrend der Modellierung. Bei der Vorverarbeitung werden zwei Methoden zur Verbesserung der statistischen Modelle durch die Erhöhung der QualitĂ€t von Trainingsdaten entwickelt. Bei der Modellierung werden verschiedene Möglichkeiten vorgestellt, um Daten nach ihrer NĂŒtzlichkeit zu gewichten. ZunĂ€chst wird der Effekt des Entfernens von False-Positives vom Parallel Corpus gezeigt. Ein Parallel Corpus besteht aus einem Text in zwei Sprachen, wobei jeder Satz einer Sprache mit dem entsprechenden Satz der anderen Sprache gepaart ist. Hierbei wird vorausgesetzt, dass die Anzahl der SĂ€tzen in beiden Sprachversionen gleich ist. False-Positives in diesem Sinne sind Satzpaare, die im Parallel Corpus gepaart sind aber keine Übersetzung voneinander sind. Um diese zu erkennen wird ein kleiner und fehlerfreier paralleler Corpus (Clean Corpus) vorausgesetzt. Mit Hilfe verschiedenen lexikalischen Eigenschaften werden zuverlĂ€ssig False-Positives vor der Modellierungsphase gefiltert. Eine wichtige lexikalische Eigenschaft hierbei ist das vom Clean Corpus erzeugte bilinguale Lexikon. In der Extraktion dieses bilingualen Lexikons werden verschiedene Heuristiken implementiert, die zu einer verbesserten Leistung fĂŒhren. Danach betrachten wir das Problem vom Extrahieren der nĂŒtzlichsten Teile der Trainingsdaten. Dabei ordnen wir die Daten basierend auf ihren Bezug zur Zieldomaine. Dies geschieht unter der Annahme der Existenz eines guten reprĂ€sentativen Tuning Datensatzes. Da solche Tuning Daten typischerweise beschrĂ€nkte GrĂ¶ĂŸe haben, werden WortĂ€hnlichkeiten benutzt um die Abdeckung der Tuning Daten zu erweitern. Die im vorherigen Schritt verwendeten WortĂ€hnlichkeiten sind entscheidend fĂŒr die QualitĂ€t des Verfahrens. Aus diesem Grund werden in der Arbeit verschiedene automatische Methoden zur Ermittlung von solche WortĂ€hnlichkeiten ausgehend von monoligual und biligual Corpora vorgestellt. Interessanterweise ist dies auch bei beschrĂ€nkten Daten möglich, indem auch monolinguale Daten, die in großen Mengen zur VerfĂŒgung stehen, zur Ermittlung der WortĂ€hnlichkeit herangezogen werden. Bei bilingualen Daten, die hĂ€ufig nur in beschrĂ€nkter GrĂ¶ĂŸe zur VerfĂŒgung stehen, können auch weitere Sprachpaare herangezogen werden, die mindestens eine Sprache mit dem vorgegebenen Sprachpaar teilen. Im Modellierungsschritt behandeln wir das Problem mit verrauschten Daten, indem die Trainingsdaten anhand der GĂŒte des Corpus gewichtet werden. Wir benutzen Statistik signifikante MessgrĂ¶ĂŸen, um die weniger verlĂ€sslichen Sequenzen zu finden und ihre Gewichtung zu reduzieren. Ähnlich zu den vorherigen AnsĂ€tzen, werden WortĂ€hnlichkeiten benutzt um das Problem bei begrenzten Daten zu behandeln. Ein weiteres Problem tritt allerdings auf sobald die absolute HĂ€ufigkeiten mit den gewichteten HĂ€ufigkeiten ersetzt werden. In dieser Arbeit werden hierfĂŒr Techniken zur GlĂ€ttung der Wahrscheinlichkeiten in dieser Situation entwickelt. Die GrĂ¶ĂŸe der Trainingsdaten werden problematisch sobald man mit Corpora von erheblichem Volumen arbeitet. Hierbei treten zwei Hauptschwierigkeiten auf: Die LĂ€nge der Trainingszeit und der begrenzte Arbeitsspeicher. FĂŒr das Problem der Trainingszeit wird ein Algorithmus entwickelt, der die rechenaufwendigen Berechnungen auf mehrere Prozessoren mit gemeinsamem Speicher ausfĂŒhrt. FĂŒr das Speicherproblem werden speziale Datenstrukturen und Algorithmen fĂŒr externe Speicher benutzt. Dies erlaubt ein effizientes Training von extrem großen Modellne in Hardware mit begrenztem Speicher

    Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture

    Full text link
    We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data

    Using distributional similarity to organise biomedical terminology

    Get PDF
    We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy

    Notes on cartography and further explanation

    Get PDF
    This article addresses one particular aspect of the cartographic enterprise, the cartographic study of the left periphery of the clause, the system of criteria, and the "syntacticisation” of scope-discourse semantics that rich and detailed syntactic maps make possible. I will compare this theoretical option with the conceivable alternative, the "pragmaticization” of a radically impoverished syntax, and will discuss some simple kinds of empirical evidence bearing on the choice between these alternative perspectives. I will then turn to the issue of whether the properties of the functional sequence (ordering, cooccurrence restrictions) are amenable to "further explanations” in terms of more basic principles constraining linguistic computations. I will argue that the search for deeper explanations is an integral part of the cartographic endeavour: it presupposes the establishment of reliable maps, and nourishes the pursuit of further cartographic questions. I will conclude by illustrating the issue of further explanation by comparing certain properties of topicalization in English and Italian, in particular the fact that DP topics are fundamentally unique in English, while they can be freely reiterated in Italian. This pattern can be plausibly traced back to intervention locality, once certain independent properties distinguishing Italian and English topicalization are taken into accoun
    • 

    corecore