Search CORE

7 research outputs found

Recommended from our members

A Bayesian mixture model for term re-occurrence and burstiness

Author: De Roeck Anne
Garthwaite Paul
Sarkar Avik
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2005
Field of study

This paper proposes a model for term reoccurrence in a text collection based on the gaps between successive occurrences of a term. These gaps are modeled using a mixture of exponential distributions. Parameter estimation is based on a Bayesian framework that allows us to fit a flexible model. The model provides measures of a term’s re-occurrence rate and withindocument burstiness. The model works for all kinds of terms, be it rare content word, medium frequency term or frequent function word. A measure is proposed to account for the term’s importance based on its distribution pattern in the corpus

Open Research Online (The Open University)

Terminology mining in social media

Author: Karlgren Jussi
Sahlgren Magnus
Publication venue
Publication date: 01/01/2009
Field of study

The highly variable and dynamic word usage in social media presents serious challenges for both research and those commercial applications that are geared towards blogs or other user-generated non-editorial texts. This paper discusses and exempliﬁes a terminology mining approach for dealing with the productive character of the textual environment in social media. We explore the challenges of practically acquiring new terminology, and of modeling similarity and relatedness of terms from observing realistic amounts of data. We also discuss semantic evolution and density, and investigate novel measures for characterizing the preconditions for terminology mining

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

Hierarchical topic structuring: from dense segmentation to topically focused fragments via burst analysis

Author: Gravier Guillaume
Simon Anca
Sébillot Pascale
Publication venue: HAL CCSD
Publication date: 01/01/2015
Field of study

International audienceTopic segmentation traditionally relies on lexical cohesion measured through word re-occurrences to output a dense segmen-tation, either linear or hierarchical. In this paper, a novel organization of the topical structure of textual content is proposed. Rather than searching for topic shifts to yield dense segmentation, we propose an algorithm to extract topically focused fragments organized in a hierarchical manner. This is achieved by leveraging the temporal distribution of word re-occurrences, searching for bursts, to skirt the limits imposed by a global counting of lexical re-occurrences within segments. Comparison to a reference dense segmentation on varied datasets indicates that we can achieve a better topic focus while retrieving all of the important aspects of a text

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words

Author: A Bell
A Bunde
A Sarkar
A Vázquez
A-L Barabási
Adilson E. Motter
B Grosz
B McShane
BH Partee
C Goodwin
CE Shannon
CF Hockett
D Ron
DJ Watts
DR Cox
E Alvarez-Lacalle
Eduardo G. Altmann
Enrico Scalas
F Wu
GK Zipf
GK Zipf
H Kamp
HA Simon
I Heim
J Laherrere
J van Benthem
J Wixted
Janet B. Pierrehumbert
JP Herrera
JR Anderson
K von Fintel
K-I Goh
KW Church
L Hrebicek
L Nigam
M Ortuño
M Politi
MA Montemurro
MA Serrano
MD Hauser
MEJ Newman
MK Tanenhaus
MS Santhanam
P Bak
R Corral
R Lambiotte
R Montague
RD Malmgren
RH Baayen
S Redner
SM Katz
W Kruskal
Y Yannaros
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 11/11/2009
Field of study

Background: Zipf's discovery that word frequency distributions obey a power law established parallels between biological and physical processes, and language, laying the groundwork for a complex systems perspective on human communication. More recent research has also identified scaling regularities in the dynamics underlying the successive occurrences of events, suggesting the possibility of similar findings for language as well. Methodology/Principal Findings: By considering frequent words in USENET discussion groups and in disparate databases where the language has different levels of formality, here we show that the distributions of distances between successive occurrences of the same word display bursty deviations from a Poisson process and are well characterized by a stretched exponential (Weibull) scaling. The extent of this deviation depends strongly on semantic type -- a measure of the logicality of each word -- and less strongly on frequency. We develop a generative model of this behavior that fully determines the dynamics of word usage. Conclusions/Significance: Recurrence patterns of words are well described by a stretched exponential distribution of recurrence times, an empirical scaling that cannot be anticipated from Zipf's law. Because the use of words provides a uniquely precise and powerful lens on human thought and activity, our findings also have implications for other overt manifestations of collective human dynamics

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Évaluation d'une nouvelle structuration thématique hiérarchique des textes dans un cadre de résumé automatique et de détection d'ancres au sein de vidéos

Author: Gravier Guillaume
Simon Anca
Sébillot Pascale
Publication venue: HAL CCSD
Publication date: 01/01/2016
Field of study

National audienceDans cet article, nous évaluons, à travers son intérêt pour le résumé automatique et la détection d'ancres dans des vidéos, le potentiel d'une nouvelle structure thématique extraite de données textuelles, composée d'une hiérarchie de fragments thématiquement focalisés. Cette structure est produite par un algorithme exploitant les distributions temporelles d'apparition des mots dans les textes en se fondant sur une analyse de salves lexicales. La hiérarchie obtenue a pour objet de filtrer le contenu non crucial et de ne conserver que l'information saillante des textes, à différents niveaux de détail. Nous montrons qu'elle permet d'améliorer la production de résumés ou au moins de maintenir les résultats de l'état de l'art, tandis que pour la détection d'ancres, elle nous conduit à la meilleure précision dans le contexte de la tâche Search and Anchoring in Video Archives à MediaEval. Les expériences sont réalisées sur du texte écrit et sur un corpus de transcriptions automatiques d'émissions de télévision. ABSTRACT Evaluation of a novel hierarchical thematic structuring of texts in the framework of text sum-marization and anchor detection for video hyperlinking This paper investigates the potential of a novel topical structure of text-like data in the context of summarization and anchor detection in video hyperlinking. This structure is produced by an algorithm that exploits temporal distributions of words through word burst analysis to generate a hierarchy of topically focused fragments. The obtained hierarchy aims at filtering out non-critical content, retaining only the salient information at various levels of detail. For the tasks we choose to evaluate the structure on, the lost of important information is highly damaging. We show that the structure can actually improve the results of summarization or at least maintain state-of-the-art results, while for anchor detection it leads us to the best precision in the context of the Search and Anchoring in Video Archives task at MediaEval. The experiments were carried on written text and a more challenging corpus containing automatic transcripts of TV shows. MOTS-CLÉS : analyse de salves lexicales, hiérarchie de fragments thématiques, résumé automa-tique, détection d'ancres. KEYWORDS: burst analysis, hierarchy of topical fragments, text summarization, anchor detection. (a) (b) (c) FIGURE 1 – Représentations génériques (a) d'une segmentation thématique linéaire, (b) d'une segmentation thématique hiérarchique dense classique, versus (c) celle d'une hiérarchie de fragments thématiquement focalisés. Les lignes verticales en pointillés illustrent les frontières des thèmes et sous-thèmes

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

Tune your brown clustering, please

Author: Bøgh K.S.
Chester S.
Derczynski L.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2015
Field of study

Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal

White Rose Research Online