Search CORE

351 research outputs found

Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling

Author: Blunsom Phil
Dyer Chris
Kawakami Kazuya
Publication venue
Publication date: 01/01/2017
Field of study

Fixed-vocabulary language models fail to account for one of the most characteristic statistical facts of natural language: the frequent creation and reuse of new word types. Although character-level language models offer a partial solution in that they can create word types not attested in the training corpus, they do not capture the "bursty" distribution of such words. In this paper, we augment a hierarchical LSTM language model that generates sequences of word tokens character by character with a caching mechanism that learns to reuse previously generated words. To validate our model we construct a new open-vocabulary language modeling corpus (the Multilingual Wikipedia Corpus, MWC) from comparable Wikipedia articles in 7 typologically diverse languages and demonstrate the effectiveness of our model across this range of languages.Comment: ACL 201

arXiv.org e-Print Archive

Crossref

Document Clustering with Bursty Information

Author: Chaoji Vineet
Hoonlor Apirak
Szymanski Bolesław K.
Zaki Mohamed J.
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 30/01/2013
Field of study

Nowadays, almost all text corpora, such as blogs, emails and RSS feeds, are a collection of text streams. The traditional vector space model (VSM), or bag-of-words representation, cannot capture the temporal aspect of these text streams. So far, only a few bursty features have been proposed to create text representations with temporal modeling for the text streams. We propose bursty feature representations that perform better than VSM on various text mining tasks, such as document retrieval, topic modeling and text categorization. For text clustering, we propose a novel framework to generate bursty distance measure. We evaluated it on UPGMA, Star and K-Medoids clustering algorithms. The bursty distance measure did not only perform equally well on various text collections, but it was also able to cluster the news articles related to specific events much better than other models

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Context Modeling for Ranking and Tagging Bursty Features in Text Streams

Author: HE Jing
JIANG Jing
LI Xiaoming
Shan Dongdong
YAN Hongfei
ZHAO Xin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

Bursty features in text streams are very useful in many text mining applications. Most existing studies detect bursty features based purely on term frequency changes without taking into account the semantic contexts of terms, and as a result the detected bursty features may not always be interesting or easy to interpret. In this paper we propose to model the contexts of bursty features using a language modeling approach. We then propose a novel topic diversity-based metric using the context models to find newsworthy bursty features. We also propose to use the context models to automatically assign meaningful tags to bursty features. Using a large corpus of a stream of news articles, we quantitatively show that the proposed context language models for bursty features can effectively help rank bursty features based on their newsworthiness and to assign meaningful tags to annotate bursty features. ? 2010 ACM.EI

Institutional Knowledge at Singapore Management University

Semantic Visual Localization

Author: Geiger Andreas
Pollefeys Marc
Sattler Torsten
Schönberger Johannes L.
Publication venue
Publication date: 01/01/2018
Field of study

Robust visual localization under a wide range of viewing conditions is a fundamental problem in computer vision. Handling the difficult cases of this problem is not only very challenging but also of high practical relevance, e.g., in the context of life-long localization for augmented reality or autonomous robots. In this paper, we propose a novel approach based on a joint 3D geometric and semantic understanding of the world, enabling it to succeed under conditions where previous approaches failed. Our method leverages a novel generative model for descriptor learning, trained on semantic scene completion as an auxiliary task. The resulting 3D descriptors are robust to missing observations by encoding high-level 3D geometric and semantic information. Experiments on several challenging large-scale localization datasets demonstrate reliable localization under extreme viewpoint, illumination, and geometry changes

arXiv.org e-Print Archive

MPG.PuRe

Hierarchical topic structuring: from dense segmentation to topically focused fragments via burst analysis

Author: Gravier Guillaume
Simon Anca
Sébillot Pascale
Publication venue: HAL CCSD
Publication date: 01/01/2015
Field of study

International audienceTopic segmentation traditionally relies on lexical cohesion measured through word re-occurrences to output a dense segmen-tation, either linear or hierarchical. In this paper, a novel organization of the topical structure of textual content is proposed. Rather than searching for topic shifts to yield dense segmentation, we propose an algorithm to extract topically focused fragments organized in a hierarchical manner. This is achieved by leveraging the temporal distribution of word re-occurrences, searching for bursts, to skirt the limits imposed by a global counting of lexical re-occurrences within segments. Comparison to a reference dense segmentation on varied datasets indicates that we can achieve a better topic focus while retrieving all of the important aspects of a text

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Exploring Trend Life Cycles in Science and Innovation through Text Mining

Author: Tattershall Emma
Publication venue
Publication date: 01/08/2022
Field of study

The University of Manchester - Institutional Repository

Temporal search in document streams

Author: Kotsakos Dimitrios
Publication venue
Publication date: 27/10/2015
Field of study

In this thesis, we address major challenges in searching temporal document collections. In such collections, documents are created and/or edited over time. Examples of temporal document collections are web archives, news archives, blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on termmatching only can give unsatisfactory results when searching temporal document collections. The reason for this is twofold: the contents of documents are strongly time-dependent, i.e., documents are about events happened at particular time periods, and a query representing an information need can be time-dependent as well, i.e., a temporal query. On the other hand, time-only-based methods fall short when it comes to reasoning about events in social media. During the last few years users create chronologically ordered documents about topics that draw their attention in an ever increasing pace. However, with the vast adoption of social media, new types of marketing campaigns have been developed in order to promote content, i.e. brands, products, celebrities, etc

Digital Repository of Hellenic Managing Authority of the Operational Programme "Education and Lifelong Learning" (EDULLL)