Search CORE

5 research outputs found

Extracting News Events from Microblogs

Author: Ramampiaro Heri
Repp Øystein
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2018
Field of study

Twitter stream has become a large source of information for many people, but the magnitude of tweets and the noisy nature of its content have made harvesting the knowledge from Twitter a challenging task for researchers for a long time. Aiming at overcoming some of the main challenges of extracting the hidden information from tweet streams, this work proposes a new approach for real-time detection of news events from the Twitter stream. We divide our approach into three steps. The first step is to use a neural network or deep learning to detect news-relevant tweets from the stream. The second step is to apply a novel streaming data clustering algorithm to the detected news tweets to form news events. The third and final step is to rank the detected events based on the size of the event clusters and growth speed of the tweet frequencies. We evaluate the proposed system on a large, publicly available corpus of annotated news events from Twitter. As part of the evaluation, we compare our approach with a related state-of-the-art solution. Overall, our experiments and user-based evaluation show that our approach on detecting current (real) news events delivers a state-of-the-art performance

arXiv.org e-Print Archive

NORA - Norwegian Open Research Archives

ATM : Adversarial-neural topic model

Author: He Yulan
Wang Rui
Zhou Deyu
Publication venue: 'Elsevier BV'
Publication date: 21/08/2019
Field of study

Topic models are widely used for thematic structure discovery in text. But traditional topic models often require dedicated inference procedures for specific tasks at hand. Also, they are not designed to generate word-level semantic representations. To address the limitations, we propose a neural topic modeling approach based on the Generative Adversarial Nets (GANs), called Adversarial-neural Topic Model (ATM) in this paper. To our best knowledge, this work is the first attempt to use adversarial training for topic modeling. The proposed ATM models topics with Dirichlet prior and employs a generator network to capture the semantic patterns among latent topics. Meanwhile, the generator could also produce word-level semantic representations. Besides, to illustrate the feasibility of porting ATM to tasks other than topic modeling, we apply ATM for open domain event extraction. To validate the effectiveness of the proposed ATM, two topic modeling benchmark corpora and an event dataset are employed in the experiments. Our experimental results on benchmark corpora show that ATM generates more coherence topics (considering five topic coherence measures), outperforming a number of competitive baselines. Moreover, the experiments on event dataset also validate that the proposed approach is able to extract meaningful events from news articles

arXiv.org e-Print Archive

Warwick Research Archives Portal Repository

ツイート数と現実の統計量との差異に関する検討

Author: Eiji Aramaki
Shoko Wakamiya
若宮翔子
荒牧英治
Publication venue: 統計数理研究所
Publication date: 01/12/2016
Field of study

要旨あり統計的言語研究の現在研究ノー

RISM (Repository of the Institute of Statistical Mathematics) / 統計数理研究所学術研究リポジトリ

Topic models for short text data

Author: Paun Silviu
Publication venue
Publication date: 01/01/2017
Field of study

Topic models are known to suffer from sparsity when applied to short text data. The problem is caused by a reduced number of observations available for a reliable inference (i.e.: the words in a document). A popular heuristic utilized to overcome this problem is to perform before training some form of document aggregation by context (e.g.: author, hashtag). We dedicated one part of this dissertation to modeling explicitly the implicit assumptions of the document aggregation heuristic and applying it to two well known model architectures: a mixture and an admixture. Our findings indicate that an admixture model benefits more from aggregation compared to a mixture model which rarely improved over its baseline (the standard mixture). We also find that the state of the art in short text data can be surpassed as long as every context is shared by a small number of documents. In the second part of the dissertation we develop a more general purpose topic model which can also be used when contextual information is not available. The proposed model is formulated around the observation that in normal text data, a classic topic model like an admixture works well because patterns of word co-occurrences arise across the documents. However, the possibility of such patterns to arise in a short text dataset is reduced. The model assumes every document is a bag of word co-occurrences, where each co-occurrence belongs to a latent topic. The documents are enhanced a priori with related co-occurrences from the other documents, such that the collection will have a greater chance of exhibiting word patterns. The proposed model performs well managing to surpass the state of the art and popular topic model baselines

University of Essex Research Repository

Biblioteca Digital de la Comunidad de Madrid

A Simple Bayesian Modelling Approach to Event Extraction from Twitter

Author
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2014
Field of study

Crossref