150 research outputs found

    Detecting Traffic Information From Social Media Texts With Deep Learning Approaches

    Get PDF
    Mining traffic-relevant information from social media data has become an emerging topic due to the real-time and ubiquitous features of social media. In this paper, we focus on a specific problem in social media mining which is to extract traffic relevant microblogs from Sina Weibo, a Chinese microblogging platform. It is transformed into a machine learning problem of short text classification. First, we apply the continuous bag-of-word model to learn word embedding representations based on a data set of three billion microblogs. Compared to the traditional one-hot vector representation of words, word embedding can capture semantic similarity between words and has been proved effective in natural language processing tasks. Next, we propose using convolutional neural networks (CNNs), long short-term memory (LSTM) models and their combination LSTM-CNN to extract traffic relevant microblogs with the learned word embeddings as inputs. We compare the proposed methods with competitive approaches, including the support vector machine (SVM) model based on a bag of n-gram features, the SVM model based on word vector features, and the multi-layer perceptron model based on word vector features. Experiments show the effectiveness of the proposed deep learning approaches

    Rumor Identification with Maximum Entropy in MicroNet

    Get PDF

    Hashtag biased ranking for keyword extraction from microblog posts

    Full text link
    © Springer International Publishing Switzerland 2015. Nowadays, a huge amount of text is being generated for social networking purpose on the Web. Keyword extraction from such text benefit many applications such as advertising, search, and content filtering. Recent studies show that graph based ranking is more effective than traditional term or document frequecy based approaches. However, most work in the literature constructs word to word graph within a document or a collection of documents before applying a kind of random walk. Such a graph does not consider the influence of document importance on keyword extraction. Moreover, social text like a microblog post usually has speical social features such as hashtag and so on, which can help us understand its topic. In this paper, we propose hashtag biased ranking for keyword extraction from a collection of microblog posts. We first build a word-post weighted graph by taking into account the posts themselves. Then, a hashtag biased random walk is applied on this graph, which guides our approach to extract keywords according to the hashtag topic. Last, the final ranking of a word is determined by the stationary probability after a number of interations. We evaluate our proposed method on a real Chinese microblog posts. Experiments show that our method is more effective than the traditional word to word graph based ranking in terms of precision

    Crowdsourcing High-Quality Parallel Data Extraction from Twitter *

    Get PDF
    Abstract High-quality parallel data is crucial for a range of multilingual applications, from tuning and evaluating machine translation systems to cross-lingual annotation projection. Unfortunately, automatically obtained parallel data (which is available in relative abundance) tends to be quite noisy. To obtain high-quality parallel data, we introduce a crowdsourcing paradigm in which workers with only basic bilingual proficiency identify translations from an automatically extracted corpus of parallel microblog messages. For less than $350, we obtained over 5000 parallel segments in five language pairs. Evaluated against expert annotations, the quality of the crowdsourced corpus is significantly better than existing automatic methods: it obtains an performance comparable to expert annotations when used in MERT tuning of a microblog MT system; and training a parallel sentence classifier with it leads also to improved results. The crowdsourced corpora will be made available i

    ADDRESSING INFORMALITY IN PROCESSING CHINESE MICROTEXT

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Mining Event-Oriented Topics in Microblog Stream with Unsupervised Multi-View Hierarchical Embedding

    Get PDF
    This article presents an unsupervised multi-view hierarchical embedding (UMHE) framework to sufficiently reveal the intrinsic topical knowledge in social events. Event-oriented topics are highly related to such events as it can provide explicit descriptions of what have happened in social community. In many real-world cases, however, it is difficult to include all attributes of microblogs, more often, textual aspects only are available. Traditional topic modelling methods have failed to generate event-oriented topics with the textual aspects, since the inherent relations between topics are often overlooked in these methods. Meanwhile, the metrics in original word vocabulary space might not effectively capture semantic distances. Our UMHE framework overcomes the severe information deficiency and poor feature representation. The UMHE first develops a multi-view Bayesian rose tree to preliminarily generate prior knowledge for latent topics and their relations. With such prior knowledge, we design an unsupervised translation-based hierarchical embedding method to make a better representation of these latent topics. By applying self-adaptive spectral clustering on the embedding space and the original space concomitantly, we eventually extract event-oriented topics in word distributions to express social events. Our framework is purely data-driven and unsupervised, without any external knowledge. Experimental results on TREC Tweets2011 dataset and Sina Weibo dataset demonstrate that the UMHE framework can construct hierarchical structure with high fitness, but also yield topic embeddings with salient semantics; therefore, it can derive event-oriented topics with meaningful descriptions

    ANALYZING IMAGE TWEETS IN MICROBLOGS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore