215 research outputs found

    A history and theory of textual event detection and recognition

    Get PDF

    Measuring the semantic uncertainty of news events for evolution potential estimation

    Full text link
    © 2016 ACM. The evolution potential estimation of news events can support the decision making of both corporations and governments. For example, a corporation could manage its public relations crisis in a timely manner if a negative news event about this corporation is known with large evolution potential in advance. However, existing state-of-the-art methods are mainly based on time series historical data, which are not suitable for the news events with limited historical data and bursty properties. In this article, we propose a purely content-based method to estimate the evolution potential of the news events. The proposed method considers a news event at a given time point as a system composed of different keywords, and the uncertainty of this system is defined and measured as the Semantic Uncertainty of this news event. At the same time, an uncertainty space is constructed with two extreme states: the most uncertain state and the most certain state. We believe that the Semantic Uncertainty has correlation with the content evolution of the news events, so it can be used to estimate the evolution potential of the news events. In order to verify the proposed method, we present detailed experimental setups and results measuring the correlation of the Semantic Uncertainty with the Content Change of news events using collected news events data. The results show that the correlation does exist and is stronger than the correlation of value from the time-series-based method with the Content Change. Therefore, we can use the Semantic Uncertainty to estimate the evolution potential of news events

    Machine Learning Techniques for Topic Detection and Authorship Attribution in Textual Data

    Get PDF
    The unprecedented expansion of user-generated content in recent years demands more attempts of information filtering in order to extract high-quality information from the huge amount of available data. In this dissertation, we begin with a focus on topic detection from microblog streams, which is the first step toward monitoring and summarizing social data. Then we shift our focus to the authorship attribution task, which is a sub-area of computational stylometry. It is worth mentioning that determining the style of a document is orthogonal to determining its topic, since the document features which capture the style are mainly independent of its topic. We initially present a frequent pattern mining approach for topic detection from microblog streams. This approach uses a Maximal Sequence Mining (MSM) algorithm to extract pattern sequences, where each pattern sequence is an ordered set of terms. Then we construct a pattern graph, which is a directed graph representation of the mined sequences, and apply a community detection algorithm to group the mined patterns into different topic clusters. Experiments on Twitter datasets demonstrate that the MSM approach achieves high performance in comparison with the state-of-the-art methods. For authorship attribution, while previously proposed neural models in the literature mainly focus on lexical-based neural models and lack the multi-level modeling of writing style, we present a syntactic recurrent neural network to encode the syntactic patterns of a document in a hierarchical structure. The proposed model learns the syntactic representation of sentences from the sequence of part-of-speech tags. Furthermore, we present a style-aware neural model to encode document information from three stylistic levels (lexical, syntactic, and structural) and evaluate it in the domain of authorship attribution. Our experimental results, based on four authorship attribution benchmark datasets, reveal the benefits of encoding document information from all three stylistic levels when compared to the baseline methods in the literature. We extend this work and adopt a transfer learning approach to measure the impact of lower-level linguistic representations versus higher-level linguistic representations on the task of authorship attribution. Finally, we present a self-supervised framework for learning structural representations of sentences. The self-supervised network is a Siamese network with two components; a lexical sub-network and a syntactic sub-network which take the sequence of words and their corresponding structural labels as the input, respectively. This model is trained based on a contrastive loss objective. As a result, each word in the sentence is embedded into a vector representation which mainly carries structural information. The learned structural representations can be concatenated to the existing pre-trained word embeddings and create style-aware embeddings that carry both semantic and syntactic information and is well-suited for the domain of authorship attribution

    Statistical data mining for Sina Weibo, a Chinese micro-blog: sentiment modelling and randomness reduction for topic modelling

    Get PDF
    Before the arrival of modern information and communication technology, it was not easy to capture people’s thoughts and sentiments; however, the development of statistical data mining techniques and the prevalence of mass social media provide opportunities to capture those trends. Among all types of social media, micro-blogs make use of the word limit of 140 characters to force users to get straight to thepoint, thus making the posts brief but content-rich resources for investigation. The data mining object of this thesis is Weibo, the most popular Chinese micro-blog. In the first part of the thesis, we attempt to perform various exploratory data mining on Weibo. After the literature review of micro-blogs, the initial steps of data collection and data pre-processing are introduced. This is followed by analysis of the time of the posts, analysis between intensity of the post and share price, term frequency and cluster analysis. Secondly, we conduct time series modelling on the sentiment of Weibo posts. Considering the properties of Weibo sentiment, we mainly adopt the framework of ARMA mean with GARCH type conditional variance to fit the patterns. Other distinct models are also considered for negative sentiment for its complexity. Model selection and validation are introduced to verify the fitted models. Thirdly, Latent Dirichlet Allocation (LDA) is explained in depth as a way to discover topics from large sets of textual data. The major contribution is creating a Randomness Reduction Algorithm applied to post-process the output of topic models, filtering out the insignificant topics and utilising topic distributions to find out the most persistent topics. At the end of this chapter, evidence of the effectiveness of the Randomness Reduction is presented from empirical studies. The topic classification and evolution is also unveiled

    Modelling input texts: from Tree Kernels to Deep Learning

    Get PDF
    One of the core questions when designing modern Natural Language Processing (NLP) systems is how to model input textual data such that the learning algorithm is provided with enough information to estimate accurate decision functions. The mainstream approach is to represent input objects as feature vectors where each value encodes some of their aspects, e.g., syntax, semantics, etc. Feature-based methods have demonstrated state-of-the-art results on various NLP tasks. However, designing good features is a highly empirical-driven process, it greatly depends on a task requiring a significant amount of domain expertise. Moreover, extracting features for complex NLP tasks often requires expensive pre-processing steps running a large number of linguistic tools while relying on external knowledge sources that are often not available or hard to get. Hence, this process is not cheap and often constitutes one of the major challenges when attempting a new task or adapting to a different language or domain. The problem of modelling input objects is even more acute in cases when the input examples are not just single objects but pairs of objects, such as in various learning to rank problems in Information Retrieval and Natural Language processing. An alternative to feature-based methods is using kernels which are essentially non-linear functions mapping input examples into some high dimensional space thus allowing for learning decision functions with higher discriminative power. Kernels implicitly generate a very large number of features computing similarity between input examples in that implicit space. A well-designed kernel function can greatly reduce the effort to design a large set of manually designed features often leading to superior results. However, in the recent years, the use of kernel methods in NLP has been greatly under-estimated primarily due to the following reasons: (i) learning with kernels is slow as it requires to carry out optimization in the dual space leading to quadratic complexity; (ii) applying kernels to the input objects encoded with vanilla structures, e.g., generated by syntactic parsers, often yields minor improvements over carefully designed feature-based methods. In this thesis, we adopt the kernel learning approach for solving complex NLP tasks and primarily focus on solutions to the aforementioned problems posed by the use of kernels. In particular, we design novel learning algorithms for training Support Vector Machines with structural kernels, e.g., tree kernels, considerably speeding up the training over the conventional SVM training methods. We show that using the training algorithms developed in this thesis allows for training tree kernel models on large-scale datasets containing millions of instances, which was not possible before. Next, we focus on the problem of designing input structures that are fed to tree kernel functions to automatically generate a large set of tree-fragment features. We demonstrate that previously used plain structures generated by syntactic parsers, e.g., syntactic or dependency trees, are often a poor choice thus compromising the expressivity offered by a tree kernel learning framework. We propose several effective design patterns of the input tree structures for various NLP tasks ranging from sentiment analysis to answer passage reranking. The central idea is to inject additional semantic information relevant for the task directly into the tree nodes and let the expressive kernels generate rich feature spaces. For the opinion mining tasks, the additional semantic information injected into tree nodes can be word polarity labels, while for more complex tasks of modelling text pairs the relational information about overlapping words in a pair appears to significantly improve the accuracy of the resulting models. Finally, we observe that both feature-based and kernel methods typically treat words as atomic units where matching different yet semantically similar words is problematic. Conversely, the idea of distributional approaches to model words as vectors is much more effective in establishing a semantic match between words and phrases. While tree kernel functions do allow for a more flexible matching between phrases and sentences through matching their syntactic contexts, their representation can not be tuned on the training set as it is possible with distributional approaches. Recently, deep learning approaches have been applied to generalize the distributional word matching problem to matching sentences taking it one step further by learning the optimal sentence representations for a given task. Deep neural networks have already claimed state-of-the-art performance in many computer vision, speech recognition, and natural language tasks. Following this trend, this thesis also explores the virtue of deep learning architectures for modelling input texts and text pairs where we build on some of the ideas to model input objects proposed within the tree kernel learning framework. In particular, we explore the idea of relational linking (proposed in the preceding chapters to encode text pairs using linguistic tree structures) to design a state-of-the-art deep learning architecture for modelling text pairs. We compare the proposed deep learning models that require even less manual intervention in the feature design process then previously described tree kernel methods that already offer a very good trade-off between the feature-engineering effort and the expressivity of the resulting representation. Our deep learning models demonstrate the state-of-the-art performance on a recent benchmark for Twitter Sentiment Analysis, Answer Sentence Selection and Microblog retrieval
    corecore