11 research outputs found

    Scalable Text Mining with Sparse Generative Models

    Get PDF
    The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text mining tasks. Sparse computation using inverted indices is proposed for inference on probabilistic models. This reduces the computational complexity of the common text mining operations according to sparsity, yielding probabilistic models with the scalability of modern search engines. The proposed combination provides sparse generative models: a solution for text mining that is general, effective, and scalable. Extensive experimentation on text classification and ranked retrieval datasets are conducted, showing that the proposed solution matches or outperforms the leading task-specific methods in effectiveness, with a order of magnitude decrease in classification times for Wikipedia article categorization with a million classes. The developed methods were further applied in two 2014 Kaggle data mining prize competitions with over a hundred competing teams, earning first and second places

    Towards Evaluating Veracity of Textual Statements on the Web

    Get PDF
    The quality of digital information on the web has been disquieting due to the absence of careful checking. Consequently, a large volume of false textual information is being produced and disseminated with misstatements of facts. The potential negative influence on the public, especially in time-sensitive emergencies, is a growing concern. This concern has motivated this thesis to deal with the problem of veracity evaluation. In this thesis, we set out to develop machine learning models for the veracity evaluation of textual claims based on stance and user engagements. Such evaluation is achieved from three aspects: news stance detection engaged user replies in social media and the engagement dynamics. First of all, we study stance detection in the context of online news articles where a claim is predicted to be true if it is supported by the evidential articles. We propose to manifest a hierarchical structure among stance classes: the high-level aims at identifying relatedness, while the low-level aims at classifying, those identified as related, into the other three classes, i.e., agree, disagree, and discuss. This model disentangles the semantic difference of related/unrelated and the other three stances and helps address the class imbalance problem. Beyond news articles, user replies on social media platforms also contain stances and can infer claim veracity. Claims and user replies in social media are usually short and can be ambiguous; to deal with semantic ambiguity, we design a deep latent variable model with a latent distribution to allow multimodal semantic distribution. Also, marginalizing the latent distribution enables the model to be more robust in relatively smalls-sized datasets. Thirdly, we extend the above content-based models by tracking the dynamics of user engagement in misinformation propagation. To capture these dynamics, we formulate user engagements as a dynamic graph and extract its temporal evolution patterns and geometric features based on an attention-modified Temporal Point Process. This allows to forecast the cumulative number of engaged users and can be useful in assessing the threat level of an individual piece of misinformation. The ability to evaluate veracity and forecast the scale growth of engagement networks serves to practically assist the minimization of online false information’s negative impacts

    Explainable Argument Mining

    Get PDF

    Graded Decompositional Semantic Prediction

    Get PDF
    Compared to traditional approaches, decompositional semantic labeling (DSL) is compelling but introduces complexities for data collection, quality assessment, and modeling. To shed light on these issues and lower barriers to the adoption of DSL or related approaches I bring existing models and novel variations into a shared, familiar framework, facilitating empirical investigation

    Extracting Information from Spoken User Input:A Machine Learning Approach

    Get PDF
    We propose a module that performs automatic analysis of user input in spoken dialogue systems using machine learning algorithms. The input to the module is material received from the speech recogniser and the dialogue manager of the spoken dialogue system, the output is a four-level pragmatic-semantic representation of the user utterance. Our investigation shows that when the four interpretation levels are combined in a complex machine learning task, the performance of the module is significantly better than the score of an informed baseline strategy. However, via a systematic, automatised search for the optimal subtask combinations we can gain substantial improvement produced by both classifiers for all four interpretation subtasks. A case study is conducted on dialogues between an automatised, experimental system that gives information on the phone about train connections in the Netherlands, and its users who speak in Dutch. We find that drawing on unsophisticated, potentially noisy features that characterise the dialogue situation, and by performing automatic optimisation of the formulated machine learning task it is possible to extract sophisticated information of practical pragmatic-semantic value from spoken user input with robust performance. This means that our module can with a good score interpret whether the user of the system is giving slot-filling information, and for which query slots (e.g., departure station, departure time, etc.), whether the user gave a positive or a negative answer to the system, or whether the user signals that there are problems in the interaction.

    Guided Probabilistic Topic Models for Agenda-setting and Framing

    Get PDF
    Probabilistic topic models are powerful methods to uncover hidden thematic structures in text by projecting each document into a low dimensional space spanned by a set of topics. Given observed text data, topic models infer these hidden structures and use them for data summarization, exploratory analysis, and predictions, which have been applied to a broad range of disciplines. Politics and political conflicts are often captured in text. Traditional approaches to analyze text in political science and other related fields often require close reading and manual labeling, which is labor-intensive and hinders the use of large-scale collections of text. Recent work, both in computer science and political science, has used automated content analysis methods, especially topic models to substantially reduce the cost of analyzing text at large scale. In this thesis, we follow this approach and develop a series of new probabilistic topic models, guided by additional information associated with the text, to discover and analyze agenda-setting (i.e., what topics people talk about) and framing (i.e., how people talk about those topics), a central research problem in political science, communication, public policy and other related fields. We first focus on study agendas and agenda control behavior in political debates and other conversations. The model we introduce, Speaker Identity for Topic Segmentation (SITS), is able to discover what topics that are talked about during the debates, when these topics change, and a speaker-specific measure of agenda control. To make the analysis process more effective, we build Argviz, an interactive visualization which leverages SITS's outputs to allow users to quickly grasp the conversational topic dynamics, discover when the topic changes and by whom, and interactively visualize the conversation's details on demand. We then analyze policy agendas in a more general setting of political text. We present the Label to Hierarchy (L2H) model to learn a hierarchy of topics from multi-labeled data, in which each document is tagged with multiple labels. The model captures the dependencies among labels using an interpretable tree-structured hierarchy, which helps provide insights about the political attentions that policymakers focus on, and how these policy issues relate to each other. We then go beyond just agenda-setting and expand our focus to framing--the study of how agenda issues are talked about, which can be viewed as second-level agenda-setting. To capture this hierarchical views of agendas and frames, we introduce the Supervised Hierarchical Latent Dirichlet Allocation (SHLDA) model, which jointly captures a collection of documents, each is associated with a continuous response variable such as the ideological position of the document's author on a liberal-conservative spectrum. In the topic hierarchy discovered by SHLDA, higher-level nodes map to more general agenda issues while lower-level nodes map to issue-specific frames. Although qualitative analysis shows that the topic hierarchies learned by SHLDA indeed capture the hierarchical view of agenda-setting and framing motivating the work, interpreting the discovered hierarchy still incurs moderately high cost due to the complex and abstract nature of framing. Motivated by improving the hierarchy, we introduce Hierarchical Ideal Point Topic Model (HIPTM) which jointly models a collection of votes (e.g., congressional roll call votes) and both the text associated with the voters (e.g., members of Congress) and the items (e.g., congressional bills). Customized specifically for capturing the two-level view of agendas and frames, HIPTM learns a two-level hierarchy of topics, in which first-level nodes map to an interpretable policy issue and second-level nodes map to issue-specific frames. In addition, instead of using pre-computed response variable, HIPTM also jointly estimates the ideological positions of voters on multiple interpretable dimensions
    corecore