Search CORE

11 research outputs found

NEURAL NETWORKS FOR TEXTUAL EMOTION RECOGNITION AND ANALYSIS

Author: Alhuzali Hassan
Publication venue
Publication date: 01/08/2022
Field of study

The University of Manchester - Institutional Repository

Scalable Text Mining with Sparse Generative Models

Author: Puurula Antti
Publication venue: 'University of Waikato'
Publication date: 22/06/2015
Field of study

The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text mining tasks. Sparse computation using inverted indices is proposed for inference on probabilistic models. This reduces the computational complexity of the common text mining operations according to sparsity, yielding probabilistic models with the scalability of modern search engines. The proposed combination provides sparse generative models: a solution for text mining that is general, effective, and scalable. Extensive experimentation on text classification and ranked retrieval datasets are conducted, showing that the proposed solution matches or outperforms the leading task-specific methods in effectiveness, with a order of magnitude decrease in classification times for Wikipedia article categorization with a million classes. The developed methods were further applied in two 2014 Kaggle data mining prize competitions with over a hundred competing teams, earning first and second places

arXiv.org e-Print Archive

Research Commons@Waikato

Towards Evaluating Veracity of Textual Statements on the Web

Author: Zhang Qiang
Publication venue: UCL (University College London)
Publication date: 28/10/2021
Field of study

The quality of digital information on the web has been disquieting due to the absence of careful checking. Consequently, a large volume of false textual information is being produced and disseminated with misstatements of facts. The potential negative influence on the public, especially in time-sensitive emergencies, is a growing concern. This concern has motivated this thesis to deal with the problem of veracity evaluation. In this thesis, we set out to develop machine learning models for the veracity evaluation of textual claims based on stance and user engagements. Such evaluation is achieved from three aspects: news stance detection engaged user replies in social media and the engagement dynamics. First of all, we study stance detection in the context of online news articles where a claim is predicted to be true if it is supported by the evidential articles. We propose to manifest a hierarchical structure among stance classes: the high-level aims at identifying relatedness, while the low-level aims at classifying, those identified as related, into the other three classes, i.e., agree, disagree, and discuss. This model disentangles the semantic difference of related/unrelated and the other three stances and helps address the class imbalance problem. Beyond news articles, user replies on social media platforms also contain stances and can infer claim veracity. Claims and user replies in social media are usually short and can be ambiguous; to deal with semantic ambiguity, we design a deep latent variable model with a latent distribution to allow multimodal semantic distribution. Also, marginalizing the latent distribution enables the model to be more robust in relatively smalls-sized datasets. Thirdly, we extend the above content-based models by tracking the dynamics of user engagement in misinformation propagation. To capture these dynamics, we formulate user engagements as a dynamic graph and extract its temporal evolution patterns and geometric features based on an attention-modified Temporal Point Process. This allows to forecast the cumulative number of engaged users and can be useful in assessing the threat level of an individual piece of misinformation. The ability to evaluate veracity and forecast the scale growth of engagement networks serves to practically assist the minimization of online false information’s negative impacts

UCL Discovery

Explainable Argument Mining

Author: Lawrence John
Publication venue
Publication date: 01/01/2021
Field of study

University of Dundee Online Publications

Graded Decompositional Semantic Prediction

Author: Teichert Adam R
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 26/09/2022
Field of study

Compared to traditional approaches, decompositional semantic labeling (DSL) is compelling but introduces complexities for data collection, quality assessment, and modeling. To shed light on these issues and lower barriers to the adoption of DSL or related approaches I bring existing models and novel variations into a shared, familiar framework, facilitating empirical investigation

JScholarship

Extracting Information from Spoken User Input:A Machine Learning Approach

Author: Lendvai P.K.
Publication venue: [n.n.]
Publication date: 01/01/2004
Field of study

We propose a module that performs automatic analysis of user input in spoken dialogue systems using machine learning algorithms. The input to the module is material received from the speech recogniser and the dialogue manager of the spoken dialogue system, the output is a four-level pragmatic-semantic representation of the user utterance. Our investigation shows that when the four interpretation levels are combined in a complex machine learning task, the performance of the module is significantly better than the score of an informed baseline strategy. However, via a systematic, automatised search for the optimal subtask combinations we can gain substantial improvement produced by both classifiers for all four interpretation subtasks. A case study is conducted on dialogues between an automatised, experimental system that gives information on the phone about train connections in the Netherlands, and its users who speak in Dutch. We find that drawing on unsophisticated, potentially noisy features that characterise the dialogue situation, and by performing automatic optimisation of the formulated machine learning task it is possible to extract sophisticated information of practical pragmatic-semantic value from spoken user input with robust performance. This means that our module can with a good score interpret whether the user of the system is giving slot-filling information, and for which query slots (e.g., departure station, departure time, etc.), whether the user gave a positive or a negative answer to the system, or whether the user signals that there are problems in the interaction.

Tilburg University Repository

Recommended from our members

Projects in Applied Data Science: Fall 2019

Author: Aedula Rahul
Arenson Alyssa
Gandhi Yash
Hobbs Steven
Jawale Parth
Kjerland-Nicoletti Holden
Miles Israel
Munoz Jacob
Palavalli Karthik
Phillips Caleb
Rawal Srishti
Reddy Thanika
Sharma Lakshya
Sharnez Nimra
Sugar Orgil
Tokumoto Tyler
Umada Tetsumichi
Williams Lindy
Publication venue
Publication date: 19/01/2019
Field of study

This document contains semester projects for students in CSCI 4381/7000 Data ScienceProjects. This course explores concepts and techniques for design, formulation and execution ofpractical, applied data science. Topics covered include experimental design, statistical analysisand predictive modeling, machine learning, data visualization, scientific writing and presentation.During the class, students selected a semester-long project to acquire, analyze, and understanddata in support of a research question. In addition to traditional lectures, students read anddiscussed published papers on data science topics, practiced skills in recitation sessions, andentertained guest lectures from expert data scientists in the field. Outside of these readings andrecitations, students were allowed to work on their projects exclusively and were supported withmeetings, peer-discussion and copyediting. In terms of the scope of the final product, undergraduate students were asked to perform aresearch or engineering task of some complexity while graduate students were additionallyrequired to perform a survey of related work, demonstrate some novelty in their approach, anddescribe the position of their contribution within the broader literature. All students whoperformed at or above these expectations were offered the opportunity to contribute their paperfor publication in this technical summary. The diversity of the papers herein is representative of the diversity of interests of the students inthe class. There is no common trend among the papers submitted and each takes a differenttopic to task. Students made use of open data or worked with organizations to acquire data.Several students pivoted their projects early on due to limitations and difficulties in data access--- a real-world challenge in practical data science. The projects herein range from analyzingtraffic in cities, restaurant trends and Facebook responses to smartphone accelerometer data,scaling laws in higher education, and bicycle trends in Boulder, Colorado. Analysis approachesare similarly varied: visualization, statistical analysis and modeling, machine learning,reinforcement learning, etc.. Most papers can be understood as exploratory data analysis,although some emphasize interactive visualization and others emphasize statistical modelingand prediction aimed at testing a well-defined research question. To inform the style of theirapproach, students read papers from a broad sampling of original research. They used thesereadings to build an understanding of approaches to presentation and analysis in the modernscientific literature. One paper was held out from this compendium so that it could be submittedfor publication to a peer-reviewed venue. Please direct questions/comments on individual papers to the student authors when contactinformation has been made available.</p

CU Scholar Institutional Repository

Guided Probabilistic Topic Models for Agenda-setting and Framing

Author: Nguyen Viet An
Publication venue
Publication date: 01/01/2015
Field of study

Probabilistic topic models are powerful methods to uncover hidden thematic structures in text by projecting each document into a low dimensional space spanned by a set of topics. Given observed text data, topic models infer these hidden structures and use them for data summarization, exploratory analysis, and predictions, which have been applied to a broad range of disciplines. Politics and political conflicts are often captured in text. Traditional approaches to analyze text in political science and other related fields often require close reading and manual labeling, which is labor-intensive and hinders the use of large-scale collections of text. Recent work, both in computer science and political science, has used automated content analysis methods, especially topic models to substantially reduce the cost of analyzing text at large scale. In this thesis, we follow this approach and develop a series of new probabilistic topic models, guided by additional information associated with the text, to discover and analyze agenda-setting (i.e., what topics people talk about) and framing (i.e., how people talk about those topics), a central research problem in political science, communication, public policy and other related fields. We first focus on study agendas and agenda control behavior in political debates and other conversations. The model we introduce, Speaker Identity for Topic Segmentation (SITS), is able to discover what topics that are talked about during the debates, when these topics change, and a speaker-specific measure of agenda control. To make the analysis process more effective, we build Argviz, an interactive visualization which leverages SITS's outputs to allow users to quickly grasp the conversational topic dynamics, discover when the topic changes and by whom, and interactively visualize the conversation's details on demand. We then analyze policy agendas in a more general setting of political text. We present the Label to Hierarchy (L2H) model to learn a hierarchy of topics from multi-labeled data, in which each document is tagged with multiple labels. The model captures the dependencies among labels using an interpretable tree-structured hierarchy, which helps provide insights about the political attentions that policymakers focus on, and how these policy issues relate to each other. We then go beyond just agenda-setting and expand our focus to framing--the study of how agenda issues are talked about, which can be viewed as second-level agenda-setting. To capture this hierarchical views of agendas and frames, we introduce the Supervised Hierarchical Latent Dirichlet Allocation (SHLDA) model, which jointly captures a collection of documents, each is associated with a continuous response variable such as the ideological position of the document's author on a liberal-conservative spectrum. In the topic hierarchy discovered by SHLDA, higher-level nodes map to more general agenda issues while lower-level nodes map to issue-specific frames. Although qualitative analysis shows that the topic hierarchies learned by SHLDA indeed capture the hierarchical view of agenda-setting and framing motivating the work, interpreting the discovered hierarchy still incurs moderately high cost due to the complex and abstract nature of framing. Motivated by improving the hierarchy, we introduce Hierarchical Ideal Point Topic Model (HIPTM) which jointly models a collection of votes (e.g., congressional roll call votes) and both the text associated with the voters (e.g., members of Congress) and the items (e.g., congressional bills). Customized specifically for capturing the two-level view of agendas and frames, HIPTM learns a two-level hierarchy of topics, in which first-level nodes map to an interpretable policy issue and second-level nodes map to issue-specific frames. In addition, instead of using pre-computed response variable, HIPTM also jointly estimates the ideological positions of voters on multiple interpretable dimensions

Digital Repository at the University of Maryland