39 research outputs found
Fixed versus Dynamic Co-Occurrence Windows in TextRank Term Weights for Information Retrieval
TextRank is a variant of PageRank typically used in graphs that represent
documents, and where vertices denote terms and edges denote relations between
terms. Quite often the relation between terms is simple term co-occurrence
within a fixed window of k terms. The output of TextRank when applied
iteratively is a score for each vertex, i.e. a term weight, that can be used
for information retrieval (IR) just like conventional term frequency based term
weights. So far, when computing TextRank term weights over co- occurrence
graphs, the window of term co-occurrence is al- ways ?xed. This work departs
from this, and considers dy- namically adjusted windows of term co-occurrence
that fol- low the document structure on a sentence- and paragraph- level. The
resulting TextRank term weights are used in a ranking function that re-ranks
1000 initially returned search results in order to improve the precision of the
ranking. Ex- periments with two IR collections show that adjusting the vicinity
of term co-occurrence when computing TextRank term weights can lead to gains in
early precision
Computational acquisition of knowledge in small-data environments: a case study in the field of energetics
The UKâs defence industry is accelerating its implementation of artificial intelligence, including
expert systems and natural language processing (NLP) tools designed to supplement human
analysis. This thesis examines the limitations of NLP tools in small-data environments (common
in defence) in the defence-related energetic-materials domain. A literature review identifies
the domain-specific challenges of developing an expert system (specifically an ontology). The
absence of domain resources such as labelled datasets and, most significantly, the preprocessing
of text resources are identified as challenges. To address the latter, a novel general-purpose
preprocessing pipeline specifically tailored for the energetic-materials domain is developed. The
effectiveness of the pipeline is evaluated.
Examination of the interface between using NLP tools in data-limited environments to either
supplement or replace human analysis completely is conducted in a study examining the subjective
concept of importance. A methodology for directly comparing the ability of NLP tools
and experts to identify important points in the text is presented. Results show the participants
of the study exhibit little agreement, even on which points in the text are important. The NLP,
expert (author of the text being examined) and participants only agree on general statements.
However, as a group, the participants agreed with the expert. In data-limited environments,
the extractive-summarisation tools examined cannot effectively identify the important points
in a technical document akin to an expert.
A methodology for the classification of journal articles by the technology readiness level (TRL)
of the described technologies in a data-limited environment is proposed. Techniques to overcome
challenges with using real-world data such as class imbalances are investigated. A methodology
to evaluate the reliability of human annotations is presented. Analysis identifies a lack of
agreement and consistency in the expert evaluation of document TRL.Open Acces
Research on the Evolution and Dynamics of Issue Attention on China\u27s Climate Change
Tohoku UniversityæšȘç°æŁéĄèȘČ
Ranking, Labeling, and Summarizing Short Text in Social Media
One of the key features driving the growth and success of the Social Web is large-scale participation through user-contributed content â often through short text in social media. Unlike traditional long-form documents â e.g., Web pages, blog posts â these short text resources are typically quite brief (on the order of 100s of characters), often of a personal nature (reflecting opinions and reactions of users), and being generated at an explosive rate. Coupled with this explosion of short text in social media is the need for new methods to organize, monitor, and distill relevant information from these large-scale social systems, even in the face of the inherent âmessinessâ of short text, considering the wide variability in quality, style, and substance of short text generated by a legion of Social Web participants.
Hence, this dissertation seeks to develop new algorithms and methods to ensure the continued growth of the Social Web by enhancing how users engage with short text in social media. Concretely, this dissertation takes a three-fold approach:
First, this dissertation develops a learning-based algorithm to automatically rank short text comments associated with a Social Web object (e.g., Web document, image, video) based on the expressed preferences of the community itself, so that low-quality short text may be filtered and user attention may be focused on highly-ranked short text.
Second, this dissertation organizes short text through labeling, via a graph- based framework for automatically assigning relevant labels to short text. In this way meaningful semantic descriptors may be assigned to short text for improved classification, browsing, and visualization.
Third, this dissertation presents a cluster-based summarization approach for extracting high-quality viewpoints expressed in a collection of short text, while maintaining diverse viewpoints. By summarizing short text, user attention may quickly assess the aggregate viewpoints expressed in a collection of short text, without the need to scan each of possibly thousands of short text items
Lexical cohesion analysis for topic segmentation, summarization and keyphrase extraction
Cataloged from PDF version of article.When we express some idea or story, it is inevitable to use words that are semantically
related to each other. When this phenomena is exploited from the aspect
of words in the language, it is possible to infer the level of semantic relationship
between words by observing their distribution and use in discourse. From the
aspect of discourse it is possible to model the structure of the document by observing
the changes in the lexical cohesion in order to attack high level natural
language processing tasks. In this research lexical cohesion is investigated from
both of these aspects by first building methods for measuring semantic relatedness
of word pairs and then using these methods in the tasks of topic segmentation,
summarization and keyphrase extraction.
Measuring semantic relatedness of words requires prior knowledge about the
words. Two different knowledge-bases are investigated in this research. The
first knowledge base is a manually built network of semantic relationships, while
the second relies on the distributional patterns in raw text corpora. In order to
discover which method is effective in lexical cohesion analysis, a comprehensive
comparison of state-of-the art methods in semantic relatedness is made.
For topic segmentation different methods using some form of lexical cohesion
are present in the literature. While some of these confine the relationships only
to word repetition or strong semantic relationships like synonymy, no other work
uses the semantic relatedness measures that can be calculated for any two word
pairs in the vocabulary. Our experiments suggest that topic segmentation performance
improves methods using both classical relationships and word repetition.
Furthermore, the experiments compare the performance of different semantic relatedness
methods in a high level task. The detected topic segments are used in summarization, and achieves better results compared to a lexical chains based
method that uses WordNet.
Finally, the use of lexical cohesion analysis in keyphrase extraction is investigated.
Previous research shows that keyphrases are useful tools in document
retrieval and navigation. While these point to a relation between keyphrases and
document retrieval performance, no other work uses this relationship to identify
keyphrases of a given document. We aim to establish a link between the problems
of query performance prediction (QPP) and keyphrase extraction. To this end,
features used in QPP are evaluated in keyphrase extraction using a Naive Bayes
classifier. Our experiments indicate that these features improve the effectiveness
of keyphrase extraction in documents of different length. More importantly,
commonly used features of frequency and first position in text perform poorly
on shorter documents, whereas QPP features are more robust and achieve better
results.Ercan, GönençPh.D
Macro-micro approach for mining public sociopolitical opinion from social media
During the past decade, we have witnessed the emergence of social media, which has prominence as a means for the general public to exchange opinions towards a broad range of topics. Furthermore, its social and temporal dimensions make it a rich resource for policy makers and organisations to understand public opinion. In this thesis, we present our research in understanding public opinion on Twitter along three dimensions: sentiment, topics and summary.
In the first line of our work, we study how to classify public sentiment on Twitter. We focus on the task of multi-target-specific sentiment recognition on Twitter, and propose an approach which utilises the syntactic information from parse-tree in conjunction with the left-right context of the target. We show the state-of-the-art performance on two datasets including a multi-target Twitter corpus on UK elections which we make public available for the research community. Additionally we also conduct two preliminary studies including cross-domain emotion classification on discourse around arts and cultural experiences, and social spam detection to improve the signal-to-noise ratio of our sentiment corpus.
Our second line of work focuses on automatic topical clustering of tweets. Our aim is to group tweets into a number of clusters, with each cluster representing a meaningful topic, story, event or a reason behind a particular choice of sentiment. We explore various ways of tackling this challenge and propose a two-stage hierarchical topic modelling system that is efficient and effective in achieving our goal.
Lastly, for our third line of work, we study the task of summarising tweets on common topics, with the goal to provide informative summaries for real-world events/stories or explanation underlying the sentiment expressed towards an issue/entity. As most existing tweet summarisation approaches rely on extractive methods, we propose to apply state-of-the-art neural abstractive summarisation model for tweets. We also tackle the challenge of cross-medium supervised summarisation with no target-medium training resources. To the best of our knowledge, there is no existing work on studying neural abstractive summarisation on tweets. In addition, we present a system for providing interactive visualisation of topic-entity sentiments and the corresponding summaries in chronological order.
Throughout our work presented in this thesis, we conduct experiments to evaluate and verify the effectiveness of our proposed models, comparing to relevant baseline methods. Most of our evaluations are quantitative, however, we do perform qualitative analyses where it is appropriate. This thesis provides insights and findings that can be used for better understanding public opinion in social media
Context-Aware Message-Level Rumour Detection with Weak Supervision
Social media has become the main source of all sorts of information beyond a communication medium. Its intrinsic nature can allow a continuous and massive flow of misinformation to make a severe impact worldwide. In particular, rumours emerge unexpectedly and spread quickly. It is challenging to track down their origins and stop their propagation. One of the most ideal solutions to this is to identify rumour-mongering messages as early as possible, which is commonly referred to as "Early Rumour Detection (ERD)". This dissertation focuses on researching ERD on social media by exploiting weak supervision and contextual information. Weak supervision is a branch of ML where noisy and less precise sources (e.g. data patterns) are leveraged to learn limited high-quality labelled data (Ratner et al., 2017). This is intended to reduce the cost and increase the efficiency of the hand-labelling of large-scale data. This thesis aims to study whether identifying rumours before they go viral is possible and develop an architecture for ERD at individual post level. To this end, it first explores major bottlenecks of current ERD. It also uncovers a research gap between system design and its applications in the real world, which have received less attention from the research community of ERD. One bottleneck is limited labelled data. Weakly supervised methods to augment limited labelled training data for ERD are introduced. The other bottleneck is enormous amounts of noisy data. A framework unifying burst detection based on temporal signals and burst summarisation is investigated to identify potential rumours (i.e. input to rumour detection models) by filtering out uninformative messages. Finally, a novel method which jointly learns rumour sources and their contexts (i.e. conversational threads) for ERD is proposed. An extensive evaluation setting for ERD systems is also introduced
Representation Learning for Texts and Graphs: A Unified Perspective on Efficiency, Multimodality, and Adaptability
[...] This thesis is situated between natural language processing and graph representation learning and investigates selected connections. First, we introduce matrix embeddings as an efficient text representation sensitive to word order. [...] Experiments with ten linguistic probing tasks, 11 supervised, and five unsupervised downstream tasks reveal that vector and matrix embeddings have complementary strengths and that a jointly trained hybrid model outperforms both. Second, a popular pretrained language model, BERT, is distilled into matrix embeddings. [...] The results on the GLUE benchmark show that these models are competitive with other recent contextualized language models while being more efficient in time and space. Third, we compare three model types for text classification: bag-of-words, sequence-, and graph-based models. Experiments on five datasets show that, surprisingly, a wide multilayer perceptron on top of a bag-of-words representation is competitive with recent graph-based approaches, questioning the necessity of graphs synthesized from the text. [...] Fourth, we investigate the connection between text and graph data in document-based recommender systems for citations and subject labels. Experiments on six datasets show that the title as side information improves the performance of autoencoder models. [...] We find that the meaning of item co-occurrence is crucial for the choice of input modalities and an appropriate model. Fifth, we introduce a generic framework for lifelong learning on evolving graphs in which new nodes, edges, and classes appear over time. [...] The results show that by reusing previous parameters in incremental training, it is possible to employ smaller history sizes with only a slight decrease in accuracy compared to training with complete history. Moreover, weighting the binary cross-entropy loss function is crucial to mitigate the problem of class imbalance when detecting newly emerging classes. [...