3,020 research outputs found

    A Proficient Method For High Eminence And Cohesive Relevant Phrase Mining

    Get PDF
    A sentence is an integral unit of semantic nature, context and significance. Visualizing sentences for each topic is an important way to investigate and interpret unstructured corporate texts in subject modeling. Usually the term mining method is double: mining phrases and modeling theme. Current methods also suffer from order-sensitive and improper segmentation problems for phrase mining, which often lead to phrases of low content. The limitations of sentences, which may undermine continuity, are not entirely taken into account by standard topic models for topic modeling. In addition, current methods are frequently subject to domain terminology loss as the effect of topical domain dissemination is disregarded. We suggest an effective approach for high-quality and coherent topical sentence mining in this article. A high-quality sentence must meet the requirements for frequency, phrasing, integrity and suitability. In order to increase the both phrase consistency and topical cohesion, we combine the quality assured phrase mining process, a novel subject models that incorporate phrasing restriction, and a novel text clustering method into an iterative system. Effective algorithm designs to perform these methods effectively are often defined

    Cooperative Semantic Information Processing for Literature-Based Biomedical Knowledge Discovery

    Get PDF
    Given that data is increasing exponentially everyday, extracting and understanding the information, themes and relationships from large collections of documents is more and more important to researchers in many areas. In this paper, we present a cooperative semantic information processing system to help biomedical researchers understand and discover knowledge in large numbers of titles and abstracts from PubMed query results. Our system is based on a prevalent technique, topic modeling, which is an unsupervised machine learning approach for discovering the set of semantic themes in a large set of documents. In addition, we apply a natural language processing technique to transform the “bag-of-words” assumption of topic models to the “bag-of-important-phrases” assumption and build an interactive visualization tool using a modified, open-source, Topic Browser. In the end, we conduct two experiments to evaluate the approach. The first, evaluates whether the “bag-of-important-phrases” approach is better at identifying semantic themes than the standard “bag-of-words” approach. This is an empirical study in which human subjects evaluate the quality of the resulting topics using a standard “word intrusion test” to determine whether subjects can identify a word (or phrase) that does not belong in the topic. The second is a qualitative empirical study to evaluate how well the system helps biomedical researchers explore a set of documents to discover previously hidden semantic themes and connections. The methodology for this study has been successfully used to evaluate other knowledge-discovery tools in biomedicine

    Macro-micro approach for mining public sociopolitical opinion from social media

    Get PDF
    During the past decade, we have witnessed the emergence of social media, which has prominence as a means for the general public to exchange opinions towards a broad range of topics. Furthermore, its social and temporal dimensions make it a rich resource for policy makers and organisations to understand public opinion. In this thesis, we present our research in understanding public opinion on Twitter along three dimensions: sentiment, topics and summary. In the first line of our work, we study how to classify public sentiment on Twitter. We focus on the task of multi-target-specific sentiment recognition on Twitter, and propose an approach which utilises the syntactic information from parse-tree in conjunction with the left-right context of the target. We show the state-of-the-art performance on two datasets including a multi-target Twitter corpus on UK elections which we make public available for the research community. Additionally we also conduct two preliminary studies including cross-domain emotion classification on discourse around arts and cultural experiences, and social spam detection to improve the signal-to-noise ratio of our sentiment corpus. Our second line of work focuses on automatic topical clustering of tweets. Our aim is to group tweets into a number of clusters, with each cluster representing a meaningful topic, story, event or a reason behind a particular choice of sentiment. We explore various ways of tackling this challenge and propose a two-stage hierarchical topic modelling system that is efficient and effective in achieving our goal. Lastly, for our third line of work, we study the task of summarising tweets on common topics, with the goal to provide informative summaries for real-world events/stories or explanation underlying the sentiment expressed towards an issue/entity. As most existing tweet summarisation approaches rely on extractive methods, we propose to apply state-of-the-art neural abstractive summarisation model for tweets. We also tackle the challenge of cross-medium supervised summarisation with no target-medium training resources. To the best of our knowledge, there is no existing work on studying neural abstractive summarisation on tweets. In addition, we present a system for providing interactive visualisation of topic-entity sentiments and the corresponding summaries in chronological order. Throughout our work presented in this thesis, we conduct experiments to evaluate and verify the effectiveness of our proposed models, comparing to relevant baseline methods. Most of our evaluations are quantitative, however, we do perform qualitative analyses where it is appropriate. This thesis provides insights and findings that can be used for better understanding public opinion in social media

    HAMNER: Headword Amplified Multi-span Distantly Supervised Method for Domain Specific Named Entity Recognition

    Full text link
    To tackle Named Entity Recognition (NER) tasks, supervised methods need to obtain sufficient cleanly annotated data, which is labor and time consuming. On the contrary, distantly supervised methods acquire automatically annotated data using dictionaries to alleviate this requirement. Unfortunately, dictionaries hinder the effectiveness of distantly supervised methods for NER due to its limited coverage, especially in specific domains. In this paper, we aim at the limitations of the dictionary usage and mention boundary detection. We generalize the distant supervision by extending the dictionary with headword based non-exact matching. We apply a function to better weight the matched entity mentions. We propose a span-level model, which classifies all the possible spans then infers the selected spans with a proposed dynamic programming algorithm. Experiments on all three benchmark datasets demonstrate that our method outperforms previous state-of-the-art distantly supervised methods.Comment: 9 pages, 2 figure

    Constructing and modeling text-rich information networks: a phrase mining-based approach

    Get PDF
    A lot of digital ink has been spilled on "big data" over the past few years, which is often characterized by an explosion of information. Most of this surge owes its origin to the unstructured data in the wild like words, images and video as comparing to the structured information stored in fielded form in databases. The proliferation of text-heavy data is particularly overwhelming, reflected in everyone's daily life in forms of web documents, business reviews, news, social posts, etc. In the mean time, textual data and structured entities often come in intertwined, such as authors/posters, document categories and tags, and document-associated geo locations. With this background, a core research challenge presents itself as how to turn massive, (semi-)unstructured data into structured knowledge. One promising paradigm studied in this dissertation is to integrate structured and unstructured data, constructing an organized heterogeneous information network, and developing powerful modeling mechanisms on such organized network. We name it text-rich information network, since it is an integrated representation of both structured and unstructured textual data. To thoroughly develop the construction and modeling paradigm, this dissertation will focus on forming a scalable data-driven framework and propose a new line of techniques relying on the idea of phrase mining to bridge textual documents and structured entities. We will first introduce the phrase mining method named SegPhrase+ to globally discover semantically meaningful phrases from massive textual data, providing a high quality dictionary for text structuralization. Clearly distinct from previous works that mostly focused on raw statistics of string matching, SegPhrase+ looks into the phrase context and effectively rectifies raw statistics to significantly boost the performance. Next, a novel algorithm based on latent keyphrases is developed and adopted to largely eliminate irregularities in massive text via providing an consistent and interpretable document representation. As a critical process in constructing the network, it uses the quality phrases generated in the previous step as candidates. From them a set of keyphrases are extracted to represent a particular document with inferred strength through a statistical model. After this step, documents become more structured and are consistently represented in the form of a bipartite network connecting documents with quality keyphrases. A more heterogeneous text-rich information network can be constructed by incorporating different types of document-associated entities as additional nodes. Lastly, a general and scalable framework, Tensor2vec, are to be added to trational data minining machanism, as the latter cannot readily solve the problem when the organized heterogeneous network has nodes with different types. Tensor2vec is expected to elegantly handle relevance search, entity classification, summarization and recommendation problems, by making use of higher-order link information and projecting multi-typed nodes into a shared low-dimensional vectorial space such that node proximity can be easily computed and accurately predicted

    Unsupervised keyword extraction from microblog posts via hashtags

    Full text link
    © River Publishers. Nowadays, huge amounts of texts are being generated for social networking purposes on Web. Keyword extraction from such texts like microblog posts benefits many applications such as advertising, search, and content filtering. Unlike traditional web pages, a microblog post usually has some special social feature like a hashtag that is topical in nature and generated by users. Extracting keywords related to hashtags can reflect the intents of users and thus provides us better understanding on post content. In this paper, we propose a novel unsupervised keyword extraction approach for microblog posts by treating hashtags as topical indicators. Our approach consists of two hashtag enhanced algorithms. One is a topic model algorithm that infers topic distributions biased to hashtags on a collection of microblog posts. The words are ranked by their average topic probabilities. Our topic model algorithm can not only find the topics of a collection, but also extract hashtag-related keywords. The other is a random walk based algorithm. It first builds a word-post weighted graph by taking into account posts themselves. Then, a hashtag biased random walk is applied on this graph, which guides the algorithm to extract keywords according to hashtag topics. Last, the final ranking score of a word is determined by the stationary probability after a number of iterations. We evaluate our proposed approach on a collection of real Chinese microblog posts. Experiments show that our approach is more effective in terms of precision than traditional approaches considering no hashtag. The result achieved by the combination of two algorithms performs even better than each individual algorithm

    Topic Modeling for Automatic Analysis of Natural Language: A Case Study in an Italian Customer Support Center

    Get PDF
    This paper focuses on the automatic analysis of conversation transcriptions in the call center of a customer care service. The goal is to recognize topics related to problems and complaints discussed in several dialogues between customers and agents. Our study aims to implement a framework able to automatically cluster conversation transcriptions into cohesive and well-separated groups based on the content of the data. The framework can alleviate the analyst selecting proper values for the analysis and the clustering processes. To pursue this goal, we consider a probabilistic model based on the latent Dirichlet allocation, which associates transcriptions with a mixture of topics in different proportions. A case study consisting of transcriptions in the Italian natural language, and collected in a customer support center of an energy supplier, is considered in the paper. Performance comparison of different inference techniques is discussed using the case study. The experimental results demonstrate the approach’s efficacy in clustering Italian conversation transcriptions. It also results in a practical tool to simplify the analytic process and off-load the parameter tuning from the end-user. According to recent works in the literature, this paper may be valuable for introducing latent Dirichlet allocation approaches in topic modeling for the Italian natural language

    Automatic extraction of conceptual labels from topic models

    Full text link
    This work outlines a novel system that automatically extracts conceptual labels for statistically obtained topics. By creating a projection of the topic, which is a distribution over all the vocabulary words, over the WordNet ontology we succeed in associating concepts to the said groups of words. The most important contributions of this paper are connected to the validation of the role of these concepts as topical labels and the determination of correlations that emerge between the utility of these labels and the strength of the relation between the concepts and the topics
    • …
    corecore