51,830 research outputs found

    A cross-collection mixture model for comparative text mining

    Full text link
    In this paper, we define and study a novel text mining prob-lem, which we refer to as comparative text mining. Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and differ-ences of these collections along each common theme. This general problem subsumes many interesting applications, in-cluding business intelligence, summarizing reviews of similar products, and comparing different opinions about a common topic. We propose a generative probabilistic mixture model for comparative text mining. The model simultaneously per-forms cross-collection clustering and within-collection clus-tering, and can be applied to an arbitrary set of compara-ble text collections. The model can be estimated efficiently using the Expectation-Maximization (EM) algorithm. We evaluate the model on two different text data sets (i.e., a news article data set and a laptop review data set), and compare it with a baseline clustering method also based on a mixture model. Experiment results show that the model is quite effective in discovering the latent common themes across collections and performs significantly better than our baseline mixture model. 1

    A Semantic Graph-Based Approach for Mining Common Topics From Multiple Asynchronous Text Streams

    Get PDF
    In the age of Web 2.0, a substantial amount of unstructured content are distributed through multiple text streams in an asynchronous fashion, which makes it increasingly difficult to glean and distill useful information. An effective way to explore the information in text streams is topic modelling, which can further facilitate other applications such as search, information browsing, and pattern mining. In this paper, we propose a semantic graph based topic modelling approach for structuring asynchronous text streams. Our model in- tegrates topic mining and time synchronization, two core modules for addressing the problem, into a unified model. Specifically, for handling the lexical gap issues, we use global semantic graphs of each timestamp for capturing the hid- den interaction among entities from all the text streams. For dealing with the sources asynchronism problem, local semantic graphs are employed to discover similar topics of different entities that can be potentially separated by time gaps. Our experiment on two real-world datasets shows that the proposed model significantly outperforms the existing ones

    Comprehensive Review of Opinion Summarization

    Get PDF
    The abundance of opinions on the web has kindled the study of opinion summarization over the last few years. People have introduced various techniques and paradigms to solving this special task. This survey attempts to systematically investigate the different techniques and approaches used in opinion summarization. We provide a multi-perspective classification of the approaches used and highlight some of the key weaknesses of these approaches. This survey also covers evaluation techniques and data sets used in studying the opinion summarization problem. Finally, we provide insights into some of the challenges that are left to be addressed as this will help set the trend for future research in this area.unpublishednot peer reviewe

    Cross-Domain Labeled LDA for Cross-Domain Text Classification

    Full text link
    Cross-domain text classification aims at building a classifier for a target domain which leverages data from both source and target domain. One promising idea is to minimize the feature distribution differences of the two domains. Most existing studies explicitly minimize such differences by an exact alignment mechanism (aligning features by one-to-one feature alignment, projection matrix etc.). Such exact alignment, however, will restrict models' learning ability and will further impair models' performance on classification tasks when the semantic distributions of different domains are very different. To address this problem, we propose a novel group alignment which aligns the semantics at group level. In addition, to help the model learn better semantic groups and semantics within these groups, we also propose a partial supervision for model's learning in source domain. To this end, we embed the group alignment and a partial supervision into a cross-domain topic model, and propose a Cross-Domain Labeled LDA (CDL-LDA). On the standard 20Newsgroup and Reuters dataset, extensive quantitative (classification, perplexity etc.) and qualitative (topic detection) experiments are conducted to show the effectiveness of the proposed group alignment and partial supervision.Comment: ICDM 201

    A Framework for Comparing Groups of Documents

    Full text link
    We present a general framework for comparing multiple groups of documents. A bipartite graph model is proposed where document groups are represented as one node set and the comparison criteria are represented as the other node set. Using this model, we present basic algorithms to extract insights into similarities and differences among the document groups. Finally, we demonstrate the versatility of our framework through an analysis of NSF funding programs for basic research.Comment: 6 pages; 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP '15
    • …
    corecore