6 research outputs found

    Qlusty: Quick and dirty generation of event videos from written media coverage

    Get PDF
    Qlusty generates videos describing the coverage of the same event by different news outlets automatically. Throughout four modules it identifies events, de-duplicates notes, ranks according to coverage, and queries for images to generate an overview video. In this manuscript we present our preliminary models, including quantitative evaluations of the former two and a qualitative analysis of the latter two. The results show the potential for achieving our main aim: contributing in breaking the information bubble, so common in the current news landscape

    SOTXTSTREAM: Density-based self-organizing clustering of text streams

    Get PDF
    A streaming data clustering algorithm is presented building upon the density-based selforganizing stream clustering algorithm SOSTREAM. Many density-based clustering algorithms are limited by their inability to identify clusters with heterogeneous density. SOSTREAM addresses this limitation through the use of local (nearest neighbor-based) density determinations. Additionally, many stream clustering algorithms use a two-phase clustering approach. In the first phase, a micro-clustering solution is maintained online, while in the second phase, the micro-clustering solution is clustered offline to produce a macro solution. By performing self-organization techniques on micro-clusters in the online phase, SOSTREAM is able to maintain a macro clustering solution in a single phase. Leveraging concepts from SOSTREAM, a new density-based self-organizing text stream clustering algorithm, SOTXTSTREAM, is presented that addresses several shortcomings of SOSTREAM. Gains in clustering performance of this new algorithm are demonstrated on several real-world text stream datasets

    Exploratory analysis of textual data streams

    Get PDF
    In this paper, we address exploratory analysis of textual data streams and we propose a bootstrapping process based on a combination of keyword similarity and clustering techniques to: (i) classify documents into fine-grained similarity clusters, based on keyword commonalities; (ii) aggregate similar clusters into larger document collections sharing a richer, more user-prominent keyword set that we call topic; (iii) assimilate newly extracted topics of current bootstrapping cycle with existing topics resulting from previous bootstrapping cycles, by linking similar topics of different time periods, if any, to highlight topic trends and evolution. An analysis framework is also defined enabling the topic-based exploration of the underlying textual data stream according to a thematic perspective and a temporal perspective. The bootstrapping process is evaluated on a real data stream of about 330.000 newspaper articles about politics published by the New York Times from Jan 1st 1900 to Dec 31st 2015

    Text mining with word embedding for outlier and sentiment analysis

    Get PDF
    The technology today makes it unprecedentedly easy to collect and store massive text data in various domains such as online social networks, medical records and news reports. In contrast to the gigantic volume of text data, human capabilities to read and process text data is limited. Hence, there is an emerging demand for automatic text mining tools to analyze massive text data. Word embedding is an emerging text analysis technique that leverages the fine-grained statistics of context information to map each word to a vector in the embedding space which reflects the semantic proximity between words. Embedding techniques not only enrich the statistical signals to utilize in downstream text mining applications, but also provide the possibility to characterize and represent higher-level objects in the embedding space, such as sentences, documents or topics. This study integrates word embedding techniques into a series of text mining approaches and models. The general idea is to take a text object such as a document or a sentence as a bag of embedding vectors and characterize their distributions in the embedding space. Specifically, this study focuses on two tasks: outlier analysis and weakly-supervised sentiment analysis. Outlier analysis aims to identify documents that topically deviate from the majority of a given corpus. We develop an unsupervised generative model to identify frequent and representative semantic regions in the word embedding space to represent the given corpus. Then we propose a novel outlierness measure to identify outlier documents. We also study the cost-sensitive scenario of outlier analysis. Sentiment analysis typically identifies the subjective opinion (e.g., positive vs. negative) in a piece of text. Despite being extensively studied as a supervised learning task, we tackle the problem in a weakly-supervised fashion, where users only provide a small set of seed words as guidance. We study to identify aspects and corresponding sentiments at both document and sentence level
    corecore