63,969 research outputs found

    Towards subjectifying text clustering

    Full text link
    Although it is common practice to produce only a single clustering of a dataset, in many cases text documents can be clustered along different dimensions. Unfortunately, not only do traditional text clustering algorithms fail to produce multiple clusterings of a dataset, the only clustering they produce may not be the one that the user desires. In this paper, we propose a simple active clustering algorithm that is capable of producing multiple clusterings of the same data according to user interest. In comparison to previous work on feedback-oriented clustering, the amount of user feedback required by our algorithm is minimal. In fact, the feedback turns out to be as simple as a cursory look at a list of words. Experimental results are very promising: our system is able to generate clusterings along the user-specified dimensions with reasonable accuracies on several challenging text clas-sification tasks, thus providing suggestive evidence that our approach is viable

    ImageSieve: Exploratory search of museum archives with named entity-based faceted browsing

    Get PDF
    Over the last few years, faceted search emerged as an attractive alternative to the traditional "text box" search and has become one of the standard ways of interaction on many e-commerce sites. However, these applications of faceted search are limited to domains where the objects of interests have already been classified along several independent dimensions, such as price, year, or brand. While automatic approaches to generate faceted search interfaces were proposed, it is not yet clear to what extent the automatically-produced interfaces will be useful to real users, and whether their quality can match or surpass their manually-produced predecessors. The goal of this paper is to introduce an exploratory search interface called ImageSieve, which shares many features with traditional faceted browsing, but can function without the use of traditional faceted metadata. ImageSieve uses automatically extracted and classified named entities, which play important roles in many domains (such as news collections, image archives, etc.). We describe one specific application of ImageSieve for image search. Here, named entities extracted from the descriptions of the retrieved images are used to organize a faceted browsing interface, which then helps users to make sense of and further explore the retrieved images. The results of a user study of ImageSieve demonstrate that a faceted search system based on named entities can help users explore large collections and find relevant information more effectively

    XML documents clustering using a tensor space model

    Get PDF
    The traditional Vector Space Model (VSM) is not able to represent both the structure and the content of XML documents. This paper introduces a novel method of representing XML documents in a Tensor Space Model (TSM) and then utilizing it for clustering. Empirical analysis shows that the proposed method is scalable for large-sized datasets; as well, the factorized matrices produced from the proposed method help to improve the quality of clusters through the enriched document representation of both structure and content information

    Self-adaptive GA, quantitative semantic similarity measures and ontology-based text clustering

    Get PDF
    As the common clustering algorithms use vector space model (VSM) to represent document, the conceptual relationships between related terms which do not co-occur literally are ignored. A genetic algorithm-based clustering technique, named GA clustering, in conjunction with ontology is proposed in this article to overcome this problem. In general, the ontology measures can be partitioned into two categories: thesaurus-based methods and corpus-based methods. We take advantage of the hierarchical structure and the broad coverage taxonomy of Wordnet as the thesaurus-based ontology. However, the corpus-based method is rather complicated to handle in practical application. We propose a transformed latent semantic analysis (LSA) model as the corpus-based method in this paper. Moreover, two hybrid strategies, the combinations of the various similarity measures, are implemented in the clustering experiments. The results show that our GA clustering algorithm, in conjunction with the thesaurus-based and the LSA-based method, apparently outperforms that with other similarity measures. Moreover, the superiority of the GA clustering algorithm proposed over the commonly used k-means algorithm and the standard GA is demonstrated by the improvements of the clustering performance

    Patent Analytics Based on Feature Vector Space Model: A Case of IoT

    Full text link
    The number of approved patents worldwide increases rapidly each year, which requires new patent analytics to efficiently mine the valuable information attached to these patents. Vector space model (VSM) represents documents as high-dimensional vectors, where each dimension corresponds to a unique term. While originally proposed for information retrieval systems, VSM has also seen wide applications in patent analytics, and used as a fundamental tool to map patent documents to structured data. However, VSM method suffers from several limitations when applied to patent analysis tasks, such as loss of sentence-level semantics and curse-of-dimensionality problems. In order to address the above limitations, we propose a patent analytics based on feature vector space model (FVSM), where the FVSM is constructed by mapping patent documents to feature vectors extracted by convolutional neural networks (CNN). The applications of FVSM for three typical patent analysis tasks, i.e., patents similarity comparison, patent clustering, and patent map generation are discussed. A case study using patents related to Internet of Things (IoT) technology is illustrated to demonstrate the performance and effectiveness of FVSM. The proposed FVSM can be adopted by other patent analysis studies to replace VSM, based on which various big data learning tasks can be performed

    Exploratory Analysis of Highly Heterogeneous Document Collections

    Full text link
    We present an effective multifaceted system for exploratory analysis of highly heterogeneous document collections. Our system is based on intelligently tagging individual documents in a purely automated fashion and exploiting these tags in a powerful faceted browsing framework. Tagging strategies employed include both unsupervised and supervised approaches based on machine learning and natural language processing. As one of our key tagging strategies, we introduce the KERA algorithm (Keyword Extraction for Reports and Articles). KERA extracts topic-representative terms from individual documents in a purely unsupervised fashion and is revealed to be significantly more effective than state-of-the-art methods. Finally, we evaluate our system in its ability to help users locate documents pertaining to military critical technologies buried deep in a large heterogeneous sea of information.Comment: 9 pages; KDD 2013: 19th ACM SIGKDD Conference on Knowledge Discovery and Data Minin
    • …
    corecore