404 research outputs found

    Distributed Document Clustering and Cluster Summarization in Peer-to-Peer Environments

    Get PDF
    This thesis addresses difficult challenges in distributed document clustering and cluster summarization. Mining large document collections poses many challenges, one of which is the extraction of topics or summaries from documents for the purpose of interpretation of clustering results. Another important challenge, which is caused by new trends in distributed repositories and peer-to-peer computing, is that document data is becoming more distributed. We introduce a solution for interpreting document clusters using keyphrase extraction from multiple documents simultaneously. We also introduce two solutions for the problem of distributed document clustering in peer-to-peer environments, each satisfying a different goal: maximizing local clustering quality through collaboration, and maximizing global clustering quality through cooperation. The keyphrase extraction algorithm efficiently extracts and scores candidate keyphrases from a document cluster. The algorithm is called CorePhrase and is based on modeling document collections as a graph upon which we can leverage graph mining to extract frequent and significant phrases, which are used to label the clusters. Results show that CorePhrase can extract keyphrases relevant to documents in a cluster with very high accuracy. Although this algorithm can be used to summarize centralized clusters, it is specifically employed within distributed clustering to both boost distributed clustering accuracy, and to provide summaries for distributed clusters. The first method for distributed document clustering is called collaborative peer-to-peer document clustering, which models nodes in a peer-to-peer network as collaborative nodes with the goal of improving the quality of individual local clustering solutions. This is achieved through the exchange of local cluster summaries between peers, followed by recommendation of documents to be merged into remote clusters. Results on large sets of distributed document collections show that: (i) such collaboration technique achieves significant improvement in the final clustering of individual nodes; (ii) networks with larger number of nodes generally achieve greater improvements in clustering after collaboration relative to the initial clustering before collaboration, while on the other hand they tend to achieve lower absolute clustering quality than networks with fewer number of nodes; and (iii) as more overlap of the data is introduced across the nodes, collaboration tends to have little effect on improving clustering quality. The second method for distributed document clustering is called hierarchically-distributed document clustering. Unlike the collaborative model, this model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as a hierarchy of node neighborhoods. Summarization of the global distributed clusters is achieved through a distributed version of the CorePhrase algorithm. Results on large document sets show that: (i) distributed clustering accuracy is not affected by increasing the number of nodes for networks of single level; (ii) we can achieve decent speedup by making the hierarchy taller, but on the expense of clustering quality which degrades as we go up the hierarchy; (iii) in networks that grow arbitrarily, data gets more fragmented across neighborhoods causing poor centroid generation, thus suggesting we should not increase the number of nodes in the network beyond a certain level without increasing the data set size; and (iv) distributed cluster summarization can produce accurate summaries similar to those produced by centralized summarization. The proposed algorithms offer high degree of flexibility, scalability, and interpretability of large distributed document collections. Achieving the same results using current methodologies require centralization of the data first, which is sometimes not feasible

    GDCluster: a general decentralized clustering algorithm

    Get PDF
    In many popular applications like peer-to-peer systems, large amounts of data are distributed among multiple sources. Analysis of this data and identifying clusters is challenging due to processing, storage, and transmission costs. In this paper, we propose GDCluster, a general fully decentralized clustering method, which is capable of clustering dynamic and distributed data sets. Nodes continuously cooperate through decentralized gossip-based communication to maintain summarized views of the data set. We customize GDCluster for execution of the partition-based and density-based clustering methods on the summarized views, and also offer enhancements to the basic algorithm. Coping with dynamic data is made possible by gradually adapting the clustering model. Our experimental evaluations show that GDCluster can discover the clusters efficiently with scalable transmission cost, and also expose its supremacy in comparison to the popular method LSP2P

    Clustering cliques for graph-based summarization of the biomedical research literature

    Get PDF
    BACKGROUND: Graph-based notions are increasingly used in biomedical data mining and knowledge discovery tasks. In this paper, we present a clique-clustering method to automatically summarize graphs of semantic predications produced from PubMed citations (titles and abstracts). RESULTS: SemRep is used to extract semantic predications from the citations returned by a PubMed search. Cliques were identified from frequently occurring predications with highly connected arguments filtered by degree centrality. Themes contained in the summary were identified with a hierarchical clustering algorithm based on common arguments shared among cliques. The validity of the clusters in the summaries produced was compared to the Silhouette-generated baseline for cohesion, separation and overall validity. The theme labels were also compared to a reference standard produced with major MeSH headings. CONCLUSIONS: For 11 topics in the testing data set, the overall validity of clusters from the system summary was 10% better than the baseline (43% versus 33%). While compared to the reference standard from MeSH headings, the results for recall, precision and F-score were 0.64, 0.65, and 0.65 respectively

    Interactive data analysis and its applications on multi-structured datasets

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Exploring supervised and unsupervised methods to detect topics in biomedical text

    Get PDF
    BACKGROUND: Topic detection is a task that automatically identifies topics (e.g., "biochemistry" and "protein structure") in scientific articles based on information content. Topic detection will benefit many other natural language processing tasks including information retrieval, text summarization and question answering; and is a necessary step towards the building of an information system that provides an efficient way for biologists to seek information from an ocean of literature. RESULTS: We have explored the methods of Topic Spotting, a task of text categorization that applies the supervised machine-learning technique naïve Bayes to assign automatically a document into one or more predefined topics; and Topic Clustering, which apply unsupervised hierarchical clustering algorithms to aggregate documents into clusters such that each cluster represents a topic. We have applied our methods to detect topics of more than fifteen thousand of articles that represent over sixteen thousand entries in the Online Mendelian Inheritance in Man (OMIM) database. We have explored bag of words as the features. Additionally, we have explored semantic features; namely, the Medical Subject Headings (MeSH) that are assigned to the MEDLINE records, and the Unified Medical Language System (UMLS) semantic types that correspond to the MeSH terms, in addition to bag of words, to facilitate the tasks of topic detection. Our results indicate that incorporating the MeSH terms and the UMLS semantic types as additional features enhances the performance of topic detection and the naïve Bayes has the highest accuracy, 66.4%, for predicting the topic of an OMIM article as one of the total twenty-five topics. CONCLUSION: Our results indicate that the supervised topic spotting methods outperformed the unsupervised topic clustering; on the other hand, the unsupervised topic clustering methods have the advantages of being robust and applicable in real world settings

    Adaptive Semantic Indexing of Documents for Locating Relevant Information in P2P Networks

    Get PDF
    Abstract: Locating relevant information i

    CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap

    Get PDF
    After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges

    Efficient Algorithms to Compute Hierarchical Summaries from Big Data Streams

    Full text link
    Many data stream applications have hierarchical data; containing time, geographic locations, product information, clickstreams, server logs, IP addresses. A hierarchical summary of such volumous data offers multiple advantages including compactness, quick understanding, and abstraction. The goal of this thesis is to design algorithmic approaches for summarizing hierarchical data streams. First, this thesis provides a theoretical analysis of the benchmark hierarchical heavy hitters' algorithms and uncovers their shortcomings such as requiring high theoretical memory, updates and coverage problem. To address these shortcomings, this thesis proposes efficient algorithms which offer deterministic estimation accuracy using O(η/ε) worst-case memory and O(η) worst-case time complexity per item, where ε ∈ [0,1] is a user defined parameter and η is a small constant derived from the data. The proposed hierarchical heavy hitters' algorithms are shown to have improved significantly over existing algorithms both theoretically as well as empirically. Next, this thesis introduces a new concept called hierarchically correlated heavy hitters, which is different from existing hierarchical summarization techniques. The thesis provides a formal definition of the proposed concept and compares it with existing hierarchical summarization approaches both at definition level and empirically. It also proposes an efficient hierarchy-aware algorithm for computing hierarchically correlated heavy hitters. The proposed algorithm offers deterministic estimation accuracy using O(η / (ε_p * ε_s )) worst-case memory and O(η) worst-case time complexity per item, where η is as defined previously, and ε_p ∈ [0,1], ε_s ∈ [0,1] are other user defined parameters. Finally, the thesis proposes a special hierarchical data structure and algorithm to summarize spatiotemporal data. It can be used to extract interesting and useful patterns from high-speed spatiotemporal data streams at multiple spatial and temporal granularities. Theoretical and empirical analysis are provided, which show that the proposed data structure is very efficient concerning data storage and response to queries. It updates a single item in O(1) time and responds to a point query in O(1) time. Importantly, the memory requirement of the proposed data structure is independent of the size of the data and only depends on user-supplied parameters ψ ⃗ and φ ⃗. In summary, this thesis provides a general framework consisting of a set of algorithms and data structures to compute hierarchical summaries of the big data streams. All of the proposed algorithms exploit a lattice structure built from the hierarchical attributes of the data to compute different hierarchical summaries, which can be used to address various data analytic issues in many emerging applications
    corecore