69 research outputs found

    Message Passing in Semantic Peer-to-Peer Overlay Networks

    Get PDF
    Peer-to-Peer (P2P) systems rely on machine-to-machine ad-hoc communications to offer services to a community. Contrary to the classical client-server architecture, P2P systems consider all peers, i.e., all nodes participating in the network, as being equal. Hence, peers can at the same time act as clients consuming resources from the system, and as servers providing resources to the community. P2P applications function on top of existing routing infrastructures, typically on top of the IP network, and organize peers into logical and decentralized structures called overlay networks. In this column, we discuss exploratory research related to data management in P2P overlay networks. First, we discuss the notions of unstructured and structured P2P overlay networks. Then, we discuss data management in such networks by introducing an additional layer to handle semantic heterogeneity and data integration. Finally, we present a method based on sum-product message passing to detect inconsistent information in this setting

    PicShark: Mitigating Metadata Scarcity Through Large-Scale P2P Collaboration

    Get PDF
    Abstract With the commoditization of digital devices, personal information and media sharing is becoming a key application on the pervasive Web. In such a context, data annotation rather than data production is the main bottleneck. Metadata scarcity represents a major obstacle preventing effcient information processing in large and heterogeneous communities. However, social communities also open the door to new possibilities for addressing local metadata scarcity by taking advantage of global collections of resources. We propose to tackle the lack of metadata in large-scale distributed systems through a collaborative process leveraging on both content and metadata. We develop a community-based and self-organizing system called PicShark in which information entropy in terms of missing metadata is gradually alleviated through decentralized instance and schema matching. Our approach focuses on semi- structured metadata and confines computationally expensive operations to the edge of the network, while keeping distributed operations as simple as possible to ensure scalability. PicShark builds on structured Peer-to-Peer networks for distributed look-up operations, but extends the application of self-organization principles to the propagation of metadata and the creation of schema mappings. We demonstrate the practical applicability of our method in an image sharing scenario and provide experimental evidences illustrating the validity of our approach

    Efficient Versioning for Scientific Array Databases

    Get PDF
    In this paper, we describe a versioned database storage manager we are developing for the SciDB scientific database. The system is designed to efficiently store and retrieve array-oriented data, exposing a "no-overwrite" storage model in which each update creates a new "version" of an array. This makes it possible to perform comparisons of versions produced at different times or by different algorithms, and to create complex chains and trees of versions. We present algorithms to efficiently encode these versions, minimizing storage space while still providing efficient access to the data. Additionally, we present an optimal algorithm that, given a long sequence of versions, determines which versions to encode in terms of each other (using delta compression) to minimize total storage space or query execution cost. We compare the performance of these algorithms on real world data sets from the National Oceanic and Atmospheric Administration (NOAA), Open Street Maps, and several other sources. We show that our algorithms provide better performance than existing version control systems not optimized for array data, both in terms of storage size and access time, and that our delta-compression algorithms are able to substantially reduce the total storage space when versions exist with a high degree of similarity.National Science Foundation (U.S.) (Grant IIS/III-1111371)National Science Foundation (U.S.) (Grant SI2-1047955

    Analyzing Large-Scale Public Campaigns on Twitter

    Get PDF
    Social media has become an important instrument for running various types of public campaigns and mobilizing people. Yet, the dynamics of public campaigns on social networking platforms still remain largely unexplored. In this paper, we present an in-depth analysis of over one hundred large-scale campaigns on social media platforms covering more than 6 years. In particular, we focus on campaigns related to climate change on Twitter, which promote online activism to encourage, educate, and motivate people to react to the various issues raised by climate change. We propose a generic framework based on a crowdsourcing to identify both the type of a given campaign as well as the various actions undertaken throughout its lifespan: official meetings, physical actions, calls for action, publications on climate related research, etc. We study whether the type of a campaign is correlated to the actions undertaken and how these actions influence the flow of the campaign. Leveraging more than one hundred different campaigns, we build a model capable of accurately predicting the presence of individual actions in tweets. Finally, we explore the influence of active users on the overall campaign flow

    Hippocampus: answering memory queries using transactive search

    Get PDF
    Memory queries denote queries where the user is trying to recall from his/her past personal experiences. Neither Web search nor structured queries can effectively answer this type of queries, even when supported by Human Computation so- lutions. In this paper, we propose a new approach to answer memory queries that we call Transactive Search: The user- requested memory is reconstructed from a group of people by exchanging pieces of personal memories in order to reassem- ble the overall memory, which is stored in a distributed fash- ion among members of the group. We experimentally com- pare our proposed approach against a set of advanced search techniques including the use of Machine Learning methods over the Web of Data, online Social Networks, and Human Computation techniques. Experimental results show that Transactive Search significantly outperforms the effective- ness of existing search approaches for memory queries

    ScienceWISE: Topic Modeling over Scientific Literature Networks

    Get PDF
    We provide an up-to-date view on the knowledge management system ScienceWISE (SW) and address issues related to the automatic assignment of articles to research topics. So far, SW has been proven to be an effective platform for managing large volumes of technical articles by means of ontological concept-based browsing. However, as the publication of research articles accelerates, the expressivity and the richness of the SW ontology turns into a double-edged sword: a more fine-grained characterization of articles is possible, but at the cost of introducing more spurious relations among them. In this context, the challenge of continuously recommending relevant articles to users lies in tackling a network partitioning problem, where nodes represent articles and co-occurring concepts create edges between them. In this paper, we discuss the three research directions we have taken for solving this issue: i) the identification of generic concepts to reinforce inter-article similarities; ii) the adoption of a bipartite network representation to improve scalability; iii) the design of a clustering algorithm to identify concepts for cross-disciplinary articles and obtain fine-grained topics for all articles

    Distributed Caching for Processing Raw Arrays

    Get PDF
    As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format---without loading and partitioning. Parallel processing scales to the largest datasets. In-memory caching reduces latency when the same data are accessed across a workload of queries. However, we are not aware of any work on distributed caching of multi-dimensional raw arrays. In this paper, we introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format. Given a set of files that contain portions of an array and an online query workload, the framework computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We design cache eviction and placement heuristic algorithms that consider the historical query workload. A thorough experimental evaluation over two real datasets in three file formats confirms the superiority - by as much as two orders of magnitude - of the proposed framework over existing techniques in terms of cache overhead and workload execution time
    • …
    corecore