229 research outputs found

    Random walk with restart over dynamic graphs

    Get PDF
    Random Walk with Restart (RWR) is an appealing measure of proximity between nodes based on graph structures. Since real graphs are often large and subject to minor changes, it is prohibitively expensive to recompute proximities from scratch. Previous methods use LU decomposition and degree reordering heuristics, entailing O(|V|3) time and O(|V|2) memory to compute all (|V|2) pairs of node proximities in a static graph. In this paper, a dynamic scheme to assess RWR proximities is proposed: (1) For unit update, we characterize the changes to all-pairs proximities as the outer product of two vectors. We notice that the multiplication of an RWR matrix and its transition matrix, unlike traditional matrix multiplications, is commutative. This can greatly reduce the computation of all-pairs proximities from O(|V|3) to O(|Δ|) time for each update without loss of accuracy, where |Δ| (<<|V|2) is the number of affected proximities. (2) To avoid O(|V|2) memory for all pairs of outputs, we also devise efficient partitioning techniques for our dynamic model, which can compute all pairs of proximities segment-wisely within O(l|V|) is a user-controlled trade-off between  memory an I/O costs. (3) For bulk update, we also devise aggregation and hashing methods, which can discard many unneccessary updates further and handle chuncks of unit updates simultaneously. Our experimental results on various datasets demonstrate that our methods can be 1-2 orders of magnitude faster than other competitors while securing scalability and exactness..

    Clustering semantics for hypermedia presentation

    Get PDF
    Semantic annotations of media repositories make relationships among the stored media and relevant concepts explicit. However, these relationships and the media they join are not directly presentable as hypermedia. Previous work shows how clustering over the annotations in the repositories can determine hypermedia presentation structure. Here we explore the application of different clustering techniques to generating hypermedia interfaces to media archives. This paper also describes the effect of each type of clustering on the end user's experience. We then generalize and unify these techniques with the use of proximity measures in further improving generated presentation structur

    Content warehouses

    Get PDF
    Nowadays, content management systems are an established technology. Based on the experiences from several application scenarios we discuss the points of contact between content management systems and other disciplines of information systems engineering like data warehouses, data mining, and data integration. We derive a system architecture called "content warehouse" that integrates these technologies and defines a more general and more sophisticated view on content management. As an example, a system for the collection, maintenance, and evaluation of biological content like survey data or multimedia resources is shown as a case study

    Keeping the data lake in form: proximity mining for pre-filtering schema matching

    Get PDF
    Data Lakes (DLs) are large repositories of raw datasets from disparate sources. As more datasets are ingested into a DL, there is an increasing need for efficient techniques to profile them and to detect the relationships among their schemata, commonly known as holistic schema matching. Schema matching detects similarity between the information stored in the datasets to support information discovery and retrieval. Currently, this is computationally expensive with the volume of state-of-the-art DLs. To handle this challenge, we propose a novel early-pruning approach to improve efficiency, where we collect different types of content metadata and schema metadata about the datasets, and then use this metadata in early-pruning steps to pre-filter the schema matching comparisons. This involves computing proximities between datasets based on their metadata, discovering their relationships based on overall proximities and proposing similar dataset pairs for schema matching. We improve the effectiveness of this task by introducing a supervised mining approach for effectively detecting similar datasets which are proposed for further schema matching. We conduct extensive experiments on a real-world DL which proves the success of our approach in effectively detecting similar datasets for schema matching, with recall rates of more than 85% and efficiency improvements above 70%. We empirically show the computational cost saving in space and time by applying our approach in comparison to instance-based schema matching techniques.This research was partially funded by the European Commission through the Erasmus Mundus Joint Doctorate (IT4BI-DC).Peer ReviewedPostprint (author's final draft

    A Framework for Top-K Queries over Weighted RDF Graphs

    Get PDF
    abstract: The Resource Description Framework (RDF) is a specification that aims to support the conceptual modeling of metadata or information about resources in the form of a directed graph composed of triples of knowledge (facts). RDF also provides mechanisms to encode meta-information (such as source, trust, and certainty) about facts already existing in a knowledge base through a process called reification. In this thesis, an extension to the current RDF specification is proposed in order to enhance RDF triples with an application specific weight (cost). Unlike reification, this extension treats these additional weights as first class knowledge attributes in the RDF model, which can be leveraged by the underlying query engine. Additionally, current RDF query languages, such as SPARQL, have a limited expressive power which limits the capabilities of applications that use them. Plus, even in the presence of language extensions, current RDF stores could not provide methods and tools to process extended queries in an efficient and effective way. To overcome these limitations, a set of novel primitives for the SPARQL language is proposed to express Top-k queries using traditional query patterns as well as novel predicates inspired by those from the XPath language. Plus, an extended query processor engine is developed to support efficient ranked path search, join, and indexing. In addition, several query optimization strategies are proposed, which employ heuristics, advanced indexing tools, and two graph metrics: proximity and sub-result inter-arrival time. These strategies aim to find join orders that reduce the total query execution time while avoiding worst-case pattern combinations. Finally, extensive experimental evaluation shows that using these two metrics in query optimization has a significant impact on the performance and efficiency of Top-k queries. Further experiments also show that proximity and inter-arrival have an even greater, although sometimes undesirable, impact when combined through aggregation functions. Based on these results, a hybrid algorithm is proposed which acknowledges that proximity is more important than inter-arrival time, due to its more complete nature, and performs a fine-grained combination of both metrics by analyzing the differences between their individual scores and performing the aggregation only if these differences are negligible.Dissertation/ThesisM.S. Computer Science 201

    Learning Vertex Representations for Bipartite Networks

    Full text link
    Recent years have witnessed a widespread increase of interest in network representation learning (NRL). By far most research efforts have focused on NRL for homogeneous networks like social networks where vertices are of the same type, or heterogeneous networks like knowledge graphs where vertices (and/or edges) are of different types. There has been relatively little research dedicated to NRL for bipartite networks. Arguably, generic network embedding methods like node2vec and LINE can also be applied to learn vertex embeddings for bipartite networks by ignoring the vertex type information. However, these methods are suboptimal in doing so, since real-world bipartite networks concern the relationship between two types of entities, which usually exhibit different properties and patterns from other types of network data. For example, E-Commerce recommender systems need to capture the collaborative filtering patterns between customers and products, and search engines need to consider the matching signals between queries and webpages

    Strategies for image visualisation and browsing

    Get PDF
    PhDThe exploration of large information spaces has remained a challenging task even though the proliferation of database management systems and the state-of-the art retrieval algorithms is becoming pervasive. Signi cant research attention in the multimedia domain is focused on nding automatic algorithms for organising digital image collections into meaningful structures and providing high-semantic image indices. On the other hand, utilisation of graphical and interactive methods from information visualisation domain, provide promising direction for creating e cient user-oriented systems for image management. Methods such as exploratory browsing and query, as well as intuitive visual overviews of image collection, can assist the users in nding patterns and developing the understanding of structures and content in complex image data-sets. The focus of the thesis is combining the features of automatic data processing algorithms with information visualisation. The rst part of this thesis focuses on the layout method for displaying the collection of images indexed by low-level visual descriptors. The proposed solution generates graphical overview of the data-set as a combination of similarity based visualisation and random layout approach. Second part of the thesis deals with problem of visualisation and exploration for hierarchical organisation of images. Due to the absence of the semantic information, images are considered the only source of high-level information. The content preview and display of hierarchical structure are combined in order to support image retrieval. In addition to this, novel exploration and navigation methods are proposed to enable the user to nd the way through database structure and retrieve the content. On the other hand, semantic information is available in cases where automatic or semi-automatic image classi ers are employed. The automatic annotation of image items provides what is referred to as higher-level information. This type of information is a cornerstone of multi-concept visualisation framework which is developed as a third part of this thesis. This solution enables dynamic generation of user-queries by combining semantic concepts, supported by content overview and information ltering. Comparative analysis and user tests, performed for the evaluation of the proposed solutions, focus on the ways information visualisation a ects the image content exploration and retrieval; how e cient and comfortable are the users when using di erent interaction methods and the ways users seek for information through di erent types of database organisation
    corecore