    Incremental Lossless Graph Summarization

    Given a fully dynamic graph, represented as a stream of edge insertions and deletions, how can we obtain and incrementally update a lossless summary of its current snapshot? As large-scale graphs are prevalent, concisely representing them is inevitable for efficient storage and analysis. Lossless graph summarization is an effective graph-compression technique with many desirable properties. It aims to compactly represent the input graph as (a) a summary graph consisting of supernodes (i.e., sets of nodes) and superedges (i.e., edges between supernodes), which provide a rough description, and (b) edge corrections which fix errors induced by the rough description. While a number of batch algorithms, suited for static graphs, have been developed for rapid and compact graph summarization, they are highly inefficient in terms of time and space for dynamic graphs, which are common in practice. In this work, we propose MoSSo, the first incremental algorithm for lossless summarization of fully dynamic graphs. In response to each change in the input graph, MoSSo updates the output representation by repeatedly moving nodes among supernodes. MoSSo decides nodes to be moved and their destinations carefully but rapidly based on several novel ideas. Through extensive experiments on 10 real graphs, we show MoSSo is (a) Fast and 'any time': processing each change in near-constant time (less than 0.1 millisecond), up to 7 orders of magnitude faster than running state-of-the-art batch methods, (b) Scalable: summarizing graphs with hundreds of millions of edges, requiring sub-linear memory during the process, and (c) Effective: achieving comparable compression ratios even to state-of-the-art batch methods.Comment: to appear at the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '20

    Instance-Based Lossless Summarization of Knowledge Graph With Optimized Triples and Corrections (IBA-OTC)

    Knowledge graph (KG) summarization facilitates efficient information retrieval for exploring complex structural data. For fast information retrieval, it requires processing on redundant data. However, it necessitates the completion of information in a summary graph. It also saves computational time during data retrieval, storage space, in-memory visualization, and preserving structure after summarization. State-of-the-art approaches summarize a given KG by preserving its structure at the cost of information loss. Additionally, the approaches not preserving the underlying structure, compromise the summarization ratio by focusing only on the compression of specific regions. In this way, these approaches either miss preserving the original facts or the wrong prediction of inferred information. To solve these problems, we present a novel framework for generating a lossless summary by preserving the structure through super signatures and their corresponding corrections. The proposed approach summarizes only the naturally overlapped instances while maintaining its information and preserving the underlying Resource Description Framework RDF graph. The resultant summary is composed of triples with positive, negative, and star corrections that are optimized by the smart calling of two novel functions namely merge and disperse . To evaluate the effectiveness of our proposed approach, we perform experiments on nine publicly available real-world knowledge graphs and obtain a better summarization ratio than state-of-the-art approaches by a margin of 10% to 30% with achieving its completeness, correctness, and compactness. In this way, the retrieval of common events and groups by queries is accelerated in the resultant graph

    Structural Summarization of Semantic Graphs Using Quotients

    Graph summarization is the process of computing a compact version of an input graph while preserving chosen features of its structure. We consider semantic graphs where the features include edge labels and label sets associated with a vertex. Graph summaries are typically much smaller than the original graph. Applications that depend on the preserved features can perform their tasks on the summary, but much faster or with less memory overhead, while producing the same outcome as if they were applied on the original graph. In this survey, we focus on structural summaries based on quotients that organize vertices in equivalence classes of shared features. Structural summaries are particularly popular for semantic graphs and have the advantage of defining a precise graph-based output. We consider approaches and algorithms for both static and temporal graphs. A common example of quotient-based structural summaries is bisimulation, and we discuss this in detail. While there exist other surveys on graph summarization, to the best of our knowledge, we are the first to bring in a focused discussion on quotients, bisimulation, and their relation. Furthermore, structural summarization naturally connects well with formal logic due to the discrete structures considered. We complete the survey with a brief description of approaches beyond structural summaries

    Large-scale interactive exploratory visual search

    Large scale visual search has been one of the challenging issues in the era of big data. It demands techniques that are not only highly effective and efficient but also allow users conveniently express their information needs and refine their intents. In this thesis, we focus on developing an exploratory framework for large scale visual search. We also develop a number of enabling techniques in this thesis, including compact visual content representation for scalable search, near duplicate video shot detection, and action based event detection. We propose a novel scheme for extremely low bit rate visual search, which sends compressed visual words consisting of vocabulary tree histogram and descriptor orientations rather than descriptors. Compact representation of video data is achieved through identifying keyframes of a video which can also help users comprehend visual content efficiently. We propose a novel Bag-of-Importance model for static video summarization. Near duplicate detection is one of the key issues for large scale visual search, since there exist a large number nearly identical images and videos. We propose an improved near-duplicate video shot detection approach for more effective shot representation. Event detection has been one of the solutions for bridging the semantic gap in visual search. We particular focus on human action centred event detection. We propose an enhanced sparse coding scheme to model human actions. Our proposed approach is able to significantly reduce computational cost while achieving recognition accuracy highly comparable to the state-of-the-art methods. At last, we propose an integrated solution for addressing the prime challenges raised from large-scale interactive visual search. The proposed system is also one of the first attempts for exploratory visual search. It provides users more robust results to satisfy their exploring experiences

    Descoberta de recursos para sistemas de escala arbitrarias

    Doutoramento em InformáticaTecnologias de Computação Distribuída em larga escala tais como Cloud, Grid, Cluster e Supercomputadores HPC estão a evoluir juntamente com a emergência revolucionária de modelos de múltiplos núcleos (por exemplo: GPU, CPUs num único die, Supercomputadores em single die, Supercomputadores em chip, etc) e avanços significativos em redes e soluções de interligação. No futuro, nós de computação com milhares de núcleos podem ser ligados entre si para formar uma única unidade de computação transparente que esconde das aplicações a complexidade e a natureza distribuída desses sistemas com múltiplos núcleos. A fim de beneficiar de forma eficiente de todos os potenciais recursos nesses ambientes de computação em grande escala com múltiplos núcleos ativos, a descoberta de recursos é um elemento crucial para explorar ao máximo as capacidade de todos os recursos heterogéneos distribuídos, através do reconhecimento preciso e localização desses recursos no sistema. A descoberta eficiente e escalável de recursos ´e um desafio para tais sistemas futuros, onde os recursos e as infira-estruturas de computação e comunicação subjacentes são altamente dinâmicas, hierarquizadas e heterogéneas. Nesta tese, investigamos o problema da descoberta de recursos no que diz respeito aos requisitos gerais da escalabilidade arbitrária de ambientes de computação futuros com múltiplos núcleos ativos. A principal contribuição desta tese ´e a proposta de uma entidade de descoberta de recursos adaptativa híbrida (Hybrid Adaptive Resource Discovery - HARD), uma abordagem de descoberta de recursos eficiente e altamente escalável, construída sobre uma sobreposição hierárquica virtual baseada na auto-organizaçãoo e auto-adaptação de recursos de processamento no sistema, onde os recursos computacionais são organizados em hierarquias distribuídas de acordo com uma proposta de modelo de descriçãoo de recursos multi-camadas hierárquicas. Operacionalmente, em cada camada, que consiste numa arquitetura ponto-a-ponto de módulos que, interagindo uns com os outros, fornecem uma visão global da disponibilidade de recursos num ambiente distribuído grande, dinâmico e heterogéneo. O modelo de descoberta de recursos proposto fornece a adaptabilidade e flexibilidade para executar consultas complexas através do apoio a um conjunto de características significativas (tais como multi-dimensional, variedade e consulta agregada) apoiadas por uma correspondência exata e parcial, tanto para o conteúdo de objetos estéticos e dinâmicos. Simulações mostram que o HARD pode ser aplicado a escalas arbitrárias de dinamismo, tanto em termos de complexidade como de escala, posicionando esta proposta como uma arquitetura adequada para sistemas futuros de múltiplos núcleos. Também contribuímos com a proposta de um regime de gestão eficiente dos recursos para sistemas futuros que podem utilizar recursos distribuíos de forma eficiente e de uma forma totalmente descentralizada. Além disso, aproveitando componentes de descoberta (RR-RPs) permite que a nossa plataforma de gestão de recursos encontre e aloque dinamicamente recursos disponíeis que garantam os parâmetros de QoS pedidos.Large scale distributed computing technologies such as Cloud, Grid, Cluster and HPC supercomputers are progressing along with the revolutionary emergence of many-core designs (e.g. GPU, CPUs on single die, supercomputers on chip, etc.) and significant advances in networking and interconnect solutions. In future, computing nodes with thousands of cores may be connected together to form a single transparent computing unit which hides from applications the complexity and distributed nature of these many core systems. In order to efficiently benefit from all the potential resources in such large scale many-core-enabled computing environments, resource discovery is the vital building block to maximally exploit the capabilities of all distributed heterogeneous resources through precisely recognizing and locating those resources in the system. The efficient and scalable resource discovery is challenging for such future systems where the resources and the underlying computation and communication infrastructures are highly-dynamic, highly-hierarchical and highly-heterogeneous. In this thesis, we investigate the problem of resource discovery with respect to the general requirements of arbitrary scale future many-core-enabled computing environments. The main contribution of this thesis is to propose Hybrid Adaptive Resource Discovery (HARD), a novel efficient and highly scalable resource-discovery approach which is built upon a virtual hierarchical overlay based on self-organization and self-adaptation of processing resources in the system, where the computing resources are organized into distributed hierarchies according to a proposed hierarchical multi-layered resource description model. Operationally, at each layer, it consists of a peer-to-peer architecture of modules that, by interacting with each other, provide a global view of the resource availability in a large, dynamic and heterogeneous distributed environment. The proposed resource discovery model provides the adaptability and flexibility to perform complex querying by supporting a set of significant querying features (such as multi-dimensional, range and aggregate querying) while supporting exact and partial matching, both for static and dynamic object contents. The simulation shows that HARD can be applied to arbitrary scales of dynamicity, both in terms of complexity and of scale, positioning this proposal as a proper architecture for future many-core systems. We also contributed to propose a novel resource management scheme for future systems which efficiently can utilize distributed resources in a fully decentralized fashion. Moreover, leveraging discovery components (RR-RPs) enables our resource management platform to dynamically find and allocate available resources that guarantee the QoS parameters on demand

    GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra

    We propose GraphMineSuite (GMS): the first benchmarking suite for graph mining that facilitates evaluating and constructing high-performance graph mining algorithms. First, GMS comes with a benchmark specification based on extensive literature review, prescribing representative problems, algorithms, and datasets. Second, GMS offers a carefully designed software platform for seamless testing of different fine-grained elements of graph mining algorithms, such as graph representations or algorithm subroutines. The platform includes parallel implementations of more than 40 considered baselines, and it facilitates developing complex and fast mining algorithms. High modularity is possible by harnessing set algebra operations such as set intersection and difference, which enables breaking complex graph mining algorithms into simple building blocks that can be separately experimented with. GMS is supported with a broad concurrency analysis for portability in performance insights, and a novel performance metric to assess the throughput of graph mining algorithms, enabling more insightful evaluation. As use cases, we harness GMS to rapidly redesign and accelerate state-of-the-art baselines of core graph mining problems: degeneracy reordering (by up to >2x), maximal clique listing (by up to >9x), k-clique listing (by 1.1x), and subgraph isomorphism (by up to 2.5x), also obtaining better theoretical performance bounds