31,545 research outputs found

    Contextual Information Retrieval based on Algorithmic Information Theory and Statistical Outlier Detection

    Full text link
    The main contribution of this paper is to design an Information Retrieval (IR) technique based on Algorithmic Information Theory (using the Normalized Compression Distance- NCD), statistical techniques (outliers), and novel organization of data base structure. The paper shows how they can be integrated to retrieve information from generic databases using long (text-based) queries. Two important problems are analyzed in the paper. On the one hand, how to detect "false positives" when the distance among the documents is very low and there is actual similarity. On the other hand, we propose a way to structure a document database which similarities distance estimation depends on the length of the selected text. Finally, the experimental evaluations that have been carried out to study previous problems are shown.Comment: Submitted to 2008 IEEE Information Theory Workshop (6 pages, 6 figures

    PF-OLA: A High-Performance Framework for Parallel On-Line Aggregation

    Full text link
    Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. This allows for the interactive data exploration of the largest datasets. In this paper we introduce the first framework for parallel online aggregation in which the estimation virtually does not incur any overhead on top of the actual execution. We define a generic interface to express any estimation model that abstracts completely the execution details. We design a novel estimator specifically targeted at parallel online aggregation. When executed by the framework over a massive 8TB8\text{TB} TPC-H instance, the estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and without incurring overhead.Comment: 36 page

    A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs

    Full text link
    Cyber security is one of the most significant technical challenges in current times. Detecting adversarial activities, prevention of theft of intellectual properties and customer data is a high priority for corporations and government agencies around the world. Cyber defenders need to analyze massive-scale, high-resolution network flows to identify, categorize, and mitigate attacks involving networks spanning institutional and national boundaries. Many of the cyber attacks can be described as subgraph patterns, with prominent examples being insider infiltrations (path queries), denial of service (parallel paths) and malicious spreads (tree queries). This motivates us to explore subgraph matching on streaming graphs in a continuous setting. The novelty of our work lies in using the subgraph distributional statistics collected from the streaming graph to determine the query processing strategy. We introduce a "Lazy Search" algorithm where the search strategy is decided on a vertex-to-vertex basis depending on the likelihood of a match in the vertex neighborhood. We also propose a metric named "Relative Selectivity" that is used to select between different query processing strategies. Our experiments performed on real online news, network traffic stream and a synthetic social network benchmark demonstrate 10-100x speedups over selectivity agnostic approaches.Comment: in 18th International Conference on Extending Database Technology (EDBT) (2015

    Fast Search for Dynamic Multi-Relational Graphs

    Full text link
    Acting on time-critical events by processing ever growing social media or news streams is a major technical challenge. Many of these data sources can be modeled as multi-relational graphs. Continuous queries or techniques to search for rare events that typically arise in monitoring applications have been studied extensively for relational databases. This work is dedicated to answer the question that emerges naturally: how can we efficiently execute a continuous query on a dynamic graph? This paper presents an exact subgraph search algorithm that exploits the temporal characteristics of representative queries for online news or social media monitoring. The algorithm is based on a novel data structure called the Subgraph Join Tree (SJ-Tree) that leverages the structural and semantic characteristics of the underlying multi-relational graph. The paper concludes with extensive experimentation on several real-world datasets that demonstrates the validity of this approach.Comment: SIGMOD Workshop on Dynamic Networks Management and Mining (DyNetMM), 201
    • …
    corecore