1,716 research outputs found

    FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

    Full text link
    We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (n2Dn^2D), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results

    Dura

    Get PDF
    The reactive event processing language, that is developed in the context of this project, has been called DEAL in previous documents. When we chose this name for our language it has not been used by other authors working in the same research area (complex event processing). However, in the meantime it appears in publications of other authors and because we have not used the name in publications yet we cannot claim that we were the first to use it. In order to avoid ambiguities and name conflicts in future publications we decided to rename our language to Dura which stands for “Declarative uniform reactive event processing language”. Therefore the title of this deliverable has been updated to “Dura – Concepts and Examples”

    Indexing the Earth Mover's Distance Using Normal Distributions

    Full text link
    Querying uncertain data sets (represented as probability distributions) presents many challenges due to the large amount of data involved and the difficulties comparing uncertainty between distributions. The Earth Mover's Distance (EMD) has increasingly been employed to compare uncertain data due to its ability to effectively capture the differences between two distributions. Computing the EMD entails finding a solution to the transportation problem, which is computationally intensive. In this paper, we propose a new lower bound to the EMD and an index structure to significantly improve the performance of EMD based K-nearest neighbor (K-NN) queries on uncertain databases. We propose a new lower bound to the EMD that approximates the EMD on a projection vector. Each distribution is projected onto a vector and approximated by a normal distribution, as well as an accompanying error term. We then represent each normal as a point in a Hough transformed space. We then use the concept of stochastic dominance to implement an efficient index structure in the transformed space. We show that our method significantly decreases K-NN query time on uncertain databases. The index structure also scales well with database cardinality. It is well suited for heterogeneous data sets, helping to keep EMD based queries tractable as uncertain data sets become larger and more complex.Comment: VLDB201

    Secure Time-Aware Provenance for Distributed Systems

    Get PDF
    Operators of distributed systems often find themselves needing to answer forensic questions, to perform a variety of managerial tasks including fault detection, system debugging, accountability enforcement, and attack analysis. In this dissertation, we present Secure Time-Aware Provenance (STAP), a novel approach that provides the fundamental functionality required to answer such forensic questions – the capability to “explain” the existence (or change) of a certain distributed system state at a given time in a potentially adversarial environment. This dissertation makes the following contributions. First, we propose the STAP model, to explicitly represent time and state changes. The STAP model allows consistent and complete explanations of system state (and changes) in dynamic environments. Second, we show that it is both possible and practical to efficiently and scalably maintain and query provenance in a distributed fashion, where provenance maintenance and querying are modeled as recursive continuous queries over distributed relations. Third, we present security extensions that allow operators to reliably query provenance information in adversarial environments. Our extensions incorporate tamper-evident properties that guarantee eventual detection of compromised nodes that lie or falsely implicate correct nodes. Finally, the proposed research results in a proof-of-concept prototype, which includes a declarative query language for specifying a range of useful provenance queries, an interactive exploration tool, and a distributed provenance engine for operators to conduct analysis of their distributed systems. We discuss the applicability of this tool in several use cases, including Internet routing, overlay routing, and cloud data processing

    Probabilistic Inductive Querying Using ProbLog

    Get PDF
    We study how probabilistic reasoning and inductive querying can be combined within ProbLog, a recent probabilistic extension of Prolog. ProbLog can be regarded as a database system that supports both probabilistic and inductive reasoning through a variety of querying mechanisms. After a short introduction to ProbLog, we provide a survey of the different types of inductive queries that ProbLog supports, and show how it can be applied to the mining of large biological networks.Peer reviewe

    Knowledge Representation Concepts for Automated SLA Management

    Full text link
    Outsourcing of complex IT infrastructure to IT service providers has increased substantially during the past years. IT service providers must be able to fulfil their service-quality commitments based upon predefined Service Level Agreements (SLAs) with the service customer. They need to manage, execute and maintain thousands of SLAs for different customers and different types of services, which needs new levels of flexibility and automation not available with the current technology. The complexity of contractual logic in SLAs requires new forms of knowledge representation to automatically draw inferences and execute contractual agreements. A logic-based approach provides several advantages including automated rule chaining allowing for compact knowledge representation as well as flexibility to adapt to rapidly changing business requirements. We suggest adequate logical formalisms for representation and enforcement of SLA rules and describe a proof-of-concept implementation. The article describes selected formalisms of the ContractLog KR and their adequacy for automated SLA management and presents results of experiments to demonstrate flexibility and scalability of the approach.Comment: Paschke, A. and Bichler, M.: Knowledge Representation Concepts for Automated SLA Management, Int. Journal of Decision Support Systems (DSS), submitted 19th March 200
    corecore