10 research outputs found

    Oort: User-Centric Cloud Storage with Global Queries

    Get PDF
    In principle, the web should provide the perfect stage for user-generated content, allowing users to share their data seamlessly with other users across services and applications. In practice, the web fragments a user's data over many sites, each exposing only limited APIs for sharing. This paper describes Oort, a new cloud storage system that organizes data primarily by user rather than by application or web site. Oort allows users to choose which web software to use with their data and which other users to share it with, while giving applications powerful tools to query that data. Users rent space from providers that cooperate to provide a global, federated, general-purpose storage system. To support large-scale, multi-user applications such as Twitter and e-mail, Oort provides global queries that find and combine data from relevant users across all providers. Oort makes global query execution efficient by recognizing and merging similar queries issued by many users' application instances, largely eliminating the per-user factor in the global complexity of queries. Our evaluation predicts that an Oort implementation could handle traffic similar to that seen by Twitter using a hundred cooperating Oort servers, and that applications with other sharing patterns, like e-mail, can also be executed efficiently

    Efficient Probabilistic Subsumption Checking for Content-Based Publish/Subscribe Systems

    Get PDF
    Abstract. Efficient subsumption checking, deciding whether a subscription or publication is covered by a set of previously defined subscriptions, is of paramount importance for publish/subscribe systems. It provides the core system functionality—matching of publications to subscriber needs expressed as subscriptions—and additionally, reduces the overall system load and generated traffic since the covered subscriptions are not propagated in distributed environments. As the subsumption problem was shown previously to be co-NP complete and existing solutions typically apply pairwise comparisons to detect the subsumption relationship, we propose a ‘Monte Carlo type ’ probabilistic algorithm for the general subsumption problem. It determines whether a publication/subscription is covered by a disjunction of subscriptions in O(k md), wherek is the number of subscriptions, m is the number of distinct attributes in subscriptions, and d is the number of tests performed to answer a subsumption question. The probability of error is problem-specific and typically very small, and sets an upper bound on d. Our experimental results show significant gains in term of subscription set reduction which has favorable impact on the overall system performance as it reduces the total computational costs and networking traffic. Furthermore, the expected theoretical bounds underestimate algorithm performance because it performs much better in practice due to introduced optimizations, and is adequate for fast forwarding of subscriptions in case of high subscription rate.

    Top-k/w publish/subscribe: finding k most relevant publications in sliding time window w

    Get PDF
    Existing content-based publish/subscribe systems are designed assuming that all matching publications are equally relevant to a subscription. As we cannot know in advance the distribution of publication content, the following two unwanted situations are highly possible: a subscriber either receives too many or only few publications. In this paper we present a new publish/subscribe model which is based on the sliding window computation model. Our model assumes that publications have different relevance to a subscription. In the model, a subscriber receives k most relevant publications published within a time window w, where k and w are parameters defined per each subscription. As a consequence, the arrival rate of incoming relevant publications per subscription is predefined, and does not depend on the publication rate. Since all relevant objects (i.e. publications in our case) cannot be kept in main memory, existing solutions immediately discard less relevant objects, and store only a small representative set for subsequent delivery. In this paper we develop a probabilistic criterion to decide upon the arrival of a new object whether it may become the top-k object at some future point in time and should thus be stored in a special publications queue. We show that by accepting typically very small probability of error, the queue length is reasonably small and does not significantly depend on publication rate. Furthermore, we experimentally evaluate our approach to demonstrate its applicability in practice

    Fast Probabilistic Subsumption Checking for Publish/Subscribe Systems

    Get PDF
    Efficient subsumption checking, deciding whether a subscription or publication is subsumed (covered) by a set of previously defined subscriptions, is of paramount importance for publish/subscribe systems. It provides the core system functionality, and additionally, reduces the overall system load and generated traffic in distributed environments. As the deterministic solution was shown previously to be co-NP complete and existing solutions typically employ costly pairwise comparisons to detect the subsumption relationship, we propose a probabilistic algorithm for the general subsumption problem. It efficiently determines whether a publication/subscription is covered by a disjunction of subscriptions in O(k m d)O(k~m~d), where kk is the number of subscriptions, mm is the number of distinct attributes in subscriptions, and dd is the number of tests performed to answer a subsumption question. The probability of error is problem specific and typically very small, and determines an upper bound on dd in polynomial time prior to the algorithm execution. Our experimental results demonstrate the algorithm performs even better in practice due to introduced optimizations, and is adequate for fast forwarding of publications/subscriptions, especially in resource scarce environments, e.g. sensor networks

    S3-TM: scalable streaming short text matching

    Get PDF
    Micro-blogging services have become major venues for information creation, as well as channels of information dissemination. Accordingly, monitoring them for relevant information is a critical capability. This is typically achieved by registering content-based subscriptions with the micro-blogging service. Such subscriptions are long-running queries that are evaluated against the stream of posts. Given the popularity and scale of micro-blogging services like Twitter and Weibo, building a scalable infrastructure to evaluate these subscriptions is a challenge. To address this challenge, we present the S3-TM system for streaming short text matching. S3-TM is organized as a stream processing application, in the form of a data parallel flow graph designed to be run on a data center environment. It takes advantage of the structure of the publications (posts) and subscriptions to perform the matching in a scalable manner, without broadcasting publications or subscriptions to all of the matcher instances. The basic design of S3^33-TM uses a scoped multicast for publications and scoped anycast for subscriptions. To further improve throughput, we introduce publication routing algorithms that aim at minimizing the scope of the multicasts. First set of algorithms we develop are based on partitioning the word co-occurrence frequency graph, with the aim of routing posts that include commonly co-occurring words to a small set of matchers. While effective, these algorithms fell short in balancing the load. To address this, we develop the SALB algorithm, which provides better load balance by modeling the load more accurately using the word-to-post bipartite graph. We also develop a subscription placement algorithm, called LASP, to group together similar subscriptions, in order to minimize the subscription matching cost. Furthermore, to achieve good scalability for increasing number of nodes, we introduce techniques to handle workload skew. Finally, we introduce load shedding techniques for handling unexpected load spikes with small impact on the accuracy. Our experimental results show that S3-TM is scalable. Furthermore, the SALB algorithm provides more than 2.5× throughput compared to the baseline multicast and outperforms the graph partitioning-based approaches. © 2015, Springer-Verlag Berlin Heidelberg

    Design and Implementation of a Middleware for Uniform, Federated and Dynamic Event Processing

    Get PDF
    In recent years, real-time processing of massive event streams has become an important topic in the area of data analytics. It will become even more important in the future due to cheap sensors, a growing amount of devices and their ubiquitous inter-connection also known as the Internet of Things (IoT). Academia, industry and the open source community have developed several event processing (EP) systems that allow users to define, manage and execute continuous queries over event streams. They achieve a significantly better performance than the traditional store-then-process'' approach in which events are first stored and indexed in a database. Because EP systems have different roots and because of the lack of standardization, the system landscape became highly heterogenous. Today's EP systems differ in APIs, execution behaviors and query languages. This thesis presents the design and implementation of a novel middleware that abstracts from different EP systems and provides a uniform API, execution behavior and query language to users and developers. As a consequence, the presented middleware overcomes the problem of vendor lock-in and different EP systems are enabled to cooperate with each other. In practice, event streams differ dramatically in volume and velocity. We show therefore how the middleware can connect to not only different EP systems, but also database systems and a native implementation. Emerging applications such as the IoT raise novel challenges and require EP to be more dynamic. We present extensions to the middleware that enable self-adaptivity which is needed in context-sensitive applications and those that deal with constantly varying sets of event producers and consumers. Lastly, we extend the middleware to fully support the processing of events containing spatial data and to be able to run distributed in the form of a federation of heterogenous EP systems

    Community-Based Intrusion Detection

    Get PDF
    Today, virtually every company world-wide is connected to the Internet. This wide-spread connectivity has given rise to sophisticated, targeted, Internet-based attacks. For example, between 2012 and 2013 security researchers counted an average of about 74 targeted attacks per day. These attacks are motivated by economical, financial, or political interests and commonly referred to as “Advanced Persistent Threat (APT)” attacks. Unfortunately, many of these attacks are successful and the adversaries manage to steal important data or disrupt vital services. Victims are preferably companies from vital industries, such as banks, defense contractors, or power plants. Given that these industries are well-protected, often employing a team of security specialists, the question is: How can these attacks be so successful? Researchers have identified several properties of APT attacks which make them so efficient. First, they are adaptable. This means that they can change the way they attack and the tools they use for this purpose at any given moment in time. Second, they conceal their actions and communication by using encryption, for example. This renders many defense systems useless as they assume complete access to the actual communication content. Third, their actions are stealthy — either by keeping communication to the bare minimum or by mimicking legitimate users. This makes them “fly below the radar” of defense systems which check for anomalous communication. And finally, with the goal to increase their impact or monetisation prospects, their attacks are targeted against several companies from the same industry. Since months can pass between the first attack, its detection, and comprehensive analysis, it is often too late to deploy appropriate counter-measures at businesses peers. Instead, it is much more likely that they have already been attacked successfully. This thesis tries to answer the question whether the last property (industry-wide attacks) can be used to detect such attacks. It presents the design, implementation and evaluation of a community-based intrusion detection system, capable of protecting businesses at industry-scale. The contributions of this thesis are as follows. First, it presents a novel algorithm for community detection which can detect an industry (e.g., energy, financial, or defense industries) in Internet communication. Second, it demonstrates the design, implementation, and evaluation of a distributed graph mining engine that is able to scale with the throughput of the input data while maintaining an end-to-end latency for updates in the range of a few milliseconds. Third, it illustrates the usage of this engine to detect APT attacks against industries by analyzing IP flow information from an Internet service provider. Finally, it introduces a detection algorithm- and input-agnostic intrusion detection engine which supports not only intrusion detection on IP flow but any other intrusion detection algorithm and data-source as well
    corecore