798 research outputs found

    A New Storm Topology for Synopsis Management in the Processing Architecture

    Get PDF
    The Processing Architecture based on Measurement Metadata (PAbMM) is a data stream management system specialized in measurement and evaluation (M&E) projects, which incorporates predictive and detective behavior on data streams. It uses a case based organizational memory for recommending courses of action in each detected online situation and previously modeled by the project definition. In this work the storm topology associated with the online processing in PAbMM is described. Additionally, a new synopses strategy for monitoring entities under analysis is presented and a new schema for training the online classifiers is introduced. This new schema allows indicating to the classifiers the problem characterization, the proposal solution and the associated indicator value (target class). A practical case associated with the weather radar of the Experimental Agricultural Station (EAS) INTA Anguil (Province of La Pampa, Argentina) is shown, indicating the advantages of this storm topology and the new schema oriented to training data set.Sociedad Argentina de Informática e Investigación Operativa (SADIO

    A new storm topology for synopsis management in the processing architecture

    Full text link

    Distributed Processing and Analytics of IoT data in Edge Cloud

    Get PDF
    Sensors of different kinds connect to the IoT network and generate a large number of data streams. We explore the possibility of performing stream processing at the network edge and an architecture to do so. This thesis work is based on a prototype solution developed by Nokia. The system operates close to the data sources and retrieves the data based on requests made by applications through the system. Processing the data close to the place where it is generated can save bandwidth and assist in decision making. This work proposes a processing component operating at the far edge. The applicability of the prototype solution given the proposed processing component was illustrated in three use cases. Those use cases involve analysis performed on values of Key Performance Indicators, data streams generated by air quality sensors called Sensordrones, and recognizing car license plates by an application of deep learning

    Workload Management for Data-Intensive Services

    Get PDF
    <p>Data-intensive web services are typically composed of three tiers: i) a display tier that interacts with users and serves rich content to them, ii) a storage tier that stores the user-generated or machine-generated data used to create this content, and iii) an analytics tier that runs data analysis tasks in order to create and optimize new content. Each tier has different workloads and requirements that result in a diverse set of systems being used in modern data-intensive web services.</p><p>Servers are provisioned dynamically in the display tier to ensure that interactive client requests are served as per the latency and throughput requirements. The challenge is not only deciding automatically how many servers to provision but also when to provision them, while ensuring stable system performance and high resource utilization. To address these challenges, we have developed a new control policy for provisioning resources dynamically in coarse-grained units (e.g., adding or removing servers or virtual machines in cloud platforms). Our new policy, called proportional thresholding, converts a user-specified performance target value into a target range in order to account for the relative effect of provisioning a server on the overall workload performance.</p><p>The storage tier is similar to the display tier in some respects, but poses the additional challenge of needing redistribution of stored data when new storage nodes are added or removed. Thus, there will be some delay before the effects of changing a resource allocation will appear. Moreover, redistributing data can cause some interference to the current workload because it uses resources that can otherwise be used for processing requests. We have developed a system, called Elastore, that addresses the new challenges found in the storage tier. Elastore not only coordinates resource allocation and data redistribution to preserve stability during dynamic resource provisioning, but it also finds the best tradeoff between workload interference and data redistribution time.</p><p>The workload in the analytics tier consists of data-parallel workflows that can either be run in a batch fashion or continuously as new data becomes available. Each workflow is composed of smaller units that have producer-consumer relationships based on data. These workflows are often generated from declarative specifications in languages like SQL, so there is a need for a cost-based optimizer that can generate an efficient execution plan for a given workflow. There are a number of challenges when building a cost-based optimizer for data-parallel workflows, which includes characterizing the large execution plan space, developing cost models to estimate the execution costs, and efficiently searching for the best execution plan. We have built two cost-based optimizers: Stubby for batch data-parallel workflows running on MapReduce systems, and Cyclops for continuous data-parallel workflows where the choice of execution system is made a part of the execution plan space.</p><p>We have conducted a comprehensive evaluation that shows the effectiveness of each tier's automated workload management solution.</p>Dissertatio

    A Survey on the Evolution of Stream Processing Systems

    Full text link
    Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between early ('00-'10) and modern ('11-'18) streaming systems, and discuss recent trends and open problems.Comment: 34 pages, 15 figures, 5 table

    The design and implementation of fuzzy query processing on sensor networks

    Get PDF
    Sensor nodes and Wireless Sensor Networks (WSN) enable observation of the physical world in unprecedented levels of granularity. A growing number of environmental monitoring applications are being designed to leverage data collection features of WSN, increasing the need for efficient data management techniques and for comparative analysis of various data management techniques. My research leverages aspects of fuzzy database, specifically fuzzy data representation and fuzzy or flexible queries to improve upon the efficiency of existing data management techniques by exploiting the inherent uncertainty of the data collected by WSN. Herein I present my research contributions. I provide classification of WSN middleware to illustrate varying approaches to data management for WSN and identify a need to better handle the uncertainty inherent in data collected from physical environments and to take advantage of the imprecision of the data to increase the efficiency of WSN by requiring less information be transmitted to adequately answer queries posed by WSN monitoring applications. In this dissertation, I present a novel approach to querying WSN, in which semantic knowledge about sensor attributes is represented as fuzzy terms. I present an enhanced simulation environment that supports more flexible and realistic analysis by using cellular automata models to separately model the deployed WSN and the underlying physical environment. Simulation experiments are used to evaluate my fuzzy query approach for environmental monitoring applications. My analysis shows that using fuzzy queries improves upon other data management techniques by reducing the amount of data that needs to be collected to accurately satisfy application requests. This reduction in data transmission results in increased battery life within sensors, an important measure of cost and performance for WSN applications

    A catalog of stream processing optimizations

    Get PDF
    Cataloged from PDF version of article.Various research communities have independently arrived at stream processing as a programming model for efficient and parallel computing. These communities include digital signal processing, databases, operating systems, and complex event processing. Since each community faces applications with challenging performance requirements, each of them has developed some of the same optimizations, but often with conflicting terminology and unstated assumptions. This article presents a survey of optimizations for stream processing. It is aimed both at users who need to understand and guide the system's optimizer and at implementers who need to make engineering tradeoffs. To consolidate terminology, this article is organized as a catalog, in a style similar to catalogs of design patterns or refactorings. To make assumptions explicit and help understand tradeoffs, each optimization is presented with its safety constraints (when does it preserve correctness?) and a profitability experiment (when does it improve performance?). We hope that this survey will help future streaming system builders to stand on the shoulders of giants from not just their own community. © 2014 ACM

    Distributed Contextual Anomaly Detection from Big Event Streams

    Get PDF
    The age of big digital data is emerged and the size of generating data is rapidly increasing in a millisecond through the Internet of Things (IoT) and Internet of Everything (IoE) objects. Specifically, most of today’s available data are generated in a form of streams through different applications including sensor networks, bioinformatics, smart airport, smart highway traffic, smart home applications, e-commerce online shopping, and social media streams. In this context, processing and mining such high volume of data stream becomes one of the research priority concern and challenging tasks. On the one hand, processing high volumes of streaming data with low-latency response is a critical concern in most of the real-time application before the important information can be missed or disregarded. On the other hand, detecting events from data stream is becoming a new research challenging task since the existing traditional anomaly detection method is mainly focusing on; a) limited size of data, b) centralised detection with limited computing resource, and c) specific anomaly detection types of either point or collective rather than the Contextual behaviour of the data. Thus, detecting Contextual events from high sequence volume of data stream is one of the research concerns to be addressed in this thesis. As the size of IoT data stream is scaled up to a high volume, it is impractical to propose existing processing data structure and anomaly detection method. This is due to the space, time and the complexity of the existing data processing model and learning algorithms. In this thesis, a novel distributed anomaly detection method and algorithm is proposed to detect Contextual behaviours from the sequence of bounded streams. Capturing event streams and partitioning them over several windows to control the high rate of event streams mainly base on, the proposed solution firstly. Secondly, by proposing a parallel and distributed algorithm to detect Contextual anomalous event. The experimental results are evaluated based on the algorithm’s performances, processing low-latency response, and detecting Contextual anomalous behaviour accuracy rate from the event streams. Finally, to address scalability concerned of the Contextual events, appropriate computational metrics are proposed to measure and evaluate the processing latency of distributed method. The achieved result is evidenced distributed detection is effective in terms of learning from high volumes of streams in real-time

    Graph Processing in Main-Memory Column Stores

    Get PDF
    Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access. Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries. A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language. In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals. Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators. Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes
    • …
    corecore