342 research outputs found

    S-Store: Streaming Meets Transaction Processing

    Get PDF
    Stream processing addresses the needs of real-time applications. Transaction processing addresses the coordination and safety of short atomic computations. Heretofore, these two modes of operation existed in separate, stove-piped systems. In this work, we attempt to fuse the two computational paradigms in a single system called S-Store. In this way, S-Store can simultaneously accommodate OLTP and streaming applications. We present a simple transaction model for streams that integrates seamlessly with a traditional OLTP system. We chose to build S-Store as an extension of H-Store, an open-source, in-memory, distributed OLTP database system. By implementing S-Store in this way, we can make use of the transaction processing facilities that H-Store already supports, and we can concentrate on the additional implementation features that are needed to support streaming. Similar implementations could be done using other main-memory OLTP platforms. We show that we can actually achieve higher throughput for streaming workloads in S-Store than an equivalent deployment in H-Store alone. We also show how this can be achieved within H-Store with the addition of a modest amount of new functionality. Furthermore, we compare S-Store to two state-of-the-art streaming systems, Spark Streaming and Storm, and show how S-Store matches and sometimes exceeds their performance while providing stronger transactional guarantees

    Efficiently correlating complex events over live and archived data streams

    Get PDF
    Correlating complex events over live and archived data streams, which we call Pattern Correlation Queries (PCQs), provides many benefits for domains which need real-time forecasting of events or identification of causal dependencies, while handling data at high rates and in massive amounts, like in financial or medical settings. Existing work has focused either on complex event processing over a single type of stream source (i.e., either live or archived), or on simple stream correlation queries (e.g., live events trigerring a database lookup). In this paper, we specifically focus on recency-based PCQs and provide clear, useful, and optimizable semantics for them. PCQs raise a number of challenges in optimizing data management and query processing, which we address in the setting of the DejaVu complex event processing system. More specifically, we propose three complementary optimizations including recent in-put buffering, query result caching, and join source ordering. Fur-thermore, we capture the relevant query processing tradeoffs in a cost model. An extensive performance study on synthetic and real-life data sets not only validates this cost model, but also shows that our optimizations are very effective, achieving more than two orders magnitude throughput improvement and much better scala-bility compared to a conventional approach

    Explainable and Resource-Efficient Stream Processing Through Provenance and Scheduling

    Get PDF
    In our era of big data, information is captured at unprecedented volumes and velocities, with technologies such as Cyber-Physical Systems making quick decisions based on the processing of streaming, unbounded datasets. In such scenarios, it can be beneficial to process the data in an online manner, using the stream processing paradigm implemented by Stream Processing Engines (SPEs). While SPEs enable high-throughput, low-latency analysis, they are faced with challenges connected to evolving deployment scenarios, like the increasing use of heterogeneous, resource-constrained edge devices together with cloud resources and the increasing user expectations for usability, control, and resource-efficiency, on par with features provided by traditional databases.This thesis tackles open challenges regarding making stream processing more user-friendly, customizable, and resource-efficient. The first part outlines our work, providing high-level background information, descriptions of the research problems, and our contributions. The second part presents our three state-of-the-art frameworks for explainable data streaming using data provenance, which can help users of streaming queries to identify important data points, explain unexpected behaviors, and aid query understanding and debugging. (A) GeneaLog provides backward provenance allowing users to identify the inputs that contributed to the generation of each output of a streaming query. (B) Ananke is the first framework to provide a duplicate-free graph of live forward provenance, enabling easy bidirectional tracing of input-output relationships in streaming queries and identifying data points that have finished contributing to results. (C) Erebus is the first framework that allows users to define expectations about the results of a streaming query, validating whether these expectations are met or providing explanations in the form of why-not provenance otherwise. The third part presents techniques for execution efficiency through custom scheduling, introducing our state-of-the-art scheduling frameworks that control resource allocation and achieve user-defined performance goals. (D) Haren is an SPE-agnostic user-level scheduler that can efficiently enforce user-defined scheduling policies. (E) Lachesis is a standalone scheduling middleware that requires no changes to SPEs but, instead, directly guides the scheduling decisions of the underlying Operating System. Our extensive evaluations using real-world SPEs and workloads show that our work significantly improves over the state-of-the-art while introducing only small performance overheads

    WikiSensing: A collaborative sensor management system with trust assessment for big data

    Get PDF
    Big Data for sensor networks and collaborative systems have become ever more important in the digital economy and is a focal point of technological interest while posing many noteworthy challenges. This research addresses some of the challenges in the areas of online collaboration and Big Data for sensor networks. This research demonstrates WikiSensing (www.wikisensing.org), a high performance, heterogeneous, collaborative data cloud for managing and analysis of real-time sensor data. The system is based on the Big Data architecture with comprehensive functionalities for smart city sensor data integration and analysis. The system is fully functional and served as the main data management platform for the 2013 UPLondon Hackathon. This system is unique as it introduced a novel methodology that incorporates online collaboration with sensor data. While there are other platforms available for sensor data management WikiSensing is one of the first platforms that enable online collaboration by providing services to store and query dynamic sensor information without any restriction of the type and format of sensor data. An emerging challenge of collaborative sensor systems is modelling and assessing the trustworthiness of sensors and their measurements. This is with direct relevance to WikiSensing as an open collaborative sensor data management system. Thus if the trustworthiness of the sensor data can be accurately assessed, WikiSensing will be more than just a collaborative data management system for sensor but also a platform that provides information to the users on the validity of its data. Hence this research presents a new generic framework for capturing and analysing sensor trustworthiness considering the different forms of evidence available to the user. It uses an extensible set of metrics that can represent such evidence and use Bayesian analysis to develop a trust classification model. Based on this work there are several publications and others are at the final stage of submission. Further improvement is also planned to make the platform serve as a cloud service accessible to any online user to build up a community of collaborators for smart city research.Open Acces

    From Social Data Mining to Forecasting Socio-Economic Crisis

    Full text link
    Socio-economic data mining has a great potential in terms of gaining a better understanding of problems that our economy and society are facing, such as financial instability, shortages of resources, or conflicts. Without large-scale data mining, progress in these areas seems hard or impossible. Therefore, a suitable, distributed data mining infrastructure and research centers should be built in Europe. It also appears appropriate to build a network of Crisis Observatories. They can be imagined as laboratories devoted to the gathering and processing of enormous volumes of data on both natural systems such as the Earth and its ecosystem, as well as on human techno-socio-economic systems, so as to gain early warnings of impending events. Reality mining provides the chance to adapt more quickly and more accurately to changing situations. Further opportunities arise by individually customized services, which however should be provided in a privacy-respecting way. This requires the development of novel ICT (such as a self- organizing Web), but most likely new legal regulations and suitable institutions as well. As long as such regulations are lacking on a world-wide scale, it is in the public interest that scientists explore what can be done with the huge data available. Big data do have the potential to change or even threaten democratic societies. The same applies to sudden and large-scale failures of ICT systems. Therefore, dealing with data must be done with a large degree of responsibility and care. Self-interests of individuals, companies or institutions have limits, where the public interest is affected, and public interest is not a sufficient justification to violate human rights of individuals. Privacy is a high good, as confidentiality is, and damaging it would have serious side effects for society.Comment: 65 pages, 1 figure, Visioneer White Paper, see http://www.visioneer.ethz.c

    Big Data

    Get PDF
    Η εργασία στοχεύει στην ανάλυση της αγοράς των μεγάλων δεδομένων, Περιλαμβάνονται οι πάροχοι μαζί με κάποιες ενδιαφέρουσες περιπτώσεις χρήσης.Nowadays, term big data, draws a lot of attention, both for Business and person perspective. For decades, companies have been making business decisions through its Business Intelligence department, based on transactional data which were basically stored in relational databases. However, regulatory compliance, increased competition, and other pressures have created an insatiable need for companies to accumulate and analyze large, fast-growing quantities of data that was beyond the critical data

    A Heartbeat Mechanism and its Application in Gigascope.

    Get PDF
    Abstract Data stream management systems often rely on ordering properties of tuple attributes in order to implement non-blocking operators. However, query operators that work with multiple streams, such as stream merge or join, can often still block if one of the input stream is very slow or bursty. In principle, punctuation and heartbeat mechanisms have been proposed to unblock streaming operators. In practice, it is a challenge to incorporate such mechanisms into a highperformance stream management system that is operational in an industrial application. In this paper, we introduce a system for punctuation-carrying heartbeat generation that we developed for Gigascope, a high-performance streaming database for network monitoring, that is operationally used within AT&T's IP backbone. We show how heartbeats can be regularly generated by low-level nodes in query execution plans and propagated upward unblocking all streaming operators on its way. Additionally, our heartbeat mechanism can be used for other applications in distributed settings such as detecting node failures, performance monitoring, and query optimization. A performance evaluation using live data feeds shows that our system is capable of working at multiple Gigabit line speeds in a live, industrial deployment and can significantly decrease the query memory utilization

    Big Data Now, 2015 Edition

    Get PDF
    Now in its fifth year, O’Reilly’s annual Big Data Now report recaps the trends, tools, applications, and forecasts we’ve talked about over the past year. For 2015, we’ve included a collection of blog posts, authored by leading thinkers and experts in the field, that reflect a unique set of themes we’ve identified as gaining significant attention and traction. Our list of 2015 topics include: Data-driven cultures Data science Data pipelines Big data architecture and infrastructure The Internet of Things and real time Applications of big data Security, ethics, and governance Is your organization on the right track? Get a hold of this free report now and stay in tune with the latest significant developments in big data

    Processing Exact Results for Queries over Data Streams

    Get PDF
    In a growing number of information-processing applications, such as network-traffic monitoring, sensor networks, financial analysis, data mining for e-commerce, etc., data takes the form of continuous data streams rather than traditional stored databases/relational tuples. These applications have some common features like the need for real time analysis, huge volumes of data, and unpredictable and bursty arrivals of stream elements. In all of these applications, it is infeasible to process queries over data streams by loading the data into a traditional database management system (DBMS) or into main memory. Such an approach does not scale with high stream rates. As a consequence, systems that can manage streaming data have gained tremendous importance. The need to process a large number of continuous queries over bursty, high volume online data streams, potentially in real time, makes it imperative to design algorithms that should use limited resources. This dissertation focuses on processing exact results for join queries over high speed data streams using limited resources, and proposes several novel techniques for processing join queries incorporating secondary storages and non-dedicated computers. Existing approaches for stream joins either, (a) deal with memory limitations by shedding loads, and therefore can not produce exact or highly accurate results for the stream joins over data streams with time varying arrivals of stream tuples, or (b) suffer from large I/O-overheads due to random disk accesses. The proposed techniques exploit the high bandwidth of a disk subsystem by rendering the data access pattern largely sequential, eliminating small, random disk accesses. This dissertation proposes an I/O-efficient algorithm to process hybrid join queries, that join a fast, time varying or bursty data stream and a persistent disk relation. Such a hybrid join is the crux of a number of common transformations in an active data warehouse. Experimental results demonstrate that the proposed scheme reduces the response time in output results by exploiting spatio-temporal locality within the input stream, and minimizes disk overhead through disk-I/O amortization. The dissertation also proposes an algorithm to parallelize a stream join operator over a shared-nothing system. The proposed algorithm distributes the processing loads across a number of independent, non-dedicated nodes, based on a fixed or predefined communication pattern; dynamically maintains the degree of declustering in order to minimize communication and processing overheads; and presents mechanisms for reducing storage and communication overheads while scaling over a large number of nodes. We present experimental results showing the efficacy of the proposed algorithms
    corecore