    Quality of Service Aware Data Stream Processing for Highly Dynamic and Scalable Applications

    Huge amounts of georeferenced data streams are arriving daily to data stream management systems that are deployed for serving highly scalable and dynamic applications. There are innumerable ways at which those loads can be exploited to gain deep insights in various domains. Decision makers require an interactive visualization of such data in the form of maps and dashboards for decision making and strategic planning. Data streams normally exhibit fluctuation and oscillation in arrival rates and skewness. Those are the two predominant factors that greatly impact the overall quality of service. This requires data stream management systems to be attuned to those factors in addition to the spatial shape of the data that may exaggerate the negative impact of those factors. Current systems do not natively support services with quality guarantees for dynamic scenarios, leaving the handling of those logistics to the user which is challenging and cumbersome. Three workloads are predominant for any data stream, batch processing, scalable storage and stream processing. In this thesis, we have designed a quality of service aware system, SpatialDSMS, that constitutes several subsystems that are covering those loads and any mixed load that results from intermixing them. Most importantly, we natively have incorporated quality of service optimizations for processing avalanches of geo-referenced data streams in highly dynamic application scenarios. This has been achieved transparently on top of the codebases of emerging de facto standard best-in-class representatives, thus relieving the overburdened shoulders of the users in the presentation layer from having to reason about those services. Instead, users express their queries with quality goals and our system optimizers compiles that down into query plans with an embedded quality guarantee and leaves logistic handling to the underlying layers. We have developed standard compliant prototypes for all the subsystems that constitutes SpatialDSMS

    MASTAQ: A Middleware Architecture for Sensor Applications with Statistical Quality Constraints

    We present the design goals and functional components of MASTAQ, a data management middleware for pervasive applications that utilize sensor data. MASTAQ allows applications to specify their quality-of information (QoI) preferences (in terms of statistical metrics over the data) independent of the underlying network topology. It then achieves energy efficiency by adaptively activating and querying only the subset of sensor nodes needed to meet the target QoI bounds. We also present a closed-loop feedback mechanism based on broadcasting of activation probabilities, which allows MASTAQ to activate the appropriate number of sensors without requiring any inter-sensor coordination or knowledge of the actual deployment.1

    Adaptive Filters for Continuous Queries over Distributed Data Stream

    We consider an environment where distributed data sources continuously stream updates to a centralized processor that monitors continuous queries over the distributed data. Significant communication overhead is incurred in the presence of rapid update streams, and we propose a new technique for reducing the overhead. Users register continuous queries with precision requirements at the central stream processor, which installs filters at remote data sources. The filters adapt to changing conditions to minimize stream rates while guaranteeing that all continuous queries still receive the updates necessary to provide answers of adequate precision at all times. Our approach enables applications to trade precision for communication overhead at a fine granularity by individually adjusting the precision constraints of continuous queries over streams in a multi-query workload

    The Homeostasis Protocol: Avoiding Transaction Coordination Through Program Analysis

    Datastores today rely on distribution and replication to achieve improved performance and fault-tolerance. But correctness of many applications depends on strong consistency properties - something that can impose substantial overheads, since it requires coordinating the behavior of multiple nodes. This paper describes a new approach to achieving strong consistency in distributed systems while minimizing communication between nodes. The key insight is to allow the state of the system to be inconsistent during execution, as long as this inconsistency is bounded and does not affect transaction correctness. In contrast to previous work, our approach uses program analysis to extract semantic information about permissible levels of inconsistency and is fully automated. We then employ a novel homeostasis protocol to allow sites to operate independently, without communicating, as long as any inconsistency is governed by appropriate treaties between the nodes. We discuss mechanisms for optimizing treaties based on workload characteristics to minimize communication, as well as a prototype implementation and experiments that demonstrate the benefits of our approach on common transactional benchmarks

    Optimizing Notifications of Subscription-Based Forecast Queries

    Integrating sophisticated statistical methods into database management systems is gaining more and more attention in research and industry. One important statistical method is time series forecasting, which is crucial for decision management in many domains. In this context, previous work addressed the processing of ad-hoc and recurring forecast queries. In contrast, we focus on subscription-based forecast queries that arise when an application (subscriber) continuously requires forecast values for further processing. Forecast queries exhibit the unique characteristic that the underlying forecast model is updated with each new actual value and better forecast values might be available. However, (re-)sending new forecast values to the subscriber for every new value is infeasible because this can cause significant overhead at the subscriber side. The subscriber therefore wishes to be notified only when forecast values have changed relevant to the application. In this paper, we reduce the costs of the subscriber by optimizing the notifications sent to the subscriber, i.e., by balancing the number of notifications and the notification length. We introduce a generic cost model to capture arbitrary subscriber cost functions and discuss different optimization approaches that reduce the subscriber costs while ensuring constrained forecast values deviations. Our experimental evaluation on real datasets shows the validity of our approach with low computational costs

    Quality and context-aware smart health care: Evaluating the cost-quality dynamics

    Many emerging pervasive health-care applications require the determination of a variety of context attributes of an individual\u27s activities and medical parameters and her surrounding environment. Context is a high-level representation of an entity\u27s state, which captures activities, relationships, capabilities, etc. In practice, high-level context measures are often difficult to sense from a single data source and must instead be inferred using multiple sensors embedded in the environment. A key challenge in deploying context-driven health-care applications involves energy-efficient determination or inference of high-level context information from low-level sensor data streams. Because this abstraction has the potential to reduce the quality of the context information, it is also necessary to model the tradeoff between the cost of sensor data collection and the quality of the inferred context. This article describes a model of context inference in pervasive computing, the associated research challenges, and the significant practical impact of intelligent use of such context in pervasive health-care environments

    Analyzing the Impact of RDF Graph Structure on Dataset Search: A Case Study with ACORDAR

    openNel mondo del Semantic Web, RDF si pone come elemento cardine per la modellazione precisa dei dati e dei loro legami. L'obiettivo centrale di questo lavoro è esplorare le dinamiche dei grafi RDF, mettendo in luce le principali problematiche e potenzialità nell'ambito della ricerca di dataset. Il caso studio di ACORDAR viene esaminato per illustrare l'effetto delle strutture a grafo sull'organizzazione dei dati. Vengono analizzate le tecniche di serializzazione in RDF, sottolineando la centralità di elementi quali gli URI e le capacità avanzate offerte da SPARQL. Si affronta il tema della riproducibilità di ACORDAR, mettendo in risalto l'importanza dei metadati nella fase di ricerca dei dataset. In conclusione, si delineano prospettive future per ottimizzare la ricerca di dataset, arricchendo l'analisi con informazioni tratte dalle strutture a grafo e avvalendosi delle tecnologie emergenti.RDF plays a central role in the era of the Semantic Web, enabling a structured representation of datasets and their relationships. The complex nature of RDF graph structures significantly influences the retrieval of datasets, offering a blend of both challenges and possibilities. Delving deeply into the ACORDAR case study, the work unveils how graph structures influence dataset retrieval and the organization of data. Furthermore, it introduces serialization methods within RDF, emphasizing the importance of URI and the capabilities of the SPARQL. Presenting the ACORDAR reproducibility, the research underscores the significance of metadata in dataset search. Exploring potential avenues for future research in dataset search, the investigation integrates graph structures and harnesses emerging technologies from the Semantic Web era

    Trade-off among timeliness, messages and accuracy for large-Ssale information management

    The increasing amount of data and the number of nodes in large-scale environments require new techniques for information management. Examples of such environments are the decentralized infrastructures of Computational Grid and Computational Cloud applications. These large-scale applications need different kinds of aggregated information such as resource monitoring, resource discovery or economic information. The challenge of providing timely and accurate information in large scale environments arise from the distribution of the information. Reasons for delays in distributed information system are a long information transmission time due to the distribution, churn and failures. A problem of large applications such as peer-to-peer (P2P) systems is the increasing retrieval time of the information due to the decentralization of the data and the failure proneness. However, many applications need a timely information provision. Another problem is an increasing network consumption when the application scales to millions of users and data. Using approximation techniques allows reducing the retrieval time and the network consumption. However, the usage of approximation techniques decreases the accuracy of the results. Thus, the remaining problem is to offer a trade-off in order to solve the conflicting requirements of fast information retrieval, accurate results and low messaging cost. Our goal is to reach a self-adaptive decision mechanism to offer a trade-off among the retrieval time, the network consumption and the accuracy of the result. Self-adaption enables distributed software to modify its behavior based on changes in the operating environment. In large-scale information systems that use hierarchical data aggregation, we apply self-adaptation to control the approximation used for the information retrieval and reduces the network consumption and the retrieval time. The hypothesis of the thesis is that approximation techniquescan reduce the retrieval time and the network consumption while guaranteeing an accuracy of the results, while considering user’s defined priorities. First, this presented research addresses the problem of a trade-off among a timely information retrieval, accurate results and low messaging cost by proposing a summarization algorithm for resource discovery in P2P-content networks. After identifying how summarization can improve the discovery process, we propose an algorithm which uses a precision-recall metric to compare the accuracy and to offer a user-driven trade-off. Second, we propose an algorithm that applies a self-adaptive decision making on each node. The decision is about the pruning of the query and returning the result instead of continuing the query. The pruning reduces the retrieval time and the network consumption at the cost of a lower accuracy in contrast to continuing the query. The algorithm uses an analytic hierarchy process to assess the user’s priorities and to propose a trade-off in order to satisfy the accuracy requirements with a low message cost and a short delay. A quantitative analysis evaluates our presented algorithms with a simulator, which is fed with real data of a network topology and the nodes’ attributes. The usage of a simulator instead of the prototype allows the evaluation in a large scale of several thousands of nodes. The algorithm for content summarization is evaluated with half a million of resources and with different query types. The selfadaptive algorithm is evaluated with a simulator of several thousands of nodes that are created from real data. A qualitative analysis addresses the integration of the simulator’s components in existing market frameworks for Computational Grid and Cloud applications. The proposed content summarization algorithm reduces the information retrieval time from a logarithmic increase to a constant factor. Furthermore, the message size is reduced significantly by applying the summarization technique. For the user, a precision-recall metric allows defining the relation between the retrieval time and the accuracy. The self-adaptive algorithm reduces the number of messages needed from an exponential increase to a constant factor. At the same time, the retrieval time is reduced to a constant factor under an increasing number of nodes. Finally, the algorithm delivers the data with the required accuracy adjusting the depth of the query according to the network conditions.La gestió de la informació exigeix noves tècniques que tractin amb la creixent quantitat de dades i nodes en entorns a gran escala. Alguns exemples d’aquests entorns són les infraestructures descentralitzades de Computacional Grid i Cloud. Les aplicacions a gran escala necessiten diferents classes d’informació agregada com monitorització de recursos i informació econòmica. El desafiament de proporcionar una provisió ràpida i acurada d’informació en ambients de grans escala sorgeix de la distribució de la informació. Una raó és que el sistema d’informació ha de tractar amb l’adaptabilitat i fracassos d’aquests ambients. Un problema amb aplicacions molt grans com en sistemes peer-to-peer (P2P) és el creixent temps de recuperació de l’informació a causa de la descentralització de les dades i la facilitat al fracàs. No obstant això, moltes aplicacions necessiten una provisió d’informació puntual. A més, alguns usuaris i aplicacions accepten inexactituds dels resultats si la informació es reparteix a temps. A més i més, el consum de xarxa creixent fa que sorgeixi un altre problema per l’escalabilitat del sistema. La utilització de tècniques d’aproximació permet reduir el temps de recuperació i el consum de xarxa. No obstant això, l’ús de tècniques d’aproximació disminueix la precisió dels resultats. Així, el problema restant és oferir un compromís per resoldre els requisits en conflicte d’extracció de la informació ràpida, resultats acurats i cost d’enviament baix. El nostre objectiu és obtenir un mecanisme de decisió completament autoadaptatiu per tal d’oferir el compromís entre temps de recuperació, consum de xarxa i precisió del resultat. Autoadaptacío permet al programari distribuït modificar el seu comportament en funció dels canvis a l’entorn d’operació. En sistemes d’informació de gran escala que utilitzen agregació de dades jeràrquica, l’auto-adaptació permet controlar l’aproximació utilitzada per a l’extracció de la informació i redueixen el consum de xarxa i el temps de recuperació. La hipòtesi principal d’aquesta tesi és que els tècniques d’aproximació permeten reduir el temps de recuperació i el consum de xarxa mentre es garanteix una precisió adequada definida per l’usari. La recerca que es presenta, introdueix un algoritme de sumarització de continguts per a la descoberta de recursos a xarxes de contingut P2P. Després d’identificar com sumarització pot millorar el procés de descoberta, proposem una mètrica que s’utilitza per comparar la precisió i oferir un compromís definit per l’usuari. Després, introduïm un algoritme nou que aplica l’auto-adaptació a un ordre per satisfer els requisits de precisió amb un cost de missatge baix i un retard curt. Basat en les prioritats d’usuari, l’algoritme troba automàticament un compromís. L’anàlisi quantitativa avalua els algoritmes presentats amb un simulador per permetre l’evacuació d’uns quants milers de nodes. El simulador s’alimenta amb dades d’una topologia de xarxa i uns atributs dels nodes reals. L’algoritme de sumarització de contingut s’avalua amb mig milió de recursos i amb diferents tipus de sol·licituds. L’anàlisi qualitativa avalua la integració del components del simulador en estructures de mercat existents per a aplicacions de Computacional Grid i Cloud. Així, la funcionalitat implementada del simulador (com el procés d’agregació i la query language) és comprovada per la integració de prototips. L’algoritme de sumarització de contingut proposat redueix el temps d’extracció de l’informació d’un augment logarítmic a un factor constant. A més, també permet que la mida del missatge es redueix significativament. Per a l’usuari, una precision-recall mètric permet definir la relació entre el nivell de precisió i el temps d’extracció de la informació. Alhora, el temps de recuperació es redueix a un factor constant sota un nombre creixent de nodes. Finalment, l’algoritme reparteix les dades amb la precisió exigida i ajusta la profunditat de la sol·licitud segons les condicions de xarxa. Els algoritmes introduïts són prometedors per ser utilitzats per l’agregació d’informació en nous sistemes de gestió de la informació de gran escala en el futur

    CAPS: Energy-Efficient Processing of Continuous Aggregate Queries in Sensor Networks

    In this paper, we design and evaluate an energy efficient data retrieval architecture for continuous aggregate queries in wireless sensor networks. We show how the modification of precision in one sensor affects the sample-reporting fre-quency of other sensors, and how the precisions of a group of sensors may be collectively modified to achieve the target Quality of Information (QoI) with higher energy-efficiency. The proposed Collective Adaptive Precision Setting (CAPS) architecture is then extended to exploit the observed tempo-ral correlation among successive sensor samples for even greater energy efficiency. Detailed simulations with syn-thetic and real data traces demonstrate how the combi-nation of weak consistency semantics and temporal corre-lation can dramatically lower the energy consumption in practical sensor environments.