100 research outputs found

    Multi-Tenant Geo-Distributed Data Analytics

    Get PDF
    University of Minnesota Ph.D. dissertation. July 2019. Major: Computer Science. Advisors: Abhishek Chandra, Jon Weissman. 1 computer file (PDF); x, 132 pages.Geo-distributed data analytics has gained much interest in recent years due to the need for extracting insights from geo-distributed data. Traditionally, data analytics has been done within a cluster/data center environment. However, analyzing geo-distributed data using existing cluster-based systems typically cannot satisfy the timeliness requirement of most applications and result in wasteful resource consumption due to the fundamental differences of the environments, especially due to the scarce, highly heterogeneous, and dynamic nature of the wide-area resources: compute power and network bandwidth. This thesis addresses the challenges faced by geo-distributed data analytics systems in ensuring high-performance and reliable execution of multiple data analytics applications/queries. Specifically, the focus is on sharing resources across multiple users, applications, and computing frameworks. Sharing resources is attractive as it increases resource utilization and reduces operational cost. However, ensuring high-performance execution of multiple applications in a shared environment is challenging as they may compete for the same resources, especially in a wide-area environment with scarce resources. Furthermore, dynamics such as workload variation, resource variation, stragglers, and failures are inevitable in large-scale distributed systems. These can cause large resource perturbation that significantly affect the performance of query executions. This thesis makes the following contributions. First, we present a resource sharing technique across multiple geo-distributed data analytics frameworks. The main challenge here is how to elastically partition resources while allowing high locality scheduling to each individual framework, which is critical to the execution performance of geo-distributed analytics queries. We then address the problem of how to identify and exploit common executions across multiple queries to mitigate wasteful resource consumption. We demonstrate that traditional multi-query optimization may degrade the overall query execution performance due to its lack of support for network awareness. Finally, we highlight the importance of adaptability in ensuring reliable query execution in the presence of dynamics, both for single and multiple query executions. We propose a systematic approach that can selectively determine which queries to adapt and how to adapt them based on the types of queries, dynamics, and optimization goals

    Semantic model-driven framework for validating quality requirements of Internet of Things streaming data

    Get PDF
    The rise of Internet of Things has provided platforms mostly enhanced by real-time data-driven services for reactive services and Smart Cities innovations. However, IoT streaming data are known to be compromised by quality problems, thereby influencing the performance and accuracy of IoT-based reactive services or Smart applications. This research investigates the suitability of the semantic approach for the run-time validation of IoT streaming data for quality problems. To realise this aim, Semantic IoT Streaming Data Validation with its framework (SISDaV) is proposed. The novel approach involves technologies for semantic query and reasoning with semantic rules defined on an established relationship with external data sources with consideration for specific run-time events that can influence the quality of streams. The work specifically targets quality issues relating to inconsistency, plausibility, and incompleteness in IoT streaming data. In particular, the investigation covers various RDF stream processing and rule-based reasoning techniques and effects of RDF Serialised formats on the reasoning process. The contributions of the work include the hierarchy of IoT data stream quality problem, lightweight evolving Smart Space and Sensor Measurement Ontology, generic time-aware validation rules and, SISDaV framework- a unified semantic rule-based validation system for RDF-based IoT streaming data that combines the popular RDF stream processing the system with generic enhanced time-aware rules. The semantic validation process ensures the conformance of the raw streaming data value produced by the IoT node(s) with IoT streaming data quality requirements and the expected value. This is facilitated through a set of generic continuous validation rules, which has been realised by extending the popular Jena rule syntax with a time element. The comparative evaluation of SISDaV is based on its effectiveness and efficiency based on the expressivity of the different serialised RDF data formats. The results are interpreted with relevant statistical estimations and performance metrics. The results from the evaluation approve of the feasibility of the framework in terms of containing the semantic validation process within the interval between reads of sensor nodes as well as provision of additional requirements that can enhance IoT streaming data processing systems which are currently missing in most related state-of-art RDF stream processing systems. Furthermore, the approach can satisfy the main research objectives as identified by the study

    Efficient data reconfiguration for today's cloud systems

    Get PDF
    Performance of big data systems largely relies on efficient data reconfiguration techniques. Data reconfiguration operations deal with changing configuration parameters that affect data layout in a system. They could be user-initiated like changing shard key, block size in NoSQL databases, or system-initiated like changing replication in distributed interactive analytics engine. Current data reconfiguration schemes are heuristics at best and often do not scale well as data volume grows. As a result, system performance suffers. In this thesis, we show that {\it data reconfiguration mechanisms can be done in the background by using new optimal or near-optimal algorithms coupling them with performant system designs}. We explore four different data reconfiguration operations affecting three popular types of systems -- storage, real-time analytics and batch analytics. In NoSQL databases (storage), we explore new strategies for changing table-level configuration and for compaction as they improve read/write latencies. In distributed interactive analytics engines, a good replication algorithm can save costs by judiciously using memory that is sufficient to provide the highest throughput and low latency for queries. Finally, in batch processing systems, we explore prefetching and caching strategies that can improve the number of production jobs meeting their SLOs. All these operations happen in the background without affecting the fast path. Our contributions in each of the problems are two-fold -- 1) we model the problem and design algorithms inspired from well-known theoretical abstractions, 2) we design and build a system on top of popular open source systems used in companies today. Finally, using real-life workloads, we evaluate the efficacy of our solutions. Morphus and Parqua provide several 9s of availability while changing table level configuration parameters in databases. By halving memory usage in distributed interactive analytics engine, Getafix reduces cost of deploying the system by 10 million dollars annually and improves query throughput. We are the first to model the problem of compaction and provide formal bounds on their runtime. Finally, NetCachier helps 30\% more production jobs to meet their SLOs compared to existing state-of-the-art

    Handling Emergent Conflicts in Adaptable Rule-based Sensor Networks

    Get PDF
    This thesis presents a study into conflicts that emerge amongst sensor device rules when such devices are formed into networks. It describes conflicting patterns of communication and computation that can disturb the monitoring of subjects, and lower the quality of service. Such conflicts can negatively affect the lifetimes of the devices and cause incorrect information to be reported. A novel approach to detecting and resolving conflicts is presented. The approach is considered within the context of home-based psychiatric Ambulatory Assessment (AA). Rules are considered that can be used to control the behaviours of devices in a sensor network for AA. The research provides examples of rule conflict that can be found for AA sensor networks. Sensor networks and AA are active areas of research and many questions remain open regarding collaboration amongst collections of heterogeneous devices to collect data, process information in-network, and report personalised findings. This thesis presents an investigation into reliable rule-based service provisioning for a variety of stakeholders, including care providers, patients and technicians. It contributes a collection of rules for controlling AA sensor networks. This research makes a number of contributions to the field of rule-based sensor networks, including areas of knowledge representation, heterogeneous device support, system personalisation, and in particular, system reliability. This thesis provides evidence to support the conclusion that conflicts can be detected and resolved in adaptable rule-based sensor networks

    On Improving Efficiency of Data-Intensive Applications in Geo-Distributed Environments

    Get PDF
    Distributed systems are pervasively demanded and adopted in nowadays for processing data-intensive workloads since they greatly accelerate large-scale data processing with scalable parallelism and improved data locality. Traditional distributed systems initially targeted computing clusters but have since evolved to data centers with multiple clusters. These systems are mostly built on top of homogeneous, tightly integrated resources connected in high-speed local-area networks (LANs), and typically require data to be ingested to a central data center for processing. Today, with enormous volumes of data continuously generated from geographically distributed locations, direct adoption of such systems is prohibitively inefficient due to the limited system scalability and high cost for centralizing the geo-distributed data over the wide-area networks (WANs). More commonly, it becomes a trend to build geo-distributed systems wherein data processing jobs are performed on top of geo-distributed, heterogeneous resources in proximity to the data at vastly distributed geo-locations. However, critical challenges and mechanisms for efficient execution of data-intensive applications in such geo-distributed environments are unclear by far. The goal of this dissertation is to identify such challenges and mechanisms, by extensively using the research principles and methodology of conventional distributed systems to investigate the geo-distributed environment, and by developing new techniques to tackle these challenges and run data-intensive applications with efficiency at scale. The contributions of this dissertation are threefold. Firstly, the dissertation shows that the high level of resource heterogeneity exhibited in the geo-distributed environment undermines the scalability of geo-distributed systems. Virtualization-based resource abstraction mechanisms have been introduced to abstract the hardware, network, and OS resources throughout the system, to mitigate the underlying resource heterogeneity and enhance the system scalability. Secondly, the dissertation reveals the overwhelming performance and monetary cost incurred by indulgent data sharing over the WANs in geo-distributed systems. Network optimization approaches, including linear- programming-based global optimization, greedy bin-packing heuristics, and TCP enhancement, are developed to optimize the network resource utilization and circumvent unnecessary expenses imposed on data sharing in WANs. Lastly, the dissertation highlights the importance of data locality for data-intensive applications running in the geo-distributed environment. Novel data caching and locality-aware scheduling techniques are devised to improve the data locality.Doctor of Philosoph

    A Globally Distributed System for Job, Data, and Information Handling for High Energy Physics

    Full text link

    EXPRESS: Resource-oriented and RESTful Semantic Web services

    No full text
    This thesis investigates an approach that simplifies the development of Semantic Web services (SWS) by removing the need for additional semantic descriptions.The most actively researched approaches to Semantic Web services introduce explicit semantic descriptions of services that are in addition to the existing semantic descriptions of the service domains. This increases their complexity and design overhead. The need for semantically describing the services in such approaches stems from their foundations in service-oriented computing, i.e. the extension of already existing service descriptions. This thesis demonstrates that adopting a resource-oriented approach based on REST will, in contrast to service-oriented approaches, eliminate the need for explicit semantic service descriptions and service vocabularies. This reduces the development efforts while retaining the significant functional capabilities.The approach proposed in this thesis, called EXPRESS (Expressing RESTful Semantic Services), utilises the similarities between REST and the Semantic Web, such as resource realisation, self-describing representations, and uniform interfaces. The semantics of a service is elicited from a resource’s semantic description in the domain ontology and the semantics of the uniform interface, hence eliminating the need for additional semantic descriptions. Moreover, stub-generation is a by-product of the mapping between entities in the domain ontology and resources.EXPRESS was developed to test the feasibility of eliminating explicit service descriptions and service vocabularies or ontologies, to explore the restrictions placed on domain ontologies as a result, to investigate the impact on the semantic quality of the description, and explore the benefits and costs to developers. To achieve this, an online demonstrator that allows users to generate stubs has been developed. In addition, a matchmaking experiment was conducted to show that the descriptions of the services are comparable to OWL-S in terms of their ability to be discovered, while improving the efficiency of discovery. Finally, an expert review was undertaken which provided evidence of EXPRESS’s simplicity and practicality when developing SWS from scratch

    Scheme for identifying and describing behavioral innovations embodied in computer programs

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1996.Includes bibliographical references (p. 177-180).by Sean Sang-Chul Pak.M.Eng
    • …
    corecore