138 research outputs found

    Towards a Cloud Native Big Data Platform using MiCADO

    Get PDF
    In the big data era, creating self-managing scalable platforms for running big data applications is a fundamental task. Such self-managing and self-healing platforms involve a proper reaction to hardware (e.g., cluster nodes) and software (e.g., big data tools) failures, besides a dynamic resizing of the allocated resources based on overload and underload situations and scaling policies. The distributed and stateful nature of big data platforms (e.g., Hadoop-based cluster) makes the management of these platforms a challenging task. This paper aims to design and implement a scalable cloud native Hadoop-based big data platform using MiCADO, an open-source, and a highly customisable multi-cloud orchestration and auto-scaling framework for Docker containers, orchestrated by Kubernetes. The proposed MiCADO-based big data platform automates the deployment and enables an automatic horizontal scaling (in and out) of the underlying cloud infrastructure. The empirical evaluation of the MiCADO-based big data platform demonstrates how easy, efficient, and fast it is to deploy and undeploy Hadoop clusters of different sizes. Additionally, it shows how the platform can automatically be scaled based on user-defined policies (such as CPU-based scaling)

    Enabling autoscaling for in-memory storage in cluster computing framework

    Get PDF
    2019 Spring.Includes bibliographical references.IoT enabled devices and observational instruments continuously generate voluminous data. A large portion of these datasets are delivered with the associated geospatial locations. The increased volumes of geospatial data, alongside the emerging geospatial services, pose computational challenges for large-scale geospatial analytics. We have designed and implemented STRETCH , an in-memory distributed geospatial storage that preserves spatial proximity and enables proactive autoscaling for frequently accessed data. STRETCH stores data with a delayed data dispersion scheme that incrementally adds data nodes to the storage system. We have devised an autoscaling feature that proactively repartitions data to alleviate computational hotspots before they occur. We compared the performance of S TRETCH with Apache Ignite and the results show that STRETCH provides up to 3 times the throughput when the system encounters hotspots. STRETCH is built on Apache Spark and Ignite and interacts with them at runtime

    Dynamic Load Balancing and Autoscaling in Distributed Stream Processing Systems

    Get PDF
    In big data world, Hadoop and other batch-processing tools are widely used to analyze data and get results in minutes. However, minutes of latency still cannot satisfy the proliferated needs for real-time decision in many fields such as live stock and trading feeds in financial services, telecommunications, sensor networks, online advertisement, etc. Distributed stream processing (DSP) systems aim to process, analyze and make decisions on-the-fly based on immense quantities of data streams being dynamically generated at high rates. As the rates of data streams may vary over time, DSP systems require an architecture that is elastic to handle dynamic load. Although many dynamic load balancing and autoscaling techniques for general pull-based distributed systems have been well studied, these solutions cannot be directly applied to DSP systems because DSP systems are push-based, they process data streams with different types of operators, each running on a cluster node. One research problem is to allocate data processing operators on nodes of clusters and balance the workload dynamically. Since the data volume and rate can be unpredictable, static mapping between operators and cluster resources often results in unbalanced operator load distribution. Furthermore, the problem of making DSP system scalable requires autoscaling at runtime. In this context, the operators need to be relocated among newly provisioned nodes. The contribution of this thesis is three folds. First, we proposes a software layer that is load-adaptive between a DSP engine and clusters. The architecture allows dynamic transferring of an operator to different cluster nodes at runtime and keeps the process transparent to developers. Second, an optimization method that combines correlation of resource utilization of nodes and capacity of clusters is proposed to balance load dynamically. Lastly, we design the autoscaling mechanism and algorithm to detect overload and provision nodes at runtime. We implement our design on S4, an open-source DSP engine first developed by Yahoo!. The implementation is evaluated by a top-N topic list application on Twitter streams using clusters on Amazon Web Services. The results demonstrate a 75.79% improvement on stream processing throughputs, and a 294.47% improvement on cluster resource utilization

    Academic Cloud Computing Research: Five Pitfalls and Five Opportunities

    Get PDF
    This discussion paper argues that there are five fundamental pitfalls, which can restrict academics from conducting cloud computing research at the infrastructure level, which is currently where the vast majority of academic research lies. Instead academics should be conducting higher risk research, in order to gain understanding and open up entirely new areas. We call for a renewed mindset and argue that academic research should focus less upon physical infrastructure and embrace the abstractions provided by clouds through five opportunities: user driven research, new programming models, PaaS environments, and improved tools to support elasticity and large-scale debugging. The objective of this paper is to foster discussion, and to define a roadmap forward, which will allow academia to make longer-term impacts to the cloud computing community.Comment: Accepted and presented at the 6th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud'14
    • …
    corecore