316 research outputs found
Distributed service orchestration : eventually consistent cloud operation and integration
Both researchers and industry players are facing the same obstacles when entering the big data field. Deploying and testing distributed data technologies requires a big up-front investment of both time and knowledge. Existing cloud automation solutions are not well suited for managing complex distributed data solutions. This paper proposes a distributed service orchestration architecture to better handle the complex orchestration logic needed in these cases. A novel service-engine based approach is proposed to cope with the versatility of the individual components. A hybrid integration approach bridges the gap between cloud modeling languages, automation artifacts, image-based schedulers and PaaS solutions. This approach is integrated in the distributed data experimentation platform Tengu, making it more flexible and robust
Automated and Portable Hadoop Cluster Orchestration on Clouds with Occopus for Big Data Applications
LDM: Lineage-Aware Data Management in Multi-tier Storage Systems
We design and develop LDM, a novel data management solution to cater the needs of applications exhibiting the lineage property, i.e. in which the current writes are future reads. In such a class of applications, slow writes significantly hurt the over-all performance of jobs, i.e. current writes determine the fate of next reads. We believe that in a large scale shared production cluster, the issues associated due to data management can be mitigated at a way higher layer in the hierarchy of the I/O path, even before requests to data access are made. Contrary to the current solutions to data management which are mostly reactive and/or based on heuristics, LDM is both deterministic and pro-active. We develop block-graphs, which enable LDM to capture the complete time-based data-task dependency associations, therefore use it to perform life-cycle management through tiering of data blocks. LDM amalgamates the information from the entire data center ecosystem, right from the application code, to file system mappings, the compute and storage devices topology, etc. to make oracle-like deterministic data management decisions. With trace-driven experiments, LDM is able to achieve 29–52% reduction in over-all data center workload execution time. Moreover, by deploying LDM with extensive pre-processing creates efficient data consumption pipelines, which also reduces write and read delays significantly
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
VM-MAD: a cloud/cluster software for service-oriented academic environments
The availability of powerful computing hardware in IaaS clouds makes cloud
computing attractive also for computational workloads that were up to now
almost exclusively run on HPC clusters.
In this paper we present the VM-MAD Orchestrator software: an open source
framework for cloudbursting Linux-based HPC clusters into IaaS clouds but also
computational grids. The Orchestrator is completely modular, allowing flexible
configurations of cloudbursting policies. It can be used with any batch system
or cloud infrastructure, dynamically extending the cluster when needed. A
distinctive feature of our framework is that the policies can be tested and
tuned in a simulation mode based on historical or synthetic cluster accounting
data.
In the paper we also describe how the VM-MAD Orchestrator was used in a
production environment at the FGCZ to speed up the analysis of mass
spectrometry-based protein data by cloudbursting to the Amazon EC2. The
advantages of this hybrid system are shown with a large evaluation run using
about hundred large EC2 nodes.Comment: 16 pages, 5 figures. Accepted at the International Supercomputing
Conference ISC13, June 17--20 Leipzig, German
Towards a Cloud Native Big Data Platform using MiCADO
In the big data era, creating self-managing scalable platforms for running big data applications is a fundamental
task. Such self-managing and self-healing platforms involve a
proper reaction to hardware (e.g., cluster nodes) and software (e.g., big data tools) failures, besides a dynamic resizing of the allocated resources based on overload and underload situations and scaling policies. The distributed and stateful nature of big data platforms (e.g., Hadoop-based cluster) makes the management of these platforms a challenging task. This paper aims to design and implement a scalable cloud native Hadoop-based big data platform using MiCADO, an open-source, and a highly customisable multi-cloud orchestration and auto-scaling framework for Docker containers, orchestrated by Kubernetes. The proposed MiCADO-based big data platform automates the deployment and enables an automatic horizontal scaling (in and out) of the underlying cloud infrastructure. The empirical evaluation of the MiCADO-based big data platform demonstrates how easy, efficient, and fast it is to deploy and undeploy Hadoop clusters of different sizes. Additionally, it shows how the platform can automatically be scaled based on user-defined policies (such as CPU-based scaling)
- …