Search CORE

232 research outputs found

Big Data and Large-scale Data Analytics: Efficiency of Sustainable Scalability and Security of Centralized Clouds and Edge Deployment Architectures

Author: Awaysheh Feras Mahmoud Naji
Publication venue
Publication date: 01/01/2020
Field of study

One of the significant shifts of the next-generation computing technologies will certainly be in the development of Big Data (BD) deployment architectures. Apache Hadoop, the BD landmark, evolved as a widely deployed BD operating system. Its new features include federation structure and many associated frameworks, which provide Hadoop 3.x with the maturity to serve different markets. This dissertation addresses two leading issues involved in exploiting BD and large-scale data analytics realm using the Hadoop platform. Namely, (i)Scalability that directly affects the system performance and overall throughput using portable Docker containers. (ii) Security that spread the adoption of data protection practices among practitioners using access controls. An Enhanced Mapreduce Environment (EME), OPportunistic and Elastic Resource Allocation (OPERA) scheduler, BD Federation Access Broker (BDFAB), and a Secure Intelligent Transportation System (SITS) of multi-tiers architecture for data streaming to the cloud computing are the main contribution of this thesis study

Repositorio Institucional da Universidade de Santiago de Compostela

Performance modelling and optimization for video-analytic algorithms in a cloud-like environment using machine learning

Author: Manal N.K. Al-Rawahi (7168808)
Publication venue
Publication date: 01/01/2016
Field of study

CCTV cameras produce a large amount of video surveillance data per day, and analysing them require the use of significant computing resources that often need to be scalable. The emergence of the Hadoop distributed processing framework has had a significant impact on various data intensive applications as the distributed computed based processing enables an increase of the processing capability of applications it serves. Hadoop is an open source implementation of the MapReduce programming model. It automates the operation of creating tasks for each function, distribute data, parallelize executions and handles machine failures that reliefs users from the complexity of having to manage the underlying processing and only focus on building their application. It is noted that in a practical deployment the challenge of Hadoop based architecture is that it requires several scalable machines for effective processing, which in turn adds hardware investment cost to the infrastructure. Although using a cloud infrastructure offers scalable and elastic utilization of resources where users can scale up or scale down the number of Virtual Machines (VM) upon requirements, a user such as a CCTV system operator intending to use a public cloud would aspire to know what cloud resources (i.e. number of VMs) need to be deployed so that the processing can be done in the fastest (or within a known time constraint) and the most cost effective manner. Often such resources will also have to satisfy practical, procedural and legal requirements. The capability to model a distributed processing architecture where the resource requirements can be effectively and optimally predicted will thus be a useful tool, if available. In literature there is no clear and comprehensive modelling framework that provides proactive resource allocation mechanisms to satisfy a user's target requirements, especially for a processing intensive application such as video analytic. In this thesis, with the hope of closing the above research gap, novel research is first initiated by understanding the current legal practices and requirements of implementing video surveillance system within a distributed processing and data storage environment, since the legal validity of data gathered or processed within such a system is vital for a distributed system's applicability in such domains. Subsequently the thesis presents a comprehensive framework for the performance ii modelling and optimization of resource allocation in deploying a scalable distributed video analytic application in a Hadoop based framework, running on virtualized cluster of machines. The proposed modelling framework investigates the use of several machine learning algorithms such as, decision trees (M5P, RepTree), Linear Regression, Multi Layer Perceptron(MLP) and the Ensemble Classifier Bagging model, to model and predict the execution time of video analytic jobs, based on infrastructure level as well as job level parameters. Further in order to propose a novel framework for the allocate resources under constraints to obtain optimal performance in terms of job execution time, we propose a Genetic Algorithms (GAs) based optimization technique. Experimental results are provided to demonstrate the proposed framework's capability to successfully predict the job execution time of a given video analytic task based on infrastructure and input data related parameters and its ability determine the minimum job execution time, given constraints of these parameters. Given the above, the thesis contributes to the state-of-art in distributed video analytics, design, implementation, performance analysis and optimisation

Loughborough University Institutional Repository

Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

Author: Caino Lores Silvina
Carretero Pérez Jesús
Nicolae Bogdan
Peterka Tom
Yildiz Orcun
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/10/2019
Field of study

Convergence between high-performance computing (HPC) and big data analytics (BDA) is currently an established research area that has spawned new opportunities for unifying the platform layer and data abstractions in these ecosystems. This work presents an architectural model that enables the interoperability of established BDA and HPC execution models, reflecting the key design features that interest both the HPC and BDA communities, and including an abstract data collection and operational model that generates a unified interface for hybrid applications. This architecture can be implemented in different ways depending on the process- and data-centric platforms of choice and the mechanisms put in place to effectively meet the requirements of the architecture. The Spark-DIY platform is introduced in the paper as a prototype implementation of the architecture proposed. It preserves the interfaces and execution environment of the popular BDA platform Apache Spark, making it compatible with any Spark-based application and tool, while providing efficient communication and kernel execution via DIY, a powerful communication pattern library built on top of MPI. Later, Spark-DIY is analyzed in terms of performance by building a representative use case from the hydrogeology domain, EnKF-HGS. This application is a clear example of how current HPC simulations are evolving toward hybrid HPC-BDA applications, integrating HPC simulations within a BDA environment.This work was supported in part by the Spanish Ministry of Economy, Industry and Competitiveness under Grant TIN2016-79637-P(toward Unification of HPC and Big Data Paradigms), in part by the Spanish Ministry of Education under Grant FPU15/00422 TrainingProgram for Academic and Teaching Staff Grant, in part by the Advanced Scientific Computing Research, Office of Science, U.S.Department of Energy, under Contract DE-AC02-06CH11357, and in part by the DOE with under Agreement DE-DC000122495,Program Manager Laura Biven

Universidad Carlos III de Madrid e-Archivo

Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)

Author: Carretero Pérez Jesús
García Blas Javier
Petcu Dana
Publication venue
Publication date: 01/01/2016
Field of study

Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to present their current research, and to discuss topics with other students in order to look for synergies and common research topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big data management, training, contributing to glue disparate researchers working across different areas and provide a meeting ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in research topics such as sustainable software solutions (applications and system software stack), data management, energy efficiency, and resilience.European Cooperation in Science and Technology. COS

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Cloud computing resource scheduling and a survey of its evolutionary approaches

Author: Chung Henry Shu-Hung
Gong Yue-Jiao
Li Yun
Liu Xiao-Fang
Zhan Zhi-Hui
Zhang Jun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/07/2015
Field of study

A disruptive technology fundamentally transforming the way that computing services are delivered, cloud computing offers information and communication technology users a new dimension of convenience of resources, as services via the Internet. Because cloud provides a finite pool of virtualized on-demand resources, optimally scheduling them has become an essential and rewarding topic, where a trend of using Evolutionary Computation (EC) algorithms is emerging rapidly. Through analyzing the cloud computing architecture, this survey first presents taxonomy at two levels of scheduling cloud resources. It then paints a landscape of the scheduling problem and solutions. According to the taxonomy, a comprehensive survey of state-of-the-art approaches is presented systematically. Looking forward, challenges and potential future research directions are investigated and invited, including real-time scheduling, adaptive dynamic scheduling, large-scale scheduling, multiobjective scheduling, and distributed and parallel scheduling. At the dawn of Industry 4.0, cloud computing scheduling for cyber-physical integration with the presence of big data is also discussed. Research in this area is only in its infancy, but with the rapid fusion of information and data technology, more exciting and agenda-setting topics are likely to emerge on the horizon

Crossref

Enlighten

Adaptive Speculation for Efficient Internetware Application Execution in Clouds

Author: Garraghan P
McKee D
Ouyang X
Primas B
Townend P
Xu J
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Modern Cloud computing systems are massive in scale, featuring environments that can execute highly dynamic Internetware applications with huge numbers of interacting tasks. This has led to a substantial challenge the straggler problem, whereby a small subset of slow tasks signiﬁcantly impede parallel job completion. This problem results in longer service responses, degraded system performance, and late timing failures that can easily threaten Quality of Service (QoS) compliance. Speculative execution (or speculation) is the prominent method deployed in Clouds to tolerate stragglers by creating task replicas at runtime. The method detects stragglers by specifying a predeﬁned threshold to calculate the difference between individual tasks and the average task progression within a job. However, such a static threshold debilitates speculation effectiveness as it fails to capture the intrinsic diversity of timing constraints in Internetware applications, as well as dynamic environmental factors such as resource utilization. By considering such characteristics, different levels of strictness for replica creation can be imposed to adaptively achieve speciﬁed levels of QoS for different applications. In this paper we present an algorithm to improve the execution efﬁciency of Internetware applications by dynamically calculating the straggler threshold, considering key parameters including job QoS timing constraints, task execution progress, and optimal system resource utilization. We implement this dynamic straggler threshold into the YARN architecture to evaluate it’s effectiveness against existing state-of-the-art solutions. Results demonstrate that the proposed approach is capable of reducing parallel job response times by up to 20% compared to the static threshold, as well as a higher speculation success rate, achieving up to 66.67% against 16.67% in comparison to the static method

Crossref

Lancaster E-Prints

White Rose Research Online

Virtual Machine Flow Analysis Using Host Kernel Tracing

Author: Nemati Hani
Publication venue
Publication date: 01/05/2019
Field of study

L’infonuagique a beaucoup gagné en popularité car elle permet d’offrir des services à coût réduit, avec le modèle économique Pay-to-Use, un stockage illimité avec les systèmes de stockage distribué, et une grande puissance de calcul grâce à l’accès direct au matériel. La technologie de virtualisation permet de partager un serveur physique entre plusieurs environnements virtualisés isolés, en déployant une couche logicielle (Hyperviseur) au-dessus du matériel. En conséquence, les environnements isolés peuvent fonctionner avec des systèmes d’exploitation et des applications différentes, sans interférence mutuelle. La croissance du nombre d’utilisateurs des services infonuagiques et la démocratisation de la technologie de virtualisation présentent un nouveau défi pour les fournisseurs de services infonuagiques. Fournir une bonne qualité de service et une haute disponibilité est une exigence principale pour l’infonuagique. La raison de la dégradation des performances d’une machine virtuelle peut être nombreuses. a Activité intense d’une application à l’intérieur de la machine virtuelle. b Conflits avec d’autres applications à l’intérieur de la machine même virtuelle. c Conflits avec d’autres machines virtuelles qui roulent sur la même machine physique. d Échecs de la plateforme infonuagique. Les deux premiers cas peuvent être gérés par le propriétaire de la machine virtuelle et les autres cas doivent être résolus par le fournisseur de l’infrastructure infonuagique. Ces infrastructures sont généralement très complexes et peuvent contenir différentes couches de virtualisation. Il est donc nécessaire d’avoir un outil d’analyse à faible surcoût pour détecter ces types de problèmes. Dans cette thèse, nous présentons une méthode précise permettant de récupérer le flux d’exécution des environnements virtualisés à partir de la machine hôte, quel que soit le niveau de la virtualisation. Pour éviter des problèmes de sécurité, faciliter le déploiement et minimiser le surcoût, notre méthode limite la collecte de données au niveau de l’hyperviseur. Pour analyser le comportement des machines virtuelles, nous utilisons un outil de traçage léger appelé Linux Trace Toolkit Next Generation (LTTng) [1]. LTTng est capable d’effectuer un traçage à haut débit et à faible surcoût, grâce aux mécanismes de synchronisation sans verrous utilisés pour mettre à jour le contenu des tampons de traçage.----------ABSTRACT: Cloud computing has gained popularity as it offers services at lower cost, with Pay-per-Use model, unlimited storage, with distributed storage, and flexible computational power, with direct hardware access. Virtualization technology allows to share a physical server, between several isolated virtualized environments, by deploying an hypervisor layer on top of hardware. As a result, each isolated environment can run with its OS and application without mutual interference. With the growth of cloud usage, and the use of virtualization, performance understanding and debugging are becoming a serious challenge for Cloud providers. Offering a better QoS and high availability are expected to be salient features of cloud computing. Nonetheless, possible reasons behind performance degradation in VMs are numerous. a) Heavy load of an application inside the VM. b) Contention with other applications inside the VM. c) Contention with other co-located VMs. d) Cloud platform failures. The first two cases can be managed by the VM owner, while the other cases need to be solved by the infrastructure provider. One key requirement for such a complex environment, with different virtualization layers, is a precise low overhead analysis tool. In this thesis, we present a host-based, precise method to recover the execution flow of virtualized environments, regardless of the level of nested virtualization. To avoid security issues, ease deployment and reduce execution overhead, our method limits its data collection to the hypervisor level. In order to analyse the behavior of each VM, we use a lightweight tracing tool called the Linux Trace Toolkit Next Generation (LTTng) [1]. LTTng is optimised for high throughput tracing with low overhead, thanks to its lock-free synchronization mechanisms used to update the trace buffer content

PolyPublie

Algorithms for advance bandwidth reservation in media production networks

Author: Barshan Maryam
De Turck Filip
Famaey Jeroen
Moens Hendrik
Publication venue
Publication date: 01/01/2015
Field of study

Media production generally requires many geographically distributed actors (e.g., production houses, broadcasters, advertisers) to exchange huge amounts of raw video and audio data. Traditional distribution techniques, such as dedicated point-to-point optical links, are highly inefficient in terms of installation time and cost. To improve efficiency, shared media production networks that connect all involved actors over a large geographical area, are currently being deployed. The traffic in such networks is often predictable, as the timing and bandwidth requirements of data transfers are generally known hours or even days in advance. As such, the use of advance bandwidth reservation (AR) can greatly increase resource utilization and cost efficiency. In this paper, we propose an Integer Linear Programming formulation of the bandwidth scheduling problem, which takes into account the specific characteristics of media production networks, is presented. Two novel optimization algorithms based on this model are thoroughly evaluated and compared by means of in-depth simulation results

Crossref

Ghent University Academic Bibliography

Institutional Repository Universiteit Antwerpen

Virtual Cluster Management for Analysis of Geographically Distributed and Immovable Data

Author: Luo Yuan
Publication venue: [Bloomington, Ind.] : Indiana University
Publication date: 01/08/2015
Field of study

Thesis (Ph.D.) - Indiana University, Informatics and Computing, 2015Scenarios exist in the era of Big Data where computational analysis needs to utilize widely distributed and remote compute clusters, especially when the data sources are sensitive or extremely large, and thus unable to move. A large dataset in Malaysia could be ecologically sensitive, for instance, and unable to be moved outside the country boundaries. Controlling an analysis experiment in this virtual cluster setting can be difficult on multiple levels: with setup and control, with managing behavior of the virtual cluster, and with interoperability issues across the compute clusters. Further, datasets can be distributed among clusters, or even across data centers, so that it becomes critical to utilize data locality information to optimize the performance of data-intensive jobs. Finally, datasets are increasingly sensitive and tied to certain administrative boundaries, though once the data has been processed, the aggregated or statistical result can be shared across the boundaries. This dissertation addresses management and control of a widely distributed virtual cluster having sensitive or otherwise immovable data sets through a controller. The Virtual Cluster Controller (VCC) gives control back to the researcher. It creates virtual clusters across multiple cloud platforms. In recognition of sensitive data, it can establish a single network overlay over widely distributed clusters. We define a novel class of data, notably immovable data that we call "pinned data", where the data is treated as a first-class citizen instead of being moved to where needed. We draw from our earlier work with a hierarchical data processing model, Hierarchical MapReduce (HMR), to process geographically distributed data, some of which are pinned data. The applications implemented in HMR use extended MapReduce model where computations are expressed as three functions: Map, Reduce, and GlobalReduce. Further, by facilitating information sharing among resources, applications, and data, the overall performance is improved. Experimental results show that the overhead of VCC is minimum. The HMR outperforms traditional MapReduce model while processing a particular class of applications. The evaluations also show that information sharing between resources and application through the VCC shortens the hierarchical data processing time, as well satisfying the constraints on the pinned data

IUScholarWorks (University of Indiana)