349 research outputs found

    BDEv 3.0: energy efficiency and microarchitectural characterization of Big Data processing frameworks

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Future Generation Computer Systems. The final authenticated version is available online at: https://doi.org/10.1016/j.future.2018.04.030[Abstract] As the size of Big Data workloads keeps increasing, the evaluation of distributed frameworks becomes a crucial task in order to identify potential performance bottlenecks that may delay the processing of large datasets. While most of the existing works generally focus only on execution time and resource utilization, analyzing other important metrics is key to fully understanding the behavior of these frameworks. For example, microarchitecture-level events can bring meaningful insights to characterize the interaction between frameworks and hardware. Moreover, energy consumption is also gaining increasing attention as systems scale to thousands of cores. This work discusses the current state of the art in evaluating distributed processing frameworks, while extending our Big Data Evaluator tool (BDEv) to extract energy efficiency and microarchitecture-level metrics from the execution of representative Big Data workloads. An experimental evaluation using BDEv demonstrates its usefulness to bring meaningful information from popular frameworks such as Hadoop, Spark and Flink.Ministerio de Economía, Industria y Competitividad; TIN2016-75845-PMinisterio de Educación; FPU14/02805Ministerio de Educación; FPU15/0338

    Performance modelling, analysis and prediction of Spark jobs in Hadoop cluster : a thesis by publications presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical & Computational Sciences, Massey University, Auckland, New Zealand

    Get PDF
    Big Data frameworks have received tremendous attention from the industry and from academic research over the past decade. The advent of distributed computing frameworks such as Hadoop MapReduce and Spark are powerful frameworks that offer an efficient solution for analysing large-scale datasets running under the Hadoop cluster. Spark has been established as one of the most popular large-scale data processing engines because of its speed, low latency in-memory computation, and advanced analytics. Spark computational performance heavily depends on the selection of suitable parameters, and the configuration of these parameters is a challenging task. Although Spark has default parameters and can deploy applications without much effort, a significant drawback of default parameter selection is that it is not always the best for cluster performance. A major limitation for Spark performance prediction using existing models is that it requires either large input data or system configuration that is time-consuming. Therefore, an analytical model could be a better solution for performance prediction and for establishing appropriate job configurations. This thesis proposes two distinct parallelisation models for performance prediction: the 2D-Plate model and the Fully-Connected Node model. Both models were constructed based on serial boundaries for a certain arrangement of executors and size of the data. In order to evaluate the cluster performance, various HiBench workloads were used, and workload’s empirical data were fitted with the models for performance prediction analysis. The developed models were benchmarked with the existing models such as Amdahl’s, Gustafson, ERNEST, and machine learning. Our experimental results show that the two proposed models can quickly and accurately predict performance in terms of runtime, and they can outperform the accuracy of machine learning models when extrapolating predictions

    A New Framework for the Analysis of Large Scale Multi-Rate Power Data

    Get PDF
    A new framework for the analysis of large scale, multi-rate power data is introduced. The system comprises high rate power grid data acquisition devices, software modules for big data management and large scale time series analysis. The power grid modeling and simulation modules enable to run power flow simulations. Visualization methods support data exploration for captured, simulated and analyzed energy data. A remote software control module for the proposed tools is provided

    An information security model based on trustworthiness for enhancing security in on-line collaborative learning

    Get PDF
    L'objectiu principal d'aquesta tesi és incorporar propietats i serveis de la seguretat en sistemes d'informació en l'aprenentatge col·laboratiu en línia, seguint un model funcional basat en la valoració i predicció de la confiança. Aquesta tesi estableix com a punt de partença el disseny d'una solució de seguretat innovadora, basada en una metodologia pròpia per a oferir als dissenyadors i gestors de l'e-learning les línies mestres per a incorporar mesures de seguretat en l'aprenentatge col·laboratiu en línia. Aquestes guies cobreixen tots els aspectes sobre el disseny i la gestió que s'han de considerar en els processos relatius a l'e-learning, entre altres l'anàlisi de seguretat, el disseny d'activitats d'aprenentatge, la detecció d'accions anòmales o el processament de dades sobre confiança. La temàtica d'aquesta tesi té una naturalesa multidisciplinària i, al seu torn, les diferents disciplines que la formen estan íntimament relacionades. Les principals disciplines de què es tracta en aquesta tesi són l'aprenentatge col·laboratiu en línia, la seguretat en sistemes d'informació, els entorns virtuals d'aprenentatge (EVA) i la valoració i predicció de la confiança. Tenint en compte aquest àmbit d'aplicació, el problema de garantir la seguretat en els processos d'aprenentatge col·laboratiu en línia es resol amb un model híbrid construït sobre la base de solucions funcionals i tecnològiques, concretament modelatge de la confiança i solucions tecnològiques per a la seguretat en sistemes d'informació.El principal objetivo de esta tesis es incorporar propiedades y servicios de la seguridad en sistemas de información en el aprendizaje colaborativo en línea, siguiendo un modelo funcional basado en la valoración y predicción de la confianza. Esta tesis establece como punto de partida el diseño de una solución de seguridad innovadora, basada en una metodología propia para ofrecer a los diseñadores y gestores del e-learning las líneas maestras para incorporar medidas de seguridad en el aprendizaje colaborativo en línea. Estas guías cubren todos los aspectos sobre el diseño y la gestión que hay que considerar en los procesos relativos al e-learning, entre otros el análisis de la seguridad, el diseño de actividades de aprendizaje, la detección de acciones anómalas o el procesamiento de datos sobre confianza. La temática de esta tesis tiene una naturaleza multidisciplinar y, a su vez, las diferentes disciplinas que la forman están íntimamente relacionadas. Las principales disciplinas tratadas en esta tesis son el aprendizaje colaborativo en línea, la seguridad en sistemas de información, los entornos virtuales de aprendizaje (EVA) y la valoración y predicción de la confianza. Teniendo en cuenta este ámbito de aplicación, el problema de garantizar la seguridad en los procesos de aprendizaje colaborativo en línea se resuelve con un modelo híbrido construido en base a soluciones funcionales y tecnológicas, concretamente modelado de la confianza y soluciones tecnológicas para la seguridad en sistemas de información.This thesis' main goal is to incorporate information security properties and services into online collaborative learning using a functional approach based on trustworthiness assessment and prediction. As a result, this thesis aims to design an innovative security solution, based on methodological approaches, to provide e-learning designers and managers with guidelines for incorporating security into online collaborative learning. These guidelines include all processes involved in e-learning design and management, such as security analysis, learning activity design, detection of anomalous actions, trustworthiness data processing, and so on. The subject of this research is multidisciplinary in nature, with the different disciplines comprising it being closely related. The most significant ones are online collaborative learning, information security, learning management systems (LMS), and trustworthiness assessment and prediction models. Against this backdrop, the problem of securing collaborative online learning activities is tackled by a hybrid model based on functional and technological solutions, namely, trustworthiness modelling and information security technologies

    Ignis: An efficient and scalable multi-language Big Data framework

    Get PDF
    Most of the relevant Big Data processing frameworks (e.g., Apache Hadoop, Apache Spark) only support JVM (Java Virtual Machine) languages by default. In order to support non-JVM languages, subprocesses are created and connected to the framework using system pipes. With this technique, the impossibility of managing the data at thread level arises together with an important loss in the performance. To address this problem we introduce Ignis, a new Big Data framework that benefits from an elegant way to create multi-language executors managed through an RPC system. As a consequence, the new system is able to execute natively applications implemented using non-JVM languages. In addition, Ignis allows users to combine in the same application the benefits of implementing each computational task in the best suited programming language without additional overhead. The system runs completely inside Docker containers, isolating the execution environment from the physical machine. A comparison with Apache Spark shows the advantages of our proposal in terms of performance and scalabilityThis work has been supported by MICINN, Spain (RTI2018-093336-B-C21), Xunta de Galicia, Spain (ED431G/08 and ED431C-2018/19) and European Regional Development Fund (ERDF)S

    Data processing of high-rate low-voltage distribution grid recordings for smart grid monitoring and analysis

    Get PDF
    Power networks will change from a rigid hierarchic architecture to dynamic interconnected smart grids. In traditional power grids, the frequency is the controlled quantity to maintain supply and load power balance. Thereby, high rotating mass inertia ensures for stability. In the future, system stability will have to rely more on real-time measurements and sophisticated control, especially when integrating fluctuating renewable power sources or high-load consumers like electrical vehicles to the low-voltage distribution grid. In the present contribution, we describe a data processing network for the in-house developed low-voltage, high-rate measurement devices called electrical data recorder (EDR). These capture units are capable of sending the full high-rate acquisition data for permanent storage in a large-scale database. The EDR network is specifically designed to serve for reliable and secured transport of large data, live performance monitoring, and deep data mining. We integrate dedicated different interfaces for statistical evaluation, big data queries, comparative analysis, and data integrity tests in order to provide a wide range of useful post-processing methods for smart grid analysis. We implemented the developed EDR network architecture for high-rate measurement data processing and management at different locations in the power grid of our Institute. The system runs stable and successfully collects data since several years. The results of the implemented evaluation functionalities show the feasibility of the implemented methods for signal processing, in view of enhanced smart grid operation. © 2015, Maaß et al.; licensee Springer

    Performance Evaluation of Data-Intensive Computing Applications on a Public IaaS Cloud

    Get PDF
    [Abstract] The advent of cloud computing technologies, which dynamically provide on-demand access to computational resources over the Internet, is offering new possibilities to many scientists and researchers. Nowadays, Infrastructure as a Service (IaaS) cloud providers can offset the increasing processing requirements of data-intensive computing applications, becoming an emerging alternative to traditional servers and clusters. In this paper, a comprehensive study of the leading public IaaS cloud platform, Amazon EC2, has been conducted in order to assess its suitability for data-intensive computing. One of the key contributions of this work is the analysis of the storage-optimized family of EC2 instances. Furthermore, this study presents a detailed analysis of both performance and cost metrics. More specifically, multiple experiments have been carried out to analyze the full I/O software stack, ranging from the low-level storage devices and cluster file systems up to real-world applications using representative data-intensive parallel codes and MapReduce-based workloads. The analysis of the experimental results has shown that data-intensive applications can benefit from tailored EC2-based virtual clusters, enabling users to obtain the highest performance and cost-effectiveness in the cloud.Ministerio de Economía y Competitividad; TIN2013-42148-PGalicia. Consellería de Cultura, Educación e Ordenación Universitaria; GRC2013/055Ministerio de Educación y Ciencia; AP2010-434

    Data-Driven Anomaly Detection in Industrial Networks

    Get PDF
    Since the conception of the first Programmable Logic Controllers (PLCs) in the 1960s, Industrial Control Systems (ICSs) have evolved vastly. From the primitive isolated setups, ICSs have become increasingly interconnected, slowly forming the complex networked environments, collectively known as Industrial Networks (INs), that we know today. Since ICSs are responsible for a wide range of physical processes, including those belonging to Critical Infrastructures (CIs), securing INs is vital for the well-being of modern societies. Out of the many research advances on the field, Anomaly Detection Systems (ADSs) play a prominent role. These systems monitor IN and/or ICS behavior to detect abnormal events, known or unknown. However, as the complexity of INs has increased, monitoring them in the search of anomalous trends has effectively become a Big Data problem. In other words, IN data has become too complex to process it by traditional means, due to its large scale, diversity and generation speeds. Nevertheless, ADSs designed for INs have not evolved at the same pace, and recent proposals are not designed to handle this data complexity, as they do not scale well or do not leverage the majority of the data types created in INs. This thesis aims to fill that gap, by presenting two main contributions: (i) a visual flow monitoring system and (ii) a multivariate ADS that is able to tackle data heterogeneity and to scale efficiently. For the flow monitor, we propose a system that, based on current flow data, builds security visualizations depicting network behavior while highlighting anomalies. For the multivariate ADS, we analyze the performance of Multivariate Statistical Process Control (MSPC) for detecting and diagnosing anomalies, and later we present a Big Data, MSPCinspired ADS that monitors field and network data to detect anomalies. The approaches are experimentally validated by building INs in test environments and analyzing the data created by them. Based on this necessity for conducting IN security research in a rigorous and reproducible environment, we also propose the design of a testbed that serves this purpose

    Personalized large scale classification of public tenders on hadoop

    Get PDF
    Ce projet a été réalisé dans le cadre d’un partenariat entre Fujitsu Canada et Université Laval. Les besoins du projets ont été centrés sur une problématique d’affaire définie conjointement avec Fujitsu. Le projet consistait à classifier un corpus d’appels d’offres électroniques avec une approche orienté big data. L’objectif était d’identifier avec un très fort rappel les offres pertinentes au domaine d’affaire de l’entreprise. Après une séries d’expérimentations à petite échelle qui nous ont permise d’illustrer empiriquement (93% de rappel) l’efficacité de notre approche basé sur l’algorithme BNS (Bi-Normal Separation), nous avons implanté un système complet qui exploite l’infrastructure technologique big data Hadoop. Nos expérimentations sur le système complet démontrent qu’il est possible d’obtenir une performance de classification tout aussi efficace à grande échelle (91% de rappel) tout en exploitant les gains de performance rendus possible par l’architecture distribuée de Hadoop.This project was completed as part of an innovation partnership with Fujitsu Canada and Université Laval. The needs and objectives of the project were centered on a business problem defined jointly with Fujitsu. Our project aimed to classify a corpus of electronic public tenders based on state of the art Hadoop big data technology. The objective was to identify with high recall public tenders relevant to the IT services business of Fujitsu Canada. A small scale prototype based on the BNS algorithm (Bi-Normal Separation) was empirically shown to classify with high recall (93%) the public tender corpus. The prototype was then re-implemented on a full scale Hadoop cluster using Apache Pig for the data preparation pipeline and using Apache Mahout for classification. Our experimentation show that the large scale system not only maintains high recall (91%) on the classification task, but can readily take advantage of the massive scalability gains made possible by Hadoop’s distributed architecture

    Application of a Smart City Model to a Traditional University Campus with a Big Data Architecture: A Sustainable Smart Campus

    Get PDF
    Currently, the integration of technologies such as the Internet of Things and big data seeks to cover the needs of an increasingly demanding society that consumes more resources. The massification of these technologies fosters the transformation of cities into smart cities. Smart cities improve the comfort of people in areas such as security, mobility, energy consumption and so forth. However, this transformation requires a high investment in both socioeconomic and technical resources. To make the most of the resources, it is important to make prototypes capable of simulating urban environments and for the results to set the standard for implementation in real environments. The search for an environment that represents the socioeconomic organization of a city led us to consider universities as a perfect environment for small-scale testing. The proposal integrates these technologies in a traditional university campus, mainly through the acquisition of data through the Internet of Things, the centralization of data in proprietary infrastructure and the use of big data for the management and analysis of data. The mechanisms of distributed and multilevel analysis proposed here could be a powerful starting point to find a reliable and efficient solution for the implementation of an intelligent environment based on sustainability
    • …
    corecore