48 research outputs found

    Technical Report: A Trace-Based Performance Study of Autoscaling Workloads of Workflows in Datacenters

    Get PDF
    To improve customer experience, datacenter operators offer support for simplifying application and resource management. For example, running workloads of workflows on behalf of customers is desirable, but requires increasingly more sophisticated autoscaling policies, that is, policies that dynamically provision resources for the customer. Although selecting and tuning autoscaling policies is a challenging task for datacenter operators, so far relatively few studies investigate the performance of autoscaling for workloads of workflows. Complementing previous knowledge, in this work we propose the first comprehensive performance study in the field. Using trace-based simulation, we compare state-of-the-art autoscaling policies across multiple application domains, workload arrival patterns (e.g., burstiness), and system utilization levels. We further investigate the interplay between autoscaling and regular allocation policies, and the complexity cost of autoscaling. Our quantitative study focuses not only on traditional performance metrics and on state-of-the-art elasticity metrics, but also on time- and memory-related autoscaling-complexity metrics. Our main results give strong and quantitative evidence about previously unreported operational behavior, for example, that autoscaling policies perform differently across application domains and by how much they differ.Comment: Technical Report for the CCGrid 2018 submission "A Trace-Based Performance Study of Autoscaling Workloads of Workflows in Datacenters

    Active Data: A Data-Centric Approach to Data Life-Cycle Management

    Get PDF
    International audienceData-intensive science offers new opportunities for innovation and discoveries, provided that large datasets can be handled efficiently. Data management for data-intensive science applications is challenging; requiring support for complex data life cycles, coordination across multiple sites, fault tolerance, and scalability to support tens of sites and petabytes of data. In this paper, we argue that data management for data-intensive science applications requires a fundamentally different management approach than the current ad-hoc task centric approach. We propose Active Data, a fundamentally novel paradigm for data life cycle management. Active Data follows two principles: data-centric and event-driven. We report on the Active Data programming model and its preliminary implementation, and discuss the benefits and limitations of the approach on recognized challenging data-intensive science use-cases.Les importants volumes de données produits par la science présentent de nouvelles opportunités d'innovation et de découvertes. Cependant ceci sera conditionné par notre capacité à gérer efficacement de très grands jeux de données. La gestion de données pour les applications scientifiques data-intensive présente un véritable défi~; elle requière le support de cycles de vie très complexes, la coordination de plusieurs sites, de la tolérance aux pannes et de passer à l'échelle sur des dizaines de sites avec plusieurs péta-octets de données. Dans cet article nous argumentons que la gestion des données pour les applications scientifiques data-intensive nécessite une approche fondamentalement différente de l'actuel paradigme centré sur les tâches. Nous proposons Active Data, un nouveau paradigme pour la gestion du cycle de vie des données. Active Data suit deux principes~: il est centré sur les données et à base d'événements. Nous présentons le modèle de programmation Active Data, un prototype d'implémentation et discutons des avantages et limites de notre approche à partir d'étude de cas d'applications scientifiques

    Active Data: A Programming Model to Manage Data Life Cycle Across Heterogeneous Systems and Infrastructures

    Get PDF
    International audienceThe Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key point is to handle the complexity of the data life cycle, i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing. We propose Active Data, a programming model to automate and improve the expressiveness of data management applications. We first define the concept of data life cycle and introduce a formal model that allows to expose data life cycle across heterogeneous systems and infrastructures. The Active Data programming model allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data. We implement and evaluate the model with four use cases: a storage cache to Amazon-S3, a cooperative sensor network, an incremental implementation of the MapReduce programming model and automated data provenance tracking across heterogeneous systems. Altogether, these scenarios illustrate the adequateness of the model to program applications that manage distributed and dynamic data sets. We also show that applications that do not leverage on data life cycle can still benefit from Active Data to improve their performances

    Active Data: A Programming Model to Manage Data Life Cycle Across Heterogeneous Systems and Infrastructures

    Get PDF
    The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key point is to handle the complexity of the data life cycle, i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing. We propose Active Data, a programming model to automate and improve the expressiveness of data management applications. We first define the concept of data life cycle and introduce a formal model that allow to expose data life cycle across heterogeneous systems and infrastructures. The Active Data programming model allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, repli-cation, transfer, deletion) happen to any data. We implement and evaluate the model with four use cases: a storage cache to Amazon-S3, a cooperative sensor network, an incremental implementation of the MapReduce programming model and automated data provenance tracking across heterogeneous systems. Altogether, these scenarios illustrate the adequateness of the model to program applications that manage distributed and dynamic data sets. We also show that applications that do not leverage on data life cycle can still benefit from Active Data to improve their performances

    Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters

    Full text link
    Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terabytes or petabytes of data. These workflows often require running over thousands of CPU cores and performing simultaneous data accesses, data movements, and computation. It is challenging to analyze the performance involving terabytes or petabytes of workflow data or measurement data of the executions, from complex workflows over a large number of nodes and multiple parallel task executions. To help identify performance bottlenecks or debug the performance issues in large-scale scientific applications and scientific clusters, we have developed a performance analysis framework, using state-ofthe- art open-source big data processing tools. Our tool can ingest system logs and application performance measurements to extract key performance features, and apply the most sophisticated statistical tools and data mining methods on the performance data. It utilizes an efficient data processing engine to allow users to interactively analyze a large amount of different types of logs and measurements. To illustrate the functionality of the big data analysis framework, we conduct case studies on the workflows from an astronomy project known as the Palomar Transient Factory (PTF) and the job logs from the genome analysis scientific cluster

    Autonomous Incident Response

    Get PDF
    Trabalho de Projeto de Mestrado, Segurança Informática, 2022, Universidade de Lisboa, Faculdade de CiênciasInformation security is a must-have for any organization willing to stay relevant and grow, it plays an important role as a business enabler, be it from a regulatory perspective or a reputation perspective. Having people, process, and technology to solve the ever growing number of security incidents as fast as possible and with the least amount of impact is a challenge for small and big companies. To address this challenge, companies started investing in Security Orchestration, Automation, and Response (SOAR) [39, 68, 70]. Security orchestration is the planning, integration, cooperation, and coordination of the activities of security tools and experts to produce and automate required actions in response to any security incident across multiple technology paradigms [40]. In other words, the use of SOAR is a way to translate the manual procedures followed by the security analysts into automated actions, making the process faster and scalable while saving on human resources budget. This project proposes a low-cost cloud native SOAR platform that is based on serverless computing, presenting the underlying details of its design. The performance of the proposed solution was evaluated through 364 real-world incidents related to 11 use cases in a large multinational enterprise. The results show that the solution is able to decrease the duration of the tasks by an average of 98.81% while having an operating expense of less than $65/month. Prior to the SOAR, it took the analyst 75.84 hours to perform manual tasks related to the 11 use cases. Additionally, an estimated 450 hours of the analyst’s time would be used to run the Update threat intelligence database use case. After the SOAR, the same tasks were automatically ran in 31.2 minutes and the Update threat intelligence database use case ran 9.000 times in 5.3 hours

    Enhancing cyber assets visibility for effective attack surface management : Cyber Asset Attack Surface Management based on Knowledge Graph

    Get PDF
    The contemporary digital landscape is filled with challenges, chief among them being the management and security of cyber assets, including the ever-growing shadow IT. The evolving nature of the technology landscape has resulted in an expansive system of solutions, making it challenging to select and deploy compatible solutions in a structured manner. This thesis explores the critical role of Cyber Asset Attack Surface Management (CAASM) technologies in managing cyber attack surfaces, focusing on the open-source CAASM tool, Starbase, by JupiterOne. It starts by underlining the importance of comprehending the cyber assets that need defending. It acknowledges the Cyber Defense Matrix as a methodical and flexible approach to understanding and addressing cyber security challenges. A comprehensive analysis of market trends and business needs validated the necessity of asset security management tools as fundamental components in firms' security journeys. CAASM has been selected as a promising solution among various tools due to its capabilities, ease of use, and seamless integration with cloud environments using APIs, addressing shadow IT challenges. A practical use case involving the integration of Starbase with GitHub was developed to demonstrate the CAASM's usability and flexibility in managing cyber assets in organizations of varying sizes. The use case enhanced the knowledge graph's aesthetics and usability using Neo4j Desktop and Neo4j Bloom, making it accessible and insightful even for non-technical users. The thesis concludes with practical guidelines in the appendices and on GitHub for reproducing the use case

    Self-Adaptation in Industry: A Survey

    Full text link
    Computing systems form the backbone of many areas in our society, from manufacturing to traffic control, healthcare, and financial systems. When software plays a vital role in the design, construction, and operation, these systems are referred as software-intensive systems. Self-adaptation equips a software-intensive system with a feedback loop that either automates tasks that otherwise need to be performed by human operators or deals with uncertain conditions. Such feedback loops have found their way to a variety of practical applications; typical examples are an elastic cloud to adapt computing resources and automated server management to respond quickly to business needs. To gain insight into the motivations for applying self-adaptation in practice, the problems solved using self-adaptation and how these problems are solved, and the difficulties and risks that industry faces in adopting self-adaptation, we performed a large-scale survey. We received 184 valid responses from practitioners spread over 21 countries. Based on the analysis of the survey data, we provide an empirically grounded overview of state-of-the-practice in the application of self-adaptation. From that, we derive insights for researchers to check their current research with industrial needs, and for practitioners to compare their current practice in applying self-adaptation. These insights also provide opportunities for the application of self-adaptation in practice and pave the way for future industry-research collaborations.Comment: 43 page
    corecore