23 research outputs found

    What broke where for distributed and parallel applications — a whodunit story

    Get PDF
    Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed and parallel systems is a difficult task. These large distributed and parallel systems are composed of various complex software and hardware components. When the system experiences some performance or correctness problem, developers struggle to understand the root cause of the problem and fix in a timely manner. In my thesis, I address these three components of the performance problems in computer systems. First, we focus on diagnosing performance problems in large-scale parallel applications running on supercomputers. We developed techniques to localize the performance problem for root-cause analysis. Parallel applications, most of which are complex scientific simulations running in supercomputers, can create up to millions of parallel tasks that run on different machines and communicate using the message passing paradigm. We developed a highly scalable and accurate automated debugging tool called PRODOMETER, which uses sophisticated algorithms to first, create a logical progress dependency graph of the tasks to highlight how the problem spread through the system manifesting as a system-wide performance issue. Second, uses this logical progress dependence graph to identify the task where the problem originated. Finally, PRODOMETER pinpoints the code region corresponding to the origin of the bug. Second, we developed a tool-chain that can detect performance anomaly using machine-learning techniques and can achieve very low false positive rate. Our input-aware performance anomaly detection system consists of a scalable data collection framework to collect performance related metrics from different granularity of code regions, an offline model creation and prediction-error characterization technique, and a threshold based anomaly-detection-engine for production runs. Our system requires few training runs and can handle unknown inputs and parameter combinations by dynamically calibrating the anomaly detection threshold according to the characteristics of the input data and the characteristics of the prediction-error of the models. Third, we developed performance problem mitigation scheme for erasure-coded distributed storage systems. Repair operations of the failed blocks in erasure-coded distributed storage system take really long time in networked constrained data-centers. The reason being, during the repair operation for erasure-coded distributed storage, a lot of data from multiple nodes are gathered into a single node and then a mathematical operation is performed to reconstruct the missing part. This process severely congests the links toward the destination where newly recreated data is to be hosted. We proposed a novel distributed repair technique, called Partial-Parallel-Repair (PPR) that performs this reconstruction in parallel on multiple nodes and eliminates network bottlenecks, and as a result, greatly speeds up the repair process. Fourth, we study how for a class of applications, performance can be improved (or performance problems can be mitigated) by selectively approximating some of the computations. For many applications, the main computation happens inside a loop that can be logically divided into a few temporal segments, we call phases. We found that while approximating the initial phases might severely degrade the quality of the results, approximating the computation for the later phases have very small impact on the final quality of the result. Based on this observation, we developed an optimization framework that for a given budget of quality-loss, would find the best approximation settings for each phase in the execution

    Enabling Distributed Applications Optimization in Cloud Environment

    Get PDF
    The past few years have seen dramatic growth in the popularity of public clouds, such as Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Container-as-a-Service (CaaS). In both commercial and scientific fields, quick environment setup and application deployment become a mandatory requirement. As a result, more and more organizations choose cloud environments instead of setting up the environment by themselves from scratch. The cloud computing resources such as server engines, orchestration, and the underlying server resources are served to the users as a service from a cloud provider. Most of the applications that run in public clouds are the distributed applications, also called multi-tier applications, which require a set of servers, a service ensemble, that cooperate and communicate to jointly provide a certain service or accomplish a task. Moreover, a few research efforts are conducting in providing an overall solution for distributed applications optimization in the public cloud. In this dissertation, we present three systems that enable distributed applications optimization: (1) the first part introduces DocMan, a toolset for detecting containerized application’s dependencies in CaaS clouds, (2) the second part introduces a system to deal with hot/cold blocks in distributed applications, (3) the third part introduces a system named FP4S, a novel fragment-based parallel state recovery mechanism that can handle many simultaneous failures for a large number of concurrently running stream applications

    Secure and efficient storage of multimedia: content in public cloud environments using joint compression and encryption

    Get PDF
    The Cloud Computing is a paradigm still with many unexplored areas ranging from the technological component to the de nition of new business models, but that is revolutionizing the way we design, implement and manage the entire infrastructure of information technology. The Infrastructure as a Service is the delivery of computing infrastructure, typically a virtual data center, along with a set of APIs that allow applications, in an automatic way, can control the resources they wish to use. The choice of the service provider and how it applies to their business model may lead to higher or lower cost in the operation and maintenance of applications near the suppliers. In this sense, this work proposed to carry out a literature review on the topic of Cloud Computing, secure storage and transmission of multimedia content, using lossless compression, in public cloud environments, and implement this system by building an application that manages data in public cloud environments (dropbox and meocloud). An application was built during this dissertation that meets the objectives set. This system provides the user a wide range of functions of data management in public cloud environments, for that the user only have to login to the system with his/her credentials, after performing the login, through the Oauth 1.0 protocol (authorization protocol) is generated an access token, this token is generated only with the consent of the user and allows the application to get access to data/user les without having to use credentials. With this token the framework can now operate and unlock the full potential of its functions. With this application is also available to the user functions of compression and encryption so that user can make the most of his/her cloud storage system securely. The compression function works using the compression algorithm LZMA being only necessary for the user to choose the les to be compressed. Relatively to encryption it will be used the encryption algorithm AES (Advanced Encryption Standard) that works with a 128 bit symmetric key de ned by user. We build the research into two distinct and complementary parts: The rst part consists of the theoretical foundation and the second part is the development of computer application where the data is managed, compressed, stored, transmitted in various environments of cloud computing. The theoretical framework is organized into two chapters, chapter 2 - Background on Cloud Storage and chapter 3 - Data compression. Sought through theoretical foundation demonstrate the relevance of the research, convey some of the pertinent theories and input whenever possible, research in the area. The second part of the work was devoted to the development of the application in cloud environment. We showed how we generated the application, presented the features, advantages, and safety standards for the data. Finally, we re ect on the results, according to the theoretical framework made in the rst part and platform development. We think that the work obtained is positive and that ts the goals we set ourselves to achieve. This research has some limitations, we believe that the time for completion was scarce and the implementation of the platform could bene t from the implementation of other features.In future research it would be appropriate to continue the project expanding the capabilities of the application, test the operation with other users and make comparative tests.A Computação em nuvem é um paradigma ainda com muitas áreas por explorar que vão desde a componente tecnológica à definição de novos modelos de negócio, mas que está a revolucionar a forma como projetamos, implementamos e gerimos toda a infraestrutura da tecnologia da informação. A Infraestrutura como Serviço representa a disponibilização da infraestrutura computacional, tipicamente um datacenter virtual, juntamente com um conjunto de APls que permitirá que aplicações, de forma automática, possam controlar os recursos que pretendem utilizar_ A escolha do fornecedor de serviços e a forma como este aplica o seu modelo de negócio poderão determinar um maior ou menor custo na operacionalização e manutenção das aplicações junto dos fornecedores. Neste sentido, esta dissertação propôs· se efetuar uma revisão bibliográfica sobre a temática da Computação em nuvem, a transmissão e o armazenamento seguro de conteúdos multimédia, utilizando a compressão sem perdas, em ambientes em nuvem públicos, e implementar um sistema deste tipo através da construção de uma aplicação que faz a gestão dos dados em ambientes de nuvem pública (dropbox e meocloud). Foi construída uma aplicação no decorrer desta dissertação que vai de encontro aos objectivos definidos. Este sistema fornece ao utilizador uma variada gama de funções de gestão de dados em ambientes de nuvem pública, para isso o utilizador tem apenas que realizar o login no sistema com as suas credenciais, após a realização de login, através do protocolo Oauth 1.0 (protocolo de autorização) é gerado um token de acesso, este token só é gerado com o consentimento do utilizador e permite que a aplicação tenha acesso aos dados / ficheiros do utilizador ~em que seja necessário utilizar as credenciais. Com este token a aplicação pode agora operar e disponibilizar todo o potencial das suas funções. Com esta aplicação é também disponibilizado ao utilizador funções de compressão e encriptação de modo a que possa usufruir ao máximo do seu sistema de armazenamento cloud com segurança. A função de compressão funciona utilizando o algoritmo de compressão LZMA sendo apenas necessário que o utilizador escolha os ficheiros a comprimir. Relativamente à cifragem utilizamos o algoritmo AES (Advanced Encryption Standard) que funciona com uma chave simétrica de 128bits definida pelo utilizador. Alicerçámos a investigação em duas partes distintas e complementares: a primeira parte é composta pela fundamentação teórica e a segunda parte consiste no desenvolvimento da aplicação informática em que os dados são geridos, comprimidos, armazenados, transmitidos em vários ambientes de computação em nuvem. A fundamentação teórica encontra-se organizada em dois capítulos, o capítulo 2 - "Background on Cloud Storage" e o capítulo 3 "Data Compression", Procurámos, através da fundamentação teórica, demonstrar a pertinência da investigação. transmitir algumas das teorias pertinentes e introduzir, sempre que possível, investigações existentes na área. A segunda parte do trabalho foi dedicada ao desenvolvimento da aplicação em ambiente "cloud". Evidenciámos o modo como gerámos a aplicação, apresentámos as funcionalidades, as vantagens. Por fim, refletimos sobre os resultados , de acordo com o enquadramento teórico efetuado na primeira parte e o desenvolvimento da plataforma. Pensamos que o trabalho obtido é positivo e que se enquadra nos objetivos que nos propusemos atingir. Este trabalho de investigação apresenta algumas limitações, consideramos que o tempo para a sua execução foi escasso e a implementação da plataforma poderia beneficiar com a implementação de outras funcionalidades. Em investigações futuras seria pertinente dar continuidade ao projeto ampliando as potencialidades da aplicação, testar o funcionamento com outros utilizadores e efetuar testes comparativos.Fundação para a Ciência e a Tecnologia (FCT

    Data Replication and Its Alignment with Fault Management in the Cloud Environment

    Get PDF
    Nowadays, the exponential data growth becomes one of the major challenges all over the world. It may cause a series of negative impacts such as network overloading, high system complexity, and inadequate data security, etc. Cloud computing is developed to construct a novel paradigm to alleviate massive data processing challenges with its on-demand services and distributed architecture. Data replication has been proposed to strategically distribute the data access load to multiple cloud data centres by creating multiple data copies at multiple cloud data centres. A replica-applied cloud environment not only achieves a decrease in response time, an increase in data availability, and more balanced resource load but also protects the cloud environment against the upcoming faults. The reactive fault tolerance strategy is also required to handle the faults when the faults already occurred. As a result, the data replication strategies should be aligned with the reactive fault tolerance strategies to achieve a complete management chain in the cloud environment. In this thesis, a data replication and fault management framework is proposed to establish a decentralised overarching management to the cloud environment. Three data replication strategies are firstly proposed based on this framework. A replica creation strategy is proposed to reduce the total cost by jointly considering the data dependency and the access frequency in the replica creation decision making process. Besides, a cloud map oriented and cost efficiency driven replica creation strategy is proposed to achieve the optimal cost reduction per replica in the cloud environment. The local data relationship and the remote data relationship are further analysed by creating two novel data dependency types, Within-DataCentre Data Dependency and Between-DataCentre Data Dependency, according to the data location. Furthermore, a network performance based replica selection strategy is proposed to avoid potential network overloading problems and to increase the number of concurrent-running instances at the same time

    Scaling and Resilience in Numerical Algorithms for Exascale Computing

    Get PDF
    The first Petascale supercomputer, the IBM Roadrunner, went online in 2008. Ten years later, the community is now looking ahead to a new generation of Exascale machines. During the decade that has passed, several hundred Petascale capable machines have been installed worldwide, yet despite the abundance of machines, applications that scale to their full size remain rare. Large clusters now routinely have 50.000+ cores, some have several million. This extreme level of parallelism, that has allowed a theoretical compute capacity in excess of a million billion operations per second, turns out to be difficult to use in many applications of practical interest. Processors often end up spending more time waiting for synchronization, communication, and other coordinating operations to complete, rather than actually computing. Component reliability is another challenge facing HPC developers. If even a single processor fail, among many thousands, the user is forced to restart traditional applications, wasting valuable compute time. These issues collectively manifest themselves as low parallel efficiency, resulting in waste of energy and computational resources. Future performance improvements are expected to continue to come in large part due to increased parallelism. One may therefore speculate that the difficulties currently faced, when scaling applications to Petascale machines, will progressively worsen, making it difficult for scientists to harness the full potential of Exascale computing. The thesis comprises two parts. Each part consists of several chapters discussing modifications of numerical algorithms to make them better suited for future Exascale machines. In the first part, the use of Parareal for Parallel-in-Time integration techniques for scalable numerical solution of partial differential equations is considered. We propose a new adaptive scheduler that optimize the parallel efficiency by minimizing the time-subdomain length without making communication of time-subdomains too costly. In conjunction with an appropriate preconditioner, we demonstrate that it is possible to obtain time-parallel speedup on the nonlinear shallow water equation, beyond what is possible using conventional spatial domain-decomposition techniques alone. The part is concluded with the proposal of a new method for constructing Parallel-in-Time integration schemes better suited for convection dominated problems. In the second part, new ways of mitigating the impact of hardware failures are developed and presented. The topic is introduced with the creation of a new fault-tolerant variant of Parareal. In the chapter that follows, a C++ Library for multi-level checkpointing is presented. The library uses lightweight in-memory checkpoints, protected trough the use of erasure codes, to mitigate the impact of failures by decreasing the overhead of checkpointing and minimizing the compute work lost. Erasure codes have the unfortunate property that if more data blocks are lost than parity codes created, the data is effectively considered unrecoverable. The final chapter contains a preliminary study on partial information recovery for incomplete checksums. Under the assumption that some meta knowledge exists on the structure of the data encoded, we show that the data lost may be recovered, at least partially. This result is of interest not only in HPC but also in data centers where erasure codes are widely used to protect data efficiently

    Security Risk Management for the Internet of Things

    Get PDF
    In recent years, the rising complexity of Internet of Things (IoT) systems has increased their potential vulnerabilities and introduced new cybersecurity challenges. In this context, state of the art methods and technologies for security risk assessment have prominent limitations when it comes to large scale, cyber-physical and interconnected IoT systems. Risk assessments for modern IoT systems must be frequent, dynamic and driven by knowledge about both cyber and physical assets. Furthermore, they should be more proactive, more automated, and able to leverage information shared across IoT value chains. This book introduces a set of novel risk assessment techniques and their role in the IoT Security risk management process. Specifically, it presents architectures and platforms for end-to-end security, including their implementation based on the edge/fog computing paradigm. It also highlights machine learning techniques that boost the automation and proactiveness of IoT security risk assessments. Furthermore, blockchain solutions for open and transparent sharing of IoT security information across the supply chain are introduced. Frameworks for privacy awareness, along with technical measures that enable privacy risk assessment and boost GDPR compliance are also presented. Likewise, the book illustrates novel solutions for security certification of IoT systems, along with techniques for IoT security interoperability. In the coming years, IoT security will be a challenging, yet very exciting journey for IoT stakeholders, including security experts, consultants, security research organizations and IoT solution providers. The book provides knowledge and insights about where we stand on this journey. It also attempts to develop a vision for the future and to help readers start their IoT Security efforts on the right foot

    Understanding Quantum Technologies 2022

    Full text link
    Understanding Quantum Technologies 2022 is a creative-commons ebook that provides a unique 360 degrees overview of quantum technologies from science and technology to geopolitical and societal issues. It covers quantum physics history, quantum physics 101, gate-based quantum computing, quantum computing engineering (including quantum error corrections and quantum computing energetics), quantum computing hardware (all qubit types, including quantum annealing and quantum simulation paradigms, history, science, research, implementation and vendors), quantum enabling technologies (cryogenics, control electronics, photonics, components fabs, raw materials), quantum computing algorithms, software development tools and use cases, unconventional computing (potential alternatives to quantum and classical computing), quantum telecommunications and cryptography, quantum sensing, quantum technologies around the world, quantum technologies societal impact and even quantum fake sciences. The main audience are computer science engineers, developers and IT specialists as well as quantum scientists and students who want to acquire a global view of how quantum technologies work, and particularly quantum computing. This version is an extensive update to the 2021 edition published in October 2021.Comment: 1132 pages, 920 figures, Letter forma

    Proceedings of the 21st Conference on Formal Methods in Computer-Aided Design – FMCAD 2021

    Get PDF
    The Conference on Formal Methods in Computer-Aided Design (FMCAD) is an annual conference on the theory and applications of formal methods in hardware and system verification. FMCAD provides a leading forum to researchers in academia and industry for presenting and discussing groundbreaking methods, technologies, theoretical results, and tools for reasoning formally about computing systems. FMCAD covers formal aspects of computer-aided system design including verification, specification, synthesis, and testing
    corecore