23 research outputs found
What broke where for distributed and parallel applications — a whodunit story
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed and parallel systems is a difficult task. These large distributed and parallel systems are composed of various complex software and hardware components. When the system experiences some performance or correctness problem, developers struggle to understand the root cause of the problem and fix in a timely manner. In my thesis, I address these three components of the performance problems in computer systems. First, we focus on diagnosing performance problems in large-scale parallel applications running on supercomputers. We developed techniques to localize the performance problem for root-cause analysis. Parallel applications, most of which are complex scientific simulations running in supercomputers, can create up to millions of parallel tasks that run on different machines and communicate using the message passing paradigm. We developed a highly scalable and accurate automated debugging tool called PRODOMETER, which uses sophisticated algorithms to first, create a logical progress dependency graph of the tasks to highlight how the problem spread through the system manifesting as a system-wide performance issue. Second, uses this logical progress dependence graph to identify the task where the problem originated. Finally, PRODOMETER pinpoints the code region corresponding to the origin of the bug. Second, we developed a tool-chain that can detect performance anomaly using machine-learning techniques and can achieve very low false positive rate. Our input-aware performance anomaly detection system consists of a scalable data collection framework to collect performance related metrics from different granularity of code regions, an offline model creation and prediction-error characterization technique, and a threshold based anomaly-detection-engine for production runs. Our system requires few training runs and can handle unknown inputs and parameter combinations by dynamically calibrating the anomaly detection threshold according to the characteristics of the input data and the characteristics of the prediction-error of the models. Third, we developed performance problem mitigation scheme for erasure-coded distributed storage systems. Repair operations of the failed blocks in erasure-coded distributed storage system take really long time in networked constrained data-centers. The reason being, during the repair operation for erasure-coded distributed storage, a lot of data from multiple nodes are gathered into a single node and then a mathematical operation is performed to reconstruct the missing part. This process severely congests the links toward the destination where newly recreated data is to be hosted. We proposed a novel distributed repair technique, called Partial-Parallel-Repair (PPR) that performs this reconstruction in parallel on multiple nodes and eliminates network bottlenecks, and as a result, greatly speeds up the repair process. Fourth, we study how for a class of applications, performance can be improved (or performance problems can be mitigated) by selectively approximating some of the computations. For many applications, the main computation happens inside a loop that can be logically divided into a few temporal segments, we call phases. We found that while approximating the initial phases might severely degrade the quality of the results, approximating the computation for the later phases have very small impact on the final quality of the result. Based on this observation, we developed an optimization framework that for a given budget of quality-loss, would find the best approximation settings for each phase in the execution
Enabling Distributed Applications Optimization in Cloud Environment
The past few years have seen dramatic growth in the popularity of public clouds, such as Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Container-as-a-Service (CaaS). In both commercial and scientific fields, quick environment setup and application deployment become a mandatory requirement. As a result, more and more organizations choose cloud environments instead of setting up the environment by themselves from scratch. The cloud computing resources such as server engines, orchestration, and the underlying server resources are served to the users as a service from a cloud provider. Most of the applications that run in public clouds are the distributed applications, also called multi-tier applications, which require a set of servers, a service ensemble, that cooperate and communicate to jointly provide a certain service or accomplish a task. Moreover, a few research efforts are conducting in providing an overall solution for distributed applications optimization in the public cloud.
In this dissertation, we present three systems that enable distributed applications optimization: (1) the first part introduces DocMan, a toolset for detecting containerized application’s dependencies in CaaS clouds, (2) the second part introduces a system to deal with hot/cold blocks in distributed applications, (3) the third part introduces a system named FP4S, a novel fragment-based parallel state recovery mechanism that can handle many simultaneous failures for a large number of concurrently running stream applications
Secure and efficient storage of multimedia: content in public cloud environments using joint compression and encryption
The Cloud Computing is a paradigm still with many unexplored areas ranging from the
technological component to the de nition of new business models, but that is revolutionizing the way we design, implement and manage the entire infrastructure of information technology.
The Infrastructure as a Service is the delivery of computing infrastructure, typically a virtual data center, along with a set of APIs that allow applications, in an automatic way, can control the resources they wish to use. The choice of the service provider and how it applies to their business model may lead to higher or lower cost in the operation and maintenance of applications near the suppliers.
In this sense, this work proposed to carry out a literature review on the topic of Cloud
Computing, secure storage and transmission of multimedia content, using lossless compression, in public cloud environments, and implement this system by building an application that manages data in public cloud environments (dropbox and meocloud).
An application was built during this dissertation that meets the objectives set. This system provides the user a wide range of functions of data management in public cloud environments, for that the user only have to login to the system with his/her credentials, after performing the login, through the Oauth 1.0 protocol (authorization protocol) is generated an access token, this token is generated only with the consent of the user and allows the application to get access to data/user les without having to use credentials. With this token the framework can now operate and unlock the full potential of its functions. With this application
is also available to the user functions of compression and encryption so that user can make the most of his/her cloud storage system securely. The compression function works using the compression algorithm LZMA being only necessary for the user to choose the les to be compressed.
Relatively to encryption it will be used the encryption algorithm AES (Advanced Encryption Standard) that works with a 128 bit symmetric key de ned by user.
We build the research into two distinct and complementary parts: The rst part consists
of the theoretical foundation and the second part is the development of computer application where the data is managed, compressed, stored, transmitted in various environments of cloud computing. The theoretical framework is organized into two chapters, chapter 2 - Background
on Cloud Storage and chapter 3 - Data compression.
Sought through theoretical foundation demonstrate the relevance of the research, convey some of the pertinent theories and input whenever possible, research in the area. The second part of the work was devoted to the development of the application in cloud environment.
We showed how we generated the application, presented the features, advantages, and
safety standards for the data. Finally, we re ect on the results, according to the theoretical
framework made in the rst part and platform development.
We think that the work obtained is positive and that ts the goals we set ourselves
to achieve. This research has some limitations, we believe that the time for completion was scarce and the implementation of the platform could bene t from the implementation of other features.In future research it would be appropriate to continue the project expanding the capabilities
of the application, test the operation with other users and make comparative tests.A Computação em nuvem é um paradigma ainda com muitas áreas por explorar que
vão desde a componente tecnológica à definição de novos modelos de negócio, mas que está
a revolucionar a forma como projetamos, implementamos e gerimos toda a infraestrutura da
tecnologia da informação.
A Infraestrutura como Serviço representa a disponibilização da infraestrutura computacional,
tipicamente um datacenter virtual, juntamente com um conjunto de APls que permitirá
que aplicações, de forma automática, possam controlar os recursos que pretendem utilizar_ A
escolha do fornecedor de serviços e a forma como este aplica o seu modelo de negócio poderão
determinar um maior ou menor custo na operacionalização e manutenção das aplicações junto
dos fornecedores.
Neste sentido, esta dissertação propôs· se efetuar uma revisão bibliográfica sobre a
temática da Computação em nuvem, a transmissão e o armazenamento seguro de conteúdos
multimédia, utilizando a compressão sem perdas, em ambientes em nuvem públicos, e implementar
um sistema deste tipo através da construção de uma aplicação que faz a gestão dos
dados em ambientes de nuvem pública (dropbox e meocloud).
Foi construída uma aplicação no decorrer desta dissertação que vai de encontro aos objectivos
definidos. Este sistema fornece ao utilizador uma variada gama de funções de gestão
de dados em ambientes de nuvem pública, para isso o utilizador tem apenas que realizar o login
no sistema com as suas credenciais, após a realização de login, através do protocolo Oauth 1.0
(protocolo de autorização) é gerado um token de acesso, este token só é gerado com o consentimento
do utilizador e permite que a aplicação tenha acesso aos dados / ficheiros do utilizador
~em que seja necessário utilizar as credenciais. Com este token a aplicação pode agora operar e
disponibilizar todo o potencial das suas funções. Com esta aplicação é também disponibilizado
ao utilizador funções de compressão e encriptação de modo a que possa usufruir ao máximo
do seu sistema de armazenamento cloud com segurança. A função de compressão funciona
utilizando o algoritmo de compressão LZMA sendo apenas necessário que o utilizador escolha os
ficheiros a comprimir. Relativamente à cifragem utilizamos o algoritmo AES (Advanced Encryption
Standard) que funciona com uma chave simétrica de 128bits definida pelo utilizador.
Alicerçámos a investigação em duas partes distintas e complementares: a primeira parte
é composta pela fundamentação teórica e a segunda parte consiste no desenvolvimento da aplicação
informática em que os dados são geridos, comprimidos, armazenados, transmitidos em
vários ambientes de computação em nuvem. A fundamentação teórica encontra-se organizada
em dois capítulos, o capítulo 2 - "Background on Cloud Storage" e o capítulo 3 "Data Compression",
Procurámos, através da fundamentação teórica, demonstrar a pertinência da investigação. transmitir algumas das teorias pertinentes e introduzir, sempre que possível, investigações
existentes na área. A segunda parte do trabalho foi dedicada ao desenvolvimento da
aplicação em ambiente "cloud". Evidenciámos o modo como gerámos a aplicação, apresentámos
as funcionalidades, as vantagens. Por fim, refletimos sobre os resultados , de acordo com o
enquadramento teórico efetuado na primeira parte e o desenvolvimento da plataforma.
Pensamos que o trabalho obtido é positivo e que se enquadra nos objetivos que nos propusemos
atingir. Este trabalho de investigação apresenta algumas limitações, consideramos que
o tempo para a sua execução foi escasso e a implementação da plataforma poderia beneficiar
com a implementação de outras funcionalidades. Em investigações futuras seria pertinente dar continuidade ao projeto ampliando as potencialidades da aplicação, testar o funcionamento
com outros utilizadores e efetuar testes comparativos.Fundação para a Ciência e a Tecnologia (FCT
Recommended from our members
Optimising data centre operation by removing the transport bottleneck
Data centres lie at the heart of almost every service on the Internet. Data centres are used to provide search results, to power social media, to store and index email, to host “cloud” applications, for online retail and to provide a myriad of other web services. Consequently the more efficient they can be made the better for all of us. The power of modern data centres is in combining commodity off-the-shelf server hardware and network equipment to provide what Google’s Barrosso and Ho ̈lzle describe as “warehouse scale” computers.
Data centres rely on TCP, a transport protocol that was originally designed for use in the Internet. Like other such protocols, TCP has been optimised to maximise throughput, usually by filling up queues at the bottleneck. However, for most applications within a data centre network latency is more critical than throughput. Consequently the choice of transport protocol becomes a bottleneck for performance. My thesis is that the solution to this is to move away from the use of one-size-fits-all transport protocols towards ones that have been designed to reduce latency across the data centre and which can dynamically respond to the needs of the applications.
This dissertation focuses on optimising the transport layer in data centre networks. In particular I address the question of whether any single transport mechanism can be flexible enough to cater to the needs of all data centre traffic. I show that one leading protocol (DCTCP) has been heavily optimised for certain network conditions. I then explore approaches that seek to minimise latency for applications that care about it while still allowing throughput-intensive applications to receive a good level of service. My key contributions to this are Silo and Trevi.
Trevi is a novel transport system for storage traffic that utilises fountain coding to max- imise throughput and minimise latency while being agnostic to drop, thus allowing storage traffic to be pushed out of the way when latency sensitive traffic is present in the network. Silo is an admission control system that is designed to give tenants of a multi-tenant data centre guaranteed low latency network performance. Both of these were developed in collaboration with others
Data Replication and Its Alignment with Fault Management in the Cloud Environment
Nowadays, the exponential data growth becomes one of the major challenges all over the world. It may cause a series of negative impacts such as network overloading, high system complexity, and inadequate data security, etc. Cloud computing is developed to construct a novel paradigm to alleviate massive data processing challenges with its on-demand services and distributed architecture. Data replication has been proposed to strategically distribute the data access load to multiple cloud data centres by creating multiple data copies at multiple cloud data centres. A replica-applied cloud environment not only achieves a decrease in response time, an increase in data availability, and more balanced resource load but also protects the cloud environment against the upcoming faults. The reactive fault tolerance strategy is also required to handle the faults when the faults already occurred. As a result, the data replication strategies should be aligned with the reactive fault tolerance strategies to achieve a complete management chain in the cloud environment.
In this thesis, a data replication and fault management framework is proposed to establish a decentralised overarching management to the cloud environment. Three data replication strategies are firstly proposed based on this framework. A replica creation strategy is proposed to reduce the total cost by jointly considering the data dependency and the access frequency in the replica creation decision making process. Besides, a cloud map oriented and cost efficiency driven replica creation strategy is proposed to achieve the optimal cost reduction per replica in the cloud environment. The local data relationship and the remote data relationship are further analysed by creating two novel data dependency types, Within-DataCentre Data Dependency and Between-DataCentre Data Dependency, according to the data location. Furthermore, a network performance based replica selection strategy is proposed to avoid potential network overloading problems and to increase the number of concurrent-running instances at the same time
Scaling and Resilience in Numerical Algorithms for Exascale Computing
The first Petascale supercomputer, the IBM Roadrunner, went online in 2008. Ten years later, the community is now looking ahead to a new generation of Exascale machines. During the decade that has passed, several hundred Petascale capable machines have been installed worldwide, yet despite the abundance of machines, applications that scale to their full size remain rare. Large clusters now routinely have 50.000+ cores, some have several million. This extreme level of parallelism, that has allowed a theoretical compute capacity in excess of a million billion operations per second, turns out to be difficult to use in many applications of practical interest. Processors often end up spending more time waiting for synchronization, communication, and other coordinating operations to complete, rather than actually computing. Component reliability is another challenge facing HPC developers. If even a single processor fail, among many thousands, the user is forced to restart traditional applications, wasting valuable compute time. These issues collectively manifest themselves as low parallel efficiency, resulting in waste of energy and computational resources. Future performance improvements are expected to continue to come in large part due to increased parallelism. One may therefore speculate that the difficulties currently faced, when scaling applications to Petascale machines, will progressively worsen, making it difficult for scientists to harness the full potential of Exascale computing.
The thesis comprises two parts. Each part consists of several chapters discussing modifications of numerical algorithms to make them better suited for future Exascale machines. In the first part, the use of Parareal for Parallel-in-Time integration techniques for scalable numerical solution of partial differential equations is considered. We propose a new adaptive scheduler that optimize the parallel efficiency by minimizing the time-subdomain length without making communication of time-subdomains too costly. In conjunction with an appropriate preconditioner, we demonstrate that it is possible to obtain time-parallel speedup on the nonlinear shallow water equation, beyond what is possible using conventional spatial domain-decomposition techniques alone. The part is concluded with the proposal of a new method for constructing Parallel-in-Time integration schemes better suited for convection dominated problems.
In the second part, new ways of mitigating the impact of hardware failures are developed and presented. The topic is introduced with the creation of a new fault-tolerant variant of Parareal. In the chapter that follows, a C++ Library for multi-level checkpointing is presented. The library uses lightweight in-memory checkpoints, protected trough the use of erasure codes, to mitigate the impact of failures by decreasing the overhead of checkpointing and minimizing the compute work lost. Erasure codes have the unfortunate property that if more data blocks are lost than parity codes created, the data is effectively considered unrecoverable. The final chapter contains a preliminary study on partial information recovery for incomplete checksums. Under the assumption that some meta knowledge exists on the structure of the data encoded, we show that the data lost may be recovered, at least partially. This result is of interest not only in HPC but also in data centers where erasure codes are widely used to protect data efficiently
Security Risk Management for the Internet of Things
In recent years, the rising complexity of Internet of Things (IoT) systems has increased their potential vulnerabilities and introduced new cybersecurity challenges. In this context, state of the art methods and technologies for security risk assessment have prominent limitations when it comes to large scale, cyber-physical and interconnected IoT systems. Risk assessments for modern IoT systems must be frequent, dynamic and driven by knowledge about both cyber and physical assets. Furthermore, they should be more proactive, more automated, and able to leverage information shared across IoT value chains. This book introduces a set of novel risk assessment techniques and their role in the IoT Security risk management process. Specifically, it presents architectures and platforms for end-to-end security, including their implementation based on the edge/fog computing paradigm. It also highlights machine learning techniques that boost the automation and proactiveness of IoT security risk assessments. Furthermore, blockchain solutions for open and transparent sharing of IoT security information across the supply chain are introduced. Frameworks for privacy awareness, along with technical measures that enable privacy risk assessment and boost GDPR compliance are also presented. Likewise, the book illustrates novel solutions for security certification of IoT systems, along with techniques for IoT security interoperability. In the coming years, IoT security will be a challenging, yet very exciting journey for IoT stakeholders, including security experts, consultants, security research organizations and IoT solution providers. The book provides knowledge and insights about where we stand on this journey. It also attempts to develop a vision for the future and to help readers start their IoT Security efforts on the right foot
Understanding Quantum Technologies 2022
Understanding Quantum Technologies 2022 is a creative-commons ebook that
provides a unique 360 degrees overview of quantum technologies from science and
technology to geopolitical and societal issues. It covers quantum physics
history, quantum physics 101, gate-based quantum computing, quantum computing
engineering (including quantum error corrections and quantum computing
energetics), quantum computing hardware (all qubit types, including quantum
annealing and quantum simulation paradigms, history, science, research,
implementation and vendors), quantum enabling technologies (cryogenics, control
electronics, photonics, components fabs, raw materials), quantum computing
algorithms, software development tools and use cases, unconventional computing
(potential alternatives to quantum and classical computing), quantum
telecommunications and cryptography, quantum sensing, quantum technologies
around the world, quantum technologies societal impact and even quantum fake
sciences. The main audience are computer science engineers, developers and IT
specialists as well as quantum scientists and students who want to acquire a
global view of how quantum technologies work, and particularly quantum
computing. This version is an extensive update to the 2021 edition published in
October 2021.Comment: 1132 pages, 920 figures, Letter forma
Proceedings of the 21st Conference on Formal Methods in Computer-Aided Design – FMCAD 2021
The Conference on Formal Methods in Computer-Aided Design (FMCAD) is an annual conference on the theory and applications of formal methods in hardware and system verification. FMCAD provides a leading forum to researchers in academia and industry for presenting and discussing groundbreaking methods, technologies, theoretical results, and tools for reasoning formally about computing systems. FMCAD covers formal aspects of computer-aided system design including verification, specification, synthesis, and testing