13 research outputs found

    A shared-disk parallel cluster file system

    Get PDF
    Dissertação apresentada para obtenção do Grau de Doutor em Informática Pela Universidade Nova de Lisboa, Faculdade de Ciências e TecnologiaToday, clusters are the de facto cost effective platform both for high performance computing (HPC) as well as IT environments. HPC and IT are quite different environments and differences include, among others, their choices on file systems and storage: HPC favours parallel file systems geared towards maximum I/O bandwidth, but which are not fully POSIX-compliant and were devised to run on top of (fault prone) partitioned storage; conversely, IT data centres favour both external disk arrays (to provide highly available storage) and POSIX compliant file systems, (either general purpose or shared-disk cluster file systems, CFSs). These specialised file systems do perform very well in their target environments provided that applications do not require some lateral features, e.g., no file locking on parallel file systems, and no high performance writes over cluster-wide shared files on CFSs. In brief, we can say that none of the above approaches solves the problem of providing high levels of reliability and performance to both worlds. Our pCFS proposal makes a contribution to change this situation: the rationale is to take advantage on the best of both – the reliability of cluster file systems and the high performance of parallel file systems. We don’t claim to provide the absolute best of each, but we aim at full POSIX compliance, a rich feature set, and levels of reliability and performance good enough for broad usage – e.g., traditional as well as HPC applications, support of clustered DBMS engines that may run over regular files, and video streaming. pCFS’ main ideas include: · Cooperative caching, a technique that has been used in file systems for distributed disks but, as far as we know, was never used either in SAN based cluster file systems or in parallel file systems. As a result, pCFS may use all infrastructures (LAN and SAN) to move data. · Fine-grain locking, whereby processes running across distinct nodes may define nonoverlapping byte-range regions in a file (instead of the whole file) and access them in parallel, reading and writing over those regions at the infrastructure’s full speed (provided that no major metadata changes are required). A prototype was built on top of GFS (a Red Hat shared disk CFS): GFS’ kernel code was slightly modified, and two kernel modules and a user-level daemon were added. In the prototype, fine grain locking is fully implemented and a cluster-wide coherent cache is maintained through data (page fragments) movement over the LAN. Our benchmarks for non-overlapping writers over a single file shared among processes running on different nodes show that pCFS’ bandwidth is 2 times greater than NFS’ while being comparable to that of the Parallel Virtual File System (PVFS), both requiring about 10 times more CPU. And pCFS’ bandwidth also surpasses GFS’ (600 times for small record sizes, e.g., 4 KB, decreasing down to 2 times for large record sizes, e.g., 4 MB), at about the same CPU usage.Lusitania, Companhia de Seguros S.A, Programa IBM Shared University Research (SUR

    Building Computing-As-A-Service Mobile Cloud System

    Get PDF
    The last five years have witnessed the proliferation of smart mobile devices, the explosion of various mobile applications and the rapid adoption of cloud computing in business, governmental and educational IT deployment. There is also a growing trends of combining mobile computing and cloud computing as a new popular computing paradigm nowadays. This thesis envisions the future of mobile computing which is primarily affected by following three trends: First, servers in cloud equipped with high speed multi-core technology have been the main stream today. Meanwhile, ARM processor powered servers is growingly became popular recently and the virtualization on ARM systems is also gaining wide ranges of attentions recently. Second, high-speed internet has been pervasive and highly available. Mobile devices are able to connect to cloud anytime and anywhere. Third, cloud computing is reshaping the way of using computing resources. The classic pay/scale-as-you-go model allows hardware resources to be optimally allocated and well-managed. These three trends lend credence to a new mobile computing model with the combination of resource-rich cloud and less powerful mobile devices. In this model, mobile devices run the core virtualization hypervisor with virtualized phone instances, allowing for pervasive access to more powerful, highly-available virtual phone clones in the cloud. The centralized cloud, powered by rich computing and memory recourses, hosts virtual phone clones and repeatedly synchronize the data changes with virtual phone instances running on mobile devices. Users can flexibly isolate different computing environments. In this dissertation, we explored the opportunity of leveraging cloud resources for mobile computing for the purpose of energy saving, performance augmentation as well as secure computing enviroment isolation. We proposed a framework that allows mo- bile users to seamlessly leverage cloud to augment the computing capability of mobile devices and also makes it simpler for application developers to run their smartphone applications in the cloud without tedious application partitioning. This framework was built with virtualization on both server side and mobile devices. It has three building blocks including agile virtual machine deployment, efficient virtual resource management, and seamless mobile augmentation. We presented the design, imple- mentation and evaluation of these three components and demonstrated the feasibility of the proposed mobile cloud model

    Analyzing Metadata Performance in Distributed File Systems

    Get PDF
    Distributed file systems are important building blocks in modern computing environments. The challenge of increasing I/O bandwidth to files has been largely resolved by the use of parallel file systems and sufficient hardware. However, determining the best means by which to manage large amounts of metadata, which contains information about files and directories stored in a distributed file system, has proved a more difficult challenge. The objective of this thesis is to analyze the role of metadata and present past and current implementations and access semantics. Understanding the development of the current file system interfaces and functionality is a key to understanding their performance limitations. Based on this analysis, a distributed metadata benchmark termed DMetabench is presented. DMetabench significantly improves on existing benchmarks and allows stress on metadata operations in a distributed file system in a parallelized manner. Both intranode and inter-node parallelity, current trends in computer architecture, can be explicitly tested with DMetabench. This is due to the fact that a distributed file system can have different semantics inside a client node rather than semantics between multiple nodes. As measurements in larger distributed environments may exhibit performance artifacts difficult to explain by reference to average numbers, DMetabench uses a time-logging technique to record time-related changes in the performance of metadata operations and also protocols additional details of the runtime environment for post-benchmark analysis. Using the large production file systems at the Leibniz Supercomputing Center (LRZ) in Munich, the functionality of DMetabench is evaluated by means of measurements on different distributed file systems. The results not only demonstrate the effectiveness of the methods proposed but also provide unique insight into the current state of metadata performance in modern file systems

    A Fully Userspace Remote Storage Access Stack

    Get PDF
    As computer networking has evolved and the available throughput has increased, the efficiency of the network software stack has become increasingly important. This is because the latency introduced by software has gone from insignificant, compared to historically poor network performance, to the largest component of latency for a modern local-area network. Currently, the vast majority of code that accesses the hardware is part of the kernel, because the kernel is responsible for ensuring that user applications do not interfere with each other when accessing the hardware. Remote Direct Memory Access~(RDMA) provides a solution for applications to perform direct data transfers over the network without requiring context switches into the kernel, but relies instead on specialized hardware interfaces to handle the virtual address mappings and transport protocols. This more intelligent hardware allows for direct control from the userspace application, eliminating the cost of context switches into the kernel. This in turn reduces the overall latency of message transfers. Just like networking, storage is currently undergoing a similar evolution. For most of the recent history of computing, the most common durable storage mechanism has been mechanical hard disk drives, which can only be accessed at block level and have high latency compared to the software drivers used to access the data. However, the introduction of solid state disks~(SSDs) based on Flash significantly decreased the latency, as there are no mechanical parts that need to move to access the data. Upcoming non-volatile memory solutions reduce this latency even further, and even allow byte-level access to the storage medium. Thus, just like with networking, software drivers become the bottleneck and we look for solutions to bypass the kernel to improve the efficiency of direct userspace access to storage. This thesis offers two contributions as part of a solution to these problems. The first part introduces urdma, a software RDMA driver which leverages the Data Plane Development Kit (DPDK) to perform network data transfers in userspace without specialized RDMA interface hardware. The second part examines remote locking protocols, which are required for synchronization in distributed storage systems. We define an RDMA locking mechanism referred to as Verbs Offload Locking Technology (VOLT), which allows acquisition of a remote lock object without any CPU usage by the target node. This offloading allows VOLT to be used with disaggregated memory servers that have limited onboard CPU resources, while also lowering the application overhead for remote locking. Finally, we define a bytecode framework using enhanced Berkeley Packet Filter (eBPF) bytecode for extending the capabilities of an RDMA-capable network interface card (NIC) with new operations, and show how this can be used to implement our remote locking operation

    A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

    Get PDF
    High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems

    Private cloud computing platforms. Analysis and implementation in a Higher Education Institution

    Get PDF
    The constant evolution of the Internet and its increasing use and subsequent entailing to private and public activities, resulting in a strong impact on their survival, originates an emerging technology. Through cloud computing, it is possible to abstract users from the lower layers to the business, focusing only on what is most important to manage and with the advantage of being able to grow (or degrades) resources as needed. The paradigm of cloud arises from the necessity of optimization of IT resources evolving in an emergent and rapidly expanding and technology. In this regard, after a study of the most common cloud platforms and the tactic of the current implementation of the technologies applied at the Institute of Biomedical Sciences of Abel Salazar and Faculty of Pharmacy of Oporto University a proposed evolution is suggested in order adorn certain requirements in the context of cloud computing.atividades privadas e públicas, traduzindo-se num forte impacto à sua sobrevivência, origina uma tecnologia emergente. Através de cloud computing, é possível abstrair os utilizadores das camadas inferiores ao negócio, focalizando apenas no que realmente é mais importante de gerir e ainda com a vantagem de poder crescer (ou diminuir) os recursos conforme as necessidades correntes. Os recursos das TI evoluíram consideravelmente na última década tendo despoletado toda uma nova consciencialização de otimização, originando o paradigma da computação em nuvem. Neste sentido, após um estudo das plataformas de cloud mais comuns, é abordado um case study das tecnologias implementadas no Instituto de Ciências Biomédicas de Abel Salazar e Faculdade de Farmácia da Universidade do Porto seguido de uma sugestão de implementação de algumas plataformas de cloud a fim de adereçar determinados requisitos do case study. Distribuições produzidas especificamente para a implementação de nuvens privadas encontram-se hoje em dia disponíveis e cujas configurações estão amplamente simplificadas. No entanto para que seja viável uma arquitetura bem implementada, quer a nível de hardware, rede, segurança eficiência e eficácia, é pertinente considerar a infraestrutura necessária como um todo. Um estudo multidisciplinar aprofundado sobre todos os temas adjacentes a esta tecnologia está intrinsecamente ligado à arquitetura de um sistema de nuvem, sob pena de se obter um sistema deficitário. É necessário um olhar mais abrangente, para além do equipamento necessário e do software utilizado, que pondere efetivamente os custos de implementação tendo em conta também os recursos humanos especializados nas diversas áreas envolvidas. A construção de um novo centro de dados, fruto da junção dos edifícios do Instituto de Ciências Biomédicas de Abel Salazar e da Faculdade de Farmácia da Universidade do Porto, possibilitou a partilha de recursos tecnológicos. Tendo em conta a infraestrutura existente, completamente escalável, e assente numa abordagem de crescimento e de virtualização, considera-se a implementação de uma nuvem privada já que os recursos existentes são perfeitamente adaptáveis a esta realidade emergente. A tecnologia de virtualização adotada, bem como o respetivo hardware (armazenamento e processamento) foi pensado numa implementação baseada no XEN Server, e considerando que existe heterogeneidade no parque dos servidores e tendo em conta a ideologia das tecnologias disponíveis (aberta e proprietária) é estudada uma abordagem distinta à implementação existente baseada na Microsoft. Dada a natureza da instituição, e dependendo dos recursos necessários e abordagem a tomar, no desenvolvimento de uma nuvem privada, poderá ser levado em conta a integração com nuvens públicas (por exemplo Google Apps), sendo que as possíveis soluções a adotar poderão ser baseadas em tecnologias abertas e/ou pagas (ou ambas). Este trabalho tem como objetivo, em última instância, o desígnio de verificar as tecnologias utilizadas atualmente e identificar potenciais soluções para que em conjunto com a infraestrutura atual, disponibilizar um serviço de nuvem privada. O trabalho inicia-se com uma explicação concisa do conceito de nuvem, comparando com outras formas de computação, expondo as suas características, revendo a sua história, explicando as suas camadas, modelos de implementação e arquiteturas. Em seguida, no capítulo do estado da arte, são abordadas as principais plataformas de computação em nuvem focando o Microsoft Azure, Google Apps, Cloud Foundry, Delta Cloud e Open Stack. São também abordadas outras plataformas que emergem fornecendo assim um olhar mais amplo para as soluções tecnológicas atuais disponíveis. Após o estado da arte, é abordado um estudo de um caso em particular, a implementação do cenário de TI do novo edifício das duas unidades orgânicas da Universidade do Porto, o Instituto de Ciências Biomédicas Abel Salazar e a Faculdade de Farmácia e sua arquitetura de nuvem privada utilizando recursos partilhados. O estudo do caso é seguido de uma sugestão de evolução da implementação, utilizando tecnologias de computação em nuvem de forma a cumprir com os requisitos necessários e integrar e agilizar a infraestrutura existente

    Virtualization of Micro-architectural Components Using Software Solutions

    Get PDF
    Cloud computing has become a dominant computing paradigm in the information technology industry due to its flexibility and efficiency in resource sharing and management. The key technology that enables cloud computing is virtualization. Essential requirements in a virtualized system where several virtual machines (VMs) run on a same physical machine include performance isolation and predictability. To enforce these properties, the virtualization software (called the hypervisor) must find a way to divide physical resources (e.g., physical memory, processor time) of the system and allocate them to VMs with respect to the amount of virtual resources defined for each VM. However, modern hardware have complex architectures and some microarchitectural-level resources such as processor caches, memory controllers, interconnects cannot be divided and allocated to VMs. They are globally shared among all VMs which compete for their use, leading to contention. Therefore, performance isolation and predictability are compromised. In this thesis, we propose software solutions for preventing unpredictability in performance due to micro-architectural components. The first contribution is called Kyoto, a solution to the cache contention issue, inspired by the polluters pay principle. A VM is said to pollute the cache if it provokes significant cache replacements which impact the performance of other VMs. Henceforth, using the Kyoto system, the provider can encourage cloud users to book pollution permits for their VMs. The second contribution addresses the problem of efficiently virtualizing NUMA machines. The major challenge comes from the fact that the hypervisor regularly reconfigures the placement of a VM over the NUMA topology. However, neither guest operating systems (OSs) nor system runtime libraries (e.g., HotSpot) are designed to consider NUMA topology changes at runtime, leading end user applications to unpredictable performance. We presents eXtended Para-Virtualization (XPV), a new principle to efficiently virtualize a NUMA architecture. XPV consists in revisiting the interface between the hypervisor and the guest OS, and between the guest OS and system runtime libraries so that they can dynamically take into account NUMA topology changes

    Optimierung des Wirkungsgrades virtueller Infrastrukturen

    Get PDF
    Virtualisierungstechniken erfreuen sich immer größerer Beliebtheit in vielen Bereichen der Informatik. Ursprünglich wiederentdeckt mit dem Ziel Ressourcen und Dienste zu konsolidieren, dienen Virtualisierungsansätze heute als Grundlage für moderne Grid- und Cloud-Computing-Infastrukturen und werden damit auch im Bereich des Hochleistungsrechnens eingesetzt. Derzeit existieren keine objektiven und systematischen Analysen bezüglich des Wirkungsgrades von Virtualisierungsansätzen, Techniken und Implementierungen, obwohl sie von vielen großen Rechenzentren weltweit eingesetzt und produktiv betrieben werden. Alle existierenden, modernen Hostvirtualisierungsansätze setzen derzeit auf eine Softwareschicht, die sich je nach Virtualisierungstyp zwischen Hardware und Gast-Betriebssystem bzw. zwischen Host- und Gast-Betriebssystem befindet. Eine Anwendung in einer virtuellen Maschine ist somit nicht mehr nur von der Leistung des physischen Systems abhängig, sondern ebenfalls von der Technologie des eingesetzten Virtualisierungsproduktes und nebenläufigen virtuellen Maschinen. Je nach Anwendungstyp kann es daher sinnvoll sein, einen anderen Virtualisierungsansatz zu wählen und auf den Typ der nebenläufigen virtuellen Maschinen zu achten, um den Wirkungsgrad eines lokalen Systems sowie den der globalen Infrastruktur zu optimieren. Um dieses Ziel zu erreichen, werden in einem zweistufigen Ansatz zunächst theoretisch Virtualisierungsansätze analysiert und Parameter identifiziert, deren Einfluss auf den Wirkungsgrad in einem zweiten Schritt empirisch quantifiziert wird. Für die Durchführung dieser quantitativen Analyse ist eine Anpassung verbreiteter Leistungsmaße, wie z.B. Durchsatz und Antwortzeit, für den Kontext der Virtualisierung erforderlich, da sie sich klassisch gesehen auf das Betriebssystem einer Maschine beziehen, eine virtuelle Maschine jedoch von der Architektur her eher einer klassischen Anwendung entspricht. Die Messung dieses Leistungsmaßes in virtuellen Umgebungen stellt eine weitere Herausforderung dar, da Zeitmessung in virtuellen Maschinen aufgrund von Scheduling durch den Hypervisor generell fehlerbehaftet ist und somit alternative Messmethoden konzipiert werden müssen. Basierend auf den durchgeführten Analysen und Messungen wird anschließend ein Leitfaden entwickelt, der dabei hilft, die zur Virtualisierung einer Infrastruktur benötigten Ressourcen qualitativ sowie quantitativ abzuschätzen und eine Verteilung der virtuellen Maschinen anhand ihres charakteristischen Ressourcenbedarfes auf physische Systeme vorzunehmen, so dass vorhandene physische Ressourcen optimal ausgenutzt werden können. Die Automatisierung des erstellten Leitfadens durch die Entwicklung und prototypische Implementierung eines globalen Ressourcen-Schedulers auf der Basis eines gewichteten Constraint Solvers rundet die Arbeit ab. Der verwendete Ansatz besitzt zwar eine theoretisch exponentielle Laufzeitkomplexität, liefert in der Praxis aufgrund einer entwickelten Greedy-Heuristik jedoch bereits nach extrem kurzer Laufzeit herausragende Ergebnisse. Die optimierten Verteilungen lassen sich anschließend mittels weniger Live Migration realisieren, da bereits bei der Berechnung einer Verteilung auf deren räumliche Nähe zur bestehenden Verteilung geachtet wird

    Acceleration of the hardware-software interface of a communication device for parallel systems

    Full text link
    During the last decades the ever growing need for computational power fostered the development of parallel computer architectures. Applications need to be parallelized and optimized to be able to exploit modern system architectures. Today, scalability of applications is more and more limited both by development resources, as programming of complex parallel applications becomes increasingly demanding, and by the fundamental scalability issues introduced by the cost of communication in distributed memory systems. Lowering the latency of communication is mandatory to increase scalability and serves as an enabling technology for programming of distributed memory systems at a higher abstraction layer using higher degrees of compiler driven automation. At the same time it can increase performance of such systems in general. In this work, the software/hardware interface and the network interface controller functions of the EXTOLL network architecture, which is specifically designed to satisfy the needs of low-latency networking for high-performance computing, is presented. Several new architectural contributions are made in this thesis, namely a new efficient method for virtual-tophysical address-translation named ATU and a novel method to issue operations to a virtual device in an optimal way which has been termed Transactional I/O. This new method needs changes in the architecture of the host CPU the device is connected to. Two additional methods that emulate most of the characteristics of Transactional I/O are developed and employed in the development of the EXTOLL hardware to facilitate usage together with contemporary CPUs. These new methods heavily leverage properties of the HyperTransport interface used to connect the device to the CPU. Finally, this thesis also introduces an optimized remote-memory-access architecture for efficient split-phase transactions and atomic operations. The complete architecture has been prototyped using FPGA technology enabling a more precise analysis and verification than is possible using simulation alone. The resulting design utilizes 95 % of a 90 nm FPGA device and reaches speeds of 200 MHz and 156 MHz in the different clock domains of the design. The EXTOLL software stack is developed and a performance evaluation of the software using the EXTOLL hardware is performed. The performance evaluation shows an excellent start-up latency value of 1.3 μs, which competes with the most advanced networks available, in spite of the technological performance handicap encountered by FPGA technology. The resulting network is, to the best of the knowledge of the author, the fastest FPGA-based interconnection network for commodity processors ever built