85 research outputs found

    A shared-disk parallel cluster file system

    Get PDF
    Dissertação apresentada para obtenção do Grau de Doutor em Informática Pela Universidade Nova de Lisboa, Faculdade de Ciências e TecnologiaToday, clusters are the de facto cost effective platform both for high performance computing (HPC) as well as IT environments. HPC and IT are quite different environments and differences include, among others, their choices on file systems and storage: HPC favours parallel file systems geared towards maximum I/O bandwidth, but which are not fully POSIX-compliant and were devised to run on top of (fault prone) partitioned storage; conversely, IT data centres favour both external disk arrays (to provide highly available storage) and POSIX compliant file systems, (either general purpose or shared-disk cluster file systems, CFSs). These specialised file systems do perform very well in their target environments provided that applications do not require some lateral features, e.g., no file locking on parallel file systems, and no high performance writes over cluster-wide shared files on CFSs. In brief, we can say that none of the above approaches solves the problem of providing high levels of reliability and performance to both worlds. Our pCFS proposal makes a contribution to change this situation: the rationale is to take advantage on the best of both – the reliability of cluster file systems and the high performance of parallel file systems. We don’t claim to provide the absolute best of each, but we aim at full POSIX compliance, a rich feature set, and levels of reliability and performance good enough for broad usage – e.g., traditional as well as HPC applications, support of clustered DBMS engines that may run over regular files, and video streaming. pCFS’ main ideas include: · Cooperative caching, a technique that has been used in file systems for distributed disks but, as far as we know, was never used either in SAN based cluster file systems or in parallel file systems. As a result, pCFS may use all infrastructures (LAN and SAN) to move data. · Fine-grain locking, whereby processes running across distinct nodes may define nonoverlapping byte-range regions in a file (instead of the whole file) and access them in parallel, reading and writing over those regions at the infrastructure’s full speed (provided that no major metadata changes are required). A prototype was built on top of GFS (a Red Hat shared disk CFS): GFS’ kernel code was slightly modified, and two kernel modules and a user-level daemon were added. In the prototype, fine grain locking is fully implemented and a cluster-wide coherent cache is maintained through data (page fragments) movement over the LAN. Our benchmarks for non-overlapping writers over a single file shared among processes running on different nodes show that pCFS’ bandwidth is 2 times greater than NFS’ while being comparable to that of the Parallel Virtual File System (PVFS), both requiring about 10 times more CPU. And pCFS’ bandwidth also surpasses GFS’ (600 times for small record sizes, e.g., 4 KB, decreasing down to 2 times for large record sizes, e.g., 4 MB), at about the same CPU usage.Lusitania, Companhia de Seguros S.A, Programa IBM Shared University Research (SUR

    MaldOS: a Moderately Abstracted Layer for Developing Operating Systems

    Get PDF
    Anche se pochi studenti affronteranno la sfida di sviluppare software al di sotto del sistema operativo, la comprensione dei suoi principi di funzionamento è essenziale. In sè, la teoria dietro ai sistemi operativi non è particolarmente complessa: concetti come scheduling, livelli di esecuzione e semafori sono intuitivamente comprensibili; tuttavia appropriarsi pienamente di queste nozioni soltanto tramite lo studio teorico è quasi impossibile: serve un esempio pratico per assimilare i dettagli. Sviluppare un sistema operativo come progetto accademico è però diversi ordini di grandezza più difficile che creare un software in ambiente di lavoro già esistente. La complessità aggiunta dell'hardware va spesso oltre a quello che ci si aspetta dagli studenti, il che rende difficile anche soltanto la ricerca di un'architettura su cui lavorare. Questo studio è fortemente ispirato da precedenti soluzioni a questo problema come uMPS, un emulatore per il processore MIPS. Lavorando su una virtualizzazione semplificata gli studenti si possono concentrare sui concetti chiave dello sviluppo di un SO. Anche se ispirato a un'architettura reale, uMPS rimane comunque un ambiente astratto, e nel corso del lavoro potrebbe sorgere una sensazione di distacco dalla realtà. In questo studio si sostiene che un progetto simile possa essere sviluppato su hardware reale senza che questo diventi troppo complicato. L'architettura scelta è ARMv8, più moderna e diffusa rispetto a MIPS, nella forma della board educativa Raspberry Pi. Il risultato del lavoro è duplice: da una parte è stato portato avanti uno studio dettagliato su come sviluppare un sistema operativo minimale sul Raspberry Pi, dall'altra è stato creato un layer di astrazione che si occupa di semplificare l'approccio alle periferiche, permettendo agli utenti di costruirci sopra un piccolo sistema operativo. Pur facendo riferimento a un dispositivo reale, la possibilità di lavorare su un emulatore rimane grazie al supporto di Qemu

    Energy-Aware Data Management on NUMA Architectures

    Get PDF
    The ever-increasing need for more computing and data processing power demands for a continuous and rapid growth of power-hungry data center capacities all over the world. As a first study in 2008 revealed, energy consumption of such data centers is becoming a critical problem, since their power consumption is about to double every 5 years. However, a recently (2016) released follow-up study points out that this threatening trend was dramatically throttled within the past years, due to the increased energy efficiency actions taken by data center operators. Furthermore, the authors of the study emphasize that making and keeping data centers energy-efficient is a continuous task, because more and more computing power is demanded from the same or an even lower energy budget, and that this threatening energy consumption trend will resume as soon as energy efficiency research efforts and its market adoption are reduced. An important class of applications running in data centers are data management systems, which are a fundamental component of nearly every application stack. While those systems were traditionally designed as disk-based databases that are optimized for keeping disk accesses as low a possible, modern state-of-the-art database systems are main memory-centric and store the entire data pool in the main memory, which replaces the disk as main bottleneck. To scale up such in-memory database systems, non-uniform memory access (NUMA) hardware architectures are employed that face a decreased bandwidth and an increased latency when accessing remote memory compared to the local memory. In this thesis, we investigate energy awareness aspects of large scale-up NUMA systems in the context of in-memory data management systems. To do so, we pick up the idea of a fine-grained data-oriented architecture and improve the concept in a way that it keeps pace with increased absolute performance numbers of a pure in-memory DBMS and scales up on NUMA systems in the large scale. To achieve this goal, we design and build ERIS, the first scale-up in-memory data management system that is designed from scratch to implement a data-oriented architecture. With the help of the ERIS platform, we explore our novel core concept for energy awareness, which is Energy Awareness by Adaptivity. The concept describes that software and especially database systems have to quickly respond to environmental changes (i.e., workload changes) by adapting themselves to enter a state of low energy consumption. We present the hierarchically organized Energy-Control Loop (ECL), which is a reactive control loop and provides two concrete implementations of our Energy Awareness by Adaptivity concept, namely the hardware-centric Resource Adaptivity and the software-centric Storage Adaptivity. Finally, we will give an exhaustive evaluation regarding the scalability of ERIS as well as our adaptivity facilities

    Hajautetun tietovaraston suunnittelu ja toteutus Java-kielellä

    Get PDF
    Service creation platform is a development platform that is used to create customer specific service applications to operator networks. Service applications must support high availability and high performance with sufficient level of scalability to support future traffic growth. Service creation platform is located in the operator network, and it provides business logic creation and connectivity framework to enable flexible service creation. Service applications typically connect to various operator business support systems, core messaging components and content provider applications. Service applications almost always need to read and write service execution related persistent or transient data. Previously a highly available database was used for providing such storage services for the duster of service nodes. However, highly available databases are typically either expensive or complex, and they often require additional hardware support for providing the high availability. The target of this thesis work is to design and implement a distributed data storage component, which is optimised for read access. The implementation ensures data persistence and high availability using local file system disks and transaction distribution between the cluster nodes. The component is fully integrated into the service creation platform providing the clustered data storage services for the platform itself and the applications but on top of the platform.Palvelukehitysalusta on asiakaskohtaisten palveluiden kehitystä varten luotu ohjelmisto mobiiliverkko-operaattoreille. Alustalla toteutettavat palveluohjelmistot tarjoavat operaattoreille korkean käytettävyyden ja suorituskyvyn, yhdistettynä tulevaisuuden kasvuodotukset mahdollistavaan skaalautuvuuteen. Palvelukehitysalusta asennetaan osaksi operaattorin verkkoa, ja se tarjoaa ympäristön sekä palveluiden luomista että niiden ajamista varten. Tyypillisesti palveluohjelmistot liittyvät useisiin operaattorin järjestelmiin, kuten verkon viestikeskuksiin, palvelutarjoajien sovelluksiin ja business tuki järjestelmiin. On tavallista, että palveluohjelmistot sekä käyttävät että tallentavat tietoa ohjelman suorituksen yhteydessä. Tallennettava tieto voi olla joko pysyvää, tai tilapäistä ja lyhytaikaisesti säilytettävää. Aiemmin palveluohjelmistoissa tiedon tallennukseen käytettiin korkean käytettävyyden omaavia tietokantoja. Korkean käytettävyyden tietokannat ovat tyypillisesti sekä kalliita että monimutkaisia. Lisäksi tietokannat yleensä vaativat ylimääräistä laitteistoa korkean käytettävyyden saavuttamiseksi. Tämän diplomityön aiheena on hajautetun tietovaraston suunnittelu ja toteutus. Toteutus on optimoitu tiedon lukemista varten, ja se tarjoaa tiedon pysyvän tallennuksen yhdistettynä korkeaan käytettävyyteen. Tieto hajautetaan järjestelmän kaikkiin solmuihin, ja se tallennetaan jokaisessa solmussa paikallisesti. Tietovarasto toteutetaan komponenttina, joka integroidaan osaksi palvelukehitysalustaa. Komponentti tarjoaa sekä palvelukehitysalustalle että palvelusovelluksille luotettavan tallennuspalvelun klusterissa

    Management of object-oriented action-based distributed programs

    Get PDF
    Phd ThesisThis thesis addresses the problem of managing the runtime behaviour of distributed programs. The thesis of this work is that management is fundamentally an information processing activity and that the object model, as applied to actionbased distributed systems and database systems, is an appropriate representation of the management information. In this approach, the basic concepts of classes, objects, relationships, and atomic transition systems are used to form object models of distributed programs. Distributed programs are collections of objects whose methods are structured using atomic actions, i.e., atomic transactions. Object models are formed of two submodels, each representing a fundamental aspect of a distributed program. The structural submodel represents a static perspective of the distributed program, and the control submodel represents a dynamic perspective of it. Structural models represent the program's objects, classes and their relationships. Control models represent the program's object states, events, guards and actions-a transition system. Resolution of queries on the distributed program's object model enable the management system to control certain activities of distributed programs. At a different level of abstraction, the distributed program can be seen as a reactive system where two subprograms interact: an application program and a management program; they interact only through sensors and actuators. Sensors are methods used to probe an object's state and actuators are methods used to change an object's state. The management program is capable to prod the application program into action by activating sensors and actuators available at the interface of the application program. Actions are determined by management policies that are encoded in the management program. This way of structuring the management system encourages a clear modularization of application and management distributed programs, allowing better separation of concerns. Managemental concerns can be dealt with by the management program, functional concerns can be assigned to the application program. The object-oriented action-based computational model adopted by the management system provides a natural framework for the implementation of faulttolerant distributed programs. Object orientation provides modularity and extensibility through object encapsulation. Atomic actions guarantee the consistency of the objects of the distributed program despite concurrency and failures. Replication of the distributed program provides increased fault-tolerance by guaranteeing the consistent progress of the computation, even though some of the replicated objects can fail. A prototype management system based on the management theory proposed above has been implemented atop Arjuna; an object-oriented programming system which provides a set of tools for constructing fault-tolerant distributed programs. The management system is composed of two subsystems: Stabilis, a management system for structural information, and Vigil, a management system for control information. Example applications have been implemented to illustrate the use of the management system and gather experimental evidence to give support to the thesis.CNPq (Consellho Nacional de Desenvolvimento Cientifico e Tecnol6gico, Brazil): BROADCAST (Basic Research On Advanced Distributed Computing: from Algorithms to SysTems)

    Towards Scalable OLTP Over Fast Networks

    Get PDF
    Online Transaction Processing (OLTP) underpins real-time data processing in many mission-critical applications, from banking to e-commerce. These applications typically issue short-duration, latency-sensitive transactions that demand immediate processing. High-volume applications, such as Alibaba's e-commerce platform, achieve peak transaction rates as high as 70 million transactions per second, exceeding the capacity of a single machine. Instead, distributed OLTP database management systems (DBMS) are deployed across multiple powerful machines. Historically, such distributed OLTP DBMSs have been primarily designed to avoid network communication, a paradigm largely unchanged since the 1980s. However, fast networks challenge the conventional belief that network communication is the main bottleneck. In particular, emerging network technologies, like Remote Direct Memory Access (RDMA), radically alter how data can be accessed over a network. RDMA's primitives allow direct access to the memory of a remote machine within an order of magnitude of local memory access. This development invalidates the notion that network communication is the primary bottleneck. Given that traditional distributed database systems have been designed with the premise that the network is slow, they cannot efficiently exploit these fast network primitives, which requires us to reconsider how we design distributed OLTP systems. This thesis focuses on the challenges RDMA presents and its implications on the design of distributed OLTP systems. First, we examine distributed architectures to understand data access patterns and scalability in modern OLTP systems. Drawing on these insights, we advocate a distributed storage engine optimized for high-speed networks. The storage engine serves as the foundation of a database, ensuring efficient data access through three central components: indexes, synchronization primitives, and buffer management (caching). With the introduction of RDMA, the landscape of data access has undergone a significant transformation. This requires a comprehensive redesign of the storage engine components to exploit the potential of RDMA and similar high-speed network technologies. Thus, as the second contribution, we design RDMA-optimized tree-based indexes — especially applicable for disaggregated databases to access remote data efficiently. We then turn our attention to the unique challenges of RDMA. One-sided RDMA, one of the network primitives introduced by RDMA, presents a performance advantage in enabling remote memory access while bypassing the remote CPU and the operating system. This allows the remote CPU to process transactions uninterrupted, with no requirement to be on hand for network communication. However, that way, specialized one-sided RDMA synchronization primitives are required since traditional CPU-driven primitives are bypassed. We found that existing RDMA one-sided synchronization schemes are unscalable or, even worse, fail to synchronize correctly, leading to hard-to-detect data corruption. As our third contribution, we address this issue by offering guidelines to build scalable and correct one-sided RDMA synchronization primitives. Finally, recognizing that maintaining all data in memory becomes economically unattractive, we propose a distributed buffer manager design that efficiently utilizes cost-effective NVMe flash storage. By leveraging low-latency RDMA messages, our buffer manager provides a transparent memory abstraction, accessing the aggregated DRAM and NVMe storage across nodes. Central to our approach is a distributed caching protocol that dynamically caches data. With this approach, our system can outperform RDMA-enabled in-memory distributed databases while managing larger-than-memory datasets efficiently

    Benchmarking Hadoop performance on different distributed storage systems

    Get PDF
    Distributed storage systems have been in place for years, and have undergone significant changes in architecture to ensure reliable storage of data in a cost-effective manner. With the demand for data increasing, there has been a shift from disk-centric to memory-centric computing - the focus is on saving data in memory rather than on the disk. The primary motivation for this is the increased speed of data processing. This could, however, mean a change in the approach to providing the necessary fault-tolerance - instead of data replication, other techniques may be considered. One example of an in-memory distributed storage system is Tachyon. Instead of replicating data files in memory, Tachyon provides fault-tolerance by maintaining a record of the operations needed to generate the data files. These operations are replayed if the files are lost. This approach is termed lineage. Tachyon is already deployed by many well-known companies. This thesis work compares the storage performance of Tachyon with that of the on-disk storage systems HDFS and Ceph. After studying the architectures of well-known distributed storage systems, the major contribution of the work is to integrate Tachyon with Ceph as an underlayer storage system, and understand how this affects its performance, and how to tune Tachyon to extract maximum performance out of it
    corecore