Search CORE

17 research outputs found

EOS workshop

Author: Bitzes Georgios
Publication venue
Publication date: 01/01/2020
Field of study

In this talk we give a brief overview of the successful migration to the new namespace. Practically all EOS instances at CERN are currently on QuarkDB, the new namespace is officially *boring technology*, and *MGM boot time* a distant memory. We will also discuss future plans and ideas to further improve scalability and performance of the namespace in particular with respect to locking, planned end-of-support for in-memory legacy namespace, and all miscellaneous namespace-related news

CERN Document Server

Discovering interesting associations from SNP-trait data

Author: Bitzes Georgios
Publication venue
Publication date: 29/07/2022
Field of study

Aaltodoc Publication Archive

EOS workshop

Author: Bitzes Georgios
Publication venue
Publication date: 01/01/2019
Field of study

The new EOS namespace implementation based on QuarkDB entered production during 2018 with full success. In this presentation, we report the current status and experience with running new-namespace instances in production, as well as some preliminary plans for deprecating the old namespace

CERN Document Server

Inverted CERN School of Computing 2017

Author: Bitzes Georgios
Publication venue
Publication date: 01/01/2017
Field of study

In a world where clusters with thousands of nodes are becoming commonplace, we are often faced with the task of having them coordinate and share state. As the number of machines goes up, so does the probability that something goes wrong: a node could temporarily lose connectivity, crash because of some race condition, or have its hard drive fail. What are the challenges when designing fault-tolerant distributed systems, where a cluster is able to survive the loss of individual nodes? In this lecture, we will discuss some basics on this topic (consistency models, CAP theorem, failure modes, byzantine faults), detail the raft consensus algorithm, and showcase an interesting example of a highly resilient distributed system, bitcoin

CERN Document Server

A milestone for DPM (Disk Pool Manager)

Author: Bitzes Georgios
Furano Fabrizio
Keeble Oliver
Manzi Andrea
Publication venue: 'EDP Sciences'
Publication date: 01/01/2019
Field of study

The DPM (Disk Pool Manager) system is a multiprotocol scalable technology for Grid storage that supports about 130 sites for a total of about 90 Petabytes online. The system has recently completed the development phase that had been announced in the past years, which consolidates its core component (DOME: Disk Operations Management Engine) as a full-featured high performance engine that can also be operated with standard Web clients and uses a fully documented REST-based protocol. Together with a general improvement on performance and with a comprehensive administration command-line interface, this milestone also brings back features like the automatic disk server status detection and the volatile pools for deploying experimental disk caches. In this contribution we also discuss the end of support for the historical DPM components (that also include a dependency on the Globus toolkit), whose deployment is now only linked to the usage of the SRM protocols, hence can be uninstalled when these are not needed any more by the site

EDP Sciences OAI-PMH repository (1.2.0)

CERN Document Server

Scaling the EOS namespace

Author: Bitzes Georgios
Peters Andreas J
Sindrilaru Elvin A
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

EOS is the distributed storage system being developed at CERN with the aim of fulfilling a wide range of data storage needs, ranging from physics data to user home directories. Being in production since 2011, EOS currently manages around 224 petabytes of disk space and 1.4 billion files across several instances. Even though individual EOS instances routinely manage hundreds of disk servers, users access the contents through a single, unified namespace which is exposed by the head node (MGM), and contains the metadata of all files stored on that instance. The legacy implementation keeps the entire namespace in-memory. Modifications are appended to a persistent, on-disk changelog; this way, the in-memory contents can be reconstructed after every reboot by replaying the changelog. While this solution has proven reliable and effective, we are quickly approaching the limits of its scalability. In this paper, we present our new implementation which is currently in testing. We have designed and implemented QuarkDB, a highly available, strongly consistent distributed database which exposes a subset of the redis command set, and serves as the namespace storage backend. Using this design, the MGM now acts as a stateless write-through cache, with all metadata persisted in QuarkDB. Scalability is achieved by having multiple MGMs, each assigned to a subtree of the namespace, with clients being automatically redirected to the appropriate one

Crossref

CERN Document Server

Code health in EOS: Improving test infrastructure and overall service quality

Author: Bitzes Georgios
Luchetti Fabio
Patrascoiu Mihai
Sindrilaru Elvin Alin
Publication venue: 'EDP Sciences'
Publication date: 01/01/2020
Field of study

During the last few years, the EOS distributed storage system at CERN has seen a steady increase in use, both in terms of traﬃc volume as well as sheer amount of stored data. This has brought the unwelcome side effect of stretching the EOS software stack to its design constraints, resulting in frequent user-facing issues and occasional downtime of critical services. In this paper, we discuss the challenges of adapting the software to meet the increasing demands, while at the same time preserving functionality without breaking existing features or introducing new bugs. We document our eﬀorts in modernizing and stabilizing the codebase, through the refactoring of legacy code, introduction of widespread unit testing, as well as leveraging Kubernetes to build a comprehensive test orchestration framework capable of stressing every aspect of an EOS installation, with the goal of discovering bottlenecks and instabilities before they reach production

Directory of Open Access Journals

Scaling the EOS namespace – new developments, and performance optimizations

Author: Bitzes Georgios
Joachim Peters Andreas
Sindrilaru Elvin Alin
Publication venue: 'EDP Sciences'
Publication date: 01/01/2019
Field of study

EOS is the distributed storage solution being developed and deployed at CERN with the primary goal of fulfilling the data needs of the LHC and its various experiments. Being in production since 2011, EOS currently manages around 256 petabytes of raw disk space and 3.4 billion files across several instances. Nowadays, EOS is increasingly being used as a distributed filesystem and file sharing platform, which poses scalability challenges on its legacy namespace subsystem, tasked with keeping track of all file and directory metadata on a particular instance. In this paper we discuss said challenges, and present our solution which has recently entered production. We made several architectural improvements to the overall system design, the most important of which was introducing QuarkDB, a highly-available datastore capable of serving as the metadata backend for EOS, tailored to the needs of the namespace. We also describe our efforts in providing comparable latency and performance to the legacy in-memory implementation, both when reading through the use of extensive caching and prefetching, and when writing through the use of latencyhiding techniques involving a persistent, back-pressured local queue for batching writes towards the QuarkDB backend

Directory of Open Access Journals

Scaling the EOS namespace - new developments, and performance optimizations

Author: Bitzes Georgios
Peters Andreas Joachim
Sindrilaru Elvin Alin
Publication venue: 'EDP Sciences'
Publication date: 01/01/2019
Field of study

EOS is the distributed storage solution being developed and deployed at CERN with the primary goal of fulfilling the data needs of the LHC and its various experiments. Being in production since 2011, EOS currently manages around 256 petabytes of raw disk space and 3.4 billion files across several instances. Nowadays, EOS is increasingly being used as a distributed filesystem and file sharing platform, which poses scalability challenges on its legacy namespace subsystem, tasked with keeping track of all file and directory metadata on a particular instance. In this paper we discuss said challenges, and present our solution which has recently entered production. We made several architectural improvements to the overall system design, the most important of which was introducing QuarkDB, a highly-available datastore capable of serving as the metadata backend for EOS, tailored to the needs of the namespace. We also describe our efforts in providing comparable latency and performance to the legacy in-memory implementation, both when reading through the use of extensive caching and prefetching, and when writing through the use of latency-hiding techniques involving a persistent, back-pressured local queue for batching writes towards the QuarkDB backend

EDP Sciences OAI-PMH repository (1.2.0)

CERN Document Server

A milestone for DPM (Disk Pool Manager)

Author: Bitzes Georgios
Furano Fabrizio
Manzi Andrea
Oliver Keeble Nurcan
Publication venue: 'EDP Sciences'
Publication date: 01/01/2019
Field of study

The DPM (Disk Pool Manager) system is a multiprotocol scalable technology for Grid storage that supports about 130 sites for a total of about 90 Petabytes online. The system has recently completed the development phase that had been announced in the past years, which consolidates its core component (DOME: Disk Operations Management Engine) as a full-featured high performance engine that can also be operated with standardWeb clients and uses a fully documented REST-based protocol. Together with a general improvement on performance and with a comprehensive administration command-line interface, this milestone also brings back features like the automatic disk server status detection and the volatile pools for deploying experimental disk caches. In this contribution we also discuss the end of support for the historical DPM components (that also include a dependency on the Globus toolkit), whose deployment is now only linked to the usage of the SRM protocols, hence can be uninstalled when these are not needed any more by the site

Directory of Open Access Journals