Search CORE

1,939 research outputs found

Lustre, Hadoop, Accumulo

Author: Arcand William
Bergeron Bill
Bestor David
Byun Chansup
Edwards Lauren
Gadepally Vijay
Hubbell Matthew
Kepner Jeremy
Michaleas Peter
Mullen Julie
Prout Andrew
Reuther Albert
Rosa Antonio
Yee Charles
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 08/07/2015
Field of study

Data processing systems impose multiple views on data as it is processed by the system. These views include spreadsheets, databases, matrices, and graphs. There are a wide variety of technologies that can be used to store and process data through these different steps. The Lustre parallel file system, the Hadoop distributed file system, and the Accumulo database are all designed to address the largest and the most challenging data storage problems. There have been many ad-hoc comparisons of these technologies. This paper describes the foundational principles of each technology, provides simple models for assessing their capabilities, and compares the various technologies on a hypothetical common cluster. These comparisons indicate that Lustre provides 2x more storage capacity, is less likely to loose data during 3 simultaneous drive failures, and provides higher bandwidth on general purpose workloads. Hadoop can provide 4x greater read bandwidth on special purpose workloads. Accumulo provides 10,000x lower latency on random lookups than either Lustre or Hadoop but Accumulo's bulk bandwidth is 10x less. Significant recent work has been done to enable mix-and-match solutions that allow Lustre, Hadoop, and Accumulo to be combined in different ways.Comment: 6 pages; accepted to IEEE High Performance Extreme Computing conference, Waltham, MA, 201

arXiv.org e-Print Archive

Crossref

Database integrated analytics using R : initial experiences with SQL-Server + R

Author: Berral Josep Ll.
Poggi Nicolas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Most data scientists use nowadays functional or semi-functional languages like SQL, Scala or R to treat data, obtained directly from databases. Such process requires to fetch data, process it, then store again, and such process tends to be done outside the DB, in often complex data-flows. Recently, database service providers have decided to integrate “R-as-a-Service” in their DB solutions. The analytics engine is called directly from the SQL query tree, and results are returned as part of the same query. Here we show a first taste of such technology by testing the portability of our ALOJA-ML analytics framework, coded in R, to Microsoft SQL-Server 2016, one of the SQL+R solutions released recently. In this work we discuss some data-flow schemes for porting a local DB + analytics engine architecture towards Big Data, focusing specially on the new DB Integrated Analytics approach, and commenting the first experiences in usability and performance obtained from such new services and capabilities.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

Survey and Analysis of Production Distributed Computing Infrastructures

Author: Jha Shantenu
Katz Daniel S.
Parashar Manish
Rana Omer
Weissman Jon
Publication venue
Publication date: 13/08/2012
Field of study

This report has two objectives. First, we describe a set of the production distributed infrastructures currently available, so that the reader has a basic understanding of them. This includes explaining why each infrastructure was created and made available and how it has succeeded and failed. The set is not complete, but we believe it is representative. Second, we describe the infrastructures in terms of their use, which is a combination of how they were designed to be used and how users have found ways to use them. Applications are often designed and created with specific infrastructures in mind, with both an appreciation of the existing capabilities provided by those infrastructures and an anticipation of their future capabilities. Here, the infrastructures we discuss were often designed and created with specific applications in mind, or at least specific types of applications. The reader should understand how the interplay between the infrastructure providers and the users leads to such usages, which we call usage modalities. These usage modalities are really abstractions that exist between the infrastructures and the applications; they influence the infrastructures by representing the applications, and they influence the ap- plications by representing the infrastructures

arXiv.org e-Print Archive

FigShare

플래시 기반의 고성능 컴퓨팅 스토리지 시스템을 위한 효율적인 입출력 관리 기법

Author: 성한울
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2020. 8. 엄현상.Most I/O traffic in high performance computing (HPC) storage systems is dominated by checkpoints and the restarts of HPC applications. For such a bursty I/O, new all-flash HPC storage systems with an integrated burst buffer (BB) and parallel file system (PFS) have been proposed. However, most of the distributed file systems (DFS) used to configure the storage systems provide a single connection between a compute node and a server node, which hinders users from utilizing the high I/O bandwidth provided by an all-flash server node. To provide multiple connections, DFSs must be modified to increase the number of sockets, which is an extremely difficult and time-consuming task owing to their complicated structures. Users can increase the number of daemons in the DFSs to forcibly increase the number of connections without a DFS modification. Because each daemon has a mount point for its connection, there are multiple mount points in the compute nodes, resulting in significant effort required for users to distribute file I/O requests to multiple mount points. In addition, to avoid access to a PFS composed of low-speed storage devices, such as hard disks, dedicated BB allocation is preferred despite its severe underutilization. However, a BB allocation method may be inappropriate because all-flash HPC storage systems speed up access to the PFS. To handle such problems, we propose an efficient user-transparent I/O management scheme for all-flash HPC storage systems. The first scheme, I/O transfer management, provides multiple connections between a compute node and a server node without additional effort from DFS developers and users. To do so, we modified a mount procedure and I/O processing procedures in a virtual file system (VFS). In the second scheme, data management between BB and PFS, a BB over-subscription allocation method is adopted to improve the BB utilization. Unfortunately, the allocation method aggravates the I/O interference and demotion overhead from the BB to the PFS, resulting in a degraded checkpoint and restart performance. To minimize this degradation, we developed an I/O scheduler and a new data management based on the checkpoint and restart characteristics. To prove the effectiveness of our proposed schemes, we evaluated our I/O transfer and data management schemes between the BB and PFS. The I/O transfer management scheme improves the write and read I/O throughputs for the checkpoint and restart by up to 6- and 3-times, that of a DFS using the original kernel, respectively. Based on the data management scheme, we found that the BB utilization is improved by at least 2.2-fold, and a stabler and higher checkpoint performance is guaranteed. In addition, we achieved up to a 96.4\% hit ratio of the restart requests on the BB and up to a 3.1-times higher restart performance than that of other existing methods.고성능 컴퓨팅 스토리지 시스템의 입출력 대역폭의 대부분은 고성능 어플리케이션의 체크포인트와 재시작이 차지하고 있다. 이런 고성능 어플리케이션의 폭발적인 입출력을 원활하게 처리하게 위하여, 고급 플래시 저장 장치와 저급 플래시 저장 장치를 이용하여 버스트 버퍼와 PFS를 합친 새로운 플래시 기반의 고성능 컴퓨팅 스토리지 시스템이 제안되었다. 하지만 스토리지 시스템을 구성하기 위하여 사용되는 대부분의 분산 파일 시스템들은 노드간 하나의 네트워크 연결을 제공하고 있어 서버 노드에서 제공할 수 있는 높은 플래시들의 입출력 대역폭을 활용하지 못한다. 여러개의 네트워크 연결을 제공하기 위해서는 분산 파일 시스템이 수정되어야 하거나, 분산 파일 시스템의 클라이언트 데몬과 서버 데몬의 갯수를 증가시키는 방법이 사용되어야 한다. 하지만, 분산 파일 시스템은 매우 복잡한 구조로 구성되어 있기 때문에 많은 시간과 노력이 분산 파일 시스템 개발자들에게 요구된다. 데몬의 갯수를 증가시키는 방법은 각 네트워크 커넥션마다 새로운 마운트 포인트가 존재하기 때문에, 직접 파일 입출력 리퀘스트를 여러 마운트 포인트로 분산시켜야 하는 엄청난 노력이 사용자에게 요구된다. 서버 데몬의 개수를 증가시켜 네트워크 커넥션의 수를 증가시킬 경우엔, 서버 데몬이 서로 다른 파일 시스템 디렉토리 관점을 갖기 때문에 사용자가 직접 서로 다른 서버 데몬을 인식하고 데이터 충돌이 일어나지 않도록 주의해야 한다. 게다가, 기존에는 사용자들이 하드디스크와 같은 저속 저장 장치로 구성된 PFS로의 접근을 피하기 위하여, 버스트 버퍼의 효율성을 포기하면서도 전용 버스트 버퍼 할당 방식 (Dedicated BB allocation method)을 선호했다. 하지만 새로운 플래시 기반의 고성능 컴퓨팅 스토리지 시스템에서는 병렬 파일 시스템으로의 접근이 빠르기때문에, 해당 버스트 버퍼 할당 방식을 사용하는것은 적절치 않다. 이런 문제들을 해결하기 위하여, 본 논문에서 사용자에게 내부 처리과정이 노출 되지않는 새로운 플래시 기반의 고성능 스토리지 시스템을 위한 효율적인 데이터 기법들을 소개한다. 첫번째 기법인 입출력 전송 관리 기법은 분산 파일 시스템 개발자와 사용자들의 추가적인 노력없이 컴퓨트 노드와 서버 노드 사이에 여러개의 커넥션을 제공한다. 이를 위해, 가상 파일 시스템의 마운트 수행 과정과 입출력 처리 과정을 수정하였다. 두번째 기법인 데이터 관리 기법에서는 버스트 버퍼의 활용률을 향상 시키기 위하여 버스트 버퍼 초과 할당 기법 (BB over-subscription method)을 사용한다. 하지만, 해당 할당 방식은 사용자 간의 입출력 경합과 디모션 오버헤드를 발생하기때문에 낮은 체크포인트와 재시작 성능을 제공한다. 이를 방지하기 위하여, 체크포인트와 재시작의 특성을 기반으로 버스트 버퍼와 병렬 파일 시스템의 데이터를 관리한다. 본 논문에서는 제안한 방법들의 효과를 증명하기 위하여 실제 플래시 기반의 스토리지 시스템을 구축하고 제안한 방법들을 적용하여 성능을 평가했다. 실험을 통해 입출력 전송 관리 기법이 기존 기법보다 최대 6배 그리고 최대 2배 높은 쓰기 그리고 읽기 입출력 성능을 제공했다. 데이터 관리 기법은 기존 방법에 비해, 버스트 버퍼 활용률을 2.2배 향상 시켰다. 게다가 높고 안정적인 체크포인트 성능을 보였으며 최대 3.1배 높은 재시작 성능을 제공했다.Chapter 1 Introduction 1 Chapter 2 Background 11 2.1 Burst Buffer 11 2.2 Virtual File System 13 2.3 Network Bandwidth 14 2.4 Mean Time Between Failures 16 2.5 Checkpoint/Restart Characteristics 17 Chapter 3 Motivation 19 3.1 I/O Transfer Management for HPC Storage Systems 19 3.1.1 Problems of Existing HPC Storage Systems 19 3.1.2 Limitations of Existing Approaches 23 3.2 Data Management for HPC Storage Systems 26 3.2.1 Problems of Existing HPC Storage Systems 26 3.2.2 Limitations with Existing Approaches 27 Chapter 4 Mulconn: User-Transparent I/O Transfer Management for HPC Storage Systems 31 4.1 Design and Architecture 31 4.1.1 Overview 31 4.1.2 Scale Up Connections 34 4.1.3 I/O Scheduling 36 4.1.4 Automatic Policy Decision 38 4.2 Implementation 41 4.2.1 File Open and Close 41 4.2.2 File Write and Read 45 4.3 Evaluation. 46 4.3.1 Experimental Environment 46 4.3.2 I/O Throughputs Improvement 46 4.3.3 Comparison between TtoS and TtoM 59 4.3.4 Effectiveness of Our System 60 4.4 Summary 63 Chapter 5 BBOS: User-Transparent Data Management for HPC Storage Systems 64 5.1 Design and Architecture 64 5.1.1 Overview 64 5.1.2 DataManagementEngine 66 5.2 Implementation 72 5.2.1 In-memory Key-value Store 72 5.2.2 I/O Engine 72 5.2.3 Data Management Engine 75 5.2.4 Stable Checkpoint and Demotion Performance 77 5.3 Evaluation 78 5.3.1 Experimental Environment 78 5.3.2 Burst Buffer Utilization 81 5.3.3 Checkpoint Performance 82 5.3.4 Restart Performance 86 5.4 Summary 90 Chapter 6 Related Work 91 Chapter 7 Conclusion 94 요약 105 감사의 글 107Docto

SNU Open Repository and Archive

Dinamización de cargas de trabajo HPC/HTC: conciliando modelos “onpremise” y “cloud computing”

Author: Heredia Canales Pedro Andrés
Publication venue
Publication date: 01/10/2020
Field of study

ABSTRACT: Cloud computing has grown tremendously in the last decade, evolving from being a mere technological concept to be considered a full business model. Some entities like companies or research groups, that have a need for computational power are beginning to consider a full migration of their systems to the cloud. However, following the trend of full migration to the cloud might not be the optimal option, perhaps, not everything is black and white and the answer could be found somewhere in between. Although great efforts are being made in the development and implementation of the so called hybrid cloud by companies that manage the biggest commercial cloud environments, namely Google, Amazon and Microsoft, most of them are focused in the creation of software developing platforms like in the case of Azure Stack from Microsoft Azure, that helps to develop hybrid applications that can be executed both locally and in the cloud. Meanwhile, the provisioning of execution environments for HPC/HTC applications seems that to be relegated to the background. In part, this could be because currently there is a low demand for these environments. This low demand can be motivated by many factors among which it is worth highlighting the necessity of having really specialised hardware, the overhead introduced by virtualization and last, but not least, the economic impact usually associated to this kind of customized infrastructures in contrast with more standard ones. With these limitations in mind and the fact that, in most of the cases, complete migration to the cloud is limited by the previous existence of a local infrastructure that provides computing and storage resources, this thesis explores an intermediate path between on-premise (local) and cloud computing. This kind of solution will allow an HPC/HTC user to benefit from the cloud schema in a transparent way, maintaining the on-premise environment that he is so used with, and also being able to execute jobs in both paradigms. To achieve this, the Hybrid-Infrastructure-as-a-Service Manager (HIaaS-M) framework is created. This framework tries to join both computing paradigms by automating the interaction between them in a way that is eficient and completely transparent to the user. This framework is especially design to be integrated into already existing infrastructures (on-premise); in other words, without the need of changing any of the existing software pieces. The framework is a standalone software that communicates with the existing systems, minimizing the impact that changing the base software and/or infrastructure of an entity could cause. This document describes the whole development process of this modular and configurable framework which allows the integration of previously existing infrastructures with one created in the cloud through a cloud infrastructure provider, adding the alternative of executing jobs in any of the cloud environments provided by cloud providers thanks to the Apache Libcloud library. This document concludes with a proof of concept made over a development cluster (called "cluster2") hosted in the 3MARES Data Processing Center at the Science Faculty of the University of Cantabria. This deployment on a similar to real life environment has allowed to identify the main advantages of the framework, as well as improvement that could be made that are expressed in a suggested roadmap for future work.RESUMEN: Con el enorme crecimiento que ha experimentado la computación en la nube durante la última década, evolucionando desde un concepto tecnológico a ser considerada un modelo de negocio completo, las entidades que demandan recursos computacionales se empiezan a plantear la migración completa de estos recursos a la nube. Sin embargo, seguir esta tendencia de tener todo en la nube puede no ser necesariamente la mejor opción y, quizás, como en muchas otras cosas, la respuesta esté en algo intermedio. Si bien actualmente las principales compañías que gestionan los grandes entornos cloud comerciales, como son Google, Microsoft o Amazon, están llevando a cabo grandes esfuerzos en el desarrollo e implementación de la llamada nube híbrida, estos concentran principalmente su atención en la evolución de plataformas para desarrollo de software, como ocurre por ejemplo en el caso de Microsoft Azure, con su producto Azure Stack, destinado al desarrollo y ejecución de aplicaciones híbridas (que pueden ejecutar tanto on-premise como en cloud), mientras que los entornos para la ejecución de aplicaciones HPC/HTC, parece que quedan relegados a un segundo plano. Su baja demanda actual viene motivada por diversas causas, entre las que caben destacar la necesidad de hardware muy específico (en muchas ocasiones de muy altas prestaciones), los problemas derivados del overhead creado por la virtualización, y no menos importante, el factor económico, ya que este tipo de infraestructuras personalizadas pueden alcanzar un precio mucho mayor que las de configuración más estándar. Sabiendo esto y teniendo en cuenta que, en la inmensa mayoría de casos, migrar completamente a la nube puede ser desaconsejable debido entre otras cosas, a la existencia previa de una infraestructura local que provee de los recursos necesarios, en cuanto a computación y almacenamiento se refiere, este trabajo pretende explorar una solución a medio camino entre la computación on-premise y la computación en la nube, o también llamada cloud-computing, que permita a un usuario HPC/HTC beneficiarse de la computación en la nube sin prescindir del entorno on-premise al que está acostumbrado, siendo capaz de ejecutar trabajos en ambos entornos. Para ello, se ha desarrollado el framework Hybrid-Infrastructure-as-a-Service Manager (HIaaS-M), que pretende conciliar los dos paradigmas, automatizando la interacción entre ellos de forma eficiente y completamente transparente para el usuario. Este framework está especialmente diseñado para su integración en infraestructuras ya existentes (onpremise) de forma también transparente, es decir, sin necesidad de modificar ninguna pieza software. Su ejecución se realiza de manera independiente, como un programa autónomo, que se comunica con los sistemas existentes, minimizando así el impacto que puedan suponer posibles cambios en las piezas software que componen la infraestructura donde se vaya a implantar. A lo largo de esta memoria se describe el proceso completo de desarrollo de este framework, modular y configurable, el cual permite la integración de una infraestructura computacional existente con la proporcionada por un entorno cloud, añadiendo la posibilidad de ejecutar trabajos en prácticamente cualquiera de los entornos cloud apoyándose fundamentalmente en el uso de la librería Libcloud. El trabajo culmina con una prueba de concepto realizada sobre el cluster en desarrollo (de nombre cluster2) ubicado en el CPD 3Mares de la Facultad de Ciencias de la Universidad de Cantabria. Este último paso nos ha permitido concluir el trabajo identificando las ventajas del framework así como algunas consideraciones a tener en cuenta para trabajos futuros.Máster en Ingeniería Informátic

UCrea