218 research outputs found
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
Benchmarking Apache Arrow Flight -- A wire-speed protocol for data transfer, querying and microservices
Moving structured data between different big data frameworks and/or data
warehouses/storage systems often cause significant overhead. Most of the time
more than 80\% of the total time spent in accessing data is elapsed in
serialization/de-serialization step. Columnar data formats are gaining
popularity in both analytics and transactional databases. Apache Arrow, a
unified columnar in-memory data format promises to provide efficient data
storage, access, manipulation and transport. In addition, with the introduction
of the Arrow Flight communication capabilities, which is built on top of gRPC,
Arrow enables high performance data transfer over TCP networks. Arrow Flight
allows parallel Arrow RecordBatch transfer over networks in a platform and
language-independent way, and offers high performance, parallelism and security
based on open-source standards.
In this paper, we bring together some recently implemented use cases of Arrow
Flight with their benchmarking results. These use cases include bulk Arrow data
transfer, querying subsystems and Flight as a microservice integration into
different frameworks to show the throughput and scalability results of this
protocol. We show that Flight is able to achieve up to 6000 MB/s and 4800 MB/s
throughput for DoGet() and DoPut() operations respectively. On Mellanox
ConnectX-3 or Connect-IB interconnect nodes Flight can utilize upto 95\% of the
total available bandwidth. Flight is scalable and can use upto half of the
available system cores efficiently for a bidirectional communication. For query
systems like Dremio, Flight is order of magnitude faster than ODBC and turbodbc
protocols. Arrow Flight based implementation on Dremio performs 20x and 30x
better as compared to turbodbc and ODBC connections respectively
RDMA mechanisms for columnar data in analytical environments
Dissertação de mestrado integrado em Engenharia InformáticaThe amount of data in information systems is growing constantly and, as a consequence, the
complexity of analytical processing is greater. There are several storage solutions to persist
this information, with different architectures targeting different use cases. For analytical
processing, storage solutions with a column-oriented format are particularly relevant due
to the convenient placement of the data in persistent storage and the closer mapping to
in-memory processing.
The access to the database is typically remote and has overhead associated, mainly when
it is necessary to obtain the same data multiple times. Thus, it is desirable to have a cache
on the processing side and there are solutions for this. The problem with the existing so lutions is the overhead introduced by network latency and memory-copy between logical
layers. Remote Direct Memory Access (RDMA) mechanisms have the potential to help min imize this overhead. Furthermore, this type of mechanism is indicated for large amounts of
data because zero-copy has more impact as the data volume increases. One of the problems
associated with RDMA mechanisms is the complexity of development. This complexity is
induced by its different development paradigm when compared to other network commu nication protocols, for example, TCP.
Aiming to improve the efficiency of analytical processing, this dissertation presents a dis tributed cache that takes advantage of RDMA mechanisms to improve analytical processing
performance. The cache abstracts the intricacies of RDMA mechanisms and is developed
as a middleware making it transparent to take advantage of this technology. Moreover, this
technique could be used in other contexts where a distributed cache makes sense, such as
a set of replicated web servers that access the same database.A quantidade de informação nos sistemas informáticos tem vindo a aumentar e consequentemente, a complexidade do processamento analítico torna-se maior. Existem diversas soluções para o armazenamento de dados com diferentes arquiteturas e indicadas para determinados casos de uso. Num contexto de processamento analítico, uma solução com o modelo de dados colunar e especialmente relevante devido à disposição conveniente dos dados em disco e a sua proximidade com o mapeamento em memória desses mesmos dados. Muitas vezes, o acesso aos dados é feito remotamente e isso traz algum overhead, principalmente quando é necessário aceder aos mesmos dados mais do que uma vez. Posto isto, é vantajoso fazer caching dos dados e já existem soluções para esse efeito. O overhead introduzido pela latência da rede e cópia de buffers entre camadas lógicas é o principal problema das soluções existentes. Os mecanismos de acesso direto à memória remota (RDMA - Remote Direct Memory Access) tem o potencial de melhorar o desempenho neste cenário. Para além disso, este tipo de tecnologia faz sentido em sistemas com grandes quantidades de dados, nos quais o acesso direto pode ter um impacto ainda maior por ser zero-copy. Um dos problemas associados com mecanismos RDMA é a complexidade de desenvolvimento. Esta complexidade é causada pelo paradigma de desenvolvimento completamente diferente de outros protocolos de comunicação, como por exemplo, TCP. Tendo em vista melhorar a eficiência do processamento analítico, esta dissertação propõe uma solução de cache distribuída que tira partido de mecanismos de acesso direto a memoria remota (RDMA). A cache abstrai as particularidades dos mecanismos RDMA e é disponibilizada como middleware, tornando a utilização desta tecnologia completamente transparente. Esta solução visa os sistemas de processamento analítico, mas poderá ser utilizada noutros contextos em que uma cache distribuída faça sentido, como por exemplo num conjunto de servidores web replicados que acedem a mesma base de dados
Master of Science
thesisEfficient movement of massive amounts of data over high-speed networks at high throughput is essential for a modern-day in-memory storage system. In response to the growing needs of throughput and latency demands at scale, a new class of database systems was developed in recent years. The development of these systems was guided by increased access to high throughput, low latency network fabrics, and declining cost of Dynamic Random Access Memory (DRAM). These systems were designed with On-Line Transactional Processing (OLTP) workloads in mind, and, as a result, are optimized for fast dispatch and perform well under small request-response scenarios. However, massive server responses such as those for range queries and data migration for load balancing poses challenges for this design. This thesis analyzes the effects of large transfers on scale-out systems through the lens of a modern Network Interface Card (NIC). The present-day NIC offers new and exciting opportunities and challenges for large transfers, but using them efficiently requires smart data layout and concurrency control. We evaluated the impact of modern NICs in designing data layout by measuring transmit performance and full system impact by observing the effects of Direct Memory Access (DMA), Remote Direct Memory Access (RDMA), and caching improvements such as Intel® Data Direct I/O (DDIO). We discovered that use of techniques such as Zero Copy yield around 25% savings in CPU cycles and a 50% reduction in the memory bandwidth utilization on a server by using a client-assisted design with records that are not updated in place. We also set up experiments that underlined the bottlenecks in the current approach to data migration in RAMCloud and propose guidelines for a fast and efficient migration protocol for RAMCloud
Steroid OpenFlow Service Scalability Analysis
Modern cloud applications are hosted on data centers across vast geographical scopes and exchange large amounts of data continuously. Transmission Control Protocol (TCP) is the most popular protocol for reliable data transfer; however, due to TCP’s congestion control mechanism, maximum achievable throughput across a large bandwidth-delay product (BDP) network is limited. Various solutions exist to enhance data transfer throughput but they usually require non-trivial and explicit installation and tuning of specialized software on both sides which makes deployment limited. A software defined networking (SDN) based solution Steroid OpenFlow Service (SOS) was developed that utilizes multiple parallel TCP connections to transparently enhance network performance across a large BDP network. OpenFlow is used to transparently redirect user traffic to nearby service machines called SOS agent and these agents use multiple TCP connections to transfer data fast across large BDP network. While SOS has shown significant improvements in data transfer throughput, there are multiple factors which affect its performance. This study focuses on SOS scalability analysis targeting four critical factors: CPU utilization of SOS agents, sockets used for parallel TCP connections, how OpenFlow is used and network configurations. Through this study, the SOS agent code was revamped for performance improvements. Experiments were conducted on the National Science Foundation’s CloudLab platform to assess the effect of the above-mentioned factors on SOS performance. Results have shown improvement in throughput per SOS session from 10.96Gbps to 12.82Gbps by removing CPU bottleneck on 25Gbps network. SOS deployment over an InfiniBand network has shown a linear increase in throughput to 23.22Gbps with optimal network configurations. Using OpenFlow to support multiple client connections to the same server have increased throughput from 12.17Gbps to 17.20Gbps. The study showed that with code-level improvements and optimal network configurations, SOS performance can be improved substantially
Systemunterstützung für moderne Speichertechnologien
Trust and scalability are the two significant factors which impede the dissemination of clouds.
The possibility of privileged access to customer data by a cloud provider limits the usage of clouds for processing security-sensitive data.
Low latency cloud services rely on in-memory computations, and thus, are limited by several characteristics of Dynamic RAM (DRAM) such as capacity, density, energy consumption, for example.
Two technological areas address these factors.
Mainstream server platforms, such as Intel Software Guard eXtensions (SGX) und AMD Secure Encrypted Virtualisation (SEV) offer extensions for trusted execution in untrusted environments.
Various technologies of Non-Volatile RAM (NV-RAM) have better capacity and density compared to DRAM and thus can be considered as DRAM alternatives in the future.
However, these technologies and extensions require new programming approaches and system support since they add features to the system architecture: new system components (Intel SGX) and data persistence (NV-RAM).
This thesis is devoted to the programming and architectural aspects of persistent and trusted systems.
For trusted systems, an in-depth analysis of new architectural extensions was performed.
A novel framework named EActors and a database engine named STANlite were developed to effectively use the capabilities of trusted~execution.
For persistent systems, an in-depth analysis of prospective memory technologies, their features and the possible impact on system architecture was performed.
A new persistence model, called the hypervisor-based model of persistence, was developed and evaluated by the NV-Hypervisor.
This offers transparent persistence for legacy and proprietary software, and supports virtualisation of persistent memory.Vertrauenswürdigkeit und Skalierbarkeit sind die beiden maßgeblichen Faktoren, die die Verbreitung von Clouds behindern.
Die Möglichkeit privilegierter Zugriffe auf Kundendaten durch einen Cloudanbieter schränkt die Nutzung von Clouds bei der Verarbeitung von sicherheitskritischen und vertraulichen Informationen ein.
Clouddienste mit niedriger Latenz erfordern die Durchführungen von Berechnungen im Hauptspeicher und sind daher an Charakteristika von Dynamic RAM (DRAM) wie Kapazität, Dichte, Energieverbrauch und andere Aspekte gebunden.
Zwei technologische Bereiche befassen sich mit diesen Faktoren: Etablierte Server Plattformen wie Intel Software Guard eXtensions (SGX) und AMD Secure Encrypted Virtualisation (SEV) stellen Erweiterungen für vertrauenswürdige Ausführung in nicht vertrauenswürdigen Umgebungen bereit.
Verschiedene Technologien von nicht flüchtigem Speicher bieten bessere Kapazität und Speicherdichte verglichen mit DRAM, und können daher in Zukunft als Alternative zu DRAM herangezogen werden.
Jedoch benötigen diese Technologien und Erweiterungen neuartige Ansätze und Systemunterstützung bei der Programmierung, da diese der Systemarchitektur neue Funktionalität hinzufügen: Systemkomponenten (Intel SGX) und Persistenz (nicht-flüchtiger Speicher).
Diese Dissertation widmet sich der Programmierung und den Architekturaspekten von persistenten und vertrauenswürdigen Systemen.
Für vertrauenswürdige Systeme wurde eine detaillierte Analyse der neuen Architekturerweiterungen durchgeführt.
Außerdem wurden das neuartige EActors Framework und die STANlite Datenbank entwickelt, um die neuen Möglichkeiten von vertrauenswürdiger Ausführung effektiv zu nutzen.
Darüber hinaus wurde für persistente Systeme eine detaillierte Analyse zukünftiger Speichertechnologien, deren Merkmale und mögliche Auswirkungen auf die Systemarchitektur durchgeführt.
Ferner wurde das neue Hypervisor-basierte Persistenzmodell entwickelt und mittels NV-Hypervisor ausgewertet, welches transparente Persistenz für alte und proprietäre Software, sowie Virtualisierung von persistentem Speicher ermöglicht
Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows
We consider ML query processing in distributed systems where GPU-enabled
workers coordinate to execute complex queries: a computing style often seen in
applications that interact with users in support of image processing and
natural language processing. In such systems, coscheduling of GPU memory
management and task placement represents a promising opportunity. We propose
Compass, a novel framework that unifies these functions to reduce job latency
while using resources efficiently, placing tasks where data dependencies will
be satisfied, collocating tasks from the same job (when this will not overload
the host or its GPU), and efficiently managing GPU memory. Comparison with
other state of the art schedulers shows a significant reduction in completion
times while requiring the same amount or even fewer resources. In one case,
just half the servers were needed for processing the same workload
Characterizing Network Requirements for GPU API Remoting in AI Applications
GPU remoting is a promising technique for supporting AI applications.
Networking plays a key role in enabling remoting. However, for efficient
remoting, the network requirements in terms of latency and bandwidth are
unknown. In this paper, we take a GPU-centric approach to derive the minimum
latency and bandwidth requirements for GPU remoting, while ensuring no (or
little) performance degradation for AI applications. Our study including
theoretical model demonstrates that, with careful remoting design, unmodified
AI applications can run on the remoting setup using commodity networking
hardware without any overhead or even with better performance, with low network
demands
High Performance Computing using Infiniband-based clusters
L'abstract è presente nell'allegato / the abstract is in the attachmen
- …