273 research outputs found
High Energy Physics Forum for Computational Excellence: Working Group Reports (I. Applications Software II. Software Libraries and Tools III. Systems)
Computing plays an essential role in all aspects of high energy physics. As
computational technology evolves rapidly in new directions, and data throughput
and volume continue to follow a steep trend-line, it is important for the HEP
community to develop an effective response to a series of expected challenges.
In order to help shape the desired response, the HEP Forum for Computational
Excellence (HEP-FCE) initiated a roadmap planning activity with two key
overlapping drivers -- 1) software effectiveness, and 2) infrastructure and
expertise advancement. The HEP-FCE formed three working groups, 1) Applications
Software, 2) Software Libraries and Tools, and 3) Systems (including systems
software), to provide an overview of the current status of HEP computing and to
present findings and opportunities for the desired HEP computational roadmap.
The final versions of the reports are combined in this document, and are
presented along with introductory material.Comment: 72 page
FfDL : A Flexible Multi-tenant Deep Learning Platform
Deep learning (DL) is becoming increasingly popular in several application
domains and has made several new application features involving computer
vision, speech recognition and synthesis, self-driving automobiles, drug
design, etc. feasible and accurate. As a result, large scale on-premise and
cloud-hosted deep learning platforms have become essential infrastructure in
many organizations. These systems accept, schedule, manage and execute DL
training jobs at scale.
This paper describes the design, implementation and our experiences with
FfDL, a DL platform used at IBM. We describe how our design balances
dependability with scalability, elasticity, flexibility and efficiency. We
examine FfDL qualitatively through a retrospective look at the lessons learned
from building, operating, and supporting FfDL; and quantitatively through a
detailed empirical evaluation of FfDL, including the overheads introduced by
the platform for various deep learning models, the load and performance
observed in a real case study using FfDL within our organization, the frequency
of various faults observed including unanticipated faults, and experiments
demonstrating the benefits of various scheduling policies. FfDL has been
open-sourced.Comment: MIDDLEWARE 201
Cloud-efficient modelling and simulation of magnetic nano materials
Scientific simulations are rarely attempted in a cloud due to the substantial
performance costs of virtualization. Considerable communication overheads,
intolerable latencies, and inefficient hardware emulation are the main reasons why
this emerging technology has not been fully exploited. On the other hand, the
progress of computing infrastructure nowadays is strongly dependent on
perspective storage medium development, where efficient micromagnetic
simulations play a vital role in future memory design.
This thesis addresses both these topics by merging micromagnetic simulations
with the latest OpenStack cloud implementation while providing a time and costeffective alternative to expensive computing centers.
However, many challenges have to be addressed before a high-performance cloud
platform emerges as a solution for problems in micromagnetic research
communities. First, the best solver candidate has to be selected and further
improved, particularly in the parallelization and process communication domain.
Second, a 3-level cloud communication hierarchy needs to be recognized and
each segment adequately addressed. The required steps include breaking the VMisolation for the host’s shared memory activation, cloud network-stack tuning,
optimization, and efficient communication hardware integration.
The project work concludes with practical measurements and confirmation of
successfully implemented simulation into an open-source cloud environment. It is
achieved that the renewed Magpar solver runs for the first time in the OpenStack
cloud by using ivshmem for shared memory communication. Also, extensive
measurements proved the effectiveness of our solutions, yielding from sixty
percent to over ten times better results than those achieved in the standard cloud.Aufgrund der erheblichen Leistungskosten der Virtualisierung werden
wissenschaftliche Simulationen in einer Cloud selten versucht. Beträchtlicher
Kommunikationsaufwand, erhebliche Latenzen und ineffiziente
Hardwareemulation sind die HauptgrĂĽnde, warum diese aufkommende
Technologie nicht vollständig genutzt wurde. Andererseits hängt der Fortschritt der
Computertechnologie heutzutage stark von der Entwicklung perspektivischer
Speichermedien ab, bei denen effiziente mikromagnetische Simulationen eine
wichtige Rolle fĂĽr die zukĂĽnftige Speichertechnologie spielen.
Diese Arbeit befasst sich mit diesen beiden Themen, indem mikromagnetische
Simulationen mit der neuesten OpenStack Cloud-Implementierung
zusammengefĂĽhrt werden, um eine zeit- und kostengĂĽnstige Alternative zu teuren
Rechenzentren bereitzustellen.
Viele Herausforderungen mĂĽssen jedoch angegangen werden, bevor eine
leistungsstarke Cloud-Plattform als Lösung für Probleme in mikromagnetischen
Forschungsgemeinschaften entsteht. Zunächst muss der beste Kandidat für die
Lösung ausgewählt und weiter verbessert werden, insbesondere im Bereich der
Parallelisierung und Prozesskommunikation. Zweitens muss eine 3-stufige CloudKommunikationshierarchie erkannt und jedes Segment angemessen adressiert
werden. Die erforderlichen Schritte umfassen das Aufheben der VM-Isolation, um
den gemeinsam genutzten Speicher zwischen Cloud-Instanzen zu aktivieren, die
Optimierung des Cloud-Netzwerkstapels und die effiziente Integration von
Kommunikationshardware.
Die praktische Arbeit endet mit Messungen und der Bestätigung einer erfolgreich
implementierten Simulation in einer Open-Source Cloud-Umgebung. Als Ergebnis
haben wir erreicht, dass der neu erstellte Magpar-Solver zum ersten Mal in der
OpenStack Cloud ausgefĂĽhrt wird, indem ivshmem fĂĽr die Shared-Memory
Kommunikation verwendet wird. Umfangreiche Messungen haben auch die
Wirksamkeit unserer Lösungen bewiesen und von sechzig Prozent bis zu zehnmal
besseren Ergebnissen als in der Standard Cloud gefĂĽhrt
Recommended from our members
Building Distributed Systems with Non-Volatile Main Memories and RDMA Networks
High-performance, byte-addressable non-volatile main memories (NVMMs) allow application developers to combine storage and memory into a single layer. These high-performance storage systems would be especially useful in large-scale data center environments where data is distributed and replicated across multiple servers.Unfortunately, existing approaches of providing remote storage access rest on the assumption that storage is slow, so the cost of the software and protocols is acceptable. Such assumption no longer holds for the fast NVMM. As a result, taking full advantage of NVMMs’ potential will require changes in system software and networking protocol. This thesis focuses on accessing remote NVMM efficiently using remote direct memory access (RDMA) network. RDMA enables a client to directly access memory on a remote machine without involving its local CPU.This thesis first presents Mojim, a system that provides replicated, reliable, and highly-available NVMM as an operating system service. Applications can access data in Mojim using normal load and store instructions while controlling when and how updates propagate to replicas using system calls. Our evaluation shows Mojim adds little overhead to the un-replicated system and provides 0.4x to 2.7x the throughput of the un-replicated system.This thesis then presents Orion, a distributed file system designed from for NVMM and RDMA networks. Traditional distributed file systems are designed for slower hard drives. These slower media incentivizes complex optimizations (e.g., queuing, striping, and batching) around disk accesses. Orion combines file system functions and network operations into a single layer. It provides low latency metadata accesses and outperforms existing distributed file systems by a large margin.Finally, an NVMM application can map files backed by an NVMM file system into its address space, and accesses them using CPU instructions. In this case, RDMA and NVMM file systems introduce duplication of effort on permissions, naming, and address translation. We introduce two changes to the existing RDMA protocol: the file memory region (FileMR) and range based address translation. By eliminating redundant translations, FileMR minimizes the number of translations done at the NIC, reducing the load on the NIC’s translation cache and resulting in application performance improvement by 1.8x - 2.0x
- …