5,150 research outputs found
Survey and Analysis of Production Distributed Computing Infrastructures
This report has two objectives. First, we describe a set of the production
distributed infrastructures currently available, so that the reader has a basic
understanding of them. This includes explaining why each infrastructure was
created and made available and how it has succeeded and failed. The set is not
complete, but we believe it is representative.
Second, we describe the infrastructures in terms of their use, which is a
combination of how they were designed to be used and how users have found ways
to use them. Applications are often designed and created with specific
infrastructures in mind, with both an appreciation of the existing capabilities
provided by those infrastructures and an anticipation of their future
capabilities. Here, the infrastructures we discuss were often designed and
created with specific applications in mind, or at least specific types of
applications. The reader should understand how the interplay between the
infrastructure providers and the users leads to such usages, which we call
usage modalities. These usage modalities are really abstractions that exist
between the infrastructures and the applications; they influence the
infrastructures by representing the applications, and they influence the ap-
plications by representing the infrastructures
Survey of End-to-End Mobile Network Measurement Testbeds, Tools, and Services
Mobile (cellular) networks enable innovation, but can also stifle it and lead
to user frustration when network performance falls below expectations. As
mobile networks become the predominant method of Internet access, developer,
research, network operator, and regulatory communities have taken an increased
interest in measuring end-to-end mobile network performance to, among other
goals, minimize negative impact on application responsiveness. In this survey
we examine current approaches to end-to-end mobile network performance
measurement, diagnosis, and application prototyping. We compare available tools
and their shortcomings with respect to the needs of researchers, developers,
regulators, and the public. We intend for this survey to provide a
comprehensive view of currently active efforts and some auspicious directions
for future work in mobile network measurement and mobile application
performance evaluation.Comment: Submitted to IEEE Communications Surveys and Tutorials. arXiv does
not format the URL references correctly. For a correctly formatted version of
this paper go to
http://www.cs.montana.edu/mwittie/publications/Goel14Survey.pd
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
Crux: Locality-Preserving Distributed Services
Distributed systems achieve scalability by distributing load across many
machines, but wide-area deployments can introduce worst-case response latencies
proportional to the network's diameter. Crux is a general framework to build
locality-preserving distributed systems, by transforming an existing scalable
distributed algorithm A into a new locality-preserving algorithm ALP, which
guarantees for any two clients u and v interacting via ALP that their
interactions exhibit worst-case response latencies proportional to the network
latency between u and v. Crux builds on compact-routing theory, but generalizes
these techniques beyond routing applications. Crux provides weak and strong
consistency flavors, and shows latency improvements for localized interactions
in both cases, specifically up to several orders of magnitude for
weakly-consistent Crux (from roughly 900ms to 1ms). We deployed on PlanetLab
locality-preserving versions of a Memcached distributed cache, a Bamboo
distributed hash table, and a Redis publish/subscribe. Our results indicate
that Crux is effective and applicable to a variety of existing distributed
algorithms.Comment: 11 figure
Reducing Internet Latency : A Survey of Techniques and their Merit
Bob Briscoe, Anna Brunstrom, Andreas Petlund, David Hayes, David Ros, Ing-Jyh Tsang, Stein Gjessing, Gorry Fairhurst, Carsten Griwodz, Michael WelzlPeer reviewedPreprin
A Flexible and Scalable Architecture for Real-Time ANT+ Sensor Data Acquisition and NoSQL Storage
Wireless Personal or Body Area Networks (WPANs or WBANs) are the main mechanisms to develop healthcare systems for an ageing society. Such systems offer monitoring, security, and caring services by measuring physiological body parameters using wearable devices. Wireless sensor networks allow inexpensive, continuous, and real-time updates of the sensor data, to the data repositories via an Internet. A great deal of research is going on with a focus on technical, managerial, economic, and social health issues. The technical obstacles, which we encounter, in general, are better methodologies, architectures, and context data storage. Sensor communication, data processing and interpretation, data interchange format, data transferal, and context data storage are sensitive phases during the whole process of body parameter acquisition until the storage. ANT+ is a proprietary (but open access) low energy protocol, which supports device interoperability by mutually agreeing upon device profile standards. We have implemented a prototype, based upon ANT+ enabled sensors for a real-time scenario. This paper presents a system architecture, with its software organization, for real-time message interpretation, event-driven based real-time bidirectional communication, and schema flexible storage. A computer user uses it to acquire and to transmit the data using a Windows service to the context server
- …