441 research outputs found

    Preparing HPC Applications for the Exascale Era: A Decoupling Strategy

    Full text link
    Production-quality parallel applications are often a mixture of diverse operations, such as computation- and communication-intensive, regular and irregular, tightly coupled and loosely linked operations. In conventional construction of parallel applications, each process performs all the operations, which might result inefficient and seriously limit scalability, especially at large scale. We propose a decoupling strategy to improve the scalability of applications running on large-scale systems. Our strategy separates application operations onto groups of processes and enables a dataflow processing paradigm among the groups. This mechanism is effective in reducing the impact of load imbalance and increases the parallel efficiency by pipelining multiple operations. We provide a proof-of-concept implementation using MPI, the de-facto programming system on current supercomputers. We demonstrate the effectiveness of this strategy by decoupling the reduce, particle communication, halo exchange and I/O operations in a set of scientific and data-analytics applications. A performance evaluation on 8,192 processes of a Cray XC40 supercomputer shows that the proposed approach can achieve up to 4x performance improvement.Comment: The 46th International Conference on Parallel Processing (ICPP-2017

    Network Vortex: Distributed Virtual Memory for Streaming Applications

    Get PDF
    Explosive growth of the Internet, cluster computing, and storage technology has led to generation of enormous volumes of information and the need for scalable data computing. One of the central frameworks for such analysis is MapReduce, which is a programming platform for processing streaming data in external/distributed memory. Despite a significant public effort, open-source implementations of MapReduce (e.g., Hadoop, Spark) are complicated, bulky, and inefficient. To overcome this problem, we explore employing and expanding upon a recent a C/C++ programming abstraction called Vortex that offers a simple interface to the user, zero-copy operation, low RAM consumption, and high data throughput. In particular, this research examines algorithms and techniques for enabling Vortex operation over the network, including both TCP/IP sockets and data-link RDMA (e.g., InfiniBand) interfaces. We developed a new producer-consumer memory stream abstraction presented as a Vortex stream split across two hosts, travelling through a hidden network communication layer to provide the illusion of writing a continuous stream of data directly into a window of memory on a remote machine, thereby enabling the creation of high-performance networking code and size-agnostic data transport under appropriate circumstances written as simply as an in-memory copy operation, overcoming complications normally inherent in the discrete nature of network packet transfer. While the resulting product is highly workable over standard IP-based internet networks, the design limitations of RDMA technology in interfacing with virtual memory prove to make Vortex streams a suboptimal abstraction for this programming platform, as its central appeal of zero-copy network transfers are rendered largely inaccessible. Alternative algorithms to enhance RDMA performance with Vortex are proposed for future study

    Transferring big data across the globe

    Get PDF
    Transmitting data via the Internet is a routine and common task for users today. The amount of data being transmitted by the average user has dramatically increased over the past few years. Transferring a gigabyte of data in an entire day was normal, however users are now transmitting multiple gigabytes in a single hour. With the influx of big data and massive scientific data sets that are measured in tens of petabytes, a user has the propensity to transfer even larger amounts of data. When transferring data sets of this magnitude on public or shared networks, the performance of all workloads in the system will be impacted. This dissertation addresses the issues and challenges inherent with transferring big data over shared networks. A survey of current transfer techniques is provided and these techniques are evaluated in simulated, experimental and live environments. The main contribution of this dissertation is the development of a new, nice model for big data transfers, which is based on a store-and-forward methodology instead of an end-to-end approach. This nice model ensures that big data transfers only occur when there is idle bandwidth that can be repurposed for these large transfers. The nice model improves overall performance and significantly reduces the transmission time for big data transfers. The model allows for efficient transfers regardless of time zone differences or variations in bandwidth between sender and receiver. Nice is the first model that addresses the challenges of transferring big data across the globe

    Fluid Stochastic Petri Nets: From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks

    Get PDF
    © 2012 Mitrevski and Kotevski, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Fluid Stochastic Petri Nets: From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networ

    Distributed Data Summarization in Well-Connected Networks

    Get PDF
    We study distributed algorithms for some fundamental problems in data summarization. Given a communication graph G of n nodes each of which may hold a value initially, we focus on computing sum_{i=1}^N g(f_i), where f_i is the number of occurrences of value i and g is some fixed function. This includes important statistics such as the number of distinct elements, frequency moments, and the empirical entropy of the data. In the CONGEST~ model, a simple adaptation from streaming lower bounds shows that it requires Omega~(D+ n) rounds, where D is the diameter of the graph, to compute some of these statistics exactly. However, these lower bounds do not hold for graphs that are well-connected. We give an algorithm that computes sum_{i=1}^{N} g(f_i) exactly in {tau_{G}} * 2^{O(sqrt{log n})} rounds where {tau_{G}} is the mixing time of G. This also has applications in computing the top k most frequent elements. We demonstrate that there is a high similarity between the GOSSIP~ model and the CONGEST~ model in well-connected graphs. In particular, we show that each round of the GOSSIP~ model can be simulated almost perfectly in O~({tau_{G}}) rounds of the CONGEST~ model. To this end, we develop a new algorithm for the GOSSIP~ model that 1 +/- epsilon approximates the p-th frequency moment F_p = sum_{i=1}^N f_i^p in O~(epsilon^{-2} n^{1-k/p}) roundsfor p >= 2, when the number of distinct elements F_0 is at most O(n^{1/(k-1)}). This result can be translated back to the CONGEST~ model with a factor O~({tau_{G}}) blow-up in the number of rounds

    Mathematical analysis of scheduling policies in peer-to-peer video streaming networks

    Get PDF
    Las redes de pares son comunidades virtuales autogestionadas, desarrolladas en la capa de aplicación sobre la infraestructura de Internet, donde los usuarios (denominados pares) comparten recursos (ancho de banda, memoria, procesamiento) para alcanzar un fin común. La distribución de video representa la aplicación más desafiante, dadas las limitaciones de ancho de banda. Existen básicamente tres servicios de video. El más simple es la descarga, donde un conjunto de servidores posee el contenido original, y los usuarios deben descargar completamente este contenido previo a su reproducción. Un segundo servicio se denomina video bajo demanda, donde los pares se unen a una red virtual siempre que inicien una solicitud de un contenido de video, e inician una descarga progresiva en línea. El último servicio es video en vivo, donde el contenido de video es generado, distribuido y visualizado simultáneamente. En esta tesis se estudian aspectos de diseño para la distribución de video en vivo y bajo demanda. Se presenta un análisis matemático de estabilidad y capacidad de arquitecturas de distribución bajo demanda híbridas, asistidas por pares. Los pares inician descargas concurrentes de múltiples contenidos, y se desconectan cuando lo desean. Se predice la evolución esperada del sistema asumiendo proceso Poisson de arribos y egresos exponenciales, mediante un modelo determinístico de fluidos. Un sub-modelo de descargas secuenciales (no simultáneas) es globalmente y estructuralmente estable, independientemente de los parámetros de la red. Mediante la Ley de Little se determina el tiempo medio de residencia de usuarios en un sistema bajo demanda secuencial estacionario. Se demuestra teóricamente que la filosofía híbrida de cooperación entre pares siempre desempeña mejor que la tecnología pura basada en cliente-servidor

    Datacenter Design for Future Cloud Radio Access Network.

    Full text link
    Cloud radio access network (C-RAN), an emerging cloud service that combines the traditional radio access network (RAN) with cloud computing technology, has been proposed as a solution to handle the growing energy consumption and cost of the traditional RAN. Through aggregating baseband units (BBUs) in a centralized cloud datacenter, C-RAN reduces energy and cost, and improves wireless throughput and quality of service. However, designing a datacenter for C-RAN has not yet been studied. In this dissertation, I investigate how a datacenter for C-RAN BBUs should be built on commodity servers. I first design WiBench, an open-source benchmark suite containing the key signal processing kernels of many mainstream wireless protocols, and study its characteristics. The characterization study shows that there is abundant data level parallelism (DLP) and thread level parallelism (TLP). Based on this result, I then develop high performance software implementations of C-RAN BBU kernels in C++ and CUDA for both CPUs and GPUs. In addition, I generalize the GPU parallelization techniques of the Turbo decoder to the trellis algorithms, an important family of algorithms that are widely used in data compression and channel coding. Then I evaluate the performance of commodity CPU servers and GPU servers. The study shows that the datacenter with GPU servers can meet the LTE standard throughput with 4× to 16× fewer machines than with CPU servers. A further energy and cost analysis show that GPU servers can save on average 13× more energy and 6× more cost. Thus, I propose the C-RAN datacenter be built using GPUs as a server platform. Next I study resource management techniques to handle the temporal and spatial traffic imbalance in a C-RAN datacenter. I propose a “hill-climbing” power management that combines powering-off GPUs and DVFS to match the temporal C-RAN traffic pattern. Under a practical traffic model, this technique saves 40% of the BBU energy in a GPU-based C-RAN datacenter. For spatial traffic imbalance, I propose three workload distribution techniques to improve load balance and throughput. Among all three techniques, pipelining packets has the most throughput improvement at 10% and 16% for balanced and unbalanced loads, respectively.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120825/1/qizheng_1.pd
    corecore