Search CORE

458,353 research outputs found

Integrative Dynamic Reconfiguration in a Parallel Stream Processing Engine

Author: Cao Jianneng
Madsen Kasper Grud Skat
Zhou Yongluan
Publication venue
Publication date: 11/02/2016
Field of study

Load balancing, operator instance collocations and horizontal scaling are critical issues in Parallel Stream Processing Engines to achieve low data processing latency, optimized cluster utilization and minimized communication cost respectively. In previous work, these issues are typically tackled separately and independently. We argue that these problems are tightly coupled in the sense that they all need to determine the allocations of workloads and migrate computational states at runtime. Optimizing them independently would result in suboptimal solutions. Therefore, in this paper, we investigate how these three issues can be modeled as one integrated optimization problem. In particular, we first consider jobs where workload allocations have little effect on the communication cost, and model the problem of load balance as a Mixed-Integer Linear Program. Afterwards, we present an extended solution called ALBIC, which support general jobs. We implement the proposed techniques on top of Apache Storm, an open-source Parallel Stream Processing Engine. The extensive experimental results over both synthetic and real datasets show that our techniques clearly outperform existing approaches

arXiv.org e-Print Archive

Crossref

University of Southern Denmark Research Output

Parallel processing and non-uniform grids in global air quality modeling

Author: Berkvens P.J.F.
Botchev M.A.
Publication venue: Springer Verlag
Publication date: 01/01/2002
Field of study

A large-scale global air quality model, running efficiently on a single vector processor, is enhanced to make more realistic and more long-term simulations feasible. Two strategies are combined: non-uniform grids and parallel processing. The communication through the hierarchy of non-uniform grids interferes with the inter-processor communication. We discuss load balance in the decomposition of the domain, I/O, and inter-processor communication. A model shows that the communication overhead for both techniques is very low, whence non-uniform grids allow for large speed-ups and high speed-up can be expected from parallelization. The implementation is in progress, and results of experiments will be reported elsewhere

University of Twente Research Information

Distributed Parallel Processing

Author: Wu Adrian
Publication venue: Scholarship@Western
Publication date: 22/08/2021
Field of study

This report summarizes the development of testing new microcontrollers in performing image processing and parallel processing to solve the problem of testing and deploying expensive computer hardware technology in space. The project determined that the I2C communication protocol should be converted to an alternative protocol to maximize data transfer in parallel processing. This report also analyzes the software components and hardware components of the Distributed Parallel Processing project. This project was to research the alternative protocols for the recently developed project Distributed Parallel Processing with CubeSats. There project was to develop a suitable microcontroller to perform image processing techniques with the use of parallel processing to solve hardware limitations of data computing within space

Scholarship@Western

Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations

Author: Dimitrov Rossen Petkov
Publication venue: Scholars Junction
Publication date: 12/05/2001
Field of study

This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered

Scholars Junction - Mississippi State University Institutional Repository

Distributed data association for multi-target tracking in sensor networks

Author: Cetin Mujdat
Chen Lei
Willsky Alan S.
Çetin Müjdat
Publication venue: International Society of Information fusion
Publication date: 01/01/2005
Field of study

Associating sensor measurements with target tracks is a fundamental and challenging problem in multi-target tracking. The problem is even more challenging in the context of sensor networks, since association is coupled across the network, yet centralized data processing is in general infeasible due to power and bandwidth limitations. Hence efficient, distributed solutions are needed. We propose techniques based on graphical models to efficiently solve such data association problems in sensor networks. Our approach scales well with the number of sensor nodes in the network, and it is well--suited for distributed implementation. Distributed inference is realized by a message--passing algorithm which requires iterative, parallel exchange of information among neighboring nodes on the graph. So as to address trade--offs between inference performance and communication costs, we also propose a communication--sensitive form of message--passing that is capable of achieving near--optimal performance using far less communication. We demonstrate the effectiveness of our approach with experiments on simulated data

CiteSeerX

Sabanci University Research Database

Advanced list scheduling heuristic for task scheduling with communication contention for parallel embedded systems

Author: Cousin Jean-Gabriel
Mu Pengcheng
Nezan Jean François
Raulet Mickael
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 22/10/2010
Field of study

WOSInternational audienceModern embedded systems tend to use multiple cores or processors for processing parallel applications. This paper indeed aims at task scheduling with communication contention for parallel embedded systems and proposes three advanced techniques to improve the list scheduling heuristic. Five groups of node levels (two existing groups and three new groups) are firstly used as node priorities to generate node lists. Then the critical child technique improves the selection of a processor in the scheduling process. Finally, the communication delay technique enlarges the idle time intervals on communication links. We also propose an advanced dynamic list scheduling heuristic by combining the three techniques. Experimental results show that the combined advanced dynamic heuristic is efficient to shorten the schedule length for most of the randomly generated DAGs in the cases of medium and high communication. Our method accelerates an application up to 80% in the case of high communication and can also reduce the use of hardware resources

GPU-based Streaming for Parallel Level of Detail on Massive Model Rendering

Author: Cao Yong
Peng Chao
Publication venue
Publication date: 01/01/2011
Field of study

Rendering massive 3D models in real-time has long been recognized as a very challenging problem because of the limited computational power and memory space available in a workstation. Most existing rendering techniques, especially level of detail (LOD) processing, have suffered from their sequential execution natures, and does not scale well with the size of the models. We present a GPU-based progressive mesh simplification approach which enables the interactive rendering of large 3D models with hundreds of millions of triangles. Our work contributes to the massive rendering research in two ways. First, we develop a novel data structure to represent the progressive LOD mesh, and design a parallel mesh simplification algorithm towards GPU architecture. Second, we propose a GPU-based streaming approach which adopt a frame-to-frame coherence scheme in order to minimize the high communication cost between CPU and GPU. Our results show that the parallel mesh simplification algorithm and GPU-based streaming approach significantly improve the overall rendering performance

Computer Science Technical Reports @Virginia Tech

Feed-forward volume rendering algorithm for moderately parallel MIMD machines

Author: Yagel Roni
Publication venue
Publication date
Field of study

Algorithms for direct volume rendering on parallel and vector processors are investigated. Volumes are transformed efficiently on parallel processors by dividing the data into slices and beams of voxels. Equal sized sets of slices along one axis are distributed to processors. Parallelism is achieved at two levels. Because each slice can be transformed independently of others, processors transform their assigned slices with no communication, thus providing maximum possible parallelism at the first level. Within each slice, consecutive beams are incrementally transformed using coherency in the transformation computation. Also, coherency across slices can be exploited to further enhance performance. This coherency yields the second level of parallelism through the use of the vector processing or pipelining. Other ongoing efforts include investigations into image reconstruction techniques, load balancing strategies, and improving performance

NASA Technical Reports Server