2,952 research outputs found
Dynamic resource allocation in a hierarchical multiprocessor system: A preliminary study
An integrated system approach to dynamic resource allocation is proposed. Some of the problems in dynamic resource allocation and the relationship of these problems to system structures are examined. A general dynamic resource allocation scheme is presented. A hierarchial system architecture which dynamically maps between processor structure and programs at multiple levels of instantiations is described. Simulation experiments were conducted to study dynamic resource allocation on the proposed system. Preliminary evaluation based on simple dynamic resource allocation algorithms indicates that with the proposed system approach, the complexity of dynamic resource management could be significantly reduced while achieving reasonable effective dynamic resource allocation
Recommended from our members
Computing infrastructure issues in distributed communications systems : a survey of operating system transport system architectures
The performance of distributed applications (such as file transfer, remote login, tele-conferencing, full-motion video, and scientific visualization) is influenced by several factors that interact in complex ways. In particular, application performance is significantly affected both by communication infrastructure factors and computing infrastructure factors. Several communication infrastructure factors include channel speed, bit-error rate, and congestion at intermediate switching nodes. Computing infrastructure factors include (among other things) both protocol processing activities (such as connection management, flow control, error detection, and retransmission) and general operating system factors (such as memory latency, CPU speed, interrupt and context switching overhead, process architecture, and message buffering). Due to a several orders of magnitude increase in network channel speed and an increase in application diversity, performance bottlenecks are shifting from the network factors to the transport system factors.This paper defines an abstraction called an "Operating System Transport System Architecture" (OSTSA) that is used to classify the major components and services in the computing infrastructure. End-to-end network protocols such as TCP, TP4, VMTP, XTP, and Delta-t typically run on general-purpose computers, where they utilize various operating system resources such as processors, virtual memory, and network controllers. The OSTSA provides services that integrate these resources to support distributed applications running on local and wide area networks.A taxonomy is presented to evaluate OSTSAs in terms of their support for protocol processing activities. We use this taxonomy to compare and contrast five general-purpose commercial and experimental operating systems including System V UNIX, BSD UNIX, the x-kernel, Choices, and Xinu
A Fast Causal Profiler for Task Parallel Programs
This paper proposes TASKPROF, a profiler that identifies parallelism
bottlenecks in task parallel programs. It leverages the structure of a task
parallel execution to perform fine-grained attribution of work to various parts
of the program. TASKPROF's use of hardware performance counters to perform
fine-grained measurements minimizes perturbation. TASKPROF's profile execution
runs in parallel using multi-cores. TASKPROF's causal profile enables users to
estimate improvements in parallelism when a region of code is optimized even
when concrete optimizations are not yet known. We have used TASKPROF to isolate
parallelism bottlenecks in twenty three applications that use the Intel
Threading Building Blocks library. We have designed parallelization techniques
in five applications to in- crease parallelism by an order of magnitude using
TASKPROF. Our user study indicates that developers are able to isolate
performance bottlenecks with ease using TASKPROF.Comment: 11 page
Space sharing job scheduling policies for parallel computers
The distinguishing characteristic of space sharing parallel job scheduling policies is that applications are allocated non-overlapping processor subsets. The interference among jobs is reduced, the synchronization delays and message latencies can be predictable, and distinct processors may be allocated to cooperating processes so as to avoid the overhead of context switches associated with traditional time-multiplexing;The processor allocation strategy, the job selection criteria, and workload characteristics are fundamental factors that influence system performance under space sharing. Allocation can be static or dynamic. The processor subset allocated to an application is fixed under static space sharing, whereas it can change during execution under dynamic space sharing. Static allocation can produce more predictable run times, permits a wide range of compiler optimizations (e.g., static data distribution and binding), and avoids the processor releases and reallocations associated with dynamic allocation. Its major problem is that it can induce high processor fragmentation;In this dissertation, alternative static and dynamic space sharing policies that differ in the allocation discipline and the job selection criteria are studied. The results show that significantly superior performance can be achieved under static space sharing if applications can be folded (i.e., allocated fewer processors than they requested). Folding typically increases program efficiency and can reduce processor fragmentation. Policies that increase folding with the system load are proposed and compared to schemes that use unconstrained folding, no folding, and fixed maximum folding factors. The adaptive policies produced higher and more stable system utilization, significantly shorter mean response times, and good fairness curves. However, unconstrained folding resulted in considerably more severe processor fragmentation than no folding. Its advantage is that it exploits the efficiency improvement that typically results when an application is allocated fewer processors. Consequently, it can produce shorter mean response times than no folding under medium to heavy loads;Also because of this efficiency improvement, dynamic policies that reduce waiting times by executing a large number of jobs simultaneously are more promising than schemes that limit the number of active jobs. However, limiting the number of active applications can be the superior approach when folding does not improve application efficiency
Characterization of vectorization strategies for recursive algorithms
A successful architectural trend in parallelism is the emphasis on data parallelism with SIMD hardware. Since SIMD extensions on commodity processors tend to require relatively little extra hardware, executing a SIMD instruction is essentially free from a power perspective, making vector computation an attractive target for parallelism. SIMD instructions are designed to accelerate the performance of applications such as motion video, real-time physics and graphics. Such applications perform repetitive operations on large arrays of numbers. While the key idea is to parallelize significant portions of data that get operated by several sequential instructions into a single instruction, not every application can be parallelized automatically. Regular applications with dense matrices and arrays are easier to vectorize compared to irregular applications that involve pointer based data structures like trees and graphs. Programmers are burdened with the arduous task of manually tuning such applications for better performance. One such class of applications are recursive programs. While they are not traditional serial instruction sequences, they follow a serialized pattern in their control flow graph and exhibit dependencies. They can be visualized to be directed trees data structures. Vectorizing recursive applications with SIMD hardware cannot be achieved by using the existing intrinsic directly because of the nature of these algorithms. In this dissertation, we argue that, for an important subset of recursive programs which arise in many domains, there exists general techniques to efficiently vectorize the program to operate on SIMD architecture. Recursive algorithms are very popular in graph problems, tree traversal algorithms, gaming applications et al. While multi-core and GPU implementation of such algorithms have been explored, methods to execute them efficiently on vector units like SIMD and AVX have not been explored. We investigate techniques for work generation and efficient vectorization to enable vectorization in recursion. We further implement a generic tree model that allows us to guarantee lower bounds on its utilization efficiency
Model-driven Scheduling for Distributed Stream Processing Systems
Distributed Stream Processing frameworks are being commonly used with the
evolution of Internet of Things(IoT). These frameworks are designed to adapt to
the dynamic input message rate by scaling in/out.Apache Storm, originally
developed by Twitter is a widely used stream processing engine while others
includes Flink, Spark streaming. For running the streaming applications
successfully there is need to know the optimal resource requirement, as
over-estimation of resources adds extra cost.So we need some strategy to come
up with the optimal resource requirement for a given streaming application. In
this article, we propose a model-driven approach for scheduling streaming
applications that effectively utilizes a priori knowledge of the applications
to provide predictable scheduling behavior. Specifically, we use application
performance models to offer reliable estimates of the resource allocation
required. Further, this intuition also drives resource mapping, and helps
narrow the estimated and actual dataflow performance and resource utilization.
Together, this model-driven scheduling approach gives a predictable application
performance and resource utilization behavior for executing a given DSPS
application at a target input stream rate on distributed resources.Comment: 54 page
Parallel programming using functional languages
It has been argued for many years that functional programs are well suited to parallel evaluation. This thesis investigates this claim from a programming perspective; that is, it investigates parallel programming using functional languages. The approach taken has been to determine the minimum programming which is necessary in order to write efficient parallel programs. This has been attempted without the aid of clever compile-time analyses. It is argued that parallel evaluation should be explicitly expressed, by the programmer, in programs. To do achieve this a lazy functional language is extended with parallel and sequential combinators.
The mathematical nature of functional languages means that programs can be formally derived by program transformation. To date, most work on program derivation has concerned sequential programs. In this thesis Squigol has been used to derive three parallel algorithms. Squigol is a functional calculus from program derivation, which is becoming increasingly popular. It is shown that some aspects of Squigol are suitable for parallel program derivation, while others aspects are specifically orientated towards sequential algorithm derivation.
In order to write efficient parallel programs, parallelism must be controlled. Parallelism must be controlled in order to limit storage usage, the number of tasks and the minimum size of tasks. In particular over-eager evaluation or generating excessive numbers of tasks can consume too much storage. Also, tasks can be too small to be worth evaluating in parallel. Several program techniques for parallelism control were tried. These were compared with a run-time system heuristic for parallelism control. It was discovered that the best control was effected by a combination of run-time system and programmer control of parallelism.
One of the problems with parallel programming using functional languages is that non-deterministic algorithms cannot be expressed. A bag (multiset) data type is proposed to allow a limited form of non-determinism to be expressed. Bags can be given a non-deterministic parallel implementation. However, providing the operations used to combine bag elements are associative and commutative, the result of bag operations will be deterministic. The onus is on the programmer to prove this, but usually this is not difficult. Also bags' insensitivity to ordering means that more transformations are directly applicable than if, say, lists were used instead.
It is necessary to be able to reason about and measure the performance of parallel programs. For example, sometimes algorithms which seem intuitively to be good parallel ones, are not. For some higher order functions it is posible to devise parameterised formulae describing their performance. This is done for divide and conquer functions, which enables constraints to be formulated which guarantee that they have a good performance. Pipelined parallelism is difficult to analyse. Therefore a formal semantics for calculating the performance of pipelined programs is devised. This is used to analyse the performance of a pipelined Quicksort. By treating the performance semantics as a set of transformation rules, the simulation of parallel programs may be achieved by transforming programs. Some parallel programs perform poorly due to programming errors. A pragmatic method of debugging such programming errors is illustrated by some examples
- …