1,528 research outputs found
Advanced software techniques for space shuttle data management systems Final report
Airborne/spaceborn computer design and techniques for space shuttle data management system
A batch scheduler with high level components
In this article we present the design choices and the evaluation of a batch
scheduler for large clusters, named OAR. This batch scheduler is based upon an
original design that emphasizes on low software complexity by using high level
tools. The global architecture is built upon the scripting language Perl and
the relational database engine Mysql. The goal of the project OAR is to prove
that it is possible today to build a complex system for ressource management
using such tools without sacrificing efficiency and scalability. Currently, our
system offers most of the important features implemented by other batch
schedulers such as priority scheduling (by queues), reservations, backfilling
and some global computing support. Despite the use of high level tools, our
experiments show that our system has performances close to other systems.
Furthermore, OAR is currently exploited for the management of 700 nodes (a
metropolitan GRID) and has shown good efficiency and robustness
Revisiting Actor Programming in C++
The actor model of computation has gained significant popularity over the
last decade. Its high level of abstraction makes it appealing for concurrent
applications in parallel and distributed systems. However, designing a
real-world actor framework that subsumes full scalability, strong reliability,
and high resource efficiency requires many conceptual and algorithmic additives
to the original model.
In this paper, we report on designing and building CAF, the "C++ Actor
Framework". CAF targets at providing a concurrent and distributed native
environment for scaling up to very large, high-performance applications, and
equally well down to small constrained systems. We present the key
specifications and design concepts---in particular a message-transparent
architecture, type-safe message interfaces, and pattern matching
facilities---that make native actors a viable approach for many robust,
elastic, and highly distributed developments. We demonstrate the feasibility of
CAF in three scenarios: first for elastic, upscaling environments, second for
including heterogeneous hardware like GPGPUs, and third for distributed runtime
systems. Extensive performance evaluations indicate ideal runtime behaviour for
up to 64 cores at very low memory footprint, or in the presence of GPUs. In
these tests, CAF continuously outperforms the competing actor environments
Erlang, Charm++, SalsaLite, Scala, ActorFoundry, and even the OpenMPI.Comment: 33 page
General purpose simulator system study
Modifications to computerized simulator system for space shuttle and space station application
Parallel Real-Time Scheduling for Latency-Critical Applications
In order to provide safety guarantees or quality of service guarantees, many of today\u27s systems consist of latency-critical applications, e.g. applications with timing constraints. The problem of scheduling multiple latency-critical jobs on a multiprocessor or multicore machine has been extensively studied for sequential (non-parallizable) jobs and different system models and different objectives have been considered. However, the computational requirement of a single job is still limited by the capacity of a single core. To provide increasingly complex functionalities of applications and to complete their higher computational demands within the same or even more stringent timing constraints, we must exploit the internal parallelism of jobs, where individual jobs are parallel programs and can potentially utilize more than one core in parallel. However, there is little work considering scheduling multiple parallel jobs that are latency-critical.
This dissertation focuses on developing new scheduling strategies, analysis tools, and practical platform design techniques to enable efficient and scalable parallel real-time scheduling for latency-critical applications on multicore systems. In particular, the research is focused on two types of systems: (1) static real-time systems for tasks with deadlines where the temporal properties of the tasks that need to execute is known a priori and the goal is to guarantee the temporal correctness of the tasks prior to their executions; and (2) online systems for latency-critical jobs where multiple jobs arrive over time and the goal to optimize for a performance objective of jobs during the execution.
For static real-time systems for parallel tasks, several scheduling strategies, including global earliest deadline first, global rate monotonic and a novel federated scheduling, are proposed, analyzed and implemented. These scheduling strategies have the best known theoretical performance for parallel real-time tasks under any global strategy, any fixed priority scheduling and any scheduling strategy, respectively. In addition, federated scheduling is generalized to systems with multiple criticality levels and systems with stochastic tasks. Both numerical and empirical experiments show that federated scheduling and its variations have good schedulability performance and are efficient in practice.
For online systems with multiple latency-critical jobs, different online scheduling strategies are proposed and analyzed for different objectives, including maximizing the number of jobs meeting a target latency, maximizing the profit of jobs, minimizing the maximum latency and minimizing the average latency. For example, a simple First-In-First-Out scheduler is proven to be scalable for minimizing the maximum latency. Based on this theoretical intuition, a more practical work-stealing scheduler is developed, analyzed and implemented. Empirical evaluations indicate that, on both real world and synthetic workloads, this work-stealing implementation performs almost as well as an optimal scheduler
Efficiently and Transparently Maintaining High SIMD Occupancy in the Presence of Wavefront Irregularity
Demand is increasing for high throughput processing of irregular streaming applications; examples of such applications from scientific and engineering domains include biological sequence alignment, network packet filtering, automated face detection, and big graph algorithms. With wide SIMD, lightweight threads, and low-cost thread-context switching, wide-SIMD architectures such as GPUs allow considerable flexibility in the way application work is assigned to threads. However, irregular applications are challenging to map efficiently onto wide SIMD because data-dependent filtering or replication of items creates an unpredictable data wavefront of items ready for further processing. Straightforward implementations of irregular applications on a wide-SIMD architecture are prone to load imbalance and reduced occupancy, while more sophisticated implementations require advanced use of parallel GPU operations to redistribute work efficiently among threads.
This dissertation will present strategies for addressing the performance challenges of wavefront- irregular applications on wide-SIMD architectures. These strategies are embodied in a developer framework called Mercator that (1) allows developers to map irregular applications onto GPUs ac- cording to the streaming paradigm while abstracting from low-level data movement and (2) includes generalized techniques for transparently overcoming the obstacles to high throughput presented by wavefront-irregular applications on a GPU. Mercator forms the centerpiece of this dissertation, and we present its motivation, performance model, implementation, and extensions in this work
Scheduling Irregular Workloads on GPUs
This doctoral research aims at understanding the nature of the overhead for data irregular GPU workloads, proposing a solution, and examining the consequences of the result. We propose a novel, retry-free GPU workload scheduler for irregular workloads. When used in a Breadth First Search (BFS) algorithm, the proposed simple, monolithic concurrent queue scales to within 10% of ideal scalability on AMD’s Fiji GPU with 14,336 active threads. The dissertation presents an important finding that the retry overhead associated with Compare and Swap (CAS) operations is the principle reason why concurrent queues do not scale well as the number of clients increases in a massively multi-threaded environment
- …