337 research outputs found
Locality-Aware Dynamic Task Graph Scheduling
Dynamic task graph schedulers automatically balance work across processor cores by scheduling tasks among available threads while preserving dependences. In this paper, we design NabbitC, a provably efficient dynamic task graph scheduler that accounts for data locality on NUMA systems. NabbitC allows users to assign a color to each task representing the location (e.g., a processor core) that has the most efficient access to data needed during that node’s execution. NabbitC then automatically adjusts the scheduling so as to preferentially execute each node at the location that matches its color—leading to better locality because the node is likely to make local rather than remote accesses. At the same time, NabbitC tries to optimize load balance and not add too much overhead compared to the vanilla Nabbit scheduler that does not consider locality. We provide a theoretical analysis that shows that NabbitC does not asymptotically impact the scalability of Nabbit . We evaluated the performance of NabbitC on a suite of memory intensive benchmarks. Our experiments indicates that adding locality awareness has a considerable performance advantage compared to the vanilla Nabbit scheduler. In addition, we also compared NabbitC to OpenMP programs for both regular and irregular applications. For regular applications, OpenMP achieves perfect locality and perfect load balance statically. For these benchmarks, NabbitC has a small performance penalty compared to OpenMP due to its dynamic scheduling strategy. For irregular applications, where OpenMP can not achieve locality and load balance simultaneously, we find that NabbitC performs better. Therefore, NabbitC combines the benefits of locality- aware scheduling for regular applications (the forte of static schedulers such as those in OpenMP) and dynamically adapting to load imbalance (the forte of dynamic schedulers such as Cilk Plus, TBB, and Nabbit)
Locality-Aware Concurrency Platforms
Modern computing systems from all domains are becoming increasingly more parallel. Manufacturers are taking advantage of the increasing number of available transistors by packaging more and more computing resources together on a single chip or within a single system. These platforms generally contain many levels of private and shared caches in addition to physically distributed main memory. Therefore, some memory is more expensive to access than other and high-performance software must consider memory locality as one of the first level considerations.
Memory locality is often difficult for application developers to consider directly, however, since many of these NUMA affects are invisible to the application programmer and only show up in low performance. Moreover, on parallel platforms, the performance depends on both locality and load balance and these two metrics are often at odds with each other. Therefore, directly considering locality and load balance at the application level may make the application much more complex to program.
In this work, we develop locality-conscious concurrency platforms for multiple different structured parallel programming models, including streaming applications, task-graphs and parallel for loops. In all of this work, the idea is to minimally disrupt the application programming model so that the application developer is either unimpacted or must only provide high-level hints to the runtime system. The runtime system then schedules the application to provide good locality of access while, at the same time also providing good load balance. In particular, we address cache locality for streaming applications through static partitioning and developed an extensible platform to execute partitioned streaming applications. For task-graphs, we extend a task-graph scheduling library to guide scheduling decisions towards better NUMA locality with the help of user-provided locality hints. CilkPlus parallel for loops utilize a randomized dynamic scheduler to distribute work which, in many loop based applications, results in poor locality at all levels of the memory hierarchy. We address this issue with a novel parallel for loop implementation that can get good cache and NUMA locality while providing support to maintain good load balance dynamically
Multi-Path Alpha-Fair Resource Allocation at Scale in Distributed Software Defined Networks
The performance of computer networks relies on how bandwidth is shared among
different flows. Fair resource allocation is a challenging problem particularly
when the flows evolve over time. To address this issue, bandwidth sharing
techniques that quickly react to the traffic fluctuations are of interest,
especially in large scale settings with hundreds of nodes and thousands of
flows. In this context, we propose a distributed algorithm based on the
Alternating Direction Method of Multipliers (ADMM) that tackles the multi-path
fair resource allocation problem in a distributed SDN control architecture. Our
ADMM-based algorithm continuously generates a sequence of resource allocation
solutions converging to the fair allocation while always remaining feasible, a
property that standard primal-dual decomposition methods often lack. Thanks to
the distribution of all computer intensive operations, we demonstrate that we
can handle large instances at scale
Branch and Bound Based Load Balancing for Parallel Applications
Abstract. Many parallel applications are highly dynamic in nature. In some, computation and communication patterns change gradually dur-ing the run; in others those characteristics change abruptly. Such dy-namic applications require an adaptive load balancing strategy. We are exploring an adaptive approach based on multi-partition object-based decomposition, supported by object migration. For many applications, relatively infrequent load balancing is needed. In these cases it becomes economical to spend considerable computation time toward arriving at a nearly optimal mapping of objects to processors. We present an optimal-seeking branch and bound based strategy that finds nearly optimal so-lutions to such load balancing problems quickly, and can continuously improve such solutions as time permits.
Extreme Scale De Novo Metagenome Assembly
Metagenome assembly is the process of transforming a set of short,
overlapping, and potentially erroneous DNA segments from environmental samples
into the accurate representation of the underlying microbiomes's genomes.
State-of-the-art tools require big shared memory machines and cannot handle
contemporary metagenome datasets that exceed Terabytes in size. In this paper,
we introduce the MetaHipMer pipeline, a high-quality and high-performance
metagenome assembler that employs an iterative de Bruijn graph approach.
MetaHipMer leverages a specialized scaffolding algorithm that produces long
scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is
end-to-end parallelized using the Unified Parallel C language and therefore can
run seamlessly on shared and distributed-memory systems. Experimental results
show that MetaHipMer matches or outperforms the state-of-the-art tools in terms
of accuracy. Moreover, MetaHipMer scales efficiently to large concurrencies and
is able to assemble previously intractable grand challenge metagenomes. We
demonstrate the unprecedented capability of MetaHipMer by computing the first
full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion
reads - size 2.6 TBytes.Comment: Accepted to SC1
Scalable Exact Parent Sets Identification in Bayesian Networks Learning with Apache Spark
In Machine Learning, the parent set identification problem is to find a set
of random variables that best explain selected variable given the data and some
predefined scoring function. This problem is a critical component to structure
learning of Bayesian networks and Markov blankets discovery, and thus has many
practical applications, ranging from fraud detection to clinical decision
support. In this paper, we introduce a new distributed memory approach to the
exact parent sets assignment problem. To achieve scalability, we derive
theoretical bounds to constraint the search space when MDL scoring function is
used, and we reorganize the underlying dynamic programming such that the
computational density is increased and fine-grain synchronization is
eliminated. We then design efficient realization of our approach in the Apache
Spark platform. Through experimental results, we demonstrate that the method
maintains strong scalability on a 500-core standalone Spark cluster, and it can
be used to efficiently process data sets with 70 variables, far beyond the
reach of the currently available solutions
Supporting intra-task parallelism in real-time multiprocessor systems
Os sistemas de tempo real modernos geram, cada vez mais, cargas computacionais pesadas e
dinâmicas, começando-se a tornar pouco expectável que sejam implementados em sistemas uniprocessador.
Na verdade, a mudança de sistemas com um único processador para sistemas multi-
processador pode ser vista, tanto no domÃnio geral, como no de sistemas embebidos, como uma
forma eficiente, em termos energéticos, de melhorar a performance das aplicações.
Simultaneamente, a proliferação das plataformas multi-processador transformaram a programação
paralela num tópico de elevado interesse, levando o paralelismo dinâmico a ganhar
rapidamente popularidade como um modelo de programação. A ideia, por detrás deste modelo,
é encorajar os programadores a exporem todas as oportunidades de paralelismo através da simples
indicação de potenciais regiões paralelas dentro das aplicações. Todas estas anotações são
encaradas pelo sistema unicamente como sugestões, podendo estas serem ignoradas e substituÃdas, por construtores sequenciais equivalentes, pela própria linguagem. Assim, o modo como
a computação é na realidade subdividida, e mapeada nos vários processadores, é da responsabilidade
do compilador e do sistema computacional subjacente.
Ao retirar este fardo do programador, a complexidade da programação é consideravelmente
reduzida, o que normalmente se traduz num aumento de produtividade. Todavia, se o mecanismo
de escalonamento subjacente não for simples e rápido, de modo a manter o overhead geral
em nÃveis reduzidos, os benefÃcios da geração de um paralelismo com uma granularidade tão fina
serão meramente hipotéticos.
Nesta perspetiva de escalonamento, os algoritmos que empregam uma polÃtica de workstealing
são cada vez mais populares, com uma eficiência comprovada em termos de tempo,
espaço e necessidades de comunicação. Contudo, estes algoritmos não contemplam restrições
temporais, nem outra qualquer forma de atribuição de prioridades às tarefas, o que impossibilita
que sejam diretamente aplicados a sistemas de tempo real. Além disso, são tradicionalmente implementados
no runtime da linguagem, criando assim um sistema de escalonamento com dois
nÃveis, onde a previsibilidade, essencial a um sistema de tempo real, não pode ser assegurada.
Nesta tese, é descrita a forma como a abordagem de work-stealing pode ser resenhada para
cumprir os requisitos de tempo real, mantendo, ao mesmo tempo, os seus princÃpios fundamentais
que tão bons resultados têm demonstrado. Muito resumidamente, a única fila de gestão
de processos convencional (deque) é substituÃda por uma fila de deques, ordenada de forma
crescente por prioridade das tarefas. De seguida, aplicamos por cima o conhecido algoritmo de
escalonamento dinâmico G-EDF, misturamos as regras de ambos, e assim nasce a nossa proposta: o algoritmo de escalonamento RTWS.
Tirando partido da modularidade oferecida pelo escalonador do Linux, o RTWS é adicionado
como uma nova classe de escalonamento, de forma a avaliar na prática se o algoritmo proposto
é viável, ou seja, se garante a eficiência e escalonabilidade desejadas. Modificar o núcleo do
Linux é uma tarefa complicada, devido à complexidade das suas funções internas e às fortes interdependências
entre os vários subsistemas. Não obstante, um dos objetivos desta tese era ter
a certeza que o RTWS é mais do que um conceito interessante. Assim, uma parte significativa
deste documento é dedicada à discussão sobre a implementação do RTWS e à exposição de situações
problemáticas, muitas delas não consideradas em teoria, como é o caso do desfasamento
entre vários mecanismo de sincronização.
Os resultados experimentais mostram que o RTWS, em comparação com outro trabalho prático de escalonamento dinâmico de tarefas com restrições temporais, reduz significativamente
o overhead de escalonamento através de um controlo de migrações, e mudanças de contexto,
eficiente e escalável (pelo menos até 8 CPUs), ao mesmo tempo que alcança um bom balanceamento
dinâmico da carga do sistema, até mesmo de uma forma não custosa. Contudo, durante
a avaliação realizada foi detetada uma falha na implementação do RTWS, pela forma como facilmente
desiste de roubar trabalho, o que origina perÃodos de inatividade, no CPU em questão,
quando a utilização geral do sistema é baixa.
Embora o trabalho realizado se tenha focado em manter o custo de escalonamento baixo e
em alcançar boa localidade dos dados, a escalonabilidade do sistema nunca foi negligenciada.
Na verdade, o algoritmo de escalonamento proposto provou ser bastante robusto, não falhando
qualquer meta temporal nas experiências realizadas. Portanto, podemos afirmar que alguma
inversão de prioridades, causada pela sub-polÃtica de roubo BAS, não compromete os objetivos
de escalonabilidade, e até ajuda a reduzir a contenção nas estruturas de dados. Mesmo assim, o
RTWS também suporta uma sub-polÃtica de roubo determinÃstica: PAS. A avaliação experimental,
porém, não ajudou a ter uma noção clara do impacto de uma e de outra. No entanto, de uma
maneira geral, podemos concluir que o RTWS é uma solução promissora para um escalonamento
eficiente de tarefas paralelas com restrições temporais.Multiple programming models are emerging to address the increased need for dynamic task-level
parallelism in applications for multi-core processors and shared-memory parallel computing, presenting promising solutions from a user-level perspective. Nonetheless, while high-level parallel
languages offer a simple way for application programmers to specify parallelism in a form that
easily scales with problem size, they still leave the actual scheduling of tasks to be performed at
runtime. Therefore, if the underlying system cannot efficiently map those tasks on the available
cores, the benefits will be lost.
This is particularly important in modern real-time systems as their average workload is rapidly
growing more parallel, complex and computing-intensive, whilst preserving stringent timing constraints.
However, as the real-time scheduling theory has mostly been focused on sequential task
models, a shift to parallel task models introduces a completely new dimension to the scheduling
problem.
Within this context, the work presented in this thesis considers how to dynamically schedule
highly heterogeneous parallel applications that require real-time performance guarantees on
multi-core processors. A novel scheduling approach called RTWS is proposed. RTWS combines
the G-EDF scheduler with a priority-aware work-stealing load balancing scheme, enabling parallel
real-time tasks to be executed on more than one processor at a given time instant. Two stealing
sub-policies have arisen from this proposal and their suitability is discussed in detail.
Furthermore, this thesis describes the implementation of a new scheduling class in the Linux
kernel concerning RTWS, and extensively evaluate its feasibility. Experimental results demonstrate
the greater scalability and lower scheduling overhead of the proposed approach, comparatively to an existing real-time deadline-driven scheduling policy for the Linux kernel, as well as
reveal its better performance when considering tasks with intra-task parallelism than without,
even for short-living applications.
We show that busy-aware stealing is robust to small deviations from a strict priority schedule
and conclude that some priority inversion may be actually acceptable, provided it helps reduce
contention, communication, synchronisation and coordination between parallel threads
Engineering MultiQueues: Fast relaxed concurrent priority queues
Priority queues with parallel access are an attractive data structure for applications like prioritized online scheduling, discrete event simulation, or greedy algorithms. However, a classical priority queue constitutes a severe bottleneck in this context, leading to very small throughput. Hence, there has been significant interest in concurrent priority queues with relaxed semantics. We investigate the complementary quality criteria rank error (how close are deleted elements to the global minimum) and delay (for each element x, how many elements with lower priority are deleted before x). In this paper, we introduce MultiQueues as a natural approach to relaxed priority queues based on multiple sequential priority queues. Their naturally high theoretical scalability is further enhanced by using three orthogonal ways of batching operations on the sequential queues. Experiments indicate that MultiQueues present a very good performance-quality tradeoff and considerably outperform competing approaches in at least one of these aspects.
We employ a seemingly paradoxical technique of "wait-free locking" that might be of more general interest to convert sequential data structures to relaxed concurrent data structures
- …