1,818 research outputs found
A domain-specific high-level programming model
International audienceNowadays, computing hardware continues to move toward more parallelism and more heterogeneity, to obtain more computing power. From personal computers to supercomputers, we can find several levels of parallelism expressed by the interconnections of multi-core and many-core accelerators. On the other hand, computing software needs to adapt to this trend, and programmers can use parallel programming models (PPM) to fulfil this difficult task. There are different PPMs available that are based on tasks, directives, or low level languages or library. These offer higher or lower abstraction levels from the architecture by handling their own syntax. However, to offer an efficient PPM with a greater (additional) high-levelabstraction level while saving on performance, one idea is to restrict this to a specific domain and to adapt it to a family of applications. In the present study, we propose a high-level PPM specific to digital signal processing applications. It is based on data-flow graph models of computation, and a dynamic runtime model of execution (StarPU). We show how the user can easily express this digital signal processing application, and can take advantage of task, data and graph parallelism in the implementation, to enhance the performances of targeted heterogeneous clusters composed of CPUs and different accelerators (e.g., GPU, Xeon Phi
Recommended from our members
Leveraging simulation practice in industry through use of desktop grid middleware
This chapter focuses on the collaborative use of computing resources to support decision making in industry. Through the use of middleware for desktop grid computing, the idle CPU cycles available on existing computing resources can be harvested and used for speeding-up the execution of applications that have “non-trivial” processing requirements. This chapter focuses on the desktop grid middleware BOINC and Condor, and discusses the integration of commercial simulation software together with free-to-download grid middleware so as to offer competitive advantage to organizations that opt for this technology. It is expected that the low-intervention integration approach presented in this chapter (meaning no changes to source code required) will appeal to both simulation practitioners (as simulations can be executed faster, which in turn would mean that more replications and optimization is possible in the same amount of time) and the management (as it can potentially increase the return on investment on existing resources)
Implementation of the Low-Cost Work Stealing Algorithm for parallel computations
For quite a while, CPU’s clock speed has stagnated while the number of cores keeps
increasing. Because of this, parallel computing rose as a paradigm for programming
on multi-core architectures, making it critical to control the costs of communication.
Achieving this is hard, creating the need for tools that facilitate this task.
Work Stealing (WSteal) became a popular option for scheduling multithreaded com-
putations. It ensures scalability and can achieve high performance by spreading work
across processors. Each processor owns a double-ended queue where it stores its work.
When such deque is empty, the processor becomes a thief, attempting to steal work, at
random, from other processors’ deques. This strategy was proved to be efficient and is
still currently used in state-of-the-art WSteal algorithms. However, due to the concur-
rent nature of the deque, local operations require expensive memory fences to ensure
correctness. This means that even when a processor is not stealing work from others, it
still incurs excessive overhead due to the local accesses to the deque. Moreover, the pure
receiver-initiated approach to load balancing, as well as, the randomness of the targeting
of a victim makes it not suitable for scheduling computations with few or unbalanced
parallelism.
In this thesis, we explore the various limitations of WSteal in addition to solutions
proposed by related work. This is necessary to help decide on possible optimizations for
the Low-Cost Work Stealing (LCWS) algorithm, proposed by Paulino and Rito, that we
implemented in C++. This algorithm is proven to have exponentially less overhead than
the state-of-the-art WSteal algorithms. Such implementation will be tested against the
canonical WSteal and other variants that we implemented so that we can quantify the
gains of the algorithm.Já faz algum tempo desde que a velocidade dos CPUs tem vindo a estagnar enquanto
o número de cores tem vindo a subir. Por causa disto, o ramo de computação paralela
subiu como paradigma para programação em arquiteturas multi-core, tornando crítico
controlar os custos associados de comunicação. No entanto, isto não é uma tarefa fácil,
criando a necessidade de criar ferramentas que facilitem este controlo.
Work Stealing (WSteal) tornou-se uma opção popular para o escalonamento de com-
putações concorrentes. Este garante escalabilidade e consegue alcançar alto desempenho
por distribuir o trabalho por vários processadores. Cada processador possui uma fila du-
plamente terminada (deque) onde é guardado o trabalho. Quando este deque está vazio,
o processador torna-se um ladrão, tentando roubar trabalho do deque de um outro pro-
cessador, escolhido aleatoriamente. Esta estratégia foi provada como eficiente e ainda é
atualmente usada em vários algoritmos WSteal. Contudo, devido à natureza concorrente
do deque, operações locais requerem barreiras de memória, cujo correto funcionamento
tem um alto custo associado. Além disso, a estratégia pura receiver-initiated (iniciada
pelo recetor) de balanceamento de carga, assim como a aleatoriedade no processo de es-
colha de uma vitima faz com que o algoritmo não seja adequado para o scheduling de
computações com pouco ou desequilibrado paralelismo.
Nesta tese, nós exploramos as várias limitações de WSteal, para além das soluções
propostas por trabalhos relacionados. Isto é um passo necessário para ajudar a decidir
possíveis otimisações para o algoritmo Low-Cost Work Stealing (LCWS), proposto por
Paulino e Rito, que implementámos em C++. Este algoritmo está provado como tendo
exponencialmente menos overhead que outros algoritmos de WSteal. Tal implementação
será testada e comparada com o algoritmo canónico de WSteal, assim como outras suas
variantes que implementámos para que possamos quantificar os ganhos do algoritmo
Exploratory study to explore the role of ICT in the process of knowledge management in an Indian business environment
In the 21st century and the emergence of a digital economy, knowledge and the knowledge base economy are rapidly growing. To effectively be able to understand the processes involved in the creating, managing and sharing of knowledge management in the business environment is critical to the success of an organization. This study builds on the previous research of the authors on the enablers of knowledge management by identifying the relationship between the enablers of knowledge management and the role played by information communication technologies (ICT) and ICT infrastructure in a business setting. This paper provides the findings of a survey collected from the four major Indian cities (Chennai, Coimbatore, Madurai and Villupuram) regarding their views and opinions about the enablers of knowledge management in business setting. A total of 80 organizations participated in the study with 100 participants in each city. The results show that ICT and ICT infrastructure can play a critical role in the creating, managing and sharing of knowledge in an Indian business environment
A visual programming model to implement coarse-grained DSP applications on parallel and heterogeneous clusters
International audienceThe digital signal processing (DSP) applications are one of the biggest consumers of computing. They process a big data volume which is represented with a high accuracy. They use complex algorithms, and must satisfy a time constraints in most of cases. In the other hand, it's necessary today to use parallel and heterogeneous architectures in order to speedup the processing, where the best examples are the su-percomputers "Tianhe-2" and "Titan" from the top500 ranking. These architectures could contain several connected nodes, where each node includes a number of generalist processor (multi-core) and a number of accelerators (many-core) to finally allows several levels of parallelism. However, for DSP programmers, it's still complicated to exploit all these parallelism levels to reach good performance for their applications. They have to design their implementation to take advantage of all heteroge-neous computing units, taking into account the architecture specifici-ties of each of them: communication model, memory management, data management, jobs scheduling and synchronization . . . etc. In the present work, we characterize DSP applications, and based on their distinctive-ness, we propose a high level visual programming model and an execution model in order to drop down their implementations and in the same time make desirable performances
SignalPU: A programming model for DSP applications on parallel and heterogeneous clusters
International audience—The biomedical imagery, the numeric communi-cations, the acoustic signal processing and many others digital signal processing applications (DSP) are present more and more everyday in the numeric world. They process growing data volume which is represented with more and more accuracy, and using complex algorithms with time constraints to satisfying. Con-sequently, a high requirement of computing power characterize them. To satisfy this need, it's inevitable today to use parallel and heterogeneous architectures in order to speed-up the processing, where the best examples are the supercomputers like "Tianhe-2" and "Titan" of the ranking top500. These architectures with their multi-core nodes supported by many-core accelerators offer a good response to this problem, but they are still hard to program in order to make performance because of lot of things like synchronization, the memory management, the hardware specifications . . . In the present work, we propose a high level programming model to implement easily and efficiently digital signal processing applications on heterogeneous clusters
Dense matrix computations on NUMA architectures with distance-aware work stealing
We employ the dynamic runtime system OmpSs to decrease the overhead of data motion in the now ubiquitous non-uniform memory access (NUMA) high concurrency environment of multicore processors. The dense numerical linear algebra algorithms of Cholesky factorization and symmetric matrix inversion are employed as representative benchmarks. Work stealing occurs within an innovative NUMA-aware scheduling policy to reduce data movement between NUMA nodes. The overall approach achieves separation of concerns by abstracting the complexity of the hardware from the end users so that high productivity can be achieved. Performance results on a large NUMA system outperform the state-of-the-art existing implementations up to a two fold speedup for the Cholesky factorization, as well as the symmetric matrix inversion, while the OmpSs-enabled code maintains strong similarity to its original sequential version.The authors would like to thank the National Institute for Computational Sciences for granting us access on the Nautilus system. The KAUST authors acknowledge support of the Extreme Computing Research Center. The BSC-affiliated authors thankfully acknowledges the support of the European Commission through the HiPEAC-3 Network of Excellence (FP7-ICT 287759), Intel-BSC Exascale Lab and IBM/BSC Exascale Initiative collaboration, Spanish Ministry of Education (FPU), Computación de Altas Prestaciones VI (TIN2012-34557), Generalitat de Catalunya (2014-SGR-1051) and the grant SEV-2011-00067 of the Severo Ochoa Program.Peer ReviewedPostprint (published version
- …