Search CORE

1,818 research outputs found

A domain-specific high-level programming model

Author: Balarin
Bhattacharya
Blumofe
Board
Boulos
Cameron
Chandra
Grotker
Hiram
Houzet
Kirk
Lee
Munshi
Pacheco
Parhi
Polukhin
Reinders
Sanders
Skillicorn
Valiant
Publication venue: 'Wiley'
Publication date: 22/09/2015
Field of study

International audienceNowadays, computing hardware continues to move toward more parallelism and more heterogeneity, to obtain more computing power. From personal computers to supercomputers, we can find several levels of parallelism expressed by the interconnections of multi-core and many-core accelerators. On the other hand, computing software needs to adapt to this trend, and programmers can use parallel programming models (PPM) to fulfil this difficult task. There are different PPMs available that are based on tasks, directives, or low level languages or library. These offer higher or lower abstraction levels from the architecture by handling their own syntax. However, to offer an efficient PPM with a greater (additional) high-levelabstraction level while saving on performance, one idea is to restrict this to a specific domain and to adapt it to a family of applications. In the present study, we propose a high-level PPM specific to digital signal processing applications. It is based on data-flow graph models of computation, and a dynamic runtime model of execution (StarPU). We show how the user can easily express this digital signal processing application, and can take advantage of task, data and graph parallelism in the implementation, to enhance the performances of targeted heterogeneous clusters composed of CPUs and different accelerators (e.g., GPU, Xeon Phi

Crossref

Hal - Université Grenoble Alpes

Recommended from our members

Leveraging simulation practice in industry through use of desktop grid middleware

Author: Mustafee N
Taylor S J E
Publication venue: 'IGI Global'
Publication date: 01/01/2009
Field of study

This chapter focuses on the collaborative use of computing resources to support decision making in industry. Through the use of middleware for desktop grid computing, the idle CPU cycles available on existing computing resources can be harvested and used for speeding-up the execution of applications that have “non-trivial” processing requirements. This chapter focuses on the desktop grid middleware BOINC and Condor, and discusses the integration of commercial simulation software together with free-to-download grid middleware so as to offer competitive advantage to organizations that opt for this technology. It is expected that the low-intervention integration approach presented in this chapter (meaning no changes to source code required) will appeal to both simulation practitioners (as simulations can be executed faster, which in turn would mean that more replications and optimization is possible in the same amount of time) and the management (as it can potentially increase the return on investment on existing resources)

Brunel University Research Archive

Implementation of the Low-Cost Work Stealing Algorithm for parallel computations

Author: Custódio Rafael Guerreiro
Publication venue
Publication date: 01/12/2022
Field of study

For quite a while, CPU’s clock speed has stagnated while the number of cores keeps increasing. Because of this, parallel computing rose as a paradigm for programming on multi-core architectures, making it critical to control the costs of communication. Achieving this is hard, creating the need for tools that facilitate this task. Work Stealing (WSteal) became a popular option for scheduling multithreaded com- putations. It ensures scalability and can achieve high performance by spreading work across processors. Each processor owns a double-ended queue where it stores its work. When such deque is empty, the processor becomes a thief, attempting to steal work, at random, from other processors’ deques. This strategy was proved to be efficient and is still currently used in state-of-the-art WSteal algorithms. However, due to the concur- rent nature of the deque, local operations require expensive memory fences to ensure correctness. This means that even when a processor is not stealing work from others, it still incurs excessive overhead due to the local accesses to the deque. Moreover, the pure receiver-initiated approach to load balancing, as well as, the randomness of the targeting of a victim makes it not suitable for scheduling computations with few or unbalanced parallelism. In this thesis, we explore the various limitations of WSteal in addition to solutions proposed by related work. This is necessary to help decide on possible optimizations for the Low-Cost Work Stealing (LCWS) algorithm, proposed by Paulino and Rito, that we implemented in C++. This algorithm is proven to have exponentially less overhead than the state-of-the-art WSteal algorithms. Such implementation will be tested against the canonical WSteal and other variants that we implemented so that we can quantify the gains of the algorithm.Já faz algum tempo desde que a velocidade dos CPUs tem vindo a estagnar enquanto o número de cores tem vindo a subir. Por causa disto, o ramo de computação paralela subiu como paradigma para programação em arquiteturas multi-core, tornando crítico controlar os custos associados de comunicação. No entanto, isto não é uma tarefa fácil, criando a necessidade de criar ferramentas que facilitem este controlo. Work Stealing (WSteal) tornou-se uma opção popular para o escalonamento de com- putações concorrentes. Este garante escalabilidade e consegue alcançar alto desempenho por distribuir o trabalho por vários processadores. Cada processador possui uma fila du- plamente terminada (deque) onde é guardado o trabalho. Quando este deque está vazio, o processador torna-se um ladrão, tentando roubar trabalho do deque de um outro pro- cessador, escolhido aleatoriamente. Esta estratégia foi provada como eficiente e ainda é atualmente usada em vários algoritmos WSteal. Contudo, devido à natureza concorrente do deque, operações locais requerem barreiras de memória, cujo correto funcionamento tem um alto custo associado. Além disso, a estratégia pura receiver-initiated (iniciada pelo recetor) de balanceamento de carga, assim como a aleatoriedade no processo de es- colha de uma vitima faz com que o algoritmo não seja adequado para o scheduling de computações com pouco ou desequilibrado paralelismo. Nesta tese, nós exploramos as várias limitações de WSteal, para além das soluções propostas por trabalhos relacionados. Isto é um passo necessário para ajudar a decidir possíveis otimisações para o algoritmo Low-Cost Work Stealing (LCWS), proposto por Paulino e Rito, que implementámos em C++. Este algoritmo está provado como tendo exponencialmente menos overhead que outros algoritmos de WSteal. Tal implementação será testada e comparada com o algoritmo canónico de WSteal, assim como outras suas variantes que implementámos para que possamos quantificar os ganhos do algoritmo

Repositório da Universidade Nova de Lisboa

Exploratory study to explore the role of ICT in the process of knowledge management in an Indian business environment

Author: Gururajan Raj
Hafeez-Baig Abdul
Heng Sheng Tasi
Sankaran Prema
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

In the 21st century and the emergence of a digital economy, knowledge and the knowledge base economy are rapidly growing. To effectively be able to understand the processes involved in the creating, managing and sharing of knowledge management in the business environment is critical to the success of an organization. This study builds on the previous research of the authors on the enablers of knowledge management by identifying the relationship between the enablers of knowledge management and the role played by information communication technologies (ICT) and ICT infrastructure in a business setting. This paper provides the findings of a survey collected from the four major Indian cities (Chennai, Coimbatore, Madurai and Villupuram) regarding their views and opinions about the enablers of knowledge management in business setting. A total of 80 organizations participated in the study with 100 participants in each city. The results show that ICT and ICT infrastructure can play a critical role in the creating, managing and sharing of knowledge in an Indian business environment

University of Southern Queensland ePrints

A visual programming model to implement coarse-grained DSP applications on parallel and heterogeneous clusters

Author: B. Bhattacharya
D.B. Kirk
E. Lee
J. Bueno
K. Parhi
L. Itti
M. Flynn
P.S. Pacheco
R. Chandra
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

International audienceThe digital signal processing (DSP) applications are one of the biggest consumers of computing. They process a big data volume which is represented with a high accuracy. They use complex algorithms, and must satisfy a time constraints in most of cases. In the other hand, it's necessary today to use parallel and heterogeneous architectures in order to speedup the processing, where the best examples are the su-percomputers "Tianhe-2" and "Titan" from the top500 ranking. These architectures could contain several connected nodes, where each node includes a number of generalist processor (multi-core) and a number of accelerators (many-core) to finally allows several levels of parallelism. However, for DSP programmers, it's still complicated to exploit all these parallelism levels to reach good performance for their applications. They have to design their implementation to take advantage of all heteroge-neous computing units, taking into account the architecture specifici-ties of each of them: communication model, memory management, data management, jobs scheduling and synchronization . . . etc. In the present work, we characterize DSP applications, and based on their distinctive-ness, we propose a high level visual programming model and an execution model in order to drop down their implementations and in the same time make desirable performances

Crossref

Hal - Université Grenoble Alpes

SignalPU: A programming model for DSP applications on parallel and heterogeneous clusters

Author: Houzet Dominique
Huet Sylvain
Mansouri Farouk
Publication venue: IEEE Computer Society
Publication date: 20/08/2014
Field of study

International audience—The biomedical imagery, the numeric communi-cations, the acoustic signal processing and many others digital signal processing applications (DSP) are present more and more everyday in the numeric world. They process growing data volume which is represented with more and more accuracy, and using complex algorithms with time constraints to satisfying. Con-sequently, a high requirement of computing power characterize them. To satisfy this need, it's inevitable today to use parallel and heterogeneous architectures in order to speed-up the processing, where the best examples are the supercomputers like "Tianhe-2" and "Titan" of the ranking top500. These architectures with their multi-core nodes supported by many-core accelerators offer a good response to this problem, but they are still hard to program in order to make performance because of lot of things like synchronization, the memory management, the hardware specifications . . . In the present work, we propose a high level programming model to implement easily and efficiently digital signal processing applications on heterogeneous clusters

Hal - Université Grenoble Alpes

Dense matrix computations on NUMA architectures with distance-aware work stealing

Author: Al-Omairy Rabab
Badia Sala Rosa Maria
Keyes David E.
Labarta Mancho Jesús José
Ltaief Hatem
Martorell Bofill Xavier
Miranda Álamo Guillermo
Publication venue
Publication date: 01/03/2015
Field of study

We employ the dynamic runtime system OmpSs to decrease the overhead of data motion in the now ubiquitous non-uniform memory access (NUMA) high concurrency environment of multicore processors. The dense numerical linear algebra algorithms of Cholesky factorization and symmetric matrix inversion are employed as representative benchmarks. Work stealing occurs within an innovative NUMA-aware scheduling policy to reduce data movement between NUMA nodes. The overall approach achieves separation of concerns by abstracting the complexity of the hardware from the end users so that high productivity can be achieved. Performance results on a large NUMA system outperform the state-of-the-art existing implementations up to a two fold speedup for the Cholesky factorization, as well as the symmetric matrix inversion, while the OmpSs-enabled code maintains strong similarity to its original sequential version.The authors would like to thank the National Institute for Computational Sciences for granting us access on the Nautilus system. The KAUST authors acknowledge support of the Extreme Computing Research Center. The BSC-affiliated authors thankfully acknowledges the support of the European Commission through the HiPEAC-3 Network of Excellence (FP7-ICT 287759), Intel-BSC Exascale Lab and IBM/BSC Exascale Initiative collaboration, Spanish Ministry of Education (FPU), Computación de Altas Prestaciones VI (TIN2012-34557), Generalitat de Catalunya (2014-SGR-1051) and the grant SEV-2011-00067 of the Severo Ochoa Program.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC