Search CORE

17 research outputs found

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

Author: Carro Luigi
Cela Jose M.
Fernandes Fernando
Fratin Vinicius
Hanzich Mauricio
Lunardi Caio
Navaux Philippe
Oliveira Daniel
Pilla Laercio
Rech Paolo
Publication venue
Publication date: 01/03/2016
Field of study

In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications’ output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.This work was supported by the STIC-AmSud/CAPES scientific cooperation program under the EnergySFE research project grant 99999.007556/2015-02, EU H2020 Programme, and MCTI/RNP-Brazil under the HPC4E Project, grant agreement n° 689772. Tested K40 boards were donated thanks to Steve Keckler, Timothy Tsai, and Siva Hari from NVIDIA.Postprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Exploration of load balancing thresholds to save energy on iterative applications

Author: Castro Márcio Bastos
Mehaut Jean-Francois
Navaux Philippe Olivier Alexandre
Padoin Edson Luiz
Pilla Laercio Lima
Publication venue
Publication date: 01/01/2016
Field of study

Lume 5.8

Genetic testing for TMEM154 mutations associated with lentivirus susceptibility in sheep

Author: Ahmedn El Beltagy
Anderson M. D.
Anderson M. D.
Antonello Carta
Ben Hayes
Bertrand Servin
Brian Dalrymple
Carole Moreno
Chitko McKown Carol G.
Ciani Elena
Clare Gill
Clawson Michael L.
Cord Drogemuller
Cyril Roberts
Dave Coltman
David Machugh
Denis Larkin
Despoina Miltiadou
Elisha Gootwine
Emma Eythorsdottir
Fabio Pilla
Faruque Mdomar
Frank Nicholas
Georg Erhardt
Georgios Banos
Han Jialin
Harhay Gregory P.
Heaton Michael P
Henner Simianer
Herman Raadsma
Hutton Oddy V.
Ibrahim Cemal
James Kijas
Jillian Maddox
Johannes A. Lenstra
John Mcewan
Jon Slate
Jorge Calvo
Jorn Benenwitz
Josephine Pemberton
Juan Jose Arranz
Juha Kantanen
Kalbfleisch Theodore S.
Kijas James W.
Kimberly Gietzen
Kristen Nowak
Kui Li
Laercio R. Porto Neto
Leymaster Kreg A.
Luis V. Monteagudo Ibáñez
Lutz Bunger
Magali San Cristobal
Mariasilvia D’Andrea
Massoud Malek
Matthew Kent
Michael Heaton
Mikka Tapio
Mohammad Shariflou
Noelle Cockett
Olivier Hanotte
Ottmar Distl
Paul Scheet
Petrik Dustin T.
Pradeepa Silva
Runlin Ma
Russell Mcculloch
Samuel Paiva
Sean Mcwilliam
Selina Vattathil
Simon Boitard
Simpson Barry
Stefan Hiendleder
Steven Bishop
Terry Longhurst
Tiziana Sechi
Varsha Pardeshi
Vicki Whan
Vidya Gupta
William Barendse
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

Stefan Hiendleder is a member of the International Sheep Genomics ConsortiumIn sheep, small ruminant lentiviruses cause an incurable, progressive, lymphoproliferative disease that affects millions of animals worldwide. Known as ovine progressive pneumonia virus (OPPV) in the U.S., and Visna/Maedi virus (VMV) elsewhere, these viruses reduce an animal’s health, productivity, and lifespan. Genetic variation in the ovine transmembrane protein 154 gene (TMEM154) has been previously associated with OPPV infection in U.S. sheep. Sheep with the ancestral TMEM154 haplotype encoding glutamate (E) at position 35, and either form of an N70I variant, were highly-susceptible compared to sheep homozygous for the K35 missense mutation. Our current overall aim was to characterize TMEM154 in sheep from around the world to develop an efficient genetic test for reduced susceptibility. The average frequency of TMEM154 E35 among 74 breeds was 0.51 and indicated that highly-susceptible alleles were present in most breeds around the world. Analysis of whole genome sequences from an international panel of 75 sheep revealed more than 1,300 previously unreported polymorphisms in a 62 kb region containing TMEM154 and confirmed that the most susceptible haplotypes were distributed worldwide. Novel missense mutations were discovered in the signal peptide (A13V) and the extracellular domains (E31Q, I74F, and I102T) of TMEM154. A matrix-assisted laser desorption/ionization–time-of flight mass spectrometry (MALDI-TOF MS) assay was developed to detect these and six previously reported missense and two deletion mutations in TMEM154. In blinded trials, the call rate for the eight most common coding polymorphisms was 99.4% for 499 sheep tested and 96.0% of the animals were assigned paired TMEM154 haplotypes (i.e., diplotypes). The widespread distribution of highly-susceptible TMEM154 alleles suggests that genetic testing and selection may improve the health and productivity of infected flocks.Michael P. Heaton, Theodore S. Kalbfleisch, Dustin T. Petrik, Barry Simpson, James W. Kijas, Michael L. Clawson, Carol G. Chitko-McKown, Gregory P. Harhay, Kreg A. Leymaster, the International Sheep Genomics Consortiu

Public Library of Science (PLOS)

CiteSeerX

Crossref

Adelaide Research & Scholarship

Directory of Open Access Journals

Archivio istituzionale della ricerca - Università di Bari

PubMed Central

Bern Open Repository and Information System (BORIS)

White Rose Research Online

University of Queensland eSpace

Análise de desempenho da arquitetura CUDA utilizando os NAS parallel benchmarks

Author: Pilla Laercio Lima
Publication venue
Publication date: 01/01/2009
Field of study

Processadores gráficos vêm sendo utilizados como aceleradores paralelos para computações de propósito geral (GPGPU), não detidos mais apenas em aplicações gráficas. Isto acontece devido ao custo reduzido e grande potencial de desempenho paralelo dos processadores gráficos, alcançando Teraflops. CUDA (Compute Unified Device Architecture) é um exemplo de arquitetura com essas características. Diversas aplicações já foram portadas para CUDA nas áreas de dinâmica de fluídos, reconhecimento de fala, alinhamento de sequências, entre outras. Entretanto, não há uma definição clara de quais tipos de aplicações podem se aproveitar dos potenciais ganhos de desempenho que as GPGPUs trazem, visto que a arquitetura do hardware é do tipo SIMD (Simple Instruction, Multiple Data) e que existem restrições nos acessos à memória. Visando estudar esta questão, este trabalho apresenta uma análise de desempenho da arquitetura de placa gráfica CUDA guiada por modelos paralelos e benchmarks. Para isso, os benchmarks EP e FT dos NAS Parallel Benchmarks foram portados e otimizados para CUDA, mantendo o uso de operações de ponto flutuante de precisão dupla. Estes dois benchmarks fazem parte das categorias MapReduce e Spectral Methods, respectivamente, dentro da classificação Dwarf Mine. Uma análise de desempenho foi realizada, fazendo uma comparação entre os resultados obtidos pelas novas versões implementadas e as versões originais do código compiladas para execução de forma sequencial e paralela com OpenMP. Os resultados obtidos mostraram speedups de até 21 vezes para o benchmark EP e quase 3 vezes para o benchmark FT, quando comparadas as versões para CUDA com as versões com OpenMP. Estes resultados indicam uma compatibilidade entre a arquitetura CUDA e aplicações pertencentes às categorias MapReduce e Spectral Methods.Graphic processors are being used not only for graphic applications but as parallel accelerators for general-purpose computations (GPGPU). This happens due to their reduced costs and performance potential, reaching Teraflops. CUDA (Compute Unified Device Architecture) is an example of architecture with this characteristics. Several applications in the areas of fluid dynamics, speech recognition, sequence alignment and others have already being ported to CUDA. However, there is not a clear definition of which kinds of applications can take profit of the potential gains of performance that the GPGPUs have, since they have a SIMD (Single Instruction, Multiple Data) architecture and that are restrictions to the memory access. Aiming to study that question, this work presents a performance analysis of the CUDA GPU architecure guided by parallel models and benchmarks. For this, the EP and FT benchmarks from the NAS Parallel Benchmarks were ported and optimized to CUDA, keeping the use of double precision floating-point operations. These two benchmarks are included in the MapReduce and Spectral Methods classes, respectively, from the Dwarf Mine classification. A performance analysis was made comparing the results obtained by the new implemented versions of the benchmarks and their original versions, compiled to execute in a sequential manner and in a parallel manner with OpenMP. The obtained results showed speedups up to 21 times for the EP benchmark and almost 3 times for the FT one, when comparing the CUDA versions to the versions with OpenMP. These results indicate a compatibility between the CUDA architecture and the applications belonging to the MapReduce and Spectral Methods classes

Lume 5.8

Balanceamento de Carga ciente da topologia de máquina para a portabilidade de desempenho em plataformas de alto desempenho paralelas

Author: Pilla Laercio Lima
Publication venue
Publication date: 01/01/2014
Field of study

This thesis presents our research to provide performance portability and scalability to complex scientific applications running over hierarchical multicore parallel platforms. Performance portability is said to be attained when a low core idleness is achieved while mapping a given application to different platforms, and can be affected by performance problems such as load imbalance and costly communications, and overheads coming from the task mapping algorithm. Load imbalance is a result of irregular and dynamic load behaviors, where the amount of work to be processed varies depending on the task and the step of the simulation. Meanwhile, costly communications are caused by a task distribution that does not take into account the different communication times present in a hierarchical platform. This includes nonuniform and asymmetric communication costs at memory and network levels. Lastly, task mapping overheads come from the execution time of the task mapping algorithm trying to mitigate load imbalance and costly communications, and from the migration of tasks. Our approach to achieve the goal of performance portability is based on the hypothesis that precise machine topology information can help task mapping algorithms in their decisions. In this context, we proposed a generic machine topology model of parallel platforms composed of one or more multicore compute nodes. It includes profiled latencies and bandwidths at memory and network levels, and highlights asymmetries and nonuniformity at both levels. This information is employed by our three proposed topology-aware load balancing algorithms, named NUCOLB, HWTOPOLB, and HIERARCHICALLB. Besides topology information, these algorithms also employ application information gathered during runtime. NUCOLB focuses on the nonuniform aspects of parallel platforms, while HWTOPOLB considers the whole hierarchy in its decisions, and HIERARCHICALLB combines these algorithms hierarchically to reduce its task mapping overhead. These algorithms seek to mitigate load imbalance and costly communications while averting task migration overheads. Experimental results with the proposed load balancers over different platform composed of one or more multicore compute nodes showed performance improvements over state of the art load balancing algorithms: NUCOLB presented improvements of up to 19% on one compute node; HWTOPOLB experienced performance improvements of 19% on average; and HIERARCHICALLB outperformed HWTOPOLB by 22% on average on parallel platforms with ten or more compute nodes. These results were achieved by equalizing work among the available resources, reducing the communication costs experienced by applications, and by keeping load balancing overheads low. In this sense, our load balancing algorithms provide performance portability to scientific applications while being independent from application and system architecture.Esta tese apresenta nossa pesquisa para prover portabilidade de desempenho e escalabilidade para aplicações científicas complexas executadas em plataformas multicore paralelas e hierárquicas. A portabilidade de desempenho é dita como alcançada quando uma pequena ociosidade nas unidades de processamento é obtida para o mapeamento de uma aplicação em diferentes plataformas. A portabilidade de desempenho pode ser afetada por problemas como o desbalanceamento de carga, comunicações custosas e sobrecustos vindos do algoritmo de mapeamento de tarefas. O desbalanceamento de carga é um resultado de comportamentos de cargas de tarefas irregulares e dinâmicas, onde a quantidade de trabalho a ser processado varia dependendo da tarefa e da etapa da simulação. Enquanto isso, comunicações custosas são causadas por uma distribuição de tarefas que não leva em conta os diferentes tempos de comunicações presentes em uma plataforma hierárquica. Isto inclui custos de comunicações não uniformes e assimétricos em níveis de memória e rede. Por fim, os sobrecustos de mapeamento de tarefas vêm do tempo de execução do algoritmo de mapeamento de tarefas tentando mitigar o desbalanceamento de carga e comunicações custosas, além do tempo ligado à migração de tarefas. Nossa abordagem para atingir o objetivo de portabilidade de desempenho é baseada na hipótese de que informações precisas da topologia de máquina podem auxiliar algoritmos de mapeamento em suas decisões. Neste contexto, nós propomos um modelo de topologia de máquina genérico para plataformas paralelas compostas de um ou mais nós de processamento multicore. Ele inclui latências e larguras de banda perfiladas nos níveis de memória e rede, além de salientar assimetrias e não uniformidade em ambos níveis. Estas informações são empregadas pelos nossos três algoritmos de balanceamento de carga cientes da topologia de máquina propostos, chamados NUCOLB, HWTOPOLB e HIERARCHICALLB. Além das informações da topologia, estes algoritmos também utilizam informações da aplicação capturadas durante o tempo de execução. NUCOLB foca nos aspectos não uniformes de plataformas paralelas, enquanto HWTOPOLB considera toda a hierarquia da máquina em suas decisões. HIERARCHICALLB combina estes algoritmos hierarquicamente para reduzir seu sobrecusto de mapeamento de tarefas. Estes algoritmos buscam mitigar o desbalanceamento de carga e comunicações custosas enquanto evitam sobrecustos de migração de tarefas. Resultados experimentais com os balanceadores de carga propostos em diferentes plataformas compostas de um ou mais nós de processamento multicore apresentaram desempenhos superiores a outros algoritmos de balanceamento de carga do estado da arte: NUCOLB apresentou melhorias de até 19% em média; HWTOPOLB demonstrou melhorias de desempenho de 19% em média; e HIERARCHICALLB superou HWTOPOLB em 22% em média em plataformas paralelas com dez ou mais nós de processamento. Estes resultados foram obtidos através da equalização da carga de trabalho entre os recursos disponíveis, redução dos custos de comunicação sentidos pelas aplicações e manutenção de sobrecustos de balanceamento de carga pequenos. Dessa forma, nossos algoritmos de balanceamento de carga proveem portabilidade de desempenho para aplicações científicas enquanto se mantendo independentes de uma aplicação ou arquitetura de sistema específica

Lume 5.8

Balanceamento de Carga ciente da topologia de máquina para a portabilidade de desempenho em plataformas de alto desempenho paralelas

Author: Pilla Laercio Lima
Publication venue
Publication date: 01/01/2014
Field of study

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Lume 5.8

RCAAP - Repositório Científico de Acesso Aberto de Portugal

Análise de desempenho da arquitetura CUDA utilizando os NAS parallel benchmarks

Author: Pilla Laercio Lima
Publication venue
Publication date: 01/01/2009
Field of study

Lume 5.8

RCAAP - Repositório Científico de Acesso Aberto de Portugal

A sharing-aware memory management unit for online mapping in multi-core architectures

Author: Cruz Eduardo Henrique Molina da
Diener Matthias
Navaux Philippe Olivier Alexandre
Pilla Laercio Lima
Publication venue
Publication date: 01/01/2016
Field of study

Lume 5.8

Neutron sensitivity and software hardening strategies for matrix multiplication and FFT on graphics processing units

Author: Laercio Pilla
Luigi Carro
Paolo Rech
Philippe Navaux
Silvestri Francesco
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

In this paper, we compare the radiation response of GPUs executing matrix multiplication and FFT algorithms. The provided experimental results demonstrate that for both algorithms, in the majority of cases, the output is affected by multiple errors. The architectural and code analysis highlight that multiple errors are caused by shared resources corruption or thread dependencies. The experimental data and analytical studies can be fruitfully employed to evaluate the expected error rate of GPUs in realistic applications and to design specific and optimized software-based hardening procedures

Archivio istituzionale della ricerca - Università di Padova