629 research outputs found
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
Appraisal of Ancient Quarries and WWII Air Raids as Factors of Subsidence in Rome: A Geomatic Approach
Ancient mining and quarrying activities left anthropogenic geomorphologies that have shaped the natural landscape and affected environmental equilibria. The artificial structures and their related effects on the surrounding environment are analyzed here to characterize the quarrying landscape in the southeast area of Rome in terms of its dimensions, typology, state of preservation and interface with the urban environment. The increased occurrence of sinkhole events in urban areas has already been scientifically correlated to ancient cavities under increasing urban pressure. In this scenario, additional interacting anthropogenic factors, such as the aerial bombardments perpetrated during the Second World War, are considered here. These three factors have been investigated by employing a combined geomatic methodology. Information on air raids has been organized in vector archives. A dataset of historical aerial photographs has been processed into Digital Surface Models and orthomosaics to reconstruct the quarry landscape and its evolution, identify typologies of exploitation and forms of collapse and corroborate the discussion concerning the induced historical and recent subsidence phenomena, comparing these outputs with photogrammetric products obtained from recent satellite data. Geological and urbanistic characterization of the study area allowed a better connection between these historical and environmental factors. In light of the information gathered so far, SAR interferometric products allowed a preliminary interpretation of ground instabilities surrounding historical quarries, air raids and recent subsidence events. Various sub-areas of the AOI where the presenceof the considered factors also corresponds to areas in slight subsidence in the SAR velocity maps have been highlighted. Bivariate hotspot analysis allowed substantiating the hypothesis of a spatial correlation between these multiple aspects
Improving the efficiency of the Energy-Split tool to compute the energy of very large molecular systems
Dissertação de mestrado integrado em Engenharia InformáticaThe Energy-Split tool receives as input pieces of a very large molecular system and computes
all intra and inter-molecular energies, separately calculating the energies of each fragment
and then the total energy of the molecule. It takes into account the connectivity information
among atoms in a molecule to compute (i) the energy of all terms involving atoms covalently
bonded, namely bonds, angles, dihedral angles, and improper angles, and (ii) Coulomb
and the Van der Waals energies, that are independent of the atom’s connections, which
have to be computed for every atom in the system. The required operations to obtain the
total energy of a large molecule are computationally intensive, which require an efficient
high-performance computing approach to obtain results in an acceptable time slot.
The original Energy-Split Tcl code was thoroughly analyzed to be ported to a parallel and
more efficient C++ version. New data structures were defined with data locality features, to
take advantage of the advanced features present in current laptop or server systems. These
include the vector extensions to the scalar processors, an efficient on-chip memory hierarchy,
and the inherent parallelism in multicore devices. To improve the Energy-Split’s sequential
variant a parallel version was developed using auxiliary libraries. Both implementations
were tested on different multicore devices and optimized to take the most advantage of the
features in high performance computing.
Significant results by applying professional performance engineering approaches, namely
(i) by identifying the data values that can be represented as Boolean variables (such as
variables used in auxiliar data structures on the traversal algorithm that computes the
Euclidean distance between atoms), leading to significant performance improvements due to
the reduced memory bottleneck (over 10 times faster), and (ii) using an adequate compress
format (CSR) to represent and operate on sparse matrices (namely matrices with Euclidean
distances between atoms pairs, since all distances further the cut-off distance (user defined)
are considered as zero, and these are the majority of values).
After the first code optimizations, the performance of the sequential version was improved
by around 100 times when compared to the original version on a dual-socket server. The
parallel version improved up to 24 times, depending on the molecules tested, on the same
server. The overall picture shows that the Energy-Split code is highly scalable, obtaining
better results with larger molecule files, even when the atom’s arrangement influences the
algorithm’s performance.A ferramenta Energy-Split recebe como ficheiro de input a descrição de fragmentos de
um sistema molecular de grandes dimensões, de maneira a calcular os valores de energia
intramolecular. Separadamente, também efetua o cálculo da energia de cada fragmento e a
energia total de uma molécula. Ao mesmo tempo, tem em conta a informação das ligações
entre átomos de uma molécula para calcular (i) a energia que envolve todos os átomos
ligados covalentemente, nomeadamente bonds, angles, dihedral angles and improper angles,
e (ii) energias de Coulomb e Vand der Waals, que são independentes das conexões dos
átomos e têm de ser calculadas para cada átomo do sistema. Para cada átomo, o Energy-Split
calcula a energia de interação com todos os outros átomos do sistema, considerando a
partição da molécula em fragmentos, feita num programa open source, Visual Molecular
Dynamics.
As operações para o cálculo destas energias podem levar a tarefas muito intensivas,
computacionalmente, fazendo com que seja necessário utilizar uma abordagem que tire
proveito de computação de alto desempenho de modo a desenvolver código mais eficiente.
O cĂłdigo fornecido, em Tcl, foi profundamente analisado e convertido para uma versĂŁo
paralela e, mais eficiente, em C++.
Ao mesmo tempo, foram definidas novas estruturas de dados, que aproveitam a boa
localidade dos mesmos para tirar vantagem das extensões vetoriais presentes em qualquer
computador e, também, para explorar o paralelismo inerente a máquinas multicore. Assim,
foi implementada uma versĂŁo paralela do cĂłdigo convertido numa fase anterior com recurso
ao uso de bibliotecas auxiliares. Ambas as versões foram testadas em diferentes ambientes
multicore e otimizadas de maneira a ser possĂvel tirar o máximo partido da computação de
alto desempenho para obter os melhores resultados.
Após a aplicação de técnicas de engenharia de performance como (i) a identificação de
dados que poderiam ser representados em formatos mais leves como variáveis booleanas
(por exemplo, variáveis usadas em estruturas de dados auxiliares ao cálculo da distância
Euclideana entre átomos, utilizadas no algoritmo de travessia da molécula), o que levou a
melhorias significativas na performance (cerca de 10 vezes) devido à redução de sobrecarga
da memória. (ii) a utilização de um formato adequado para a representação de matirzes
esparsas (nomeadamente a de representação das mesmas distâncias Euclidianas do primeiro
ponto, uma vez que todas as distâncias que ultrapassem a distância de cutoff (definida pelo
utilizador) sĂŁo consideradas como 0, representado a maioria dos valores).
3
4
Depois das otimizações à versão sequencial, esta apresentou uma melhoria de cerca de 100
vezes em relação à versão original. A versão paralela foi melhorada até 24 vezes, dependendo
das moléculas em questão. No geral, o código é escalável, uma vez que apresenta melhores
resultados consoante o aumento do tamanho das moléculas testadas, apesar de se concluir
que a disposição dos átomos também influencia a perfomance do algoritmo.This work was supported by FCT (Fundação para a Ciência e Tecnologia) within project
RDB-TS: Uma base de dados de reações quĂmicas baseadas em informação de estados de transição derivados de cálculos quânticos (RefÂŞ BI2-2019_NORTE-01-0145-FEDER-031689_UMINHO),
co-funded by the North Portugal Regional Operational Programme, through the European
Regional Development Fun
Accelerating fluid-solid simulations (Lattice-Boltzmann & Immersed-Boundary) on heterogeneous architectures
We propose a numerical approach based on the Lattice-Boltzmann (LBM) and Immersed Boundary (IB) methods to tackle the problem of the interaction of solids with an incompressible fluid flow, and its implementation on heterogeneous platforms based on data-parallel accelerators such as NVIDIA GPUs and the Intel Xeon Phi. We explain in detail the parallelization of these methods and describe a number of optimizations, mainly focusing on improving memory management and reducing the cost of host-accelerator communication. As previous research has consistently shown, pure LBM simulations are able to achieve good performance results on heterogeneous systems thanks to the high parallel efficiency of this method. Unfortunately, when coupling LBM and IB methods, the overheads of IB degrade the overall performance. As an alternative, we have explored different hybrid implementations that effectively hide such overheads and allow us to exploit both the multi-core and the hardware accelerator in a cooperative way, with excellent performance results
SusTrainable: Promoting Sustainability as a Fundamental Driver in Software Development Training and Education. 2nd Teacher Training, January 23-27, 2023, Pula, Croatia. Revised lecture notes
This volume exhibits the revised lecture notes of the 2nd teacher training
organized as part of the project Promoting Sustainability as a Fundamental
Driver in Software Development Training and Education, held at the Juraj
Dobrila University of Pula, Croatia, in the week January 23-27, 2023. It is the
Erasmus+ project No. 2020-1-PT01-KA203-078646 - Sustrainable. More details can
be found at the project web site https://sustrainable.github.io/
One of the most important contributions of the project are two summer
schools. The 2nd SusTrainable Summer School (SusTrainable - 23) will be
organized at the University of Coimbra, Portugal, in the week July 10-14, 2023.
The summer school will consist of lectures and practical work for master and
PhD students in computing science and closely related fields. There will be
contributions from Babe\c{s}-Bolyai University, E\"{o}tv\"{o}s Lor\'{a}nd
University, Juraj Dobrila University of Pula, Radboud University Nijmegen,
Roskilde University, Technical University of Ko\v{s}ice, University of
Amsterdam, University of Coimbra, University of Minho, University of Plovdiv,
University of Porto, University of Rijeka.
To prepare and streamline the summer school, the consortium organized a
teacher training in Pula, Croatia. This was an event of five full days,
organized by Tihana Galinac Grbac and Neven Grbac. The Juraj Dobrila University
of Pula is very concerned with the sustainability issues. The education,
research and management are conducted with sustainability goals in mind.
The contributions in the proceedings were reviewed and provide a good
overview of the range of topics that will be covered at the summer school. The
papers in the proceedings, as well as the very constructive and cooperative
teacher training, guarantee the highest quality and beneficial summer school
for all participants.Comment: 85 pages, 8 figures, 3 code listings and 1 table; editors: Tihana
Galinac Grbac, Csaba Szab\'{o}, Jo\~{a}o Paulo Fernande
Towards Cyberbullying-free social media in smart cities: a unified multi-modal approach
YesSmart cities are shifting the presence of people from physical world to cyber world (cyberspace). Along with the facilities for societies, the troubles of physical world, such as bullying, aggression and hate speech, are also taking their presence emphatically in cyberspace. This paper aims to dig the posts of social media to identify the bullying comments containing text as well as image. In this paper, we have proposed a unified representation of text and image together to eliminate the need for separate learning modules for image and text. A single-layer Convolutional Neural Network model is used with a unified representation. The major findings of this research are that the text represented as image is a better model to encode the information. We also found that single-layer Convolutional Neural Network is giving better results with two-dimensional representation. In the current scenario, we have used three layers of text and three layers of a colour image to represent the input that gives a recall of 74% of the bullying class with one layer of Convolutional Neural Network.Ministry of Electronics and Information Technology (MeitY), Government of Indi
A pilgrimage to gravity on GPUs
In this short review we present the developments over the last 5 decades that
have led to the use of Graphics Processing Units (GPUs) for astrophysical
simulations. Since the introduction of NVIDIA's Compute Unified Device
Architecture (CUDA) in 2007 the GPU has become a valuable tool for N-body
simulations and is so popular these days that almost all papers about high
precision N-body simulations use methods that are accelerated by GPUs. With the
GPU hardware becoming more advanced and being used for more advanced algorithms
like gravitational tree-codes we see a bright future for GPU like hardware in
computational astrophysics.Comment: To appear in: European Physical Journal "Special Topics" : "Computer
Simulations on Graphics Processing Units" . 18 pages, 8 figure
Doctor of Philosophy
dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented
- …