31 research outputs found
Computational and Theoretical Issues of Multiparameter Persistent Homology for Data Analysis
The basic goal of topological data analysis is to apply topology-based descriptors
to understand and describe the shape of data. In this context, homology is one of
the most relevant topological descriptors, well-appreciated for its discrete nature,
computability and dimension independence. A further development is provided
by persistent homology, which allows to track homological features along a oneparameter
increasing sequence of spaces. Multiparameter persistent homology, also
called multipersistent homology, is an extension of the theory of persistent homology
motivated by the need of analyzing data naturally described by several parameters,
such as vector-valued functions. Multipersistent homology presents several issues in
terms of feasibility of computations over real-sized data and theoretical challenges
in the evaluation of possible descriptors. The focus of this thesis is in the interplay
between persistent homology theory and discrete Morse Theory. Discrete Morse
theory provides methods for reducing the computational cost of homology and persistent
homology by considering the discrete Morse complex generated by the discrete
Morse gradient in place of the original complex. The work of this thesis addresses
the problem of computing multipersistent homology, to make such tool usable in real
application domains. This requires both computational optimizations towards the
applications to real-world data, and theoretical insights for finding and interpreting
suitable descriptors. Our computational contribution consists in proposing a new
Morse-inspired and fully discrete preprocessing algorithm. We show the feasibility
of our preprocessing over real datasets, and evaluate the impact of the proposed
algorithm as a preprocessing for computing multipersistent homology. A theoretical
contribution of this thesis consists in proposing a new notion of optimality for such
a preprocessing in the multiparameter context. We show that the proposed notion
generalizes an already known optimality notion from the one-parameter case. Under
this definition, we show that the algorithm we propose as a preprocessing is optimal
in low dimensional domains. In the last part of the thesis, we consider preliminary
applications of the proposed algorithm in the context of topology-based multivariate
visualization by tracking critical features generated by a discrete gradient field compatible
with the multiple scalar fields under study. We discuss (dis)similarities of such
critical features with the state-of-the-art techniques in topology-based multivariate
data visualization
Irregular alignment of arbitrarily long DNA sequences on GPU
The use of Graphics Processing Units to accelerate computational applications is increasingly being adopted due to its affordability, flexibility and performance. However, achieving top performance comes at the price of restricted data-parallelism models. In the case of sequence alignment, most GPU-based approaches focus on accelerating the Smith-Waterman dynamic programming algorithm due to its regularity. Nevertheless, because of its quadratic complexity, it becomes impractical when comparing long sequences, and therefore heuristic methods are required to reduce the search space. We present GPUGECKO, a CUDA implementation for the sequential, seed-and-extend sequence-comparison algorithm, GECKO. Our proposal includes optimized kernels based on collective operations capable of producing arbitrarily long alignments while dealing with heterogeneous and unpredictable load. Contrary to other state-of-the-art methods, GPUGECKO employs a batching mechanism that prevents memory exhaustion by not requiring to fit all alignments at once into the device memory, therefore enabling to run massive comparisons exhaustively with improved sensitivity while also providing up to 6x average speedup w.r.t. the CUDA acceleration of BLASTN.Funding for open access publishing: Universidad Málaga/CBUA /// This work has been partially supported by the European project ELIXIR-EXCELERATE (grant no. 676559), the Spanish national project Plataforma de Recursos Biomoleculares y Bioinformáticos (ISCIII-PT13.0001.0012 and ISCIII-PT17.0009.0022), the Fondo Europeo de Desarrollo Regional (UMA18-FEDERJA-156, UMA20-FEDERJA-059), the Junta de AndalucĂa (P18-FR-3130), the Instituto de InvestigaciĂłn BiomĂ©dica de Málaga IBIMA and the University of Málaga
A practitioner's guide to data base compression tutorial
Data compression techniques can improve information system performance by reducing the size of a database by as much as ninety percent. This paper is written to provide assistance to practitioners considering the use of data compression for the storage of a commercial database. It reviews a wealth of literature on data compression and presents facts and guidelines which will assist system designers in evaluating the costs and benefits of compression and in selecting techniques appropriate for their needs.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/25353/1/0000800.pd
These rows are made for sorting and that's just what we'll do
Sorting is one of the most well-studied problems in computer science and a vital operation for relational database systems. Despite this, little research has been published on implementing an efficient relational sorting operator. In this work, we explore the design space of sorting in a relational database system. We use micro-benchmarks to explore how to sort relational data efficiently in analytical database systems, taking into account different query execution engines as well as row and columnar data formats. We show that, regardless of architectural differences between query engines, sorting rows is almost always more efficient than sorting columnar data, even if this requires converting the data from columns to rows and back. Sorting rows efficiently is challenging for systems with an interpreted execution engine, as their implementation has to stay generic. We show that these challenges can be overcome with several existing techniques. Based on our findings, we implement a highly optimized row-based sorting approach in the DuckDB open-source in-process analytical database management system, which has a vectorized interpreted query engine. We compare DuckDB with four analytical database systems and find that DuckDB's sort implementation outperforms query engines that sort using a columnar data format
Visuelle Analyse groĂźer Partikeldaten
Partikelsimulationen sind eine bewährte und weit verbreitete numerische Methode in der Forschung und Technik. Beispielsweise werden Partikelsimulationen zur Erforschung der Kraftstoffzerstäubung in Flugzeugturbinen eingesetzt. Auch die Entstehung des Universums wird durch die Simulation von dunkler Materiepartikeln untersucht. Die hierbei produzierten Datenmengen sind immens. So enthalten aktuelle Simulationen Billionen von Partikeln, die sich über die Zeit bewegen und miteinander interagieren. Die Visualisierung bietet ein großes Potenzial zur Exploration, Validation und Analyse wissenschaftlicher Datensätze sowie der zugrundeliegenden
Modelle. Allerdings liegt der Fokus meist auf strukturierten Daten mit einer regulären Topologie. Im Gegensatz hierzu bewegen sich Partikel frei durch Raum und Zeit. Diese Betrachtungsweise ist aus der Physik als das lagrange Bezugssystem bekannt. Zwar können Partikel aus dem lagrangen in ein reguläres eulersches Bezugssystem, wie beispielsweise in ein uniformes Gitter, konvertiert werden. Dies ist bei einer großen Menge an Partikeln jedoch mit einem erheblichen Aufwand verbunden. Darüber hinaus führt diese Konversion meist zu einem Verlust der Präzision bei gleichzeitig erhöhtem Speicherverbrauch. Im Rahmen dieser Dissertation werde ich neue Visualisierungstechniken erforschen, welche speziell auf der lagrangen Sichtweise basieren. Diese ermöglichen eine effiziente und effektive visuelle Analyse großer Partikeldaten
Multidimensional projections for the visual exploration of multimedia data
Multidimensional data analysis is considerably important when dealing with such large and complex datasets. Among the possibilities when analyzing such kind of data, applying visualization techniques can help the user find and understand patters, trends and establish new goals. This thesis aims to present several visualization methods to interactively explore multidimensional datasets aimed from specialized to casual users, by making use of both static and dynamic representations created by multidimensional projections
Working With Incremental Spatial Data During Parallel (GPU) Computation
Central to many complex systems, spatial actors require an awareness of their local environment to enable behaviours such as communication and navigation. Complex system simulations represent this behaviour with Fixed Radius Near Neighbours (FRNN) search. This algorithm allows actors to store data at spatial locations and then query the data structure to find all data stored within a fixed radius of the search origin.
The work within this thesis answers the question: What techniques can be used for improving the performance of FRNN searches during complex system simulations on Graphics Processing Units (GPUs)?
It is generally agreed that Uniform Spatial Partitioning (USP) is the most suitable data structure for providing FRNN search on GPUs. However, due to the architectural complexities of GPUs, the performance is constrained such that FRNN search remains one of the most expensive common stages between complex systems models.
Existing innovations to USP highlight a need to take advantage of recent GPU advances, reducing the levels of divergence and limiting redundant memory accesses as viable routes to improve the performance of FRNN search. This thesis addresses these with three separate optimisations that can be used simultaneously.
Experiments have assessed the impact of optimisations to the general case of FRNN search found within complex system simulations and demonstrated their impact in practice when applied to full complex system models. Results presented show the performance of the construction and query stages of FRNN search can be improved by over 2x and 1.3x respectively. These improvements allow complex system simulations to be executed faster, enabling increases in scale and model complexity