29 research outputs found
These rows are made for sorting and that's just what we'll do
Sorting is one of the most well-studied problems in computer science and a vital operation for relational database systems. Despite this, little research has been published on implementing an efficient relational sorting operator. In this work, we explore the design space of sorting in a relational database system. We use micro-benchmarks to explore how to sort relational data efficiently in analytical database systems, taking into account different query execution engines as well as row and columnar data formats. We show that, regardless of architectural differences between query engines, sorting rows is almost always more efficient than sorting columnar data, even if this requires converting the data from columns to rows and back. Sorting rows efficiently is challenging for systems with an interpreted execution engine, as their implementation has to stay generic. We show that these challenges can be overcome with several existing techniques. Based on our findings, we implement a highly optimized row-based sorting approach in the DuckDB open-source in-process analytical database management system, which has a vectorized interpreted query engine. We compare DuckDB with four analytical database systems and find that DuckDB's sort implementation outperforms query engines that sort using a columnar data format
Irregular alignment of arbitrarily long DNA sequences on GPU
The use of Graphics Processing Units to accelerate computational applications is increasingly being adopted due to its affordability, flexibility and performance. However, achieving top performance comes at the price of restricted data-parallelism models. In the case of sequence alignment, most GPU-based approaches focus on accelerating the Smith-Waterman dynamic programming algorithm due to its regularity. Nevertheless, because of its quadratic complexity, it becomes impractical when comparing long sequences, and therefore heuristic methods are required to reduce the search space. We present GPUGECKO, a CUDA implementation for the sequential, seed-and-extend sequence-comparison algorithm, GECKO. Our proposal includes optimized kernels based on collective operations capable of producing arbitrarily long alignments while dealing with heterogeneous and unpredictable load. Contrary to other state-of-the-art methods, GPUGECKO employs a batching mechanism that prevents memory exhaustion by not requiring to fit all alignments at once into the device memory, therefore enabling to run massive comparisons exhaustively with improved sensitivity while also providing up to 6x average speedup w.r.t. the CUDA acceleration of BLASTN.Funding for open access publishing: Universidad Málaga/CBUA /// This work has been partially supported by the European project ELIXIR-EXCELERATE (grant no. 676559), the Spanish national project Plataforma de Recursos Biomoleculares y Bioinformáticos (ISCIII-PT13.0001.0012 and ISCIII-PT17.0009.0022), the Fondo Europeo de Desarrollo Regional (UMA18-FEDERJA-156, UMA20-FEDERJA-059), the Junta de AndalucÃa (P18-FR-3130), the Instituto de Investigación Biomédica de Málaga IBIMA and the University of Málaga
Visuelle Analyse großer Partikeldaten
Partikelsimulationen sind eine bewährte und weit verbreitete numerische Methode in der Forschung und Technik. Beispielsweise werden Partikelsimulationen zur Erforschung der Kraftstoffzerstäubung in Flugzeugturbinen eingesetzt. Auch die Entstehung des Universums wird durch die Simulation von dunkler Materiepartikeln untersucht. Die hierbei produzierten Datenmengen sind immens. So enthalten aktuelle Simulationen Billionen von Partikeln, die sich über die Zeit bewegen und miteinander interagieren. Die Visualisierung bietet ein großes Potenzial zur Exploration, Validation und Analyse wissenschaftlicher Datensätze sowie der zugrundeliegenden
Modelle. Allerdings liegt der Fokus meist auf strukturierten Daten mit einer regulären Topologie. Im Gegensatz hierzu bewegen sich Partikel frei durch Raum und Zeit. Diese Betrachtungsweise ist aus der Physik als das lagrange Bezugssystem bekannt. Zwar können Partikel aus dem lagrangen in ein reguläres eulersches Bezugssystem, wie beispielsweise in ein uniformes Gitter, konvertiert werden. Dies ist bei einer großen Menge an Partikeln jedoch mit einem erheblichen Aufwand verbunden. Darüber hinaus führt diese Konversion meist zu einem Verlust der Präzision bei gleichzeitig erhöhtem Speicherverbrauch. Im Rahmen dieser Dissertation werde ich neue Visualisierungstechniken erforschen, welche speziell auf der lagrangen Sichtweise basieren. Diese ermöglichen eine effiziente und effektive visuelle Analyse großer Partikeldaten
PSA 2020
These preprints were automatically compiled into a PDF from the collection of papers deposited in PhilSci-Archive in conjunction with the PSA 2020
The Making of the Humanities, Volume III. The Modern Humanities
This comprehensive history of the humanities focuses on the modern period (1850-2000). The contributors, including Floris Cohen, Lorraine Daston and Ingrid Rowland, survey the rise of the humanities in interaction with the natural and social sciences, offering new perspectives on the interaction between disciplines in Europe and Asia and new insights generated by digital humanities
Working With Incremental Spatial Data During Parallel (GPU) Computation
Central to many complex systems, spatial actors require an awareness of their local environment to enable behaviours such as communication and navigation. Complex system simulations represent this behaviour with Fixed Radius Near Neighbours (FRNN) search. This algorithm allows actors to store data at spatial locations and then query the data structure to find all data stored within a fixed radius of the search origin.
The work within this thesis answers the question: What techniques can be used for improving the performance of FRNN searches during complex system simulations on Graphics Processing Units (GPUs)?
It is generally agreed that Uniform Spatial Partitioning (USP) is the most suitable data structure for providing FRNN search on GPUs. However, due to the architectural complexities of GPUs, the performance is constrained such that FRNN search remains one of the most expensive common stages between complex systems models.
Existing innovations to USP highlight a need to take advantage of recent GPU advances, reducing the levels of divergence and limiting redundant memory accesses as viable routes to improve the performance of FRNN search. This thesis addresses these with three separate optimisations that can be used simultaneously.
Experiments have assessed the impact of optimisations to the general case of FRNN search found within complex system simulations and demonstrated their impact in practice when applied to full complex system models. Results presented show the performance of the construction and query stages of FRNN search can be improved by over 2x and 1.3x respectively. These improvements allow complex system simulations to be executed faster, enabling increases in scale and model complexity
Computational and Theoretical Issues of Multiparameter Persistent Homology for Data Analysis
The basic goal of topological data analysis is to apply topology-based descriptors
to understand and describe the shape of data. In this context, homology is one of
the most relevant topological descriptors, well-appreciated for its discrete nature,
computability and dimension independence. A further development is provided
by persistent homology, which allows to track homological features along a oneparameter
increasing sequence of spaces. Multiparameter persistent homology, also
called multipersistent homology, is an extension of the theory of persistent homology
motivated by the need of analyzing data naturally described by several parameters,
such as vector-valued functions. Multipersistent homology presents several issues in
terms of feasibility of computations over real-sized data and theoretical challenges
in the evaluation of possible descriptors. The focus of this thesis is in the interplay
between persistent homology theory and discrete Morse Theory. Discrete Morse
theory provides methods for reducing the computational cost of homology and persistent
homology by considering the discrete Morse complex generated by the discrete
Morse gradient in place of the original complex. The work of this thesis addresses
the problem of computing multipersistent homology, to make such tool usable in real
application domains. This requires both computational optimizations towards the
applications to real-world data, and theoretical insights for finding and interpreting
suitable descriptors. Our computational contribution consists in proposing a new
Morse-inspired and fully discrete preprocessing algorithm. We show the feasibility
of our preprocessing over real datasets, and evaluate the impact of the proposed
algorithm as a preprocessing for computing multipersistent homology. A theoretical
contribution of this thesis consists in proposing a new notion of optimality for such
a preprocessing in the multiparameter context. We show that the proposed notion
generalizes an already known optimality notion from the one-parameter case. Under
this definition, we show that the algorithm we propose as a preprocessing is optimal
in low dimensional domains. In the last part of the thesis, we consider preliminary
applications of the proposed algorithm in the context of topology-based multivariate
visualization by tracking critical features generated by a discrete gradient field compatible
with the multiple scalar fields under study. We discuss (dis)similarities of such
critical features with the state-of-the-art techniques in topology-based multivariate
data visualization
Treatment plan robustness in pancreatic patients treated with scanned ion-beam therapy: Inter- and intra-fractional aspects
Pancreatic cancer is still an unsolved oncological challenge, however radiotherapy with charged particles has been considered a promising approach to improve the patients overall survival. These patients might benefit from dose escalation, although uncertainties during the beam delivery (intra-fractional) or along the treatment course (inter-fractional) can compromise the accuracy of the treatment.
In this thesis, inter- and intra-fractional anatomy changes are explored in order to define the potential source of uncertainties, quantify their effect, and to define strategies towards their reduction.
Anatomical changes along the course of the treatment showed to lead target under-dosages up to 20% and an increase in the dose to the normal tissues. However, this can be lowered through the selection of beam arrangements from the patient's posterior side and beam-specific margins. From the results of this work, it was concluded that a combination of an Internal Target Volume (ITV), obtained by a geometric expansion of 3 mm from the Clinical Target Volume (CTV), and two oblique posterior beams can reduce the mean V95CTV variations to less than 1%. For other beam directions, the calculation of ITVs including the water-equivalent path length (WEPL), suggested the need for a CTV asymmetric expansion in depth, and minimal in lateral beam direction.
Additionally, weekly monitoring of the patient anatomy using computed tomography (CT) might easily be included in the clinical workflow and will assist in the decision of treatment re-planning, when substantial anatomical changes occur. The suggested prediction model was based on the variations of the accumulated WEPL (∆accWEPL) relative to the planning CT, and showed a strong correlation between the ∆accWEPL and the gamma index of the dose distributions. The gamma criterion was selected as dose distribution quality metric, since it includes dosimetric changes in the target and normal tissues.
Regarding intra-fractional variations, the induced breathing motion together with a dynamic beam delivery, affect the dose distribution in terms of homogeneity and target coverage. This effect is stronger (∆V95CTV > 10%) for patients with a tumor motion amplitude superior to 5 mm and a highly modulated dose distribution intra- and inter-fields. The concept of modulation index was employed, it showed that different optimisers produce plans with contrasting distribution of the number of particles, resulting in unlike robustness against range and positioning uncertainties. It was concluded that under internal motion, the use of homogeneous plans, multiple beams, and geometric ITVs, originated dose distributions exhibiting a slight mean decrease of the dose homogeneity (H_CTV) and V95CTV of 4% and 1%, respectively.
Finally, a first approach to the use of 4D-Magnetic Resonance Imaging (MRI) for motion detection was performed. The results revealed cases of non-linear correlation between the breathing signal (diaphragm position) and the pancreas motion, and variability of the motion amplitude along the acquisition time and between sessions. This reinforces the need of an alternative method, comparative to the use of external surrogates, for simulation of a 4D dose distribution. Therefore, MRI will allow to include baseline drifts, amplitude variations and anatomical alterations in the 4D dose distribution assessment.
In summary, the key for a precise delivery of the treatment is the monitoring of anatomical changes, and a prompt reaction in order to minimise or eliminate potential uncertainties. In future, it is expected that the methods suggested in this thesis, the experience gained at HIT on treating moving organs and, the developments in treatment planning and treatment delivery will allow us to move towards the robust plan optimisation, prediction of changes in the dose distribution, and enable treatment without a constant and complex monitoring of the patient's movement
Explanatory visualization of multidimensional projections
Het verkrijgen van inzicht in grote gegevensverzalelingen (tegenwoording bekend als ‘big data’) kan gedaan worden door ze visueel af te beelden en deze visualisaties vervolgens interactief exploreren. Toch kunnen beide het aantal datapunten of metingen, en ook het aantal dimensies die elke meting beschrijven, zeer groot zijn – zoals een table met veel rijen en kolommen. Het visualiseren van dergelijke zogenaamde hoog-dimensionale datasets is zeer uitdagend. Een manier om dit te doen is door het maken van een laag (twee of drie) dimensionale afbeelding, waarin men dan zoekt naar interessante datapatronen in plaats van deze te zoeken in de oorspronkelijke hoog-dimensionale data. Technieken die dit scenario ondersteunen, de zogenaamde projecties, hebben verschillende voordelen – ze zijn visueel schaalbaar, ze werken robuust met ruizige data, en ze zijn snel. Toch is het gebruik van projecties ernstig beperkt door het feit dat ze moeilijk te interpreteren zijn. We benaderen dit problem door verschillende technieken te ontwikkelen die de interpretative vergemakkelijken, zoals het weergeven van projectiefouten en het uitleggen van projecties door middel van de oorpronkelijke hoge dimensies. Onze technieken zijn makkelijk te leren, snel te rekenen, en makkelijk toe te voegen aan elke dataexploratiescenario dat gebruik maakt van elke projectie. We demonstreren onze oplossingen met verschillende toepassingen en data van metingen, wetenschappelijke simulaties, software-engineering, en netwerken