1,244 research outputs found

    Evaluating the Potential of Disaggregated Memory Systems for HPC applications

    Full text link
    Disaggregated memory is a promising approach that addresses the limitations of traditional memory architectures by enabling memory to be decoupled from compute nodes and shared across a data center. Cloud platforms have deployed such systems to improve overall system memory utilization, but performance can vary across workloads. High-performance computing (HPC) is crucial in scientific and engineering applications, where HPC machines also face the issue of underutilized memory. As a result, improving system memory utilization while understanding workload performance is essential for HPC operators. Therefore, learning the potential of a disaggregated memory system before deployment is a critical step. This paper proposes a methodology for exploring the design space of a disaggregated memory system. It incorporates key metrics that affect performance on disaggregated memory systems: memory capacity, local and remote memory access ratio, injection bandwidth, and bisection bandwidth, providing an intuitive approach to guide machine configurations based on technology trends and workload characteristics. We apply our methodology to analyze thirteen diverse workloads, including AI training, data analysis, genomics, protein, fusion, atomic nuclei, and traditional HPC bookends. Our methodology demonstrates the ability to comprehend the potential and pitfalls of a disaggregated memory system and provides motivation for machine configurations. Our results show that eleven of our thirteen applications can leverage injection bandwidth disaggregated memory without affecting performance, while one pays a rack bisection bandwidth penalty and two pay the system-wide bisection bandwidth penalty. In addition, we also show that intra-rack memory disaggregation would meet the application's memory requirement and provide enough remote memory bandwidth.Comment: The submission builds on the following conference paper: N. Ding, S. Williams, H.A. Nam, et al. Methodology for Evaluating the Potential of Disaggregated Memory Systems,2nd International Workshop on RESource DISaggregation in High-Performance Computing (RESDIS), November 18, 2022. It is now submitted to the CCPE journal for revie

    ACiS: smart switches with application-level acceleration

    Full text link
    Network performance has contributed fundamentally to the growth of supercomputing over the past decades. In parallel, High Performance Computing (HPC) peak performance has depended, first, on ever faster/denser CPUs, and then, just on increasing density alone. As operating frequency, and now feature size, have levelled off, two new approaches are becoming central to achieving higher net performance: configurability and integration. Configurability enables hardware to map to the application, as well as vice versa. Integration enables system components that have generally been single function-e.g., a network to transport data—to have additional functionality, e.g., also to operate on that data. More generally, integration enables compute-everywhere: not just in CPU and accelerator, but also in network and, more specifically, the communication switches. In this thesis, we propose four novel methods of enhancing HPC performance through Advanced Computing in the Switch (ACiS). More specifically, we propose various flexible and application-aware accelerators that can be embedded into or attached to existing communication switches to improve the performance and scalability of HPC and Machine Learning (ML) applications. We follow a modular design discipline through introducing composable plugins to successively add ACiS capabilities. In the first work, we propose an inline accelerator to communication switches for user-definable collective operations. MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself. We also introduce a novel mechanism that enables the hardware to support MPI communicators of arbitrary shape and that is scalable to very large systems. In the second work, we propose a look-aside accelerator for communication switches that is capable of processing packets at line-rate. Functions requiring loops and states are addressed in this method. The proposed in-switch accelerator is based on a RISC-V compatible Coarse Grained Reconfigurable Arrays (CGRAs). To facilitate usability, we have developed a framework to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. In the third work, we extend ACiS to support fused collectives and the combining of collectives with map operations. We observe that there is an opportunity of fusing communication (collectives) with computation. Since the computation can vary for different applications, ACiS support should be programmable in this method. In the fourth work, we propose that switches with ACiS support can control and manage the execution of applications, i.e., that the switch be an active device with decision-making capabilities. Switches have a central view of the network; they can collect telemetry information and monitor application behavior and then use this information for control, decision-making, and coordination of nodes. We evaluate the feasibility of ACiS through extensive RTL-based simulation as well as deployment in an open-access cloud infrastructure. Using this simulation framework, when considering a Graph Convolutional Network (GCN) application as a case study, a speedup of on average 3.4x across five real-world datasets is achieved on 24 nodes compared to a CPU cluster without ACiS capabilities

    LIPIcs, Volume 261, ICALP 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 261, ICALP 2023, Complete Volum

    Modern data analytics in the cloud era

    Get PDF
    Cloud Computing ist die dominante Technologie des letzten Jahrzehnts. Die Benutzerfreundlichkeit der verwalteten Umgebung in Kombination mit einer nahezu unbegrenzten Menge an Ressourcen und einem nutzungsabhängigen Preismodell ermöglicht eine schnelle und kosteneffiziente Projektrealisierung für ein breites Nutzerspektrum. Cloud Computing verändert auch die Art und Weise wie Software entwickelt, bereitgestellt und genutzt wird. Diese Arbeit konzentriert sich auf Datenbanksysteme, die in der Cloud-Umgebung eingesetzt werden. Wir identifizieren drei Hauptinteraktionspunkte der Datenbank-Engine mit der Umgebung, die veränderte Anforderungen im Vergleich zu traditionellen On-Premise-Data-Warehouse-Lösungen aufweisen. Der erste Interaktionspunkt ist die Interaktion mit elastischen Ressourcen. Systeme in der Cloud sollten Elastizität unterstützen, um den Lastanforderungen zu entsprechen und dabei kosteneffizient zu sein. Wir stellen einen elastischen Skalierungsmechanismus für verteilte Datenbank-Engines vor, kombiniert mit einem Partitionsmanager, der einen Lastausgleich bietet und gleichzeitig die Neuzuweisung von Partitionen im Falle einer elastischen Skalierung minimiert. Darüber hinaus führen wir eine Strategie zum initialen Befüllen von Puffern ein, die es ermöglicht, skalierte Ressourcen unmittelbar nach der Skalierung auszunutzen. Cloudbasierte Systeme sind von fast überall aus zugänglich und verfügbar. Daten werden häufig von zahlreichen Endpunkten aus eingespeist, was sich von ETL-Pipelines in einer herkömmlichen Data-Warehouse-Lösung unterscheidet. Viele Benutzer verzichten auf die Definition von strikten Schemaanforderungen, um Transaktionsabbrüche aufgrund von Konflikten zu vermeiden oder um den Ladeprozess von Daten zu beschleunigen. Wir führen das Konzept der PatchIndexe ein, die die Definition von unscharfen Constraints ermöglichen. PatchIndexe verwalten Ausnahmen zu diesen Constraints, machen sie für die Optimierung und Ausführung von Anfragen nutzbar und bieten effiziente Unterstützung bei Datenaktualisierungen. Das Konzept kann auf beliebige Constraints angewendet werden und wir geben Beispiele für unscharfe Eindeutigkeits- und Sortierconstraints. Darüber hinaus zeigen wir, wie PatchIndexe genutzt werden können, um fortgeschrittene Constraints wie eine unscharfe Multi-Key-Partitionierung zu definieren, die eine robuste Anfrageperformance bei Workloads mit unterschiedlichen Partitionsanforderungen bietet. Der dritte Interaktionspunkt ist die Nutzerinteraktion. Datengetriebene Anwendungen haben sich in den letzten Jahren verändert. Neben den traditionellen SQL-Anfragen für Business Intelligence sind heute auch datenwissenschaftliche Anwendungen von großer Bedeutung. In diesen Fällen fungiert das Datenbanksystem oft nur als Datenlieferant, während der Rechenaufwand in dedizierten Data-Science- oder Machine-Learning-Umgebungen stattfindet. Wir verfolgen das Ziel, fortgeschrittene Analysen in Richtung der Datenbank-Engine zu verlagern und stellen das Grizzly-Framework als DataFrame-zu-SQL-Transpiler vor. Auf dieser Grundlage identifizieren wir benutzerdefinierte Funktionen (UDFs) und maschinelles Lernen (ML) als wichtige Aufgaben, die von einer tieferen Integration in die Datenbank-Engine profitieren würden. Daher untersuchen und bewerten wir Ansätze für die datenbankinterne Ausführung von Python-UDFs und datenbankinterne ML-Inferenz.Cloud computing has been the groundbreaking technology of the last decade. The ease-of-use of the managed environment in combination with nearly infinite amount of resources and a pay-per-use price model enables fast and cost-efficient project realization for a broad range of users. Cloud computing also changes the way software is designed, deployed and used. This thesis focuses on database systems deployed in the cloud environment. We identify three major interaction points of the database engine with the environment that show changed requirements compared to traditional on-premise data warehouse solutions. First, software is deployed on elastic resources. Consequently, systems should support elasticity in order to match workload requirements and be cost-effective. We present an elastic scaling mechanism for distributed database engines, combined with a partition manager that provides load balancing while minimizing partition reassignments in the case of elastic scaling. Furthermore we introduce a buffer pre-heating strategy that allows to mitigate a cold start after scaling and leads to an immediate performance benefit using scaling. Second, cloud based systems are accessible and available from nearly everywhere. Consequently, data is frequently ingested from numerous endpoints, which differs from bulk loads or ETL pipelines in a traditional data warehouse solution. Many users do not define database constraints in order to avoid transaction aborts due to conflicts or to speed up data ingestion. To mitigate this issue we introduce the concept of PatchIndexes, which allow the definition of approximate constraints. PatchIndexes maintain exceptions to constraints, make them usable in query optimization and execution and offer efficient update support. The concept can be applied to arbitrary constraints and we provide examples of approximate uniqueness and approximate sorting constraints. Moreover, we show how PatchIndexes can be exploited to define advanced constraints like an approximate multi-key partitioning, which offers robust query performance over workloads with different partition key requirements. Third, data-centric workloads changed over the last decade. Besides traditional SQL workloads for business intelligence, data science workloads are of significant importance nowadays. For these cases the database system might only act as data delivery, while the computational effort takes place in data science or machine learning (ML) environments. As this workflow has several drawbacks, we follow the goal of pushing advanced analytics towards the database engine and introduce the Grizzly framework as a DataFrame-to-SQL transpiler. Based on this we identify user-defined functions (UDFs) and machine learning inference as important tasks that would benefit from a deeper engine integration and investigate approaches to push these operations towards the database engine

    Measurement of Triple-Differential Z+Jet Cross Sections with the CMS Detector at 13 TeV and Modelling of Large-Scale Distributed Computing Systems

    Get PDF
    The achievable precision in the calculations of predictions for observables measured at the LHC experiments depends on the amount of invested computing power and the precision of input parameters that go into the calculation. Currently, no theory exists that can derive the input parameter values for perturbative calculations from first principles. Instead, they have to be derived from measurements in dedicated analyses that measure observables sensitive to the input parameters with high precision. Such an analysis that measures the production cross section of oppositely charged muon pairs with an invariant mass close to the mass of the Z\mathrm{Z} boson in association with jets in a phase space divided into bins of the transverse momentum of the dimuon system pTZp_T^\text{Z}, and two observables yy^* and yby_b created from the rapidities of the dimuon system and the jet with the highest momentum is presented. To achieve the highest statistical precision in this triple-differential measurement the full data recorded by the CMS experiment at a center-of-mass energy of s=13TeV\sqrt{s}=13\,\mathrm{TeV} in the years 2016 to 2018 is combined. The measured cross sections are compared to theoretical predictions approximating full NNLO accuracy in perturbative QCD. Deviations from these predictions are observed rendering further studies at full NNLO accuracy necessary. To obtain the measured results large amounts of data are processed and analysed on distributed computing infrastructures. Theoretical calculations pose similar computing demands. Consequently, substantial amounts of storage and processing resources are required by the LHC collaborations. These requirements are met in large parts by the resources of the WLCG, a complex federation of globally distributed computer centres. With the upgrade of the LHC and the experiments, in the HL-LHC era, the computing demands are expected to increase substantially. Therefore, the prevailing computing models need to be updated to cope with the unprecedented demands. For the design of future adaptions of the HEP workflow executions on infrastructures a simulation model is developed, and an implementation tested on infrastructure design candidates inspired by a proposal of the German HEP computing community. The presented study of these infrastructure candidates showcases the applicability of the simulation tool in the strategical development of a future computing infrastructure for HEP in the HL-LHC context

    Efficient Path Enumeration and Structural Clustering on Massive Graphs

    Full text link
    Graph analysis plays a crucial role in understanding the relationships and structures within complex systems. This thesis focuses on addressing fundamental problems in graph analysis, including hop-constrained s-t simple path (HC-s-t path) enumeration, batch HC-s-t path query processing, and graph structural clustering (SCAN). The objective is to develop efficient and scalable distributed algorithms to tackle these challenges, particularly in the context of billion-scale graphs. We first explore the problem of HC-s-t path enumeration. Existing solutions for this problem often suffer from inefficiency and scalability limitations, especially when dealing with billion-scale graphs. To overcome these drawbacks, we propose a novel hybrid search paradigm specifically tailored for HC-s-t path enumeration. This paradigm combines different search strategies to effectively explore the solution space. Building upon this paradigm, we devise a distributed enumeration algorithm that follows a divide-and-conquer strategy, incorporates fruitless exploration pruning, and optimizes memory consumption. Experimental evaluations on various datasets demonstrate that our algorithm achieves a significant speedup compared to existing solutions, even on datasets where they encounter out-of-memory issues. Secondly, we address the problem of batch HC-s-t path query processing. In real-world scenarios, it is common to issue multiple HC-s-t path queries simultaneously and process them as a batch. However, existing solutions often focus on optimizing the processing performance of individual queries, disregarding the benefits of processing queries concurrently. To bridge this gap, we propose the concept of HC-s path queries, which captures the common computation among different queries. We design a two-phase HC-s path query detection algorithm to identify the shared computation for a given set of HC-s-t path queries. Based on the detected HC-s path queries, we develop an efficient HC-s-t path enumeration algorithm that effectively shares the common computation. Extensive experiments on diverse datasets validate the efficiency and scalability of our algorithm for processing multiple HC-s-t path queries concurrently. Thirdly, we investigate the problem of graph structural clustering (SCAN) in billion-scale graphs. Existing distributed solutions for SCAN often lack efficiency or suffer from high memory consumption, making them impractical for large-scale graphs. To overcome these challenges, we propose a fine-grained clustering framework specifically tailored for SCAN. This framework enables effective identification of cohesive subgroups within a graph. Building upon this framework, we devise a distributed SCAN algorithm that minimizes communication overhead and reduces memory consumption throughout the execution. We also incorporate an effective workload balance mechanism that dynamically adjusts to handle skewed workloads. Experimental evaluations on real-world graphs demonstrate the efficiency and scalability of our proposed algorithm. Overall, this thesis contributes novel distributed algorithms for HC-s-t path enumeration, batch HC-s-t path query processing, and graph structural clustering. The proposed algorithms address the efficiency and scalability challenges in graph analysis, particularly on billion-scale graphs. Extensive experimental evaluations validate the superiority of our algorithms compared to existing solutions, enabling efficient and scalable graph analysis in complex systems

    General Purpose Flow Visualization at the Exascale

    Get PDF
    Exascale computing, i.e., supercomputers that can perform 1018 math operations per second, provide significant opportunity for improving the computational sciences. That said, these machines can be difficult to use efficiently, due to their massive parallelism, due to the use of accelerators, and due to the diversity of accelerators used. All areas of the computational science stack need to be reconsidered to address these problems. With this dissertation, we consider flow visualization, which is critical for analyzing vector field data from simulations. We specifically consider flow visualization techniques that use particle advection, i.e., tracing particle trajectories, which presents performance and implementation challenges. The dissertation makes four primary contributions. First, it synthesizes previous work on particle advection performance and introduces a high-level analytical cost model. Second, it proposes an approach for performance portability across accelerators. Third, it studies expected speedups based on using accelerators, including the importance of factors such as duration, particle count, data set, and others. Finally, it proposes an exascale-capable particle advection system that addresses diversity in many dimensions, including accelerator type, parallelism approach, analysis use case, underlying vector field, and more

    Multiscale visualization approaches for Volunteered Geographic Information and Location-based Social Media

    Get PDF
    Today, “zoomable” maps are a state-of-the-art way to explore the world, available to anyone with Internet access. However, the process of creating this visualization has been rather loosely investigated and documented. Nevertheless, with an increasing amount of available data, interactive maps have become a more integral approach to visualizing and exploring big datasets and user-generated data. OpenStreetMap and online platforms such as Twitter and Flickr offer application programming interfaces (APIs) with geographic information. They are well-known examples of this visualization challenge and are often used as examples. In addition, an increasing number of public administrations collect open data and publish their data sets, which makes the task of visualization even more relevant. This dissertation deals with the visualization of user-generated geodata as a multiscale map. The basics of today’s multiscale maps—their history, technologies, and possibilities—are explored and abstracted. This work introduces two new multiscale-focused visualization approaches for point data from volunteered geographic information (VGI) and location-based social media (LBSM). One contribution of this effort is a visualization methodology for spatially referenced information in the form of point geometries, using nominally scaled data from social media such as Twitter or Flickr. Typical for this data is a high number of social media posts in different categories—a post on social media corresponds to a point in a specific category. Due to the sheer quantity and similar characteristics, the posts appear generic rather than unique. This type of dataset can be explored using the new method of micro diagrams to visualize the dataset on multiple scales and resolutions. The data is aggregated into small grid cells, and the numerical proportion is shown with small diagrams, which can visually merge into heterogenous areas through colors depicting a specific category. The diagram sizes allow the user to estimate the overall number of aggregated points in a grid cell. A different visualization approach is proposed for more unique points, considered points of interest (POI), based on the selection method. The goal is to identify more locally relevant points from the data set, considered more important compared to other points in the neighborhood, which are then compared by numerical attribute. The method, derived from topographic isolation and called discrete isolation, is the distance from one point to the next with a higher attribute value. By using this measure, the most essential points can be easily selected by choosing a minimum distance and producing a homogenous spatial of the selected points within the chosen dataset. The two newly developed approaches are applied to multiscale mapping by constructing example workflows that produce multiscale maps. The publicly available multiscale mapping workflows OpenMapTiles and OpenStreetMap Carto, using OpenStreetMap data, are systematically explored and analyzed. The result is a general workflow for multiscale map production and a short overview of the toolchain software. In particular, the generalization approaches in the example projects are discussed and these are classified into cartographic theories on the basis of literature. The workflow is demonstrated by building a raster tile service for the micro diagrams and a vector tile service for the discrete isolation, able to be used with just a web browser. In conclusion, these new approaches for point data using VGI and LBSM allow better qualitative visualization of geodata. While analyzing vast global datasets is challenging, exploring and analyzing hidden data patterns is fruitful. Creating this degree of visualization and producing maps on multiple scales is a complicated task. The workflows and tools provided in this thesis will make map production on a worldwide scale easier.:1 Introduction 1 1.1 Motivation .................................................................................................. 3 1.2 Visualization of crowdsourced geodata on multiple scales ............ 5 1.2.1 Research objective 1: Visualization of point collections ......... 6 1.2.2 Research objective 2: Visualization of points of interest ......... 7 1.2.3 Research objective 3: Production of multiscale maps ............. 7 1.3 Reader’s guide ......................................................................................... 9 1.3.1 Structure ........................................................................................... 9 1.3.2 Related Publications ....................................................................... 9 1.3.3 Formatting and layout ................................................................. 10 1.3.4 Online examples ........................................................................... 10 2 Foundations of crowdsourced mapping on multiple scales 11 2.1 Types and properties of crowdsourced data .................................. 11 2.2 Currents trends in cartography ......................................................... 11 2.3 Definitions .............................................................................................. 12 2.3.1 VGI .................................................................................................. 12 2.3.2 LBSM .............................................................................................. 13 2.3.3 Space, place, and location......................................................... 13 2.4 Visualization approaches for crowdsourced geodata ................... 14 2.4.1 Review of publications and visualization approaches ........... 14 2.4.2 Conclusions from the review ...................................................... 15 2.4.3 Challenges mapping crowdsourced data ................................ 17 2.5 Technologies for serving multiscale maps ...................................... 17 2.5.1 Research about multiscale maps .............................................. 17 2.5.2 Web Mercator projection ............................................................ 18 2.5.3 Tiles and zoom levels .................................................................. 19 2.5.4 Raster tiles ..................................................................................... 21 2.5.5 Vector tiles .................................................................................... 23 2.5.6 Tiling as a principle ..................................................................... 25 3 Point collection visualization with categorized attributes 26 3.1 Target users and possible tasks ....................................................... 26 3.2 Example data ......................................................................................... 27 3.3 Visualization approaches .................................................................... 28 3.3.1 Common techniques .................................................................... 28 3.3.2 The micro diagram approach .................................................... 30 3.4 The micro diagram and its parameters ............................................ 33 3.4.1 Aggregating points into a regular structure ............................ 33 3.4.2 Visualizing the number of data points ...................................... 35 3.4.3 Grid and micro diagrams ............................................................ 36 3.4.4 Visualizing numerical proportions with diagrams .................. 37 3.4.5 Influence of color and color brightness ................................... 38 3.4.6 Interaction options with micro diagrams .................................. 39 3.5 Application and user-based evaluation ............................................ 39 3.5.1 Micro diagrams in a multiscale environment ........................... 39 3.5.2 The micro diagram user study ................................................... 41 3.5.3 Point collection visualization discussion .................................. 47 4 Selection of POIs for visualization 50 4.1 Approaches for point selection .......................................................... 50 4.2 Methods for point selection ................................................................ 51 4.2.1 Label grid approach .................................................................... 52 4.2.2 Functional importance approach .............................................. 53 4.2.3 Discrete isolation approach ....................................................... 54 4.3 Functional evaluation of selection methods .................................... 56 4.3.1 Runtime comparison .................................................................... 56 4.3.2 Use cases for discrete isolation ................................................ 57 4.4 Discussion of the selection approaches .......................................... 61 4.4.1 A critical view of the use cases ................................................. 61 4.4.2 Comparing the approaches ........................................................ 62 4.4.3 Conclusion ..................................................................................... 64 5 Creating multiscale maps 65 5.1 Examples of multiscale map production .......................................... 65 5.1.1 OpenStreetMap Infrastructure ................................................... 66 5.1.2 OpenStreetMap Carto ................................................................. 67 5.1.3 OpenMapTiles ............................................................................... 73 5.2 Methods of multiscale map production ............................................ 80 5.2.1 OpenStreetMap tools ................................................................... 80 5.2.2 Geoprocessing .............................................................................. 80 5.2.3 Database ........................................................................................ 80 5.2.4 Creating tiles ................................................................................. 82 5.2.5 Caching .......................................................................................... 82 5.2.6 Styling tiles .................................................................................... 82 5.2.7 Viewing tiles ................................................................................... 83 5.2.8 The stackless approach to tile creation ................................... 83 5.3 Example workflows for creating multiscale maps ........................... 84 5.3.1 Raster tiles: OGC services and micro diagrams .................... 84 5.3.2 Vector tiles: Slippy map and vector tiles ................................. 87 5.4 Discussion of approaches and workflows ....................................... 90 5.4.1 Map production as a rendering pipeline .................................. 90 5.4.2 Comparison of OpenStreetMap Carto and OpenMapTiles .. 92 5.4.3 Discussion of the implementations ........................................... 93 5.4.4 Generalization in map production workflows .......................... 95 5.4.5 Conclusions ................................................................................. 101 6 Discussion 103 6.1 Development for web mapping ........................................................ 103 6.1.1 The role of standards in map production .............................. 103 6.1.2 Technological development ..................................................... 103 6.2 New data, new mapping techniques? ............................................. 104 7 Conclusion 106 7.1 Visualization of point collections ..................................................... 106 7.2 Visualization of points of interest ................................................... 107 7.3 Production of multiscale maps ........................................................ 107 7.4 Synthesis of the research questions .............................................. 108 7.5 Contributions ....................................................................................... 109 7.6 Limitations ............................................................................................ 110 7.7 Outlook ................................................................................................. 111 8 References 113 9 Appendix 130 9.1 Zoom levels and Scale ...................................................................... 130 9.3 Full information about selected UGC papers ................................ 131 9.4 Timeline of mapping technologies .................................................. 133 9.5 Timeline of map providers ................................................................ 133 9.6 Code snippets from own map production workflows .................. 134 9.6.1 Vector tiles workflow ................................................................. 134 9.6.2 Raster tiles workflow.................................................................. 137Heute sind zoombare Karten Alltag für jeden Internetznutzer. Die Erstellung interaktiv zoombarer Karten ist allerdings wenig erforscht, was einen deutlichen Gegensatz zu ihrer aktuellen Bedeutung und Nutzungshäufigkeit darstellt. Die Forschung in diesem Bereich ist also umso notwendiger. Steigende Datenmengen und größere Regionen, die von Karten abgedeckt werden sollen, unterstreichen den Forschungsbedarf umso mehr. Beispiele für stetig wachsende Datenmengen sind Geodatenquellen wie OpenStreetMap aber auch freie amtliche Geodatensätze (OpenData), aber auch die zunehmende Zahl georeferenzierter Inhalte auf Internetplatformen wie Twitter oder Flickr zu nennen. Das Thema dieser Arbeit ist die Visualisierung eben dieser nutzergenerierten Geodaten mittels zoombarer Karten. Dafür wird die Entwicklung der zugrundeliegenden Technologien über die letzten zwei Jahr-zehnte und die damit verbundene Möglichkeiten vorgestellt. Weitere Beiträge sind zwei neue Visualisierungsmethoden, die sich besonders für die Darstellung von Punktdaten aus raumbezogenen nutzergenerierten Daten und georeferenzierte Daten aus Sozialen Netzwerken eignen. Ein Beitrag dieser Arbeit ist eine neue Visualisierungsmethode für raumbezogene Informationen in Form von Punktgeometrien mit nominal skalierten Daten aus Sozialen Medien, wie beispielsweise Twitter oder Flickr. Typisch für diese Daten ist eine hohe Anzahl von Beiträgen mit unterschiedlichen Kategorien. Wobei die Beiträge, bedingt durch ihre schiere Menge und ähnlicher Ei-genschaften, eher generisch als einzigartig sind. Ein Beitrag in den So-zia len Medien entspricht dabei einem Punkt mit einer bestimmten Katego-rie. Ein solcher Datensatz kann mit der neuen Methode der „micro diagrams“ in verschiedenen Maßstäben und Auflösungen visualisiert und analysiert werden. Dazu werden die Daten in kleine Gitterzellen aggregiert. Die Menge und Verteilung der über die Kategorien aggregierten Punkte wird durch kleine Diagramme dargestellt, wobei die Farben die verschiedenen Kategorien visualisieren. Durch die geringere Größe der einzelnen Diagramme verschmelzen die kleinen Diagramme visuell, je nach der Verteilung der Farben für die Kategorien. Bei genauerem Hinsehen ist die Schätzung der Menge der aggregierten Punkte über die Größe der Diagramme die Menge und die Verteilung über die Kategorien möglich. Für einzigartigere Punkte, die als Points of Interest (POI) angesehen werden, wird ein anderer Visualisierungsansatz vorgeschlagen, der auf einer Auswahlmethode basiert. Ziel ist es dabei lokal relevantere Punkte aus dem Datensatz zu identifizieren, die im Vergleich zu anderen Punkten in der Nachbarschaft des Punktes verglichen nach einem numerischen Attribut wichtiger sind. Die Methode ist von dem geographischen Prinzip der Dominanz von Bergen abgeleitet und wird „discrete isolation“ genannt. Es handelt sich dabei um die Distanz von einem Punkt zum nächsten mit einem höheren Attributwert. Durch die Verwendung dieses Maßes können lokal bedeutende Punkte leicht ausgewählt werden, indem ein minimaler Abstand gewählt und so räumlich gleichmäßig verteilte Punkte aus dem Datensatz ausgewählt werden. Die beiden neu vorgestellten Methoden werden in den Kontext der zoombaren Karten gestellt, indem exemplarische Arbeitsabläufe erstellt werden, die als Er-gebnis eine zoombare Karte liefern. Dazu werden die frei verfügbaren Beispiele zur Herstellung von weltweiten zoombaren Karten mit nutzergenerierten Geo-daten von OpenStreetMap, anhand der Kartenprojekte OpenMapTiles und O-penStreetMap Carto analysiert und in Arbeitsschritte gegliedert. Das Ergebnis ist ein wiederverwendbarer Arbeitsablauf zur Herstellung zoombarer Karten, ergänzt durch eine Auswahl von passender Software für die einzelnen Arbeits-schritte. Dabei wird insbesondere auf die Generalisierungsansätze in den Beispielprojekten eingegangen und diese anhand von Literatur in die kartographische Theorie eingeordnet. Zur Demonstration des Workflows wird je ein Raster Tiles Dienst für die „micro diagrams“ und ein Vektor Tiles Dienst für die „discrete isolation“ erstellt. Beide Dienste lassen sich mit einem aktuellen Webbrowser nutzen. Zusammenfassend ermöglichen diese neuen Visualisierungsansätze für Punkt-daten aus VGI und LBSM eine bessere qualitative Visualisierung der neuen Geodaten. Die Analyse riesiger globaler Datensätze ist immer noch eine Herausforderung, aber die Erforschung und Analyse verborgener Muster in den Daten ist lohnend. Die Erstellung solcher Visualisierungen und die Produktion von Karten in verschiedenen Maßstäben ist eine komplexe Aufgabe. Die in dieser Arbeit vorgestellten Arbeitsabläufe und Werkzeuge erleichtern die Erstellung von Karten in globalem Maßstab.:1 Introduction 1 1.1 Motivation .................................................................................................. 3 1.2 Visualization of crowdsourced geodata on multiple scales ............ 5 1.2.1 Research objective 1: Visualization of point collections ......... 6 1.2.2 Research objective 2: Visualization of points of interest ......... 7 1.2.3 Research objective 3: Production of multiscale maps ............. 7 1.3 Reader’s guide ......................................................................................... 9 1.3.1 Structure ........................................................................................... 9 1.3.2 Related Publications ....................................................................... 9 1.3.3 Formatting and layout ................................................................. 10 1.3.4 Online examples ........................................................................... 10 2 Foundations of crowdsourced mapping on multiple scales 11 2.1 Types and properties of crowdsourced data .................................. 11 2.2 Currents trends in cartography ......................................................... 11 2.3 Definitions .............................................................................................. 12 2.3.1 VGI .................................................................................................. 12 2.3.2 LBSM .............................................................................................. 13 2.3.3 Space, place, and location......................................................... 13 2.4 Visualization approaches for crowdsourced geodata ................... 14 2.4.1 Review of publications and visualization approaches ........... 14 2.4.2 Conclusions from the review ...................................................... 15 2.4.3 Challenges mapping crowdsourced data ................................ 17 2.5 Technologies for serving multiscale maps ...................................... 17 2.5.1 Research about multiscale maps .............................................. 17 2.5.2 Web Mercator projection ............................................................ 18 2.5.3 Tiles and zoom levels .................................................................. 19 2.5.4 Raster tiles ..................................................................................... 21 2.5.5 Vector tiles .................................................................................... 23 2.5.6 Tiling as a principle ..................................................................... 25 3 Point collection visualization with categorized attributes 26 3.1 Target users and possible tasks ....................................................... 26 3.2 Example data ......................................................................................... 27 3.3 Visualization approaches .................................................................... 28 3.3.1 Common techniques .................................................................... 28 3.3.2 The micro diagram approach .................................................... 30 3.4 The micro diagram and its parameters ............................................ 33 3.4.1 Aggregating points into a regular structure ............................ 33 3.4.2 Visualizing the number of data points ...................................... 35 3.4.3 Grid and micro diagrams ............................................................ 36 3.4.4 Visualizing numerical proportions with diagrams .................. 37 3.4.5 Influence of color and color brightness ................................... 38 3.4.6 Interaction options with micro diagrams .................................. 39 3.5 Application and user-based evaluation ............................................ 39 3.5.1 Micro diagrams in a multiscale environment ........................... 39 3.5.2 The micro diagram user study ................................................... 41 3.5.3 Point collection vis

    Harnessing the Power of Distributed Computing: Advancements in Scientific Applications, Homomorphic Encryption, and Federated Learning Security

    Get PDF
    Data explosion poses lot of challenges to the state-of-the art systems, applications, and methodologies. It has been reported that 181 zettabytes of data are expected to be generated in 2025 which is over 150\% increase compared to the data that is expected to be generated in 2023. However, while system manufacturers are consistently developing devices with larger storage spaces and providing alternative storage capacities in the cloud at affordable rates, another key challenge experienced is how to effectively process the fraction of large scale of stored data in time-critical conventional systems. One transformative paradigm revolutionizing the processing and management of these large data is distributed computing whose application requires deep understanding. This dissertation focuses on exploring the potential impact of applying efficient distributed computing concepts to long existing challenges or issues in (i) a widely data-intensive scientific application (ii) applying homomorphic encryption to data intensive workloads found in outsourced databases and (iii) security of tokenized incentive mechanism for Federated learning (FL) systems.The first part of the dissertation tackles the Microelectrode arrays (MEAs) parameterization problem from an orthogonal viewpoint enlightened by algebraic topology, which allows us to algebraically parametrize MEAs whose structure and intrinsic parallelism are hard to identify otherwise. We implement a new paradigm, namely Parma, to demonstrate the effectiveness of the proposed approach and report how it outperforms the state-of-the-practice in time, scalability, and memory usage.The second part discusses our work on introducing the concept of parallel caching of secure aggregation to mitigate the performance overhead incurred by the HE module in outsourced databases. The key idea of this optimization approach is caching selected radix-ciphertexts in parallel without violating existing security guarantees of the primitive/base HE scheme. A new radix HE algorithm was designed and applied to both batch and incremental HE schemes, and experiments carried out on six workloads show that the proposed caching boost state-of-the-art HE schemes by high orders of magnitudes.In the third part, I will discuss our work on leveraging the security benefit of blockchains to enhance or protect the fairness and reliability of tokenized incentive mechanism for FL systems. We designed a blockchain-based auditing protocol to mitigate Gaussian attacks and carried out experiments with multiple FL aggregation algorithms, popular data sets and a variety of scales to validate its effectiveness
    corecore