7,875 research outputs found

    GeoYCSB: A Benchmark Framework for the Performance and Scalability Evaluation of Geospatial NoSQL Databases

    Get PDF
    The proliferation of geospatial applications has tremendously increased the variety, velocity, and volume of spatial data that data stores have to manage. Traditional relational databases reveal limitations in handling such big geospatial data, mainly due to their rigid schema requirements and limited scalability. Numerous NoSQL databases have emerged and actively serve as alternative data stores for big spatial data. This study presents a framework, called GeoYCSB, developed for benchmarking NoSQL databases with geospatial workloads. To develop GeoYCSB, we extend YCSB, a de facto benchmark framework for NoSQL systems, by integrating into its design architecture the new components necessary to support geospatial workloads. GeoYCSB supports both microbenchmarks and macrobenchmarks and facilitates the use of real datasets in both. It is extensible to evaluate any NoSQL database, provided they support spatial queries, using geospatial workloads performed on datasets of any geometric complexity. We use GeoYCSB to benchmark two leading document stores, MongoDB and Couchbase, and present the experimental results and analysis. Finally, we demonstrate the extensibility of GeoYCSB by including a new dataset consisting of complex geometries and using it to benchmark a system with a wide variety of geospatial queries: Apache Accumulo, a wide-column store, with the GeoMesa framework applied on top

    TabR: Unlocking the Power of Retrieval-Augmented Tabular Deep Learning

    Full text link
    Deep learning (DL) models for tabular data problems are receiving increasingly more attention, while the algorithms based on gradient-boosted decision trees (GBDT) remain a strong go-to solution. Following the recent trends in other domains, such as natural language processing and computer vision, several retrieval-augmented tabular DL models have been recently proposed. For a given target object, a retrieval-based model retrieves other relevant objects, such as the nearest neighbors, from the available (training) data and uses their features or even labels to make a better prediction. However, we show that the existing retrieval-based tabular DL solutions provide only minor, if any, benefits over the properly tuned simple retrieval-free baselines. Thus, it remains unclear whether the retrieval-based approach is a worthy direction for tabular DL. In this work, we give a strong positive answer to this question. We start by incrementally augmenting a simple feed-forward architecture with an attention-like retrieval component similar to those of many (tabular) retrieval-based models. Then, we highlight several details of the attention mechanism that turn out to have a massive impact on the performance on tabular data problems, but that were not explored in prior work. As a result, we design TabR -- a simple retrieval-based tabular DL model which, on a set of public benchmarks, demonstrates the best average performance among tabular DL models, becomes the new state-of-the-art on several datasets, and even outperforms GBDT models on the recently proposed ``GBDT-friendly'' benchmark (see the first figure).Comment: Code: https://github.com/yandex-research/tabular-dl-tab

    Interactive visualizations of unstructured oceanographic data

    Get PDF
    The newly founded company Oceanbox is creating a novel oceanographic forecasting system to provide oceanography as a service. These services use mathematical models that generate large hydrodynamic data sets as unstructured triangular grids with high-resolution model areas. Oceanbox makes the model results accessible in a web application. New visualizations are needed to accommodate land-masking and large data volumes. In this thesis, we propose using a k-d tree to spatially partition unstructured triangular grids to provide the look-up times needed for interactive visualizations. A k-d tree is implemented in F# called FsKDTree. This thesis also describes the implementation of dynamic tiling map layers to visualize current barbs, scalar fields, and particle streams. The current barb layer queries data from the data server with the help of the k-d tree and displays it in the browser. Scalar fields and particle streams are implemented using WebGL, which enables the rendering of triangular grids. Stream particle visualization effects are implemented as velocity advection computed on the GPU with textures. The new visualizations are used in Oceanbox's production systems, and spatial indexing has been integrated into Oceanbox's archive retrieval system. FsKDTree improves tree creation times by up to 4x over the C# equivalent and improves search times by up to 13x compared to the .NET C# implementation. Finally, the largest model areas can be viewed with current barbs, scalar fields, and particle stream visualizations at 60 FPS, even for the largest model areas provided by the service

    Carbon-Free Power

    Get PDF
    There is a new world order in electrical energy production. Solar and wind power are established as the low-cost leaders. However, these energy sources are highly variable and electrical power is needed 24/7. Alternative sources must fill the gaps, but only a few are both economical and carbon-free or -neutral. This book presents one alternative: small modular nuclear reactors (SMRs). The authors describe the technology, including its safety and economic aspects, and assess its fit with other carbon-free energy sources, storage solutions, and industrial opportunities. They also explain the challenges with SMRs, including public acceptance. The purpose of the book is to help readers consider these relatively new reactors as part of an appropriate energy mix for the future and, ultimately, to make their own judgments on the merits of the arguments for SMRs.Publishe

    A review of technical factors to consider when designing neural networks for semantic segmentation of Earth Observation imagery

    Full text link
    Semantic segmentation (classification) of Earth Observation imagery is a crucial task in remote sensing. This paper presents a comprehensive review of technical factors to consider when designing neural networks for this purpose. The review focuses on Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Generative Adversarial Networks (GANs), and transformer models, discussing prominent design patterns for these ANN families and their implications for semantic segmentation. Common pre-processing techniques for ensuring optimal data preparation are also covered. These include methods for image normalization and chipping, as well as strategies for addressing data imbalance in training samples, and techniques for overcoming limited data, including augmentation techniques, transfer learning, and domain adaptation. By encompassing both the technical aspects of neural network design and the data-related considerations, this review provides researchers and practitioners with a comprehensive and up-to-date understanding of the factors involved in designing effective neural networks for semantic segmentation of Earth Observation imagery.Comment: 145 pages with 32 figure

    Marine Data Fusion for Analyzing Spatio-Temporal Ocean Region Connectivity

    Get PDF
    This thesis develops methods to automate and objectify the connectivity analysis between ocean regions. Existing methods for connectivity analysis often rely on manual integration of expert knowledge, which renders the processing of large amounts of data tedious. This thesis presents a new framework for Data Fusion that provides several approaches for automation and objectification of the entire analysis process. It identifies different complexities of connectivity analysis and shows how the Data Fusion framework can be applied and adapted to them. The framework is used in this thesis to analyze geo-referenced trajectories of fish larvae in the western Mediterranean Sea, to trace the spreading pathways of newly formed water in the subpolar North Atlantic based on their hydrographic properties, and to gauge their temporal change. These examples introduce a new, and highly relevant field of application for the established Data Science methods that were used and innovatively combined in the framework. New directions for further development of these methods are opened up which go beyond optimization of existing methods. The Marine Science, more precisely Physical Oceanography, benefits from the new possibilities to analyze large amounts of data quickly and objectively for its exact research questions. This thesis is a foray into the new field of Marine Data Science. It practically and theoretically explores the possibilities of combining Data Science and the Marine Sciences advantageously for both sides. The example of automating and objectifying connectivity analysis between marine regions in this thesis shows the added value of combining Data Science and Marine Science. This thesis also presents initial insights and ideas on how researchers from both disciplines can position themselves to thrive as Marine Data Scientists and simultaneously advance our understanding of the ocean

    Many or Few Samples? Comparing Transfer, Contrastive and Meta-Learning in Encrypted Traffic Classification

    Full text link
    The popularity of Deep Learning (DL), coupled with network traffic visibility reduction due to the increased adoption of HTTPS, QUIC and DNS-SEC, re-ignited interest towards Traffic Classification (TC). However, to tame the dependency from task-specific large labeled datasets we need to find better ways to learn representations that are valid across tasks. In this work we investigate this problem comparing transfer learning, meta-learning and contrastive learning against reference Machine Learning (ML) tree-based and monolithic DL models (16 methods total). Using two publicly available datasets, namely MIRAGE19 (40 classes) and AppClassNet (500 classes), we show that (i) using large datasets we can obtain more general representations, (ii) contrastive learning is the best methodology and (iii) meta-learning the worst one, and (iv) while ML tree-based cannot handle large tasks but fits well small tasks, by means of reusing learned representations, DL methods are reaching tree-based models performance also for small tasks.Comment: to appear in Traffic Measurements and Analysis (TMA) 202

    Ditransitives in germanic languages. Synchronic and diachronic aspects

    Full text link
    This volume brings together twelve empirical studies on ditransitive constructions in Germanic languages and their varieties, past and present. Specifically, the volume includes contributions on a wide variety of Germanic languages, including English, Dutch, and German, but also Danish, Swedish, and Norwegian, as well as lesser-studied ones such as Faroese. While the first part of the volume focuses on diachronic aspects, the second part showcases a variety of synchronic aspects relating to ditransitive patterns. Methodologically, the volume covers both experimental and corpus-based studies. Questions addressed by the papers in the volume are, among others, issues like the cross-linguistic pervasiveness and cognitive reality of factors involved in the choice between different ditransitive constructions, or differences and similarities in the diachronic development of ditransitives. The volume’s broad scope and comparative perspective offers comprehensive insights into well-known phenomena and furthers our understanding of variation across languages of the same family

    Current Challenges in the Application of Algorithms in Multi-institutional Clinical Settings

    Get PDF
    The Coronavirus disease pandemic has highlighted the importance of artificial intelligence in multi-institutional clinical settings. Particularly in situations where the healthcare system is overloaded, and a lot of data is generated, artificial intelligence has great potential to provide automated solutions and to unlock the untapped potential of acquired data. This includes the areas of care, logistics, and diagnosis. For example, automated decision support applications could tremendously help physicians in their daily clinical routine. Especially in radiology and oncology, the exponential growth of imaging data, triggered by a rising number of patients, leads to a permanent overload of the healthcare system, making the use of artificial intelligence inevitable. However, the efficient and advantageous application of artificial intelligence in multi-institutional clinical settings faces several challenges, such as accountability and regulation hurdles, implementation challenges, and fairness considerations. This work focuses on the implementation challenges, which include the following questions: How to ensure well-curated and standardized data, how do algorithms from other domains perform on multi-institutional medical datasets, and how to train more robust and generalizable models? Also, questions of how to interpret results and whether there exist correlations between the performance of the models and the characteristics of the underlying data are part of the work. Therefore, besides presenting a technical solution for manual data annotation and tagging for medical images, a real-world federated learning implementation for image segmentation is introduced. Experiments on a multi-institutional prostate magnetic resonance imaging dataset showcase that models trained by federated learning can achieve similar performance to training on pooled data. Furthermore, Natural Language Processing algorithms with the tasks of semantic textual similarity, text classification, and text summarization are applied to multi-institutional, structured and free-text, oncology reports. The results show that performance gains are achieved by customizing state-of-the-art algorithms to the peculiarities of the medical datasets, such as the occurrence of medications, numbers, or dates. In addition, performance influences are observed depending on the characteristics of the data, such as lexical complexity. The generated results, human baselines, and retrospective human evaluations demonstrate that artificial intelligence algorithms have great potential for use in clinical settings. However, due to the difficulty of processing domain-specific data, there still exists a performance gap between the algorithms and the medical experts. In the future, it is therefore essential to improve the interoperability and standardization of data, as well as to continue working on algorithms to perform well on medical, possibly, domain-shifted data from multiple clinical centers

    20th SC@RUG 2023 proceedings 2022-2023

    Get PDF
    • …
    corecore