42 research outputs found

    Parallel Algorithms for Summing Floating-Point Numbers

    Full text link
    The problem of exactly summing n floating-point numbers is a fundamental problem that has many applications in large-scale simulations and computational geometry. Unfortunately, due to the round-off error in standard floating-point operations, this problem becomes very challenging. Moreover, all existing solutions rely on sequential algorithms which cannot scale to the huge datasets that need to be processed. In this paper, we provide several efficient parallel algorithms for summing n floating point numbers, so as to produce a faithfully rounded floating-point representation of the sum. We present algorithms in PRAM, external-memory, and MapReduce models, and we also provide an experimental analysis of our MapReduce algorithms, due to their simplicity and practical efficiency.Comment: Conference version appears in SPAA 201

    R*-Grove: Balanced Spatial Partitioning for Large-scale Datasets

    Full text link
    The rapid growth of big spatial data urged the research community to develop several big spatial data systems. Regardless of their architecture, one of the fundamental requirements of all these systems is to spatially partition the data efficiently across machines. The core challenges of big spatial partitioning are building high spatial quality partitions while simultaneously taking advantages of distributed processing models by providing load balanced partitions. Previous works on big spatial partitioning are to reuse existing index search trees as-is, e.g., the R-tree family, STR, Kd-tree, and Quad-tree, by building a temporary tree for a sample of the input and use its leaf nodes as partition boundaries. However, we show in this paper that none of those techniques has addressed the mentioned challenges completely. This paper proposes a novel partitioning method, termed R*-Grove, which can partition very large spatial datasets into high quality partitions with excellent load balance and block utilization. This appealing property allows R*-Grove to outperform existing techniques in spatial query processing. R*-Grove can be easily integrated into any big data platforms such as Apache Spark or Apache Hadoop. Our experiments show that R*-Grove outperforms the existing partitioning techniques for big spatial data systems. With all the proposed work publicly available as open source, we envision that R*-Grove will be adopted by the community to better serve big spatial data research.Comment: 29 pages, to be published in Frontiers in Big Dat

    Skewness-Based Partitioning in SpatialHadoop

    Get PDF
    In recent years, several extensions of the Hadoop system have been proposed for dealing with spatial data. SpatialHadoop belongs to this group of projects and includes some MapReduce implementations of spatial operators, like range queries and spatial join. the MapReduce paradigm is based on the fundamental principle that a task can be parallelized by partitioning data into chunks and performing the same operation on them, (map phase), eventually combining the partial results at the end (reduce phase). Thus, the applied partitioning technique can tremendously affect the performance of a parallel execution, since it is the key point for obtaining balanced map tasks and exploiting the parallelism as much as possible. When uniformly distributed datasets are considered, this goal can be easily obtained by using a regular grid covering the whole reference space for partitioning the geometries of the input dataset; conversely, with skewed distributed datasets, this might not be the right choice and other techniques have to be applied. for instance, SpatialHadoop can produce a global index also by means of a Quadtree-based grid or an Rtree-based grid, which in turn are more expensive index structures to build. This paper proposes a technique based on both a box counting function and a heuristic, rooted on theoretical properties and experimental observations, for detecting the degree of skewness of an input spatial dataset and then deciding which partitioning technique to apply in order to improve as much as possible the performance of subsequent operations. Experiments on both synthetic and real datasets are presented to confirm the effectiveness of the proposed approach

    Towards a Learned Cost Model for Distributed Spatial Join: Data, Code & Models

    Get PDF
    Geospatial data comprise around 60% of all the publicly available data. One of the essential and most complex operations that brings together multiple geospatial datasets is the spatial join operation. Due to its complexity, there is a lot of partitioning techniques and parallel algorithms for the spatial join problem. This leads to a complex query optimization problem: which algorithm to use for a given pair of input datasets that we want to join? With the rise of machine learning, there is a promise in addressing this problem with the use of various learned models. However, one of the concerns is the lack of a standard and publicly available data to train and test on, as well as the lack of accessible baseline models. This resource paper helps the research community to solve this problem by providing synthetic and real datasets for spatial join, source code for constructing more datasets, and several baseline solutions that researchers can further extend and compare to

    Burnout among surgeons before and during the SARS-CoV-2 pandemic: an international survey

    Get PDF
    Background: SARS-CoV-2 pandemic has had many significant impacts within the surgical realm, and surgeons have been obligated to reconsider almost every aspect of daily clinical practice. Methods: This is a cross-sectional study reported in compliance with the CHERRIES guidelines and conducted through an online platform from June 14th to July 15th, 2020. The primary outcome was the burden of burnout during the pandemic indicated by the validated Shirom-Melamed Burnout Measure. Results: Nine hundred fifty-four surgeons completed the survey. The median length of practice was 10 years; 78.2% included were male with a median age of 37 years old, 39.5% were consultants, 68.9% were general surgeons, and 55.7% were affiliated with an academic institution. Overall, there was a significant increase in the mean burnout score during the pandemic; longer years of practice and older age were significantly associated with less burnout. There were significant reductions in the median number of outpatient visits, operated cases, on-call hours, emergency visits, and research work, so, 48.2% of respondents felt that the training resources were insufficient. The majority (81.3%) of respondents reported that their hospitals were included in the management of COVID-19, 66.5% felt their roles had been minimized; 41% were asked to assist in non-surgical medical practices, and 37.6% of respondents were included in COVID-19 management. Conclusions: There was a significant burnout among trainees. Almost all aspects of clinical and research activities were affected with a significant reduction in the volume of research, outpatient clinic visits, surgical procedures, on-call hours, and emergency cases hindering the training. Trial registration: The study was registered on clicaltrials.gov "NCT04433286" on 16/06/2020

    A Learned Query Optimizer for Spatial Join

    Get PDF
    The importance and complexity of spatial join resulted in many join algorithms, some of which run on big-data platforms such as Hadoop and Spark. This paper proposes the first machine-learning-based query optimizer for spatial join operation which can accommodate the skewness of the spatial datasets and the complexity of the different algorithms. The main challenge is how to develop portable cost models that take into account the important input characteristics such as data distribution, spatial partitioning, logic of spatial join algorithms, and the relationship between the two datasets. The proposed system defines a set of features that can all be computed efficiently for the data to catch the intricate aspects of spatial join. Then, it uses these features to train three machine learning models that capture several metrics to estimate the cost of four spatial join algorithms according to user requirements. The first model can estimate the cardinality of spatial join algorithm. The second model can predict the number of rough comparisons for a specific join algorithm. Finally, the third model is a classification model that can choose the best join algorithm to run. Experiments on large scale synthetic and real data show the efficiency of the proposed models over baseline methods
    corecore