191,214 research outputs found

    Scalable big data systems: Architectures and optimizations

    Get PDF
    Big data analytics has become not just a popular buzzword but also a strategic direction in information technology for many enterprises and government organizations. Even though many new computing and storage systems have been developed for big data analytics, scalable big data processing has become more and more challenging as a result of the huge and rapidly growing size of real-world data. Dedicated to the development of architectures and optimization techniques for scaling big data processing systems, especially in the era of cloud computing, this dissertation makes three unique contributions. First, it introduces a suite of graph partitioning algorithms that can run much faster than existing data distribution methods and inherently scale to the growth of big data. The main idea of these approaches is to partition a big graph by preserving the core computational data structure as much as possible to maximize intra-server computation and minimize inter-server communication. In addition, it proposes a distributed iterative graph computation framework that effectively utilizes secondary storage to maximize access locality and speed up distributed iterative graph computations. The framework not only considerably reduces memory requirements for iterative graph algorithms but also significantly improves the performance of iterative graph computations. Last but not the least, it establishes a suite of optimization techniques for scalable spatial data processing along with three orthogonal dimensions: (i) scalable processing of spatial alarms for mobile users traveling on road networks, (ii) scalable location tagging for improving the quality of Twitter data analytics and prediction accuracy, and (iii) lightweight spatial indexing for enhancing the performance of big spatial data queries.Ph.D

    Graph Database Solution for Higher Order Spatial Statistics in the Era of Big Data

    Full text link
    We present an algorithm for the fast computation of the general NN-point spatial correlation functions of any discrete point set embedded within an Euclidean space of Rn\mathbb{R}^n. Utilizing the concepts of kd-trees and graph databases, we describe how to count all possible NN-tuples in binned configurations within a given length scale, e.g. all pairs of points or all triplets of points with side lengths <rmax<r_{max}. Through bench-marking we show the computational advantage of our new graph based algorithm over more traditional methods. We show that all 3-point configurations up to and beyond the Baryon Acoustic Oscillation scale (∼\sim200 Mpc in physical units) can be performed on current SDSS data in reasonable time. Finally we present the first measurements of the 4-point correlation function of ∼\sim0.5 million SDSS galaxies over the redshift range 0.43<z<0.70.43<z<0.7.Comment: 9 pages, 8 figures, submitte

    SVS-JOIN : efficient spatial visual similarity join for geo-multimedia

    Get PDF
    In the big data era, massive amount of multimedia data with geo-tags has been generated and collected by smart devices equipped with mobile communications module and position sensor module. This trend has put forward higher request on large-scale geo-multimedia retrieval. Spatial similarity join is one of the significant problems in the area of spatial database. Previous works focused on spatial textual document search problem, rather than geo-multimedia retrieval. In this paper, we investigate a novel geo-multimedia retrieval paradigm named spatial visual similarity join (SVS-JOIN for short), which aims to search similar geo-image pairs in both aspects of geo-location and visual content. Firstly, the definition of SVS-JOIN is proposed and then we present the geographical similarity and visual similarity measurement. Inspired by the approach for textual similarity join, we develop an algorithm named SVS-JOIN B by combining the PPJOIN algorithm and visual similarity. Besides, an extension of it named SVS-JOIN G is developed, which utilizes spatial grid strategy to improve the search efficiency. To further speed up the search, a novel approach called SVS-JOIN Q is carefully designed, in which a quadtree and a global inverted index are employed. Comprehensive experiments are conducted on two geo-image datasets and the results demonstrate that our solution can address the SVS-JOIN problem effectively and efficiently

    Scalable model selection for spatial additive mixed modeling: application to crime analysis

    Full text link
    A rapid growth in spatial open datasets has led to a huge demand for regression approaches accommodating spatial and non-spatial effects in big data. Regression model selection is particularly important to stably estimate flexible regression models. However, conventional methods can be slow for large samples. Hence, we develop a fast and practical model-selection approach for spatial regression models, focusing on the selection of coefficient types that include constant, spatially varying, and non-spatially varying coefficients. A pre-processing approach, which replaces data matrices with small inner products through dimension reduction dramatically accelerates the computation speed of model selection. Numerical experiments show that our approach selects the model accurately and computationally efficiently, highlighting the importance of model selection in the spatial regression context. Then, the present approach is applied to open data to investigate local factors affecting crime in Japan. The results suggest that our approach is useful not only for selecting factors influencing crime risk but also for predicting crime events. This scalable model selection will be key to appropriately specifying flexible and large-scale spatial regression models in the era of big data. The developed model selection approach was implemented in the R package spmoran

    Understanding and Comparing Scalable Gaussian Process Regression for Big Data

    Full text link
    As a non-parametric Bayesian model which produces informative predictive distribution, Gaussian process (GP) has been widely used in various fields, like regression, classification and optimization. The cubic complexity of standard GP however leads to poor scalability, which poses challenges in the era of big data. Hence, various scalable GPs have been developed in the literature in order to improve the scalability while retaining desirable prediction accuracy. This paper devotes to investigating the methodological characteristics and performance of representative global and local scalable GPs including sparse approximations and local aggregations from four main perspectives: scalability, capability, controllability and robustness. The numerical experiments on two toy examples and five real-world datasets with up to 250K points offer the following findings. In terms of scalability, most of the scalable GPs own a time complexity that is linear to the training size. In terms of capability, the sparse approximations capture the long-term spatial correlations, the local aggregations capture the local patterns but suffer from over-fitting in some scenarios. In terms of controllability, we could improve the performance of sparse approximations by simply increasing the inducing size. But this is not the case for local aggregations. In terms of robustness, local aggregations are robust to various initializations of hyperparameters due to the local attention mechanism. Finally, we highlight that the proper hybrid of global and local scalable GPs may be a promising way to improve both the model capability and scalability for big data.Comment: 25 pages, 15 figures, preprint submitted to KB

    Studying the first galaxies with ALMA

    Full text link
    We discuss observations of the first galaxies, within cosmic reionization, at centimeter and millimeter wavelengths. We present a summary of current observations of the host galaxies of the most distant QSOs (z∼6z \sim 6). These observations reveal the gas, dust, and star formation in the host galaxies on kpc-scales. These data imply an enriched ISM in the QSO host galaxies within 1 Gyr of the big bang, and are consistent with models of coeval supermassive black hole and spheroidal galaxy formation in major mergers at high redshift. Current instruments are limited to studying truly pathologic objects at these redshifts, meaning hyper-luminous infrared galaxies (LFIR∼1013L_{FIR} \sim 10^{13} L⊙_\odot). ALMA will provide the one to two orders of magnitude improvement in millimeter astronomy required to study normal star forming galaxies (ie. Ly-α\alpha emitters) at z∼6z \sim 6. ALMA will reveal, at sub-kpc spatial resolution, the thermal gas and dust -- the fundamental fuel for star formation -- in galaxies into cosmic reionization.Comment: to appear in Science with ALMA: a new era for Astrophysics}, ed. R. Bachiller (Springer: Berlin); 5 pages, 7 figure

    Spatial information and the legibility of urban form: Big data in urban morphology

    Get PDF
    Urban planning and morphology have relied on analytical cartography and visual communication tools for centuries to illustrate spatial patterns, propose designs, compare alternatives, and engage the public. Classic urban form visualizations – from Giambattista Nolli’s ichnographic maps of Rome to Allan Jacobs’s figure-ground diagrams of city streets – have compressed physical urban complexity into easily comprehensible information artifacts. Today we can enhance these traditional workflows through the Smart Cities paradigm of understanding cities via user-generated content and harvested data in an information management context. New spatial technology platforms and big data offer new lenses to understand, evaluate, monitor, and manage urban form and evolution. This paper builds on the theoretical framework of visual cultures in urban planning and morphology to introduce and situate computational data science processes for exploring urban fabric patterns and spatial order. It demonstrates these workflows with OSMnx and data from OpenStreetMap, a collaborative spatial information system and mapping platform, to examine street network patterns, orientations, and configurations in different study sites around the world, considering what these reveal about the urban fabric. The age of ubiquitous urban data and computational toolkits opens up a new era of worldwide urban form analysis from integrated quantitative and qualitative perspectives
    • …
    corecore