197,599 research outputs found
Scalable big data systems: Architectures and optimizations
Big data analytics has become not just a popular buzzword but also a strategic direction in information technology for many enterprises and government organizations. Even though many new computing and storage systems have been developed for big data analytics, scalable big data processing has become more and more challenging as a result of the huge and rapidly growing size of real-world data. Dedicated to the development of architectures and optimization techniques for scaling big data processing systems, especially in the era of cloud computing, this dissertation makes three unique contributions. First, it introduces a suite of graph partitioning algorithms that can run much faster than existing data distribution methods and inherently scale to the growth of big data. The main idea of these approaches is to partition a big graph by preserving the core computational data structure as much as possible to maximize intra-server computation and minimize inter-server communication. In addition, it proposes a distributed iterative graph computation framework that effectively utilizes secondary storage to maximize access locality and speed up distributed iterative graph computations. The framework not only considerably reduces memory requirements for iterative graph algorithms but also significantly improves the performance of iterative graph computations. Last but not the least, it establishes a suite of optimization techniques for scalable spatial data processing along with three orthogonal dimensions: (i) scalable processing of spatial alarms for mobile users traveling on road networks, (ii) scalable location tagging for improving the quality of Twitter data analytics and prediction accuracy, and (iii) lightweight spatial indexing for enhancing the performance of big spatial data queries.Ph.D
Graph Database Solution for Higher Order Spatial Statistics in the Era of Big Data
We present an algorithm for the fast computation of the general -point
spatial correlation functions of any discrete point set embedded within an
Euclidean space of . Utilizing the concepts of kd-trees and graph
databases, we describe how to count all possible -tuples in binned
configurations within a given length scale, e.g. all pairs of points or all
triplets of points with side lengths . Through bench-marking we show
the computational advantage of our new graph based algorithm over more
traditional methods. We show that all 3-point configurations up to and beyond
the Baryon Acoustic Oscillation scale (200 Mpc in physical units) can be
performed on current SDSS data in reasonable time. Finally we present the first
measurements of the 4-point correlation function of 0.5 million SDSS
galaxies over the redshift range .Comment: 9 pages, 8 figures, submitte
SVS-JOIN : efficient spatial visual similarity join for geo-multimedia
In the big data era, massive amount of multimedia data with geo-tags has been generated and collected by smart devices equipped with mobile communications module and position sensor module. This trend has put forward higher request on large-scale geo-multimedia retrieval. Spatial similarity join is one of the significant problems in the area of spatial database. Previous works focused on spatial textual document search problem, rather than geo-multimedia retrieval. In this paper, we investigate a novel geo-multimedia retrieval paradigm named spatial visual similarity join (SVS-JOIN for short), which aims to search similar geo-image pairs in both aspects of geo-location and visual content. Firstly, the definition of SVS-JOIN is proposed and then we present the geographical similarity and visual similarity measurement. Inspired by the approach for textual similarity join, we develop an algorithm named SVS-JOIN B by combining the PPJOIN algorithm and visual similarity. Besides, an extension of it named SVS-JOIN G is developed, which utilizes spatial grid strategy to improve the search efficiency. To further speed up the search, a novel approach called SVS-JOIN Q is carefully designed, in which a quadtree and a global inverted index are employed. Comprehensive experiments are conducted on two geo-image datasets and the results demonstrate that our solution can address the SVS-JOIN problem effectively and efficiently
Scalable model selection for spatial additive mixed modeling: application to crime analysis
A rapid growth in spatial open datasets has led to a huge demand for
regression approaches accommodating spatial and non-spatial effects in big
data. Regression model selection is particularly important to stably estimate
flexible regression models. However, conventional methods can be slow for large
samples. Hence, we develop a fast and practical model-selection approach for
spatial regression models, focusing on the selection of coefficient types that
include constant, spatially varying, and non-spatially varying coefficients. A
pre-processing approach, which replaces data matrices with small inner products
through dimension reduction dramatically accelerates the computation speed of
model selection. Numerical experiments show that our approach selects the model
accurately and computationally efficiently, highlighting the importance of
model selection in the spatial regression context. Then, the present approach
is applied to open data to investigate local factors affecting crime in Japan.
The results suggest that our approach is useful not only for selecting factors
influencing crime risk but also for predicting crime events. This scalable
model selection will be key to appropriately specifying flexible and
large-scale spatial regression models in the era of big data. The developed
model selection approach was implemented in the R package spmoran
Understanding and Comparing Scalable Gaussian Process Regression for Big Data
As a non-parametric Bayesian model which produces informative predictive
distribution, Gaussian process (GP) has been widely used in various fields,
like regression, classification and optimization. The cubic complexity of
standard GP however leads to poor scalability, which poses challenges in the
era of big data. Hence, various scalable GPs have been developed in the
literature in order to improve the scalability while retaining desirable
prediction accuracy. This paper devotes to investigating the methodological
characteristics and performance of representative global and local scalable GPs
including sparse approximations and local aggregations from four main
perspectives: scalability, capability, controllability and robustness. The
numerical experiments on two toy examples and five real-world datasets with up
to 250K points offer the following findings. In terms of scalability, most of
the scalable GPs own a time complexity that is linear to the training size. In
terms of capability, the sparse approximations capture the long-term spatial
correlations, the local aggregations capture the local patterns but suffer from
over-fitting in some scenarios. In terms of controllability, we could improve
the performance of sparse approximations by simply increasing the inducing
size. But this is not the case for local aggregations. In terms of robustness,
local aggregations are robust to various initializations of hyperparameters due
to the local attention mechanism. Finally, we highlight that the proper hybrid
of global and local scalable GPs may be a promising way to improve both the
model capability and scalability for big data.Comment: 25 pages, 15 figures, preprint submitted to KB
Studying the first galaxies with ALMA
We discuss observations of the first galaxies, within cosmic reionization, at
centimeter and millimeter wavelengths. We present a summary of current
observations of the host galaxies of the most distant QSOs (). These
observations reveal the gas, dust, and star formation in the host galaxies on
kpc-scales. These data imply an enriched ISM in the QSO host galaxies within 1
Gyr of the big bang, and are consistent with models of coeval supermassive
black hole and spheroidal galaxy formation in major mergers at high redshift.
Current instruments are limited to studying truly pathologic objects at these
redshifts, meaning hyper-luminous infrared galaxies (
L). ALMA will provide the one to two orders of magnitude improvement in
millimeter astronomy required to study normal star forming galaxies (ie.
Ly- emitters) at . ALMA will reveal, at sub-kpc spatial
resolution, the thermal gas and dust -- the fundamental fuel for star formation
-- in galaxies into cosmic reionization.Comment: to appear in Science with ALMA: a new era for Astrophysics}, ed. R.
Bachiller (Springer: Berlin); 5 pages, 7 figure
Spatial information and the legibility of urban form: Big data in urban morphology
Urban planning and morphology have relied on analytical cartography and visual communication tools for centuries to illustrate spatial patterns, propose designs, compare alternatives, and engage the public. Classic urban form visualizations – from Giambattista Nolli’s ichnographic maps of Rome to Allan Jacobs’s figure-ground diagrams of city streets – have compressed physical urban complexity into easily comprehensible information artifacts. Today we can enhance these traditional workflows through the Smart Cities paradigm of understanding cities via user-generated content and harvested data in an information management context. New spatial technology platforms and big data offer new lenses to understand, evaluate, monitor, and manage urban form and evolution. This paper builds on the theoretical framework of visual cultures in urban planning and morphology to introduce and situate computational data science processes for exploring urban fabric patterns and spatial order. It demonstrates these workflows with OSMnx and data from OpenStreetMap, a collaborative spatial information system and mapping platform, to examine street network patterns, orientations, and configurations in different study sites around the world, considering what these reveal about the urban fabric. The age of ubiquitous urban data and computational toolkits opens up a new era of worldwide urban form analysis from integrated quantitative and qualitative perspectives
- …