261,888 research outputs found
Rule based ETL (RETL) approach for GEO spatial data warehouse
This paper presents the use of Service Oriented
Architecture (SOA) for integrating multi source
heterogeneous geospatial data in order to facilitate
geospatial data warehouse. In this study, Real Based
ETL (RETL) concept is adapted in order to extract, transform and load data from a variety of
heterogeneous data sources. ETL will transform
data to schematic format and loading data into the Geo
spatial data warehouse.By using a rule-based
technique, the distribution of parallel ETL pipeline
will enhance and perform more efficient in large scale of data and overcome data bottleneck and
performance overhead. This can ease the disaster
management and enables planners to monitor disaster
emergency response in an efficient manner
Alternating Maximization: Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes
Given a multivariate data set, sparse principal component analysis (SPCA)
aims to extract several linear combinations of the variables that together
explain the variance in the data as much as possible, while controlling the
number of nonzero loadings in these combinations. In this paper we consider 8
different optimization formulations for computing a single sparse loading
vector; these are obtained by combining the following factors: we employ two
norms for measuring variance (L2, L1) and two sparsity-inducing norms (L0, L1),
which are used in two different ways (constraint, penalty). Three of our
formulations, notably the one with L0 constraint and L1 variance, have not been
considered in the literature. We give a unifying reformulation which we propose
to solve via a natural alternating maximization (AM) method. We show the the AM
method is nontrivially equivalent to GPower (Journ\'{e}e et al; JMLR
11:517--553, 2010) for all our formulations. Besides this, we provide 24
efficient parallel SPCA implementations: 3 codes (multi-core, GPU and cluster)
for each of the 8 problems. Parallelism in the methods is aimed at i) speeding
up computations (our GPU code can be 100 times faster than an efficient serial
code written in C++), ii) obtaining solutions explaining more variance and iii)
dealing with big data problems (our cluster code is able to solve a 357 GB
problem in about a minute).Comment: 29 pages, 9 tables, 7 figures (the paper is accompanied by a release
of the open-source code '24am'
ArrayBridge: Interweaving declarative array processing with high-performance computing
Scientists are increasingly turning to datacenter-scale computers to produce
and analyze massive arrays. Despite decades of database research that extols
the virtues of declarative query processing, scientists still write, debug and
parallelize imperative HPC kernels even for the most mundane queries. This
impedance mismatch has been partly attributed to the cumbersome data loading
process; in response, the database community has proposed in situ mechanisms to
access data in scientific file formats. Scientists, however, desire more than a
passive access method that reads arrays from files.
This paper describes ArrayBridge, a bi-directional array view mechanism for
scientific file formats, that aims to make declarative array manipulations
interoperable with imperative file-centric analyses. Our prototype
implementation of ArrayBridge uses HDF5 as the underlying array storage library
and seamlessly integrates into the SciDB open-source array database system. In
addition to fast querying over external array objects, ArrayBridge produces
arrays in the HDF5 file format just as easily as it can read from it.
ArrayBridge also supports time travel queries from imperative kernels through
the unmodified HDF5 API, and automatically deduplicates between array versions
for space efficiency. Our extensive performance evaluation in NERSC, a
large-scale scientific computing facility, shows that ArrayBridge exhibits
statistically indistinguishable performance and I/O scalability to the native
SciDB storage engine.Comment: 12 pages, 13 figure
A scalable analysis framework for large-scale RDF data
With the growth of the Semantic Web, the availability of RDF datasets from multiple domains
as Linked Data has taken the corpora of this web to a terabyte-scale, and challenges
modern knowledge storage and discovery techniques. Research and engineering on RDF
data management systems is a very active area with many standalone systems being introduced.
However, as the size of RDF data increases, such single-machine approaches meet
performance bottlenecks, in terms of both data loading and querying, due to the limited
parallelism inherent to symmetric multi-threaded systems and the limited available system
I/O and system memory. Although several approaches for distributed RDF data processing
have been proposed, along with clustered versions of more traditional approaches, their
techniques are limited by the trade-off they exploit between loading complexity and query
efficiency in the presence of big RDF data. This thesis then, introduces a scalable analysis
framework for processing large-scale RDF data, which focuses on various techniques to
reduce inter-machine communication, computation and load-imbalancing so as to achieve
fast data loading and querying on distributed infrastructures.
The first part of this thesis focuses on the study of RDF store implementation and parallel
hashing on big data processing. (1) A system-level investigation of RDF store implementation
has been conducted on the basis of a comparative analysis of runtime characteristics
of a representative set of RDF stores. The detailed time cost and system consumption is
measured for data loading and querying so as to provide insight into different triple store
implementation as well as an understanding of performance differences between different
platforms. (2) A high-level structured parallel hashing approach over distributed memory is
proposed and theoretically analyzed. The detailed performance of hashing implementations
using different lock-free strategies has been characterized through extensive experiments,
thereby allowing system developers to make a more informed choice for the implementation
of their high-performance analytical data processing systems.
The second part of this thesis proposes three main techniques for fast processing of large
RDF data within the proposed framework. (1) A very efficient parallel dictionary encoding
algorithm, to avoid unnecessary disk-space consumption and reduce computational complexity of query execution. The presented implementation has achieved notable speedups
compared to the state-of-art method and also has achieved excellent scalability. (2) Several
novel parallel join algorithms, to efficiently handle skew over large data during query processing.
The approaches have achieved good load balancing and have been demonstrated
to be faster than the state-of-art techniques in both theoretical and experimental comparisons.
(3) A two-tier dynamic indexing approach for processing SPARQL queries has been
devised which keeps loading times low and decreases or in some instances removes intermachine
data movement for subsequent queries that contain the same graph patterns. The
results demonstrate that this design can load data at least an order of magnitude faster than
a clustered store operating in RAM while remaining within an interactive range for query
processing and even outperforms current systems for various queries
- …