65 research outputs found
Analytical Comparison of Grid File and K-d-b-tree Structures
Computing and Information Scienc
A Heterogeneous High Performance Computing Framework For Ill-Structured Spatial Join Processing
The frequently employed spatial join processing over two large layers of polygonal datasets to detect cross-layer polygon pairs (CPP) satisfying a join-predicate faces challenges common to ill-structured sparse problems, namely, that of identifying the few intersecting cross-layer edges out of the quadratic universe. The algorithmic engineering challenge is compounded by GPGPU SIMT architecture. Spatial join involves lightweight filter phase typically using overlap test over minimum bounding rectangles (MBRs) to discard majority of CPPs, followed by refinement phase to rigorously test the join predicate over the edges of the surviving CPPs. In this dissertation, we develop new techniques - algorithms, data structure, i/o, load balancing and system implementation - to accelerate the two-phase spatial-join processing. We present a new filtering technique, called Common MBR Filter (CMF), which changes the overall characteristic of the spatial join algorithms wherein the refinement phase is no longer the computational bottleneck. CMF is designed based on the insight that intersecting cross-layer edges must lie within the rectangular intersection of the MBRs of CPPs, their common MBRs (CMBR). We also address a key limitation of CMF for class of spatial datasets with either large or dense active CMBRs by extended CMF, called CMF-grid, that effectively employs both CMBR and grid techniques by embedding a uniform grid over CMBR of each CPP, but of suitably engineered sizes for different CPPs. To show efficiency of CMF-based filters, extensive mathematical and experimental analysis is provided. Then, two GPU-based spatial join systems are proposed based on two CMF versions including four components: 1) sort-based MBR filter, 2) CMF/CMF-grid, 3) point-in-polygon test, and, 4) edge-intersection test. The systems show two orders of magnitude speedup over the optimized sequential GEOS C++ library. Furthermore, we present a distributed system of heterogeneous compute nodes to exploit GPU-CPU computing in order to scale up the computation. A load balancing model based on Integer Linear Programming (ILP) is formulated for this system. We also provide three heuristic algorithms to approximate the ILP. Finally, we develop MPI-cuda-GIS system based on this heterogeneous computing model by integrating our CUDA-based GPU system into a newly designed distributed framework designed based on Message Passing Interface (MPI). Experimental results show good scalability and performance of MPI-cuda-GIS system
Indexing of the space data in the CSDB Microsoft SQL Server 2000
2 schemes of space indexing are realized in environment of CSDB Microsoft SQL Server 2000. The experimental research of the realized methods for window inquiries is carried out. Comparison of the realized methods with available in the standard means of indexing being in the given CSDB has been carried out. To find the quadrant splitting in methods Z-and XZ-indexing the heuristic algorithm which gives a smaller error of approximation in comparison with standard algorithm is proposed
Indexing Cached Multidimensional Objects in Large Main Memory Systems
Semantic caches allow queries into large datasets to leverage cached
results either directly or through transformations, using semantic
information about the data objects in the cache. As the price of main
memory continues to drop and its size increases, the
size of semantic caches grows proportionately, and it is becoming
expensive to compare the semantic information for each data object in the
cache against a query predicate. Instead, we propose to create an index
for cached objects. Unlike straightforward linear scanning, indexing
cached objects creates additional overhead for cache replacement. Since
the contents of a semantic cache may change dynamically at a high rate,
the cache index must support fast inserts and deletes as well as fast
search. In this paper, we show that multidimensional indexing helps
navigate efficiently through a large
semantic cache in spite of the additional overhead and overall is
considerably less expensive than linear scanning. Little emphasis has been
laid upon the performance of multidimensional index inserts and deletes,
as opposed to search performance. We compare the performance of a few
widely used multidimensional indexing structures with our SH-tree, looking
at insert, delete, and search operations, and show that SH-trees overall
perform better for large semantic caches than the widely used indexing
techniques
Survey of time series database technology
This report has been prepared by Epimorphics Ltd. as part of the ENTRAIN project (NERC grant number NE/S016244/1) which is a feasibility project within the “NERC Constructing a Digital Environment Strategic Priorities Fund Programme”. The Centre for Ecology and Hydrology(CEH) is a research organisation focusing on land and freshwater ecosystems and their interaction with the atmosphere. The organization manages a number of sensor networks to monitor the environment, and also handles large databases of 3rd party data (e.g. river flows measured by the Environment Agency and equivalents in Scotland and Wales). Data from these networks is stored and made available to users, both internally (through direct query of databases, and externally via web-services). The ENTRAIN project aims to address a number of issues in relation to sensor data storage and integration, using a number of hydrological datasets to help define use cases: COSMOS-UK (a network of ~50 sites measuring soil moisture and meteorological variables at 1-30 minute resolutions); the CEH Greenhouse Gas (GHG) network (~15 sites measuring sub-second fluxes of gases and moisture, subsequently processed up to 30-minute aggregations); the Thames Initiative (a database of weekly and hourly water quality samples from sites around the Thames basin). In addition this report considers the UK National River Flow Archive, a database of daily river flows and catchment rainfall derived by regional environmental agencies from 15-minute measurements of river levels and flows. CEH commissioned this report to survey alternative technologies for storing sensor data that scale better, could manage larger data volumes more easily and less expensively, and that might be readily deployed on different infrastructures
Text Document Classification: An Approach Based on Indexing
ABSTRACT
In this paper we propose a new method of classifying text documents. Unlike conventional vector space models, the proposed method preserves the sequence of term occurrence in a document. The term sequence is effectively preserved with the help of a novel datastructure called ‘Status Matrix’. Further the corresponding classification technique has been proposed for efficient classification of text documents. In addition, in order to avoid sequential matching during classification, we propose to index the terms in Btree, an efficient index scheme. Each term in B-tree is associated with a list of class labels of those documents which contain the term. Further the corresponding classification technique has been proposed. To corroborate the efficacy of the proposed representation and status matrix based classification, we have conducted extensive experiments on various datasets.
Original Source URL : http://aircconline.com/ijdkp/V2N1/2112ijdkp04.pdf
For more details : http://airccse.org/journal/ijdkp/vol2.htm
Advance of the Access Methods
The goal of this paper is to outline the advance of the access methods in the last ten years as well as
to make review of all available in the accessible bibliography methods
Improving the performance of similarity joins using graphics processing unit
Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2012.Thesis (Master's) -- Bilkent University, 2012.Includes bibliographical refences.The similarity join is an important operation in data mining and it is used in
many applications from varying domains. A similarity join operator takes one or
two sets of data points and outputs pairs of points whose distances in the data
space is within a certain threshold value, ". The baseline nested loop approach
computes the distances between all pairs of objects. When considering large set
of objects which yield too long query time for nested loop paradigm, accelerating
such operator becomes more important. The computing capability of recent
GPUs with the help of a general purpose parallel computing architecture (CUDA)
has attracted many researches. With this motivation, we propose two similarity
join algorithms for Graphics Processing Unit (GPU). To exploit the advantages of
general purpose GPU computing, we rst propose an improved nested loop join
algorithm (GPU-INLJ) for the speci c environment of GPU. Also we present a
partitioning-based join algorithm (KMEANS-JOIN) that guarantees each partition
can be joined independently without missing any join pair. Our experiments
demonstrate massive performance gains and the suitability of our algorithms for
large datasets.Korkmaz, ZeynepM.S
- …