121 research outputs found
One stone, two birds: A lightweight multidimensional learned index with cardinality support
Innovative learning based structures have recently been proposed to tackle
index and cardinality estimation tasks, specifically learned indexes and data
driven cardinality estimators. These structures exhibit excellent performance
in capturing data distribution, making them promising for integration into AI
driven database kernels. However, accurate estimation for corner case queries
requires a large number of network parameters, resulting in higher computing
resources on expensive GPUs and more storage overhead. Additionally, the
separate implementation for CE and learned index result in a redundancy waste
by storage of single table distribution twice. These present challenges for
designing AI driven database kernels. As in real database scenarios, a compact
kernel is necessary to process queries within a limited storage and time
budget. Directly integrating these two AI approaches would result in a heavy
and complex kernel due to a large number of network parameters and repeated
storage of data distribution parameters. Our proposed CardIndex structure
effectively killed two birds with one stone. It is a fast multidim learned
index that also serves as a lightweight cardinality estimator with parameters
scaled at the KB level. Due to its special structure and small parameter size,
it can obtain both CDF and PDF information for tuples with an incredibly low
latency of 1 to 10 microseconds. For tasks with low selectivity estimation, we
did not increase the model's parameters to obtain fine grained point density.
Instead, we fully utilized our structure's characteristics and proposed a
hybrid estimation algorithm in providing fast and exact results
Multidimensional Range Queries on Modern Hardware
Range queries over multidimensional data are an important part of database
workloads in many applications. Their execution may be accelerated by using
multidimensional index structures (MDIS), such as kd-trees or R-trees. As for
most index structures, the usefulness of this approach depends on the
selectivity of the queries, and common wisdom told that a simple scan beats
MDIS for queries accessing more than 15%-20% of a dataset. However, this wisdom
is largely based on evaluations that are almost two decades old, performed on
data being held on disks, applying IO-optimized data structures, and using
single-core systems. The question is whether this rule of thumb still holds
when multidimensional range queries (MDRQ) are performed on modern
architectures with large main memories holding all data, multi-core CPUs and
data-parallel instruction sets. In this paper, we study the question whether
and how much modern hardware influences the performance ratio between index
structures and scans for MDRQ. To this end, we conservatively adapted three
popular MDIS, namely the R*-tree, the kd-tree, and the VA-file, to exploit
features of modern servers and compared their performance to different flavors
of parallel scans using multiple (synthetic and real-world) analytical
workloads over multiple (synthetic and real-world) datasets of varying size,
dimensionality, and skew. We find that all approaches benefit considerably from
using main memory and parallelization, yet to varying degrees. Our evaluation
indicates that, on current machines, scanning should be favored over parallel
versions of classical MDIS even for very selective queries
The relational XQuery puzzle: a look-back on the pieces found so far
Given the tremendous versatility of relational database implementations toward awide range of database problems, it seems only natural to consider them as back-ends for XML data processing. Yet, the assumptions behind the language XQuery are considerably different to those in traditional RDBMSs. The underlying data model is a tree, data and results carry an intrinsic order, queries are described using explicit iteration and, after all, problems are everything else but regular. Solving the relational XQuery puzzle, therefore, has challenged anumber of research groups over the past years. The purpose of this article is to summarize and assess some of the results that have been obtained during this period to solve the puzzle. Our main focus is on the Pathfinder XQuery compiler, afull reference implementation of apurely relational XQuery processor. As we dissect its components, we relate them to other work in the field and also point to open problems and limitations in the context of relational XQuery processin
Mining a Small Medical Data Set by Integrating the Decision Tree and t-test
[[abstract]]Although several researchers have used statistical methods to prove that aspiration followed by the injection of 95% ethanol left in situ (retention) is an effective treatment for ovarian endometriomas, very few discuss the different conditions that could generate different recovery rates for the patients. Therefore, this study adopts the statistical method and decision tree techniques together to analyze the postoperative status of ovarian endometriosis patients under different conditions. Since our collected data set is small, containing only 212 records, we use all of these data as the training data. Therefore, instead of using a resultant tree to generate rules directly, we use the value of each node as a cut point to generate all possible rules from the tree first. Then, using t-test, we verify the rules to discover some useful description rules after all possible rules from the tree have been generated. Experimental results show that our approach can find some new interesting knowledge about recurrent ovarian endometriomas under different conditions.[[journaltype]]國外[[incitationindex]]EI[[booktype]]紙本[[countrycodes]]FI
A Heterogeneous High Performance Computing Framework For Ill-Structured Spatial Join Processing
The frequently employed spatial join processing over two large layers of polygonal datasets to detect cross-layer polygon pairs (CPP) satisfying a join-predicate faces challenges common to ill-structured sparse problems, namely, that of identifying the few intersecting cross-layer edges out of the quadratic universe. The algorithmic engineering challenge is compounded by GPGPU SIMT architecture. Spatial join involves lightweight filter phase typically using overlap test over minimum bounding rectangles (MBRs) to discard majority of CPPs, followed by refinement phase to rigorously test the join predicate over the edges of the surviving CPPs. In this dissertation, we develop new techniques - algorithms, data structure, i/o, load balancing and system implementation - to accelerate the two-phase spatial-join processing. We present a new filtering technique, called Common MBR Filter (CMF), which changes the overall characteristic of the spatial join algorithms wherein the refinement phase is no longer the computational bottleneck. CMF is designed based on the insight that intersecting cross-layer edges must lie within the rectangular intersection of the MBRs of CPPs, their common MBRs (CMBR). We also address a key limitation of CMF for class of spatial datasets with either large or dense active CMBRs by extended CMF, called CMF-grid, that effectively employs both CMBR and grid techniques by embedding a uniform grid over CMBR of each CPP, but of suitably engineered sizes for different CPPs. To show efficiency of CMF-based filters, extensive mathematical and experimental analysis is provided. Then, two GPU-based spatial join systems are proposed based on two CMF versions including four components: 1) sort-based MBR filter, 2) CMF/CMF-grid, 3) point-in-polygon test, and, 4) edge-intersection test. The systems show two orders of magnitude speedup over the optimized sequential GEOS C++ library. Furthermore, we present a distributed system of heterogeneous compute nodes to exploit GPU-CPU computing in order to scale up the computation. A load balancing model based on Integer Linear Programming (ILP) is formulated for this system. We also provide three heuristic algorithms to approximate the ILP. Finally, we develop MPI-cuda-GIS system based on this heterogeneous computing model by integrating our CUDA-based GPU system into a newly designed distributed framework designed based on Message Passing Interface (MPI). Experimental results show good scalability and performance of MPI-cuda-GIS system
Holographic Geometry of Entanglement Renormalization in Quantum Field Theories
We study a conjectured connection between the AdS/CFT and a real-space
quantum renormalization group scheme, the multi-scale entanglement
renormalization ansatz (MERA). By making a close contact with the holographic
formula of the entanglement entropy, we propose a general definition of the
metric in the MERA in the extra holographic direction, which is formulated
purely in terms of quantum field theoretical data. Using the continuum version
of the MERA (cMERA), we calculate this emergent holographic metric explicitly
for free scalar boson and free fermions theories, and check that the metric so
computed has the properties expected from AdS/CFT. We also discuss the cMERA in
a time-dependent background induced by quantum quench and estimate its
corresponding metric.Comment: 42pages, 9figures, reference added, minor chang
Survey of time series database technology
This report has been prepared by Epimorphics Ltd. as part of the ENTRAIN project (NERC grant number NE/S016244/1) which is a feasibility project within the “NERC Constructing a Digital Environment Strategic Priorities Fund Programme”. The Centre for Ecology and Hydrology(CEH) is a research organisation focusing on land and freshwater ecosystems and their interaction with the atmosphere. The organization manages a number of sensor networks to monitor the environment, and also handles large databases of 3rd party data (e.g. river flows measured by the Environment Agency and equivalents in Scotland and Wales). Data from these networks is stored and made available to users, both internally (through direct query of databases, and externally via web-services). The ENTRAIN project aims to address a number of issues in relation to sensor data storage and integration, using a number of hydrological datasets to help define use cases: COSMOS-UK (a network of ~50 sites measuring soil moisture and meteorological variables at 1-30 minute resolutions); the CEH Greenhouse Gas (GHG) network (~15 sites measuring sub-second fluxes of gases and moisture, subsequently processed up to 30-minute aggregations); the Thames Initiative (a database of weekly and hourly water quality samples from sites around the Thames basin). In addition this report considers the UK National River Flow Archive, a database of daily river flows and catchment rainfall derived by regional environmental agencies from 15-minute measurements of river levels and flows. CEH commissioned this report to survey alternative technologies for storing sensor data that scale better, could manage larger data volumes more easily and less expensively, and that might be readily deployed on different infrastructures
DISTRIBUTED MULTIDIMENSIONAL INDEXING FOR SCIENTIFIC DATA ANALYSIS APPLICATIONS
Scientific data analysis applications require large scale computing power to
effectively service client queries and also require large storage repositories
for datasets that are generated continually from sensors and simulations.
These scientific datasets are growing in size every day, and are becoming truly
enormous. The goal of this dissertation is to provide efficient multidimensional
indexing techniques that aid in navigating distributed scientific datasets.
In this dissertation, we show significant improvements in accessing
distributed large scientific datasets.
The first approach we took to improve access to subsets of large
multidimensional scientific datasets, was data chunking. The contents of
scientific data files typically are a collection of multidimensional arrays,
along with the corresponding metadata. Data chunking groups data elements into
small chunks of a fixed, but data-specific, size to take advantage of
spatio-temporal locality since it is not efficient to index individual data
elements of large scientific datasets.
The second approach was the design of an efficient multidimensional index for
scientific datasets. This work investigates how existing multidimensional
indexing structures perform on chunked scientific datasets, and compares their
performance with that of our own indexing structure, SH-trees. Since R-trees
were proposed, various multidimensional indexing structures have been proposed.
However, there are a relatively small number of studies focused on improving
the performance of indexing geographically distributed datasets, especially
across heterogeneous machines. As a third approach, in an attempt to
accelerate indexing performance for distributed datasets, we proposed several
distributed multidimensional indexing schemes: replicated centralized indexing,
hierarchical two level indexing, and decentralized two level indexing.
Our experimental results show that great performance improvements
are gained from distribution of multidimensional index. However, the design
choices for distributed indexing, such as replication, partitioning, and
decentralization, must be carefully considered since they may decrease the overall
performance in certain situations. Therefore, this work provides performance
guidelines to aid in selecting the best distributed multidimensional indexing
scheme for various systems and applications. Finally, we describe how a
distributed multidimensional indexing scheme can be used by a distributed
multiple query optimization middleware as a case-study application to
generate better query plans by leveraging information about the contents of
remote caches
- …