43 research outputs found
Histogram techniques for cost estimation in query optimization.
Yu Xiaohui.Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.Includes bibliographical references (leaves 98-115).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 2 --- Related Work --- p.6Chapter 2.1 --- Query Optimization --- p.6Chapter 2.2 --- Query Rewriting --- p.8Chapter 2.2.1 --- Optimizing Multi-Block Queries --- p.8Chapter 2.2.2 --- Semantic Query Optimization --- p.13Chapter 2.2.3 --- Query Rewriting in Starburst --- p.15Chapter 2.3 --- Plan Generation --- p.16Chapter 2.3.1 --- Dynamic Programming Approach --- p.16Chapter 2.3.2 --- Join Query Processing --- p.17Chapter 2.3.3 --- Queries with Aggregates --- p.23Chapter 2.4 --- Statistics and Cost Estimation --- p.24Chapter 2.5 --- Histogram Techniques --- p.27Chapter 2.5.1 --- Definitions --- p.28Chapter 2.5.2 --- Trivial Histograms --- p.29Chapter 2.5.3 --- Heuristic-based Histograms --- p.29Chapter 2.5.4 --- V-Optimal Histograms --- p.32Chapter 2.5.5 --- Wavelet-based Histograms --- p.35Chapter 2.5.6 --- Multidimensional Histograms --- p.35Chapter 2.5.7 --- Global Histograms --- p.37Chapter 3 --- New Histogram Techniques --- p.39Chapter 3.1 --- Piecewise Linear Histograms --- p.39Chapter 3.1.1 --- Construction --- p.41Chapter 3.1.2 --- Usage --- p.43Chapter 3.1.3 --- Error Measures --- p.43Chapter 3.1.4 --- Experiments --- p.45Chapter 3.1.5 --- Conclusion --- p.51Chapter 3.2 --- A-Optimal Histograms --- p.54Chapter 3.2.1 --- A-Optimal(mean) Histograms --- p.56Chapter 3.2.2 --- A-Optimal(median) Histograms --- p.58Chapter 3.2.3 --- A-Optimal(median-cf) Histograms --- p.59Chapter 3.2.4 --- Experiments --- p.60Chapter 4 --- Global Histograms --- p.64Chapter 4.1 --- Wavelet-based Global Histograms --- p.65Chapter 4.1.1 --- Wavelet-based Global Histograms I --- p.66Chapter 4.1.2 --- Wavelet-based Global Histograms II --- p.68Chapter 4.2 --- Piecewise Linear Global Histograms --- p.70Chapter 4.3 --- A-Optimal Global Histograms --- p.72Chapter 4.3.1 --- Experiments --- p.74Chapter 5 --- Dynamic Maintenance --- p.81Chapter 5.1 --- Problem Definition --- p.83Chapter 5.2 --- Refining Bucket Coefficients --- p.84Chapter 5.3 --- Restructuring --- p.86Chapter 5.4 --- Experiments --- p.91Chapter 6 --- Conclusions --- p.95Bibliography --- p.9
Mining Optimized Association Rules for Numeric Attributes
AbstractGiven a huge database, we address the problem of finding association rules for numeric attributes, such as(Balance∈I)⇒(CardLoan=yes),which implies that bank customers whose balances fall in a rangeIare likely to use card loan with a probability greater thanp. The above rule is interesting only if the rangeIhas some special feature with respect to the interrelation betweenBalanceandCardLoan. It is required that the number of customers whose balances are contained inI(called thesupportofI) is sufficient and also that the probabilitypof the conditionCardLoan=yesbeing met (called theconfidence ratio) be much higher than the average probability of the condition over all the data. Our goal is to realize a system that finds such appropriate ranges automatically. We mainly focus on computing twooptimized ranges: one that maximizes the support on the condition that the confidence ratio is at least a given threshold value, and another that maximizes the confidence ratio on the condition that the support is at least a given threshold number. Using techniques from computational geometry, we present novel algorithms that compute the optimized ranges in linear time if the data are sorted. Since sorting data with respect to each numeric attribute is expensive in the case of huge databases that occupy much more space than the main memory, we instead apply randomized bucketing as the preprocessing method and thus obtain an efficient rule-finding system. Tests show that our implementation is fast not only in theory but also in practice. The efficiency of our algorithm enables us to compute optimized rules for all combinations of hundreds of numeric and Boolean attributes in a reasonable time
An Efficient Architecture for Information Retrieval in P2P Context Using Hypergraph
Peer-to-peer (P2P) Data-sharing systems now generate a significant portion of
Internet traffic. P2P systems have emerged as an accepted way to share enormous
volumes of data. Needs for widely distributed information systems supporting
virtual organizations have given rise to a new category of P2P systems called
schema-based. In such systems each peer is a database management system in
itself, ex-posing its own schema. In such a setting, the main objective is the
efficient search across peer databases by processing each incoming query
without overly consuming bandwidth. The usability of these systems depends on
successful techniques to find and retrieve data; however, efficient and
effective routing of content-based queries is an emerging problem in P2P
networks. This work was attended as an attempt to motivate the use of mining
algorithms in the P2P context may improve the significantly the efficiency of
such methods. Our proposed method based respectively on combination of
clustering with hypergraphs. We use ECCLAT to build approximate clustering and
discovering meaningful clusters with slight overlapping. We use an algorithm
MTMINER to extract all minimal transversals of a hypergraph (clusters) for
query routing. The set of clusters improves the robustness in queries routing
mechanism and scalability in P2P Network. We compare the performance of our
method with the baseline one considering the queries routing problem. Our
experimental results prove that our proposed methods generate impressive levels
of performance and scalability with with respect to important criteria such as
response time, precision and recall.Comment: 2o pages, 8 figure
Translating Temporal SQL to Nested SQL
Sequenced and nonsequenced semantics are the two previously researched semantics for the evaluation of an operation in a temporal database such as a query or data modification. Sequenced semantics evaluates an operation in each time instant using only the data alive at that time. Nonsequenced semantics, in contrast, means that an operation explicitly references and manipulates the timestamps in the data.
In this thesis we propose a new framework that shows both semantics are variants of a general temporal semantics. We present the general semantics and show how additional semantics, such as preceding semantics can be realized. The semantics are specified using annotations.
The primary contribution of this theses is the translation from temporal SQL to nested SQL. We focus on SQL\u27s SELECT statement, which is used to query data. Temporal SQL is SQL annotated with temporal semantics. Nested SQL is SQL for non-1NF data, with additional operations, such as COGROUP and FLATTEN to create and un-nest, respectively, bags of tuples (non-1NF data). This thesis develops a denotational semantics for translating from temporal to nested SQL. We implemented the denotational semantics for an SQLite ANTLR grammar, and the thesis also reports on the implementation
Entropy-based subspace clustering for mining numerical data.
by Cheng, Chun-hung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1999.Includes bibliographical references (leaves 72-76).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgments --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Six Tasks of Data Mining --- p.1Chapter 1.1.1 --- Classification --- p.2Chapter 1.1.2 --- Estimation --- p.2Chapter 1.1.3 --- Prediction --- p.2Chapter 1.1.4 --- Market Basket Analysis --- p.3Chapter 1.1.5 --- Clustering --- p.3Chapter 1.1.6 --- Description --- p.3Chapter 1.2 --- Problem Description --- p.4Chapter 1.3 --- Motivation --- p.5Chapter 1.4 --- Terminology --- p.7Chapter 1.5 --- Outline of the Thesis --- p.7Chapter 2 --- Survey on Previous Work --- p.8Chapter 2.1 --- Data Mining --- p.8Chapter 2.1.1 --- Association Rules and its Variations --- p.9Chapter 2.1.2 --- Rules Containing Numerical Attributes --- p.15Chapter 2.2 --- Clustering --- p.17Chapter 2.2.1 --- The CLIQUE Algorithm --- p.20Chapter 3 --- Entropy and Subspace Clustering --- p.24Chapter 3.1 --- Criteria of Subspace Clustering --- p.24Chapter 3.1.1 --- Criterion of High Density --- p.25Chapter 3.1.2 --- Correlation of Dimensions --- p.25Chapter 3.2 --- Entropy in a Numerical Database --- p.27Chapter 3.2.1 --- Calculation of Entropy --- p.27Chapter 3.3 --- Entropy and the Clustering Criteria --- p.29Chapter 3.3.1 --- Entropy and the Coverage Criterion --- p.29Chapter 3.3.2 --- Entropy and the Density Criterion --- p.31Chapter 3.3.3 --- Entropy and Dimensional Correlation --- p.33Chapter 4 --- The ENCLUS Algorithms --- p.35Chapter 4.1 --- Framework of the Algorithms --- p.35Chapter 4.2 --- Closure Properties --- p.37Chapter 4.3 --- Complexity Analysis --- p.39Chapter 4.4 --- Mining Significant Subspaces --- p.40Chapter 4.5 --- Mining Interesting Subspaces --- p.42Chapter 4.6 --- Example --- p.44Chapter 5 --- Experiments --- p.49Chapter 5.1 --- Synthetic Data --- p.49Chapter 5.1.1 --- Data Generation ´ؤ Hyper-rectangular Data --- p.49Chapter 5.1.2 --- Data Generation ´ؤ Linearly Dependent Data --- p.50Chapter 5.1.3 --- Effect of Changing the Thresholds --- p.51Chapter 5.1.4 --- Effectiveness of the Pruning Strategies --- p.53Chapter 5.1.5 --- Scalability Test --- p.53Chapter 5.1.6 --- Accuracy --- p.55Chapter 5.2 --- Real-life Data --- p.55Chapter 5.2.1 --- Census Data --- p.55Chapter 5.2.2 --- Stock Data --- p.56Chapter 5.3 --- Comparison with CLIQUE --- p.58Chapter 5.3.1 --- Subspaces with Uniform Projections --- p.60Chapter 5.4 --- Problems with Hyper-rectangular Data --- p.62Chapter 6 --- Miscellaneous Enhancements --- p.64Chapter 6.1 --- Extra Pruning --- p.64Chapter 6.2 --- Multi-resolution Approach --- p.65Chapter 6.3 --- Multi-threshold Approach --- p.68Chapter 7 --- Conclusion --- p.70Bibliography --- p.71Appendix --- p.77Chapter A --- Differential Entropy vs Discrete Entropy --- p.77Chapter A.1 --- Relation of Differential Entropy to Discrete Entropy --- p.78Chapter B --- Mining Quantitative Association Rules --- p.80Chapter B.1 --- Approaches --- p.81Chapter B.2 --- Performance --- p.82Chapter B.3 --- Final Remarks --- p.8
First-Order Rewritability and Complexity of Two-Dimensional Temporal Ontology-Mediated Queries
Aiming at ontology-based data access to temporal data, we design
two-dimensional temporal ontology and query languages by combining logics from
the (extended) DL-Lite family with linear temporal logic LTL over discrete time
(Z,<). Our main concern is first-order rewritability of ontology-mediated
queries (OMQs) that consist of a 2D ontology and a positive temporal instance
query. Our target languages for FO-rewritings are two-sorted FO(<) -
first-order logic with sorts for time instants ordered by the built-in
precedence relation < and for the domain of individuals - its extension FOE
with the standard congruence predicates t \equiv 0 mod n, for any fixed n > 1,
and FO(RPR) that admits relational primitive recursion. In terms of circuit
complexity, FOE- and FO(RPR)-rewritability guarantee answering OMQs in uniform
AC0 and NC1, respectively.
We proceed in three steps. First, we define a hierarchy of 2D DL-Lite/LTL
ontology languages and investigate the FO-rewritability of OMQs with atomic
queries by constructing projections onto 1D LTL OMQs and employing recent
results on the FO-rewritability of propositional LTL OMQs. As the projections
involve deciding consistency of ontologies and data, we also consider the
consistency problem for our languages. While the undecidability of consistency
for 2D ontology languages with expressive Boolean role inclusions might be
expected, we also show that, rather surprisingly, the restriction to Krom and
Horn role inclusions leads to decidability (and ExpSpace-completeness), even if
one admits full Booleans on concepts. As a final step, we lift some of the
rewritability results for atomic OMQs to OMQs with expressive positive temporal
instance queries. The lifting results are based on an in-depth study of the
canonical models and only concern Horn ontologies
Interactive data mining and visualization on multi-dimensional data.
by Chu, Hong Ki.Thesis (M.Phil.)--Chinese University of Hong Kong, 1999.Includes bibliographical references (leaves 75-79).Abstracts in English and Chinese.Acknowledgments --- p.iiAbstract --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Problem Definitions --- p.3Chapter 1.2 --- Experimental Setup --- p.5Chapter 1.3 --- Outline of the thesis --- p.6Chapter 2 --- Survey on Previous Researches --- p.8Chapter 2.1 --- Association rules --- p.8Chapter 2.2 --- Clustering --- p.10Chapter 2.3 --- Motivation --- p.12Chapter 3 --- ID AN on discovering quantitative association rules --- p.16Chapter 3.1 --- Briefing --- p.17Chapter 3.2 --- A-Tree --- p.18Chapter 3.3 --- Insertion Algorithm --- p.25Chapter 3.4 --- Visualizing Association Rules --- p.28Chapter 4 --- ID AN on discovering patterns of clustering --- p.34Chapter 4.1 --- Briefing --- p.34Chapter 4.2 --- A-Tree --- p.36Chapter 4.3 --- Dimensionality Curse --- p.37Chapter 4.3.1 --- Discrete Fourier Transform --- p.38Chapter 4.3.2 --- Discrete Wavelet Transform --- p.40Chapter 4.3.3 --- Singular Value Decomposition --- p.42Chapter 4.4 --- IDAN - Algorithm --- p.45Chapter 4.5 --- Visualizing clustering patterns --- p.49Chapter 4.6 --- Comparison --- p.51Chapter 5 --- Performance Studies --- p.55Chapter 5.1 --- Association Rules --- p.55Chapter 5.2 --- Clustering --- p.58Chapter 6 --- Survey on data visualization techniques --- p.63Chapter 6.1 --- Geometric Projection Techniques --- p.64Chapter 6.1.1 --- Scatter-plot Matrix --- p.64Chapter 6.1.2 --- Parallel Coordinates --- p.65Chapter 6.2 --- Icon-based Techniques --- p.67Chapter 6.2.1 --- Chernoff Face --- p.67Chapter 6.2.2 --- Stick Figures --- p.68Chapter 6.3 --- Pixel-oriented Techniques --- p.70Chapter 6.4 --- Hierarchical Techniques --- p.72Chapter 7 --- Conclusion --- p.73Bibliography --- p.7
A Content-Addressable Network for Similarity Search in Metric Spaces
Because of the ongoing digital data explosion, more advanced search paradigms than the traditional exact match are needed for contentbased retrieval in huge and ever growing collections of data produced in application areas such as multimedia, molecular biology, marketing, computer-aided design and purchasing assistance. As the variety of data types is fast going towards creating a database utilized by people, the computer systems must be able to model human fundamental reasoning paradigms, which are naturally based on similarity. The ability to perceive similarities is crucial for recognition, classification, and learning, and it plays an important role in scientific discovery and creativity. Recently, the mathematical notion of metric space has become a useful abstraction of similarity and many similarity search indexes have been developed.
In this thesis, we accept the metric space similarity paradigm and concentrate on the scalability issues. By exploiting computer networks and applying the Peer-to-Peer communication paradigms, we build a structured network of computers able to process similarity queries in parallel. Since no centralized entities are used, such architectures are fully scalable. Specifically, we propose a Peer-to-Peer system for similarity search in metric spaces called Metric Content-Addressable Network (MCAN) which is an extension of the well known Content-Addressable Network (CAN) used for hash lookup. A prototype implementation of MCAN was tested on real-life datasets of image features, protein symbols, and text — observed results are reported. We also compared the performance of MCAN with three other, recently proposed, distributed data structures for similarity search in metric spaces