43 research outputs found

    Histogram techniques for cost estimation in query optimization.

    Get PDF
    Yu Xiaohui.Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.Includes bibliographical references (leaves 98-115).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 2 --- Related Work --- p.6Chapter 2.1 --- Query Optimization --- p.6Chapter 2.2 --- Query Rewriting --- p.8Chapter 2.2.1 --- Optimizing Multi-Block Queries --- p.8Chapter 2.2.2 --- Semantic Query Optimization --- p.13Chapter 2.2.3 --- Query Rewriting in Starburst --- p.15Chapter 2.3 --- Plan Generation --- p.16Chapter 2.3.1 --- Dynamic Programming Approach --- p.16Chapter 2.3.2 --- Join Query Processing --- p.17Chapter 2.3.3 --- Queries with Aggregates --- p.23Chapter 2.4 --- Statistics and Cost Estimation --- p.24Chapter 2.5 --- Histogram Techniques --- p.27Chapter 2.5.1 --- Definitions --- p.28Chapter 2.5.2 --- Trivial Histograms --- p.29Chapter 2.5.3 --- Heuristic-based Histograms --- p.29Chapter 2.5.4 --- V-Optimal Histograms --- p.32Chapter 2.5.5 --- Wavelet-based Histograms --- p.35Chapter 2.5.6 --- Multidimensional Histograms --- p.35Chapter 2.5.7 --- Global Histograms --- p.37Chapter 3 --- New Histogram Techniques --- p.39Chapter 3.1 --- Piecewise Linear Histograms --- p.39Chapter 3.1.1 --- Construction --- p.41Chapter 3.1.2 --- Usage --- p.43Chapter 3.1.3 --- Error Measures --- p.43Chapter 3.1.4 --- Experiments --- p.45Chapter 3.1.5 --- Conclusion --- p.51Chapter 3.2 --- A-Optimal Histograms --- p.54Chapter 3.2.1 --- A-Optimal(mean) Histograms --- p.56Chapter 3.2.2 --- A-Optimal(median) Histograms --- p.58Chapter 3.2.3 --- A-Optimal(median-cf) Histograms --- p.59Chapter 3.2.4 --- Experiments --- p.60Chapter 4 --- Global Histograms --- p.64Chapter 4.1 --- Wavelet-based Global Histograms --- p.65Chapter 4.1.1 --- Wavelet-based Global Histograms I --- p.66Chapter 4.1.2 --- Wavelet-based Global Histograms II --- p.68Chapter 4.2 --- Piecewise Linear Global Histograms --- p.70Chapter 4.3 --- A-Optimal Global Histograms --- p.72Chapter 4.3.1 --- Experiments --- p.74Chapter 5 --- Dynamic Maintenance --- p.81Chapter 5.1 --- Problem Definition --- p.83Chapter 5.2 --- Refining Bucket Coefficients --- p.84Chapter 5.3 --- Restructuring --- p.86Chapter 5.4 --- Experiments --- p.91Chapter 6 --- Conclusions --- p.95Bibliography --- p.9

    Mining Optimized Association Rules for Numeric Attributes

    Get PDF
    AbstractGiven a huge database, we address the problem of finding association rules for numeric attributes, such as(Balance∈I)⇒(CardLoan=yes),which implies that bank customers whose balances fall in a rangeIare likely to use card loan with a probability greater thanp. The above rule is interesting only if the rangeIhas some special feature with respect to the interrelation betweenBalanceandCardLoan. It is required that the number of customers whose balances are contained inI(called thesupportofI) is sufficient and also that the probabilitypof the conditionCardLoan=yesbeing met (called theconfidence ratio) be much higher than the average probability of the condition over all the data. Our goal is to realize a system that finds such appropriate ranges automatically. We mainly focus on computing twooptimized ranges: one that maximizes the support on the condition that the confidence ratio is at least a given threshold value, and another that maximizes the confidence ratio on the condition that the support is at least a given threshold number. Using techniques from computational geometry, we present novel algorithms that compute the optimized ranges in linear time if the data are sorted. Since sorting data with respect to each numeric attribute is expensive in the case of huge databases that occupy much more space than the main memory, we instead apply randomized bucketing as the preprocessing method and thus obtain an efficient rule-finding system. Tests show that our implementation is fast not only in theory but also in practice. The efficiency of our algorithm enables us to compute optimized rules for all combinations of hundreds of numeric and Boolean attributes in a reasonable time

    An Efficient Architecture for Information Retrieval in P2P Context Using Hypergraph

    Full text link
    Peer-to-peer (P2P) Data-sharing systems now generate a significant portion of Internet traffic. P2P systems have emerged as an accepted way to share enormous volumes of data. Needs for widely distributed information systems supporting virtual organizations have given rise to a new category of P2P systems called schema-based. In such systems each peer is a database management system in itself, ex-posing its own schema. In such a setting, the main objective is the efficient search across peer databases by processing each incoming query without overly consuming bandwidth. The usability of these systems depends on successful techniques to find and retrieve data; however, efficient and effective routing of content-based queries is an emerging problem in P2P networks. This work was attended as an attempt to motivate the use of mining algorithms in the P2P context may improve the significantly the efficiency of such methods. Our proposed method based respectively on combination of clustering with hypergraphs. We use ECCLAT to build approximate clustering and discovering meaningful clusters with slight overlapping. We use an algorithm MTMINER to extract all minimal transversals of a hypergraph (clusters) for query routing. The set of clusters improves the robustness in queries routing mechanism and scalability in P2P Network. We compare the performance of our method with the baseline one considering the queries routing problem. Our experimental results prove that our proposed methods generate impressive levels of performance and scalability with with respect to important criteria such as response time, precision and recall.Comment: 2o pages, 8 figure

    Translating Temporal SQL to Nested SQL

    Get PDF
    Sequenced and nonsequenced semantics are the two previously researched semantics for the evaluation of an operation in a temporal database such as a query or data modification. Sequenced semantics evaluates an operation in each time instant using only the data alive at that time. Nonsequenced semantics, in contrast, means that an operation explicitly references and manipulates the timestamps in the data. In this thesis we propose a new framework that shows both semantics are variants of a general temporal semantics. We present the general semantics and show how additional semantics, such as preceding semantics can be realized. The semantics are specified using annotations. The primary contribution of this theses is the translation from temporal SQL to nested SQL. We focus on SQL\u27s SELECT statement, which is used to query data. Temporal SQL is SQL annotated with temporal semantics. Nested SQL is SQL for non-1NF data, with additional operations, such as COGROUP and FLATTEN to create and un-nest, respectively, bags of tuples (non-1NF data). This thesis develops a denotational semantics for translating from temporal to nested SQL. We implemented the denotational semantics for an SQLite ANTLR grammar, and the thesis also reports on the implementation

    Querying and mining heterogeneous spatial, social, and temporal data

    Get PDF

    Entropy-based subspace clustering for mining numerical data.

    Get PDF
    by Cheng, Chun-hung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1999.Includes bibliographical references (leaves 72-76).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgments --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Six Tasks of Data Mining --- p.1Chapter 1.1.1 --- Classification --- p.2Chapter 1.1.2 --- Estimation --- p.2Chapter 1.1.3 --- Prediction --- p.2Chapter 1.1.4 --- Market Basket Analysis --- p.3Chapter 1.1.5 --- Clustering --- p.3Chapter 1.1.6 --- Description --- p.3Chapter 1.2 --- Problem Description --- p.4Chapter 1.3 --- Motivation --- p.5Chapter 1.4 --- Terminology --- p.7Chapter 1.5 --- Outline of the Thesis --- p.7Chapter 2 --- Survey on Previous Work --- p.8Chapter 2.1 --- Data Mining --- p.8Chapter 2.1.1 --- Association Rules and its Variations --- p.9Chapter 2.1.2 --- Rules Containing Numerical Attributes --- p.15Chapter 2.2 --- Clustering --- p.17Chapter 2.2.1 --- The CLIQUE Algorithm --- p.20Chapter 3 --- Entropy and Subspace Clustering --- p.24Chapter 3.1 --- Criteria of Subspace Clustering --- p.24Chapter 3.1.1 --- Criterion of High Density --- p.25Chapter 3.1.2 --- Correlation of Dimensions --- p.25Chapter 3.2 --- Entropy in a Numerical Database --- p.27Chapter 3.2.1 --- Calculation of Entropy --- p.27Chapter 3.3 --- Entropy and the Clustering Criteria --- p.29Chapter 3.3.1 --- Entropy and the Coverage Criterion --- p.29Chapter 3.3.2 --- Entropy and the Density Criterion --- p.31Chapter 3.3.3 --- Entropy and Dimensional Correlation --- p.33Chapter 4 --- The ENCLUS Algorithms --- p.35Chapter 4.1 --- Framework of the Algorithms --- p.35Chapter 4.2 --- Closure Properties --- p.37Chapter 4.3 --- Complexity Analysis --- p.39Chapter 4.4 --- Mining Significant Subspaces --- p.40Chapter 4.5 --- Mining Interesting Subspaces --- p.42Chapter 4.6 --- Example --- p.44Chapter 5 --- Experiments --- p.49Chapter 5.1 --- Synthetic Data --- p.49Chapter 5.1.1 --- Data Generation ´ؤ Hyper-rectangular Data --- p.49Chapter 5.1.2 --- Data Generation ´ؤ Linearly Dependent Data --- p.50Chapter 5.1.3 --- Effect of Changing the Thresholds --- p.51Chapter 5.1.4 --- Effectiveness of the Pruning Strategies --- p.53Chapter 5.1.5 --- Scalability Test --- p.53Chapter 5.1.6 --- Accuracy --- p.55Chapter 5.2 --- Real-life Data --- p.55Chapter 5.2.1 --- Census Data --- p.55Chapter 5.2.2 --- Stock Data --- p.56Chapter 5.3 --- Comparison with CLIQUE --- p.58Chapter 5.3.1 --- Subspaces with Uniform Projections --- p.60Chapter 5.4 --- Problems with Hyper-rectangular Data --- p.62Chapter 6 --- Miscellaneous Enhancements --- p.64Chapter 6.1 --- Extra Pruning --- p.64Chapter 6.2 --- Multi-resolution Approach --- p.65Chapter 6.3 --- Multi-threshold Approach --- p.68Chapter 7 --- Conclusion --- p.70Bibliography --- p.71Appendix --- p.77Chapter A --- Differential Entropy vs Discrete Entropy --- p.77Chapter A.1 --- Relation of Differential Entropy to Discrete Entropy --- p.78Chapter B --- Mining Quantitative Association Rules --- p.80Chapter B.1 --- Approaches --- p.81Chapter B.2 --- Performance --- p.82Chapter B.3 --- Final Remarks --- p.8

    First-Order Rewritability and Complexity of Two-Dimensional Temporal Ontology-Mediated Queries

    Get PDF
    Aiming at ontology-based data access to temporal data, we design two-dimensional temporal ontology and query languages by combining logics from the (extended) DL-Lite family with linear temporal logic LTL over discrete time (Z,<). Our main concern is first-order rewritability of ontology-mediated queries (OMQs) that consist of a 2D ontology and a positive temporal instance query. Our target languages for FO-rewritings are two-sorted FO(<) - first-order logic with sorts for time instants ordered by the built-in precedence relation < and for the domain of individuals - its extension FOE with the standard congruence predicates t \equiv 0 mod n, for any fixed n > 1, and FO(RPR) that admits relational primitive recursion. In terms of circuit complexity, FOE- and FO(RPR)-rewritability guarantee answering OMQs in uniform AC0 and NC1, respectively. We proceed in three steps. First, we define a hierarchy of 2D DL-Lite/LTL ontology languages and investigate the FO-rewritability of OMQs with atomic queries by constructing projections onto 1D LTL OMQs and employing recent results on the FO-rewritability of propositional LTL OMQs. As the projections involve deciding consistency of ontologies and data, we also consider the consistency problem for our languages. While the undecidability of consistency for 2D ontology languages with expressive Boolean role inclusions might be expected, we also show that, rather surprisingly, the restriction to Krom and Horn role inclusions leads to decidability (and ExpSpace-completeness), even if one admits full Booleans on concepts. As a final step, we lift some of the rewritability results for atomic OMQs to OMQs with expressive positive temporal instance queries. The lifting results are based on an in-depth study of the canonical models and only concern Horn ontologies

    Interactive data mining and visualization on multi-dimensional data.

    Get PDF
    by Chu, Hong Ki.Thesis (M.Phil.)--Chinese University of Hong Kong, 1999.Includes bibliographical references (leaves 75-79).Abstracts in English and Chinese.Acknowledgments --- p.iiAbstract --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Problem Definitions --- p.3Chapter 1.2 --- Experimental Setup --- p.5Chapter 1.3 --- Outline of the thesis --- p.6Chapter 2 --- Survey on Previous Researches --- p.8Chapter 2.1 --- Association rules --- p.8Chapter 2.2 --- Clustering --- p.10Chapter 2.3 --- Motivation --- p.12Chapter 3 --- ID AN on discovering quantitative association rules --- p.16Chapter 3.1 --- Briefing --- p.17Chapter 3.2 --- A-Tree --- p.18Chapter 3.3 --- Insertion Algorithm --- p.25Chapter 3.4 --- Visualizing Association Rules --- p.28Chapter 4 --- ID AN on discovering patterns of clustering --- p.34Chapter 4.1 --- Briefing --- p.34Chapter 4.2 --- A-Tree --- p.36Chapter 4.3 --- Dimensionality Curse --- p.37Chapter 4.3.1 --- Discrete Fourier Transform --- p.38Chapter 4.3.2 --- Discrete Wavelet Transform --- p.40Chapter 4.3.3 --- Singular Value Decomposition --- p.42Chapter 4.4 --- IDAN - Algorithm --- p.45Chapter 4.5 --- Visualizing clustering patterns --- p.49Chapter 4.6 --- Comparison --- p.51Chapter 5 --- Performance Studies --- p.55Chapter 5.1 --- Association Rules --- p.55Chapter 5.2 --- Clustering --- p.58Chapter 6 --- Survey on data visualization techniques --- p.63Chapter 6.1 --- Geometric Projection Techniques --- p.64Chapter 6.1.1 --- Scatter-plot Matrix --- p.64Chapter 6.1.2 --- Parallel Coordinates --- p.65Chapter 6.2 --- Icon-based Techniques --- p.67Chapter 6.2.1 --- Chernoff Face --- p.67Chapter 6.2.2 --- Stick Figures --- p.68Chapter 6.3 --- Pixel-oriented Techniques --- p.70Chapter 6.4 --- Hierarchical Techniques --- p.72Chapter 7 --- Conclusion --- p.73Bibliography --- p.7

    A Content-Addressable Network for Similarity Search in Metric Spaces

    Get PDF
    Because of the ongoing digital data explosion, more advanced search paradigms than the traditional exact match are needed for contentbased retrieval in huge and ever growing collections of data produced in application areas such as multimedia, molecular biology, marketing, computer-aided design and purchasing assistance. As the variety of data types is fast going towards creating a database utilized by people, the computer systems must be able to model human fundamental reasoning paradigms, which are naturally based on similarity. The ability to perceive similarities is crucial for recognition, classification, and learning, and it plays an important role in scientific discovery and creativity. Recently, the mathematical notion of metric space has become a useful abstraction of similarity and many similarity search indexes have been developed. In this thesis, we accept the metric space similarity paradigm and concentrate on the scalability issues. By exploiting computer networks and applying the Peer-to-Peer communication paradigms, we build a structured network of computers able to process similarity queries in parallel. Since no centralized entities are used, such architectures are fully scalable. Specifically, we propose a Peer-to-Peer system for similarity search in metric spaces called Metric Content-Addressable Network (MCAN) which is an extension of the well known Content-Addressable Network (CAN) used for hash lookup. A prototype implementation of MCAN was tested on real-life datasets of image features, protein symbols, and text — observed results are reported. We also compared the performance of MCAN with three other, recently proposed, distributed data structures for similarity search in metric spaces
    corecore