107 research outputs found

    Average cost of orthogonal range queries in multiattribute trees

    Get PDF
    Résumé disponible dans les fichiers attaché

    Online Data Structures in External Memory

    Get PDF
    The original publication is available at www.springerlink.comThe data sets for many of today's computer applications are too large to t within the computer's internal memory and must instead be stored on external storage devices such as disks. A major performance bottleneck can be the input/output communication (or I/O) between the external and internal memories. In this paper we discuss a variety of online data structures for external memory, some very old and some very new, such as hashing (for dictionaries), B-trees (for dictionaries and 1-D range search), bu er trees (for batched dynamic problems), interval trees with weight-balanced B-trees (for stabbing queries), priority search trees (for 3-sided 2-D range search), and R-trees and other spatial structures. We also discuss several open problems along the way

    A Framework for Index Bulk Loading and Dynamization

    Get PDF
    In this paper we investigate automated methods for externalizing internal memory data structures. We consider a class of balanced trees that we call weight-balanced partitioning trees (or wp-trees) for indexing a set of points in Rd. Well-known examples of wp-trees include fed-trees, BBD-trees, pseudo quad trees, and BAR trees. These trees are defined with fixed degree and are thus suited for internal memory implementations. Given an efficient wp-tree construction algorithm, we present a general framework for automatically obtaining a new dynamic external data structure. Using this framework together with a new general construction (bulk loading) technique of independent interest, we obtain data structures with guaranteed good update performance in terms of I /O transfers. Our approach gives considerably improved construction and update I/O bounds of e.g. fed-trees and BBD-trees

    High-dimensional indexing methods utilizing clustering and dimensionality reduction

    Get PDF
    The emergence of novel database applications has resulted in the prevalence of a new paradigm for similarity search. These applications include multimedia databases, medical imaging databases, time series databases, DNA and protein sequence databases, and many others. Features of data objects are extracted and transformed into high-dimensional data points. Searching for objects becomes a search on points in the high-dimensional feature space. The dissimilarity between two objects is determined by the distance between two feature vectors. Similarity search is usually implemented as nearest neighbor search in feature vector spaces. The cost of processing k-nearest neighbor (k-NN) queries via a sequential scan increases as the number of objects and the number of features increase. A variety of multi-dimensional index structures have been proposed to improve the efficiency of k-NN query processing, which work well in low-dimensional space but lose their efficiency in high-dimensional space due to the curse of dimensionality. This inefficiency is dealt in this study by Clustering and Singular Value Decomposition - CSVD with indexing, Persistent Main Memory - PMM index, and Stepwise Dimensionality Increasing - SDI-tree index. CSVD is an approximate nearest neighbor search method. The performance of CSVD with indexing is studied and the approximation to the distance in original space is investigated. For a given Normalized Mean Square Error - NMSE, the higher the degree of clustering, the higher the recall. However, more clusters require more disk page accesses. Certain number of clusters can be obtained to achieve a higher recall while maintaining a relatively lower query processing cost. Clustering and Indexing using Persistent Main Memory - CIPMM framework is motivated by the following consideration: (a) a significant fraction of index pages are accessed randomly, incurring a high positioning time for each access; (b) disk transfer rate is improving 40% annually, while the improvement in positioning time is only 8%; (c) query processing incurs less CPU time for main memory resident than disk resident indices. CIPMM aims at reducing the elapsed time for query processing by utilizing sequential, rather than random disk accesses. A specific instance of the CIPMM framework CIPOP, indexing using Persistent Ordered Partition - OP-tree, is elaborated and compared with clustering and indexing using the SR-tree, CISR. The results show that CIPOP outperforms CISR, and the higher the dimensionality, the higher the performance gains. The SDI-tree index is motivated by fanouts decrease with dimensionality increasing and shorter vectors reduce cache misses. The index is built by using feature vectors transformed via principal component analysis, resulting in a structure with fewer dimensions at higher levels and increasing the number of dimensions from one level to the other. Dimensions are retained in nonincreasing order of their variance according to a parameter p, which specifies the incremental fraction of variance at each level of the index. Experiments on three datasets have shown that SDL-trees with carefully tuned parameters access fewer disk accesses than SR-trees and VAMSR-trees and incur less CPU time than VA-Files in addition

    Modeling and measurement of consumers' decision strategies

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Sloan School of Management, 2012.Cataloged from PDF version of thesis.Includes bibliographical references.This thesis consists of three related essays which explore new approaches to modeling and measurement of consumer decision strategies. The focus is on decision strategies that deviate from von Neumann-Morgenstern utility theory. Essays 1 and 2 explore decision rules that consumers use to form their consideration sets. Essay 1 proposes disjunctions-of-conjunctions (DOC) decision rules that generalize several well-studied decision models. Two methods are proposed for estimating the model. Consumers' consideration sets for global positioning systems are observed for both calibration and validation data. For the validation data, the cognitively simple DOC-based methods predict better than the ten benchmark methods on an information theoretic measure and on hit rates. The results are robust with respect to format by which consideration is measured, sample, and presentation of profiles. Essay 2 develops and tests an active-machine-learning method to select questions adaptively when consumers use heuristic decision rules. The method tailors priors to each consumer based on a "configurator." Subsequent questions maximize information about the decision heuristics (minimize expected posterior entropy). To update posteriors after each question the posterior is approximated with a variational distribution and uses belief-propagation. The method runs sufficiently fast to select new queries in under a second and provides significantly and substantially more information per question than existing methods based on random, market-based, or orthogonal questions. The algorithm is tested empirically in a web-based survey conducted by an American automotive manufacturer to study vehicle consideration. Adaptive questions outperform market-based questions when estimating heuristic decision rules. Heuristics decision rules predict validation decisions better than compensatory rules. Essay 3 proposes a model of product search when preferences are constructed during the process of search: consumers learn what they like and dislike as they examine products. Product recommendations, whether made by sales people or online recommendation systems, bring products to the consumer's attention and impact his/her preferences. Changing preferences changes the products the consumer will choose to search; at the same time, the products the consumer chooses to search will determine the future shifts in preferences. Accounting for this two-way relationship between products and preferences is critical in optimizing recommendations.by Daria Dzyabura.Ph.D

    Coarse preferences: representation, elicitation, and decision making

    Get PDF
    In this thesis we present a theory for learning and inference of user preferences with a novel hierarchical representation that captures preferential indifference. Such models of ’Coarse Preferences’ represent the space of solutions with a uni-dimensional, discrete latent space of ’categories’. This results in a partitioning of the space of solutions into preferential equivalence classes. This hierarchical model significantly reduces the computational burden of learning and inference, with improvements both in computation time and convergence behaviour with respect to number of samples. We argue that this Coarse Preferences model facilitates the efficient solution of previously computationally prohibitive recommendation procedures. The new problem of ’coordination through set recommendation’ is one such procedure where we formulate an optimisation problem by leveraging the factored nature of our representation. Furthermore, we show how an on-line learning algorithm can be used for the efficient solution of this problem. Other benefits of our proposed model include increased quality of recommendations in Recommender Systems applications, in domains where users’ behaviour is consistent with such a hierarchical preference structure. We evaluate the usefulness of our proposed model and algorithms through experiments with two recommendation domains - a clothing retailer’s online interface, and a popular movie database. Our experimental results demonstrate computational gains over state of the art methods that use an additive decomposition of preferences in on-line active learning for recommendation

    Explaining and Refining Decision-Theoretic Choices

    Get PDF
    As the need to make complex choices among competing alternative actions is ubiquitous, the reasoning machinery of many intelligent systems will include an explicit model for making choices. Decision analysis is particularly useful for modelling such choices, and its potential use in intelligent systems motivates the construction of facilities for automatically explaining decision-theoretic choices and for helping users to incrementally refine the knowledge underlying them. The proposed thesis addresses the problem of providing such facilities. Specifically, we propose the construction of a domain-independent facility called UTIL, for explaining and refining a restricted but widely applicable decision-theoretic model called the additive multi-attribute value model. In this proposal we motivate the task, address the related issues, and present preliminary solutions in the context of examples from the domain of intelligent process control
    • …
    corecore