8 research outputs found

    Beyond Uniformity and Independence: Analysis of R-Trees Using the Concept of Fractal Dimension",

    Get PDF
    We propose the concept of fractal dimension of a set of points, in order to quantify the deviation from the uniformity distribution. Using measurements on real data sets (road intersections of U.S. counties, star coordinates from NASA's Infrared-Ultraviolet Explorer etc.) we provide evidence that real data indeed are skewed, and, moreover, we show that they behave as mathematical fractals, with a measurable, non-integer fractal dimension. Armed with this tool, we then show its practical use in predicting the performance of spatial access methods, and specifically of the R-trees. We provide the {\em first} analysis of R-trees for skewed distributions of points: We develop a formula that estimates the number of disk accesses for range queries, given only the fractal dimension of the point set, and its count. Experiments on real data sets show that the formula is very accurate: the relative error is usually below 5\%, and it rarely exceeds 10\%. We believe that the fractal dimension will help replace the uniformity and independence assumptions, allowing more accurate analysis for {\em any} spatial access method, as well as better estimates for query optimization on multi-attribute queries. NOTE - Appeared in PODS 1994. Christos Faloutsos and Ibrahim Kamel. "Beyond Uniformity and Independence: Analysis of R-Trees Using the Concept of Fractal Dimension", Proc. ACM SIGACT-SIGMOD-SIGART PODS. Minneapolis, MN (May 1994), pp. 4-13. (Also cross-referenced as UMIACS-TR-93-130

    A Survey of Information Retrieval and Filtering Methods

    Get PDF
    We survey the major techniques for information retrieval. In the first part, we provide an overview of the traditional ones (full text scanning, inversion, signature files and clustering). In the second part we discuss attempts to include semantic information (natural language processing, latent semantic indexing and neural networks)

    Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension

    Get PDF
    We examine the estimation of selectivities for range and spatial join queries in real spatial databases. As we have shown earlier, real point sets: (a) violate consistently the "uniformity" and "independence" assumptions, (b) can often be described as "fractals", with non-integer (fractal) dimension. In this paper we show that, among the infinite family of fractal dimensions, the so called "Correlation Dimension" D2 is the one that we need to predict the selectivity of spatial join. The main contribution is that, for all the real and synthetic point-sets we tried, the average number of neighbors for a given point of the point-set follows a power law, with D2 as the exponent. This immediately solves the selectivity estimation for spatial joins, as well as for "biased" range queries (i.e., queries whose centers prefer areas of high point density). We present the formulas to estimate the selectivity for the biased queries, including an integration constant (Kshape) for each query shape. Finally, we show results on real and synthetic point sets, where our formulas achieve very low relative errors (typically about 10%, versus 40%-100% of the uniform assumption)

    Execution Performance Issues in Full-Text Information Retrieval

    Get PDF
    The task of an information retrieval system is to identify documents that will satisfy a user’s information need. Effective fulfillment of this task has long been an active area of research, leading to sophisticated retrieval models for representing information content in documents and queries and measuring similarity between the two. The maturity and proven effectiveness of these systems has resulted in demand for increased capacity, performance, scalability, and functionality, especially as information retrieval is integrated into more traditional database management environments. In this dissertation we explore a number of functionality and performance issues in information retrieval. First, we consider creation and modification of the document collection, concentrating on management of the inverted file index. An inverted file architecture based on a persistent object store is described and experimental results are presented for inverted file creation and modification. Our architecture provides performance that scales well with document collection size and the database features supported by the persistent object store provide many solutions to issues that arise during integration of information retrieval into vii more general database environments. We then turn to query evaluation speed and introduce a new optimization technique for statistical ranking retrieval systems that support structured queries. Experimental results from a variety of query sets show that execution time can be reduced by more than 50% with no noticeable impact on retrieval effectiveness, making these more complex retrieval models attractive alternatives for environments that demand high performance

    Query space reduction in information retrieval

    Get PDF
    Today’s rapidly expanding and dynamic information age coupled with users who are becoming more discerning about what information they want and when they want it poses a serious challenge to information retrieval systems in their attempt to match user’s information needs with information repositories. To date most research on information retrieval has concentrated on improving system effectiveness. However as the amount of online information and the number of users concurrently accessing this information continues to grow at an exponential rate the efficiency of information retrieval systems is now a core concern of information retrieval system developers. Users who were previously content to wait for information they needed are no longer willing or able to do so because in today’s dynamic information age the ‘shelf life’ of information is getting shorter and shorter. This results in increasing pressure on information systems to provide the ‘right’ information at the ‘right’ time. This research focuses on the improving the efficiency of information retrieval systems. To this end we have developed and implemented a number of techniques aimed at reducing system response time by reducing the amount of data processed in order to effectively respond to a user’s information need

    On B-tree Indices for Skewed Distributions

    No full text
    It is often the case that the set of values over which a B-Tree is constructed has a skewed distribution. We present a geometric growth technique to manage postings records in such cases, and show that the performance of such a technique is better than that of a straightforward fixed length postings list: It guarantees 1 disk access on searching, and it takes a fraction of the space that its competitor requires (55% to 66%, in our experiments)
    corecore