9 research outputs found

    Classifiers Based on Inverted Distances

    Get PDF

    High dimensional kNN-graph construction using space filling curves

    Get PDF

    Data Management Techniques For Fast Function Approximation

    Full text link
    Computing and information technology is fundamentally changing the face of modern science. Traditional methods of performing scienti?c studies are now making way for the next generation of methods that use computing technology. However, the kinds of calculations that scientists wish to perform, and the ways in which they want to collect, archive and analyze information has posed several new challenges in data management and algorithm design. Achieving the goal of using computing technology effectively in scienti?c applications has become an important area of research in computer science, called eScience or data-driven science. Most data-driven scienti?c applications are aimed at studying and understanding some real world physical phenomenon. The general methodology followed by a scientist is to ?rst model the physical phenomenon either directly from the mathematical equations governing the phenomenon, or from a large dataset of observations about the phenomenon. Recent advances in data management, data mining and machine learning have addressed numerous challenges that arise in this ?rst stage of model building. However, the state of the art methods are inadequate in addressing challenges that arise in the second stage of a data driven scienti?c study, where the scientist uses the model she has built to help her understand the physical phenomenon, using tools such as computer simulation and visualization. This thesis identi?es and addresses data management challenges that arise when a complex model built for a real world phenomenon is analyzed by a scientist to gain insights about the phenomenon. The ?rst part of the thesis concentrates on high-dimensional function approximation (HFA), a problem relevant to virtually all applications that use computer simulation as the methodology for understanding complex models. We explore various aspects of HFA in depth, identify key data management problems, and propose solutions that signi?cantly speedup long running scienti?c simulations. Besides computer simulation, visualizing low dimensional summaries of a complex model is another method commonly used by scientists to understand models. Most real world models are complex and involve thousands of attributes. In order to get a very good understanding of a model, a scientist generates a very large number of low dimensional summaries for the model. Generating large sets of summaries for a complex model presents a challenging data management task and the second part of the thesis develops scalable algorithms for solving this data management problem

    Accounting for Boundary Effects in Nearest Neighbor Searching

    No full text
    Given n data points in d-dimensional space, nearest neighbor searching involves determining the nearest of these data points to a given query point. Most averagecase analyses of nearest neighbor searching algorithms are made under the simplifying assumption that d is fixed and that n is so large relative to d that boundary effects can be ignored. This means that for any query point the statistical distribution of the data points surrounding it is independent of the location of the query point. However, in many applications of nearest neighbor searching (such as data compression by vector quantization) this assumption is not met, since the number of data points n grows roughly as 2^d. Largely for this reason, the actual performances of many nearest neighbor algorithms tend to be much better than their theoretical analyses would suggest. We present evidence of why this is the case. We provide an accurate analysis of the number of cells visited in nearest neighbor searching by the bucketing and k-d tree algorithms. We assume..

    Accounting for Boundary Effects in Nearest Neighbor Searching

    No full text

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining

    Accounting for Boundary Effects in Nearest Neighbor Searching

    No full text
    Given n data points in d-dimensional space, nearest neighbor searching involves determining the nearest of these data points to a given query point. Most averagecase analyses of nearest neighbor searching algorithms are made under the simplifying assumption that d is fixed and that n is so large relative to d that boundary effects can be ignored. This means that for any query point the statistical distribution of the data points surrounding it is independent of the location of the query point. However, in many applications of nearest neighbor searching (such as data compression by vector quantization) this assumption is not met, since the number of data points n grows roughly as 2 d . Largely for this reason, the actual performances of many nearest neighbor algorithms tend to be much better than their theoretical analyses would suggest. We present evidence of why this is the case. We provide an accurate analysis of the number of cells visited in nearest neighbor searching ..

    Geometry © 1996 Springer-Verlag New York Inc. Accounting for Boundary Effects in Nearest-Neighbor Searching ∗

    No full text
    Abstract. Given n data points in d-dimensional space, nearest-neighbor searching involves determining the nearest of these data points to a given query point. Most averagecase analyses of nearest-neighbor searching algorithms are made under the simplifying assumption that d is fixed and that n is so large relative to d that boundary effects can be ignored. This means that for any query point the statistical distribution of the data points surrounding it is independent of the location of the query point. However, in many applications of nearest-neighbor searching (such as data compression by vector quantization) this assumption is not met, since the number of data points n grows roughly as 2 d. Largely for this reason, the actual performances of many nearest-neighbor algorithms tend to be much better than their theoretical analyses would suggest. We present evidence of why this is the case. We provide an accurate analysis of the number of cells visited in nearest-neighbor searching by the bucketing and k-d tree algorithms. We assume m d points uniformly distributed in dimension d, where m is a fixed integer ≥2. Further, we assume that distances are measured in the L ∞ metric. Our analysis is tight in the limit as d approaches infinity
    corecore