2,915 research outputs found

    HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces

    Full text link
    Nearest neighbor searching of large databases in high-dimensional spaces is inherently difficult due to the curse of dimensionality. A flavor of approximation is, therefore, necessary to practically solve the problem of nearest neighbor search. In this paper, we propose a novel yet simple indexing scheme, HD-Index, to solve the problem of approximate k-nearest neighbor queries in massive high-dimensional databases. HD-Index consists of a set of novel hierarchical structures called RDB-trees built on Hilbert keys of database objects. The leaves of the RDB-trees store distances of database objects to reference objects, thereby allowing efficient pruning using distance filters. In addition to triangular inequality, we also use Ptolemaic inequality to produce better lower bounds. Experiments on massive (up to billion scale) high-dimensional (up to 1000+) datasets show that HD-Index is effective, efficient, and scalable.Comment: PVLDB 11(8):906-919, 201

    Digital Image Access & Retrieval

    Get PDF
    The 33th Annual Clinic on Library Applications of Data Processing, held at the University of Illinois at Urbana-Champaign in March of 1996, addressed the theme of "Digital Image Access & Retrieval." The papers from this conference cover a wide range of topics concerning digital imaging technology for visual resource collections. Papers covered three general areas: (1) systems, planning, and implementation; (2) automatic and semi-automatic indexing; and (3) preservation with the bulk of the conference focusing on indexing and retrieval.published or submitted for publicatio

    Terrestrial applications: An intelligent Earth-sensing information system

    Get PDF
    For Abstract see A82-2214

    A dynamic adaptive framework for improving case-based reasoning system performance

    Get PDF
    An optimal performance of a Case-Based Reasoning (CBR) system means, the CBR system must be efficient both in time and in size, and must be optimally competent. The efficiency in time is closely related to an efficient and optimal retrieval process over the Case Base of the CBR system. Efficiency in size means that the Case Library (CL) size should be minimal. Therefore, the efficiency in size is closely related to optimal case learning policies, optimal meta-case learning policies, optimal case forgetting policies, etc. On the other hand, the optimal competence of a CBR system means that the number of problems that the CBR system can satisfactorily solve must be maximum. To improve or optimize all three dimensions in a CBR system at the same time is a difficult challenge because they are interrelated, and it becomes even more difficult when the CBR system is applied to a dynamic or continuous domain (data stream). In this thesis, a Dynamic Adaptive Case Library framework (DACL) is proposed to improve the CBR system performance coping especially with reducing the retrieval time, increasing the CBR system competence, and maintaining and adapting the CL to be efficient in size, especially in continuous domains. DACL learns cases and organizes them into dynamic cluster structures. The DACL is able to adapt itself to a dynamic environment, where new clusters, meta-cases or prototype of cases, and associated indexing structures (discriminant trees, k-d trees, etc.) can be formed, updated, or even removed. DACL offers a possible solution to the management of the large amount of data generated in an unsupervised continuous domain (data stream). In addition, we propose the use of a Multiple Case Library (MCL), which is a static version of a DACL, with the same structure but being defined statically to be used in supervised domains. The thesis work proposes some techniques for improving the indexation and the retrieval task. The most important indexing method is the NIAR k-d tree algorithm, which improves the retrieval time and competence, compared against the baseline approach (a flat CL) and against the well-known techniques based on using standard k-d tree strategies. The proposed Partial Matching Exploration (PME) technique explores a hierarchical case library with a tree indexing-structure aiming at not losing the most similar cases to a query case. This technique allows not only exploring the best matching path, but also several alternative partial matching paths to be explored. The results show an improvement in competence and time of retrieving of similar cases. Through the experimentation tests done, with a set of well-known benchmark supervised databases. The dynamic building of prototypes in DACL has been tested in an unsupervised domain (environmental domain) where the air pollution is evaluated. The core task of building prototypes in a DACL is the implementation of a stochastic method for the learning of new cases and management of prototypes. Finally, the whole dynamic framework, integrating all the main proposed approaches of the research work, has been tested in simulated unsupervised domains with several well-known databases in an incremental way, as data streams are processed in real life. The conclusions outlined that from the experimental results, it can be stated that the dynamic adaptive framework proposed (DACL/MCL), jointly with the contributed indexing strategies and exploration techniques, and with the proposed stochastic case learning policies, and meta-case learning policies, improves the performance of standard CBR systems both in supervised domains (MCL) and in unsupervised continuous domains (DACL).El rendimiento óptimo de un sistema de razonamiento basado en casos (CBR) significa que el sistema CBR debe ser eficiente tanto en tiempo como en tamaño, y debe ser competente de manera óptima. La eficiencia temporal está estrechamente relacionada con que el proceso de recuperación sobre la Base de Casos del sistema CBR sea eficiente y óptimo. La eficiencia en tamaño significa que el tamaño de la Base de Casos (CL) debe ser mínimo. Por lo tanto, la eficiencia en tamaño está estrechamente relacionada con las políticas óptimas de aprendizaje de casos y meta-casos, y las políticas óptimas de olvido de casos, etc. Por otro lado, la competencia óptima de un sistema CBR significa que el número de problemas que el sistema puede resolver de forma satisfactoria debe ser máximo. Mejorar u optimizar las tres dimensiones de un sistema CBR al mismo tiempo es un reto difícil, ya que están relacionadas entre sí, y se vuelve aún más difícil cuando se aplica el sistema de CBR a un dominio dinámico o continuo (flujo de datos). En esta tesis se propone el Dynamic Adaptive Case Library framework (DACL) para mejorar el rendimiento del sistema CBR especialmente con la reducción del tiempo de recuperación, aumentando la competencia del sistema CBR, manteniendo y adaptando la CL para ser eficiente en tamaño, especialmente en dominios continuos. DACL aprende casos y los organiza en estructuras dinámicas de clusters. DACL es capaz de adaptarse a entornos dinámicos, donde los nuevos clusters, meta-casos o prototipos de los casos, y las estructuras asociadas de indexación (árboles discriminantes, árboles k-d, etc.) se pueden formar, actualizarse, o incluso ser eliminados. DACL ofrece una posible solución para la gestión de la gran cantidad de datos generados en un dominio continuo no supervisado (flujo de datos). Además, se propone el uso de la Multiple Case Library (MCL), que es una versión estática de una DACL, con la misma estructura pero siendo definida estáticamente para ser utilizada en dominios supervisados. El trabajo de tesis propone algunas técnicas para mejorar los procesos de indexación y de recuperación. El método de indexación más importante es el algoritmo NIAR k-d tree, que mejora el tiempo de recuperación y la competencia, comparado con una CL plana y con las técnicas basadas en el uso de estrategias de árboles k-d estándar. Partial Matching Exploration (PME) technique, la técnica propuesta, explora una base de casos jerárquica con una indexación de estructura de árbol con el objetivo de no perder los casos más similares a un caso de consulta. Esta técnica no sólo permite explorar el mejor camino coincidente, sino también varios caminos parciales alternativos coincidentes. Los resultados, a través de la experimentación realizada con bases de datos supervisadas conocidas, muestran una mejora de la competencia y del tiempo de recuperación de casos similares. Además la construcción dinámica de prototipos en DACL ha sido probada en un dominio no supervisado (dominio ambiental), donde se evalúa la contaminación del aire. La tarea central de la construcción de prototipos en DACL es la implementación de un método estocástico para el aprendizaje de nuevos casos y la gestión de prototipos. Por último, todo el sistema, integrando todos los métodos propuestos en este trabajo de investigación, se ha evaluado en dominios no supervisados simulados con varias bases de datos de una manera gradual, como se procesan los flujos de datos en la vida real. Las conclusiones, a partir de los resultados experimentales, muestran que el sistema de adaptación dinámica propuesto (DACL / MCL), junto con las estrategias de indexación y de exploración, y con las políticas de aprendizaje de casos estocásticos y de meta-casos propuestas, mejora el rendimiento de los sistemas estándar de CBR tanto en dominios supervisados (MCL) como en dominios continuos no supervisados (DACL).Postprint (published version

    Retrieve: An Engineering Tool for Searching Remote Sensing and Environmental Engineering Databases

    Get PDF
    The design and development of a semi-automatic information retrieval system which features manual indexing, and an inverted file structure is presented. The system requires manual indexing done by an expert in the subject field to ensure high-precision searching. High-recall is achieved through the implementation of the inverted file. The system provides an interactive environment, a thesaurus for normalization of the indexing language, ranking of retrieved documents, and flexible output specifications. The purpose of this thesis is to present the design and development of in-house search-aid software for small document collections intended for Remote Sensing and Environmental Engineering users

    Down the Rabbit Hole: Robust Proximity Search and Density Estimation in Sublinear Space

    Full text link
    For a set of nn points in â„œd\Re^d, and parameters kk and \eps, we present a data structure that answers (1+\eps,k)-\ANN queries in logarithmic time. Surprisingly, the space used by the data-structure is \Otilde (n /k); that is, the space used is sublinear in the input size if kk is sufficiently large. Our approach provides a novel way to summarize geometric data, such that meaningful proximity queries on the data can be carried out using this sketch. Using this, we provide a sublinear space data-structure that can estimate the density of a point set under various measures, including: \begin{inparaenum}[(i)] \item sum of distances of kk closest points to the query point, and \item sum of squared distances of kk closest points to the query point. \end{inparaenum} Our approach generalizes to other distance based estimation of densities of similar flavor. We also study the problem of approximating some of these quantities when using sampling. In particular, we show that a sample of size \Otilde (n /k) is sufficient, in some restricted cases, to estimate the above quantities. Remarkably, the sample size has only linear dependency on the dimension

    Combination of Evidence in Dempster-Shafer Theory

    Full text link
    • …
    corecore