20 research outputs found

    Exploiting subspace distance equalities in Highdimensional data for knn queries

    Get PDF
    Efficient k-nearest neighbor computation for high-dimensional data is an important, yet challenging task. The response times of stateof-the-art indexing approaches highly depend on factors like distribution of the data. For clustered data, such approaches are several factors faster than a sequential scan. However, if various dimensions contain uniform or Gaussian data they tend to be clearly outperformed by a simple sequential scan. Hence, we require for an approach generally delivering good response times, independent of the data distribution. As solution, we propose to exploit a novel concept to efficiently compute nearest neighbors. We name it sub-space distance equality, which aims at reducing the number of distance computations independent of the data distribution. We integrate knn computing algorithms into the Elf index structure allowing to study the sub-space distance equality concept in isolation and in combination with a main-memory optimized storage layout. In a large comparative study with twelve data sets, our results indicate that indexes based on sub-space distance equalities compute the least amount of distances. For clustered data, our Elf knn algorithm delivers at least a performance increase of factor two up to an increase of two magnitudes without losing the performance gain compared to sequential scans for uniform or Gaussian data

    Indexing for efficient main memory processing

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    A Content-Addressable Network for Similarity Search in Metric Spaces

    Get PDF
    Because of the ongoing digital data explosion, more advanced search paradigms than the traditional exact match are needed for contentbased retrieval in huge and ever growing collections of data produced in application areas such as multimedia, molecular biology, marketing, computer-aided design and purchasing assistance. As the variety of data types is fast going towards creating a database utilized by people, the computer systems must be able to model human fundamental reasoning paradigms, which are naturally based on similarity. The ability to perceive similarities is crucial for recognition, classification, and learning, and it plays an important role in scientific discovery and creativity. Recently, the mathematical notion of metric space has become a useful abstraction of similarity and many similarity search indexes have been developed. In this thesis, we accept the metric space similarity paradigm and concentrate on the scalability issues. By exploiting computer networks and applying the Peer-to-Peer communication paradigms, we build a structured network of computers able to process similarity queries in parallel. Since no centralized entities are used, such architectures are fully scalable. Specifically, we propose a Peer-to-Peer system for similarity search in metric spaces called Metric Content-Addressable Network (MCAN) which is an extension of the well known Content-Addressable Network (CAN) used for hash lookup. A prototype implementation of MCAN was tested on real-life datasets of image features, protein symbols, and text — observed results are reported. We also compared the performance of MCAN with three other, recently proposed, distributed data structures for similarity search in metric spaces

    Improving Distance-Join Query Processing with Voronoi-Diagram based Partitioning in SpatialHadoop

    Get PDF
    SpatialHadoop is an extended MapReduce framework supporting global indexing techniques that partition spatial datasets across several machines and improve spatial query processing performance compared to traditional Hadoop systems. SpatialHadoop supports several spatial operations (e.g., Nearest Neighbor search, range query, spatial intersection join, etc.) and seven spatial partitioning techniques (Grid, Quadtree, STR, STR+, -d tree, Z-curve and Hilbert-curve). Distance-Join Queries (DJQs), like the Nearest Neighbors Join Query (NNJQ) and Closest Pairs Query (CPQ), are common operations used in numerous spatial applications. DJQs are costly operations, since they combine spatial joins with distance-based search. Data partitioning improves the management of large datasets and speeds up query performance. Therefore, performing DJQs efficiently with new partitioning methods in SpatialHadoop is a challenging task. In this paper, a new data partitioning technique based on Voronoi-Diagrams is designed and implemented in SpatialHadoop. Moreover, improved NNJQ and CPQ MapReduce algorithms, using the new partitioning mechanism, are also designed and developed for SpatialHadoop. Finally, the results of an extensive set of experiments with real-world datasets are presented, demonstrating that the new partitioning technique and the improved DJQ MapReduce algorithms are efficient, scalable and robust in SpatialHadoop

    Indexing Metric Spaces for Exact Similarity Search

    Full text link
    With the continued digitalization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity and variety. Many studies address volume or velocity, while much fewer studies concern the variety. Metric space is ideal for addressing variety because it can accommodate any type of data as long as its associated distance notion satisfies the triangle inequality. To accelerate search in metric space, a collection of indexing techniques for metric data have been proposed. However, existing surveys each offers only a narrow coverage, and no comprehensive empirical study of those techniques exists. We offer a survey of all the existing metric indexes that can support exact similarity search, by i) summarizing all the existing partitioning, pruning and validation techniques used for metric indexes, ii) providing the time and storage complexity analysis on the index construction, and iii) report on a comprehensive empirical comparison of their similarity query processing performance. Here, empirical comparisons are used to evaluate the index performance during search as it is hard to see the complexity analysis differences on the similarity query processing and the query performance depends on the pruning and validation abilities related to the data distribution. This article aims at revealing different strengths and weaknesses of different indexing techniques in order to offer guidance on selecting an appropriate indexing technique for a given setting, and directing the future research for metric indexes

    Resource Description and Selection for Similarity Search in Metric Spaces: Problems and Problem-Solving Approaches

    Get PDF
    In times of an ever increasing amount of data and a growing diversity of data types in different application contexts, there is a strong need for large-scale and flexible indexing and search techniques. Metric access methods (MAMs) provide this flexibility, because they only assume that the dissimilarity between two data objects is modeled by a distance metric. Furthermore, scalable solutions can be built with the help of distributed MAMs. Both IF4MI and RS4MI, which are presented in this thesis, represent metric access methods. IF4MI belongs to the group of centralized MAMs. It is based on an inverted file and thus offers a hybrid access method providing text retrieval capabilities in addition to content-based search in arbitrary metric spaces. In opposition to IF4MI, RS4MI is a distributed MAM based on resource description and selection techniques. Here, data objects are physically distributed. However, RS4MI is by no means restricted to a certain type of distributed information retrieval system. Various application fields for the resource description and selection techniques are possible, for example in the context of visual analytics. Due to the metric space assumption, possible application fields go far beyond content-based image retrieval applications which provide the example scenario here.Ständig zunehmende Datenmengen und eine immer größer werdende Vielfalt an Datentypen in verschiedenen Anwendungskontexten erfordern sowohl skalierbare als auch flexible Indexierungs- und Suchtechniken. Metrische Zugriffsstrukturen (MAMs: metric access methods) können diese Flexibilität bieten, weil sie lediglich unterstellen, dass die Distanz zwischen zwei Datenobjekten durch eine Distanzmetrik modelliert wird. Darüber hinaus lassen sich skalierbare Lösungen mit Hilfe verteilter MAMs entwickeln. Sowohl IF4MI als auch RS4MI, die beide in dieser Arbeit vorgestellt werden, stellen metrische Zugriffsstrukturen dar. IF4MI gehört zur Gruppe der zentralisierten MAMs. Diese Zugriffsstruktur basiert auf einer invertierten Liste und repräsentiert daher eine hybride Indexstruktur, die neben einer inhaltsbasierten Ähnlichkeitssuche in beliebigen metrischen Räumen direkt auch Möglichkeiten der Textsuche unterstützt. Im Gegensatz zu IF4MI handelt es sich bei RS4MI um eine verteilte MAM, die auf Techniken der Ressourcenbeschreibung und -auswahl beruht. Dabei sind die Datenobjekte physisch verteilt. RS4MI ist jedoch keineswegs auf die Anwendung in einem bestimmten verteilten Information-Retrieval-System beschränkt. Verschiedene Anwendungsfelder sind für die Techniken zur Ressourcenbeschreibung und -auswahl denkbar, zum Beispiel im Bereich der Visuellen Analyse. Dabei gehen Anwendungsmöglichkeiten weit über den für die Arbeit unterstellten Anwendungskontext der inhaltsbasierten Bildsuche hinaus

    COMPLEX QUERY PROCESSING AND RECOVERY IN DISTRIBUTED SYSTEMS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Advanced similarity queries and their application in data mining

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    3D exemplar-based image inpainting in electron microscopy

    Get PDF
    In electron microscopy (EM) a common problem is the non-availability of data, which causes artefacts in reconstructions. In this thesis the goal is to generate artificial data where missing in EM by using exemplar-based inpainting (EBI). We implement an accelerated 3D version tailored to applications in EM, which reduces reconstruction times from days to minutes. We develop intelligent sampling strategies to find optimal data as input for reconstruction methods. Further, we investigate approaches to reduce electron dose and acquisition time. Sparse sampling followed by inpainting is the most promising approach. As common evaluation measures may lead to misinterpretation of results in EM and falsify a subsequent analysis, we propose to use application driven metrics and demonstrate this in a segmentation task. A further application of our technique is the artificial generation of projections in tiltbased EM. EBI is used to generate missing projections, such that the full angular range is covered. Subsequent reconstructions are significantly enhanced in terms of resolution, which facilitates further analysis of samples. In conclusion, EBI proves promising when used as an additional data generation step to tackle the non-availability of data in EM, which is evaluated in selected applications. Enhancing adaptive sampling methods and refining EBI, especially considering the mutual influence, promotes higher throughput in EM using less electron dose while not lessening quality.Ein häufig vorkommendes Problem in der Elektronenmikroskopie (EM) ist die Nichtverfügbarkeit von Daten, was zu Artefakten in Rekonstruktionen führt. In dieser Arbeit ist es das Ziel fehlende Daten in der EM künstlich zu erzeugen, was durch Exemplar-basiertes Inpainting (EBI) realisiert wird. Wir implementieren eine auf EM zugeschnittene beschleunigte 3D Version, welche es ermöglicht, Rekonstruktionszeiten von Tagen auf Minuten zu reduzieren. Wir entwickeln intelligente Abtaststrategien, um optimale Datenpunkte für die Rekonstruktion zu erhalten. Ansätze zur Reduzierung von Elektronendosis und Aufnahmezeit werden untersucht. Unterabtastung gefolgt von Inpainting führt zu den besten Resultaten. Evaluationsmaße zur Beurteilung der Rekonstruktionsqualität helfen in der EM oft nicht und können zu falschen Schlüssen führen, weswegen anwendungsbasierte Metriken die bessere Wahl darstellen. Dies demonstrieren wir anhand eines Beispiels. Die künstliche Erzeugung von Projektionen in der neigungsbasierten Elektronentomographie ist eine weitere Anwendung. EBI wird verwendet um fehlende Projektionen zu generieren. Daraus resultierende Rekonstruktionen weisen eine deutlich erhöhte Auflösung auf. EBI ist ein vielversprechender Ansatz, um nicht verfügbare Daten in der EM zu generieren. Dies wird auf Basis verschiedener Anwendungen gezeigt und evaluiert. Adaptive Aufnahmestrategien und EBI können also zu einem höheren Durchsatz in der EM führen, ohne die Bildqualität merklich zu verschlechtern
    corecore