    Efficient Similarity Search in Structured Data

    Modern database applications are characterized by two major aspects: the use of complex data types with internal structure and the need for new data analysis methods. The focus of database users has shifted from simple queries to complex analyses of the data, known as knowledge discovery in databases. Important tasks in this area are the grouping of data objects (clustering), the classification of new data objects or the detection of exceptional data objects (outlier detection). Most algorithms for solving those problems are based on similarity search in databases. This makes efficient similarity search in large databases of structured objects an important basic operation for modern database applications. In this thesis we develop efficient methods for similarity search in large databases of structured data and improve the efficiency of existing query processing techniques. For the data objects, only a tree or graph structure is assumed which can be extended with arbitrary attribute information. Starting with an analysis of the demands from two example applications, several important requirements for similarity measures are identified. One aspect is the adaptability of the similarity search method to the requirements of the user and the application domain. This can even imply a change of the similarity measure between two successive queries of the same user. An explanation component which makes clear why objects are considered similar by the system is a necessary precondition for a purposeful adaption of the measure. Consequently, the edit distance, well-known from string processing, is a common similarity measure for graph structured objects. Its feature to allow a visualization of corresponding substructures and the possibility to weight single operations are the reason for this popularity. But it turns out that the edit distance and similar measures for tree structures are computationally extremely complex which makes them unsuitable for today's large and even growing databases. Therefore, we develop a multi-step query processing architecture which reduces the number of necessary distance calculations significantly. This is achieved by employing suitable filter methods. Furthermore, we show that by easing certain restrictions on the similarity measure, a significant performance gain can be obtained without reducing the quality of the measure. To achieve this, matchings of substructures (vertices or edges) of the data objects are determined. An additional cost function for those matchings allows to derive a similarity measure for structured data, called the edge matching distance, from the cost optimal matching of the substructures. But even for this new similarity measure, efficiency can be improved significantly by using a multi-step query processing approach. This allows the use of the edge matching distance for knowledge discovery applications in large databases. Within the thesis, the properties of our new similarity search methods are proved both theoretically and through experiments.Moderne Datenbankanwendungen werden vor allem durch zwei wesentliche Aspekte charakterisiert. Dies ist zum einen die Verwendung komplexer Datentypen mit interner Struktur und zum anderen die Notwendigkeit neuer Recherchemöglichkeiten. Der Fokus bei der Datenbankbenutzung hat sich von einfachen Anfragen hin zu komplexen Analysen des Datenbestandes, dem sogenannten Knowledge-Discovery in Datenbanken, entwickelt. Wichtige Analysetechniken in diesem Bereich sind unter anderem die Gruppierung der Daten in Teilmengen (Clustering), die Klassifikation neuer Datenobjekte im Bezug auf den vorhandenen Datenbestand und das Erkennen von Ausreißern in den Daten (Outlier-Identifikation). Die Basis für die meisten Verfahren zur Lösung dieser Aufgaben bildet dabei die Bestimmung der Ähnlichkeit von Datenbankobjekten. Die effiziente Ähnlichkeitssuche in großen Datenbanken strukturierter Objekte ist daher eine wichtige Basisoperation für moderne Datenbankanwendungen. In dieser Doktorarbeit werden daher effiziente Verfahren für die Ähnlichkeitssuche in großen Mengen strukturierter Objekte entwickelt, bzw. die Effizienz vorhandener Verfahren deutlich zu verbessert. Dabei wird lediglich eine baum- oder allgemein graphartige innere Struktur der Datenobjekte vorausgesetzt, die durch beliebige Attribute erweitert wird. Ausgehend von einer Analyse der Anforderungen an Ähnlichkeitssuchverfahren in zwei Beispielsanwendungen aus dem Bereich der Bildsuche und des Proteindockings, wurden mehrere wichtige Aspekte der Ähnlichkeitssuche identifiziert. Ein erster Aspekt ist, das Maß für die Ähnlichkeit für den Benutzer anpassbar zu gestalten, da der zugrundeliegende Ähnlichkeitsbegriff sowohl benutzer- als auch situationsabhängig ist, was bis hin zur Änderung des Ähnlichkeitsbegriffs zwischen zwei aufeinanderfolgenden Anfragen gehen kann. Voraussetzung für eine zielgerichtete Anpassung des Ähnlichkeitsbegriffs ist dabei eine Erklärungskomponente, welche dem Benutzer das Zustandekommen eines Ähnlichkeitswertes verdeutlicht. Die aus der Stringverarbeitung bekannte Edit-Distanz ist deshalb ein weit verbreitetes Maß für die Ähnlichkeit von graphstrukturierten Objekten, da sie eine Gewichtung einzelner Operationen erlaubt und durch eine Zuordnung von Teilobjekten aus den zu vergleichenden Strukturen eine Erklärungskomponente liefert. Es zeigt sich jedoch, dass die Bestimmung der Edit-Distanz und vergleichbarer Ähnlichkeitsmaße für Baum- oder Graphstrukturen extrem zeitaufwendig ist. Es wird daher zunächst ein mehrstufiges Anfragebearbeitungsmodell entwickelt, welches durch geeignete Filterschritte die Anzahl der notwendigen Distanzberechnungen massiv reduziert und so die Geschwindigkeit der Anfragebearbeitung deutlich steigert bzw. erst für große Datenmengen akzeptabel macht. Im nächsten Schritt wird aufgezeigt, wie sich durch Lockerung einiger Bedingungen für das Ähnlichkeitsmaß deutliche Geschwindigkeitssteigerungen erreichen lassen, ohne Einbußen bezüglich der Qualität der Anfrageergebnisse hinnehmen zu müssen. Dazu werden Paarungen von Teilstrukturen (Knoten oder Kanten) der zu vergleichenden Objekte bestimmt, die zusätzlich mittels einer Kostenfunktion gewichtet werden. Eine bezüglich dieser Kostenfunktion optimale Paarung aller Teilstrukturen stellt dann ein Maß für die Ähnlichkeit der Vergleichsobjekte dar, die sogenannte "edge matching distance". Es zeigt sich jedoch, dass auch für dieses neue Ähnlichkeitsmaß eine mehrstufige Anfragebearbeitung zusammen mit entsprechenden, neuartigen Filtermethoden eine erhebliche Performanzsteigerung erlaubt. Diese stellt die Voraussetzung für die Anwendung der Verfahren im Rahmen des Knowledge-Discovery in großen Datenbanken dar. Dabei werden die genannten Eigenschaften der neu entwickelten Verfahren sowohl theoretisch als auch mittels praktischer Experimente belegt

    Database Similarity Search in Metric Spaces: Limitations and Opportunities

    Generic database similarity search is one of the most challenging problems in current database research. Generic data are not simply structured data with several keys of numeric or alphabetic types. Traditional search algorithms that only check specified fields and keys are not effective. Similarity searches find the objects that are similar to a target using a specified similarity criterion. Tree-structured indexing techniques based on metric spaces are widely used to solve this problem. Existing methods can be divided into two categories: approaches based upon Voronoi partitions and approaches based upon reference points. The later one is the focus of this research. The problem of database similarity search using reference points in metric spaces is formulated, and the key issues are addressed. This research focuses upon two broad sets of open problems: Analysis of the limitations of approaches to similarity search using metric spaces, and development of criteria that can be and to evaluate the opportunities for new design methods. The performance limitations of similarity search based on metric spaces are analyzed and proved to be imposed by statistical characteristics of the data collection. A new concept, range threshold, is defined to evaluate the feasibility of tree-structured indexing techniques based upon reference points in metric spaces. A method to estimate the range threshold is provided, which makes it possible to check the feasibility of this approach for a data set prior to implementation. The opportunities for different approaches are evaluated by criteria based on search efficiency and utility. Comparison of different Minkowski metrics and data extraction methods using PCA (principle component analysis) are presented. Search utilities are demonstrated by examples. Several issues related to index tree structure are addressed. Experimental results show that a taller tree yields better performance. All these results indicate that the approaches based upon reference points in metric spaces are promising

    Building Semantic Corpus from WordNet

    We propose a novel methodology for extracting semantic similarity knowledge from semi-structured sources, such as WordNet. Unlike existing approaches that only explore the structured information (e.g., the hypernym relationship in WordNet), we present a framework that allows us to utilize all available information, including natural language descriptions. Our approach constructs a semantic corpus. It is represented using a graph that models the relationship between phrases using numbers. The data in the semantic corpus can be used to measure the similarity between phrases, the similarity between documents, or to perform a semantic search in a set of documents that uses the meaning of words and phrases (i.e., search that is not keyword-based)

    ADAMpro: Database Support for Big Multimedia Retrieval

    For supporting retrieval tasks within large multimedia collections, not only the sheer size of data but also the complexity of data and their associated metadata pose a challenge. Applications that have to deal with big multimedia collections need to manage the volume of data and to effectively and efficiently search within these data. When providing similarity search, a multimedia retrieval system has to consider the actual multimedia content, the corresponding structured metadata (e.g., content author, creation date, etc.) and—for providing similarity queries—the extracted low-level features stored as densely populated high-dimensional feature vectors. In this paper, we present ADAM pro , a combined database and information retrieval system that is particularly tailored to big multimedia collections. ADAM pro follows a modular architecture for storing structured metadata, as well as the extracted feature vectors and it provides various index structures, i.e., Locality-Sensitive Hashing, Spectral Hashing, and the VA-File, for a fast retrieval in the context of a similarity search. Since similarity queries are often long-running, ADAM pro supports progressive queries that provide the user with streaming result lists by returning (possibly imprecise) results as soon as they become available. We provide the results of an evaluation of ADAM pro on the basis of several collection sizes up to 50 million entries and feature vectors with different numbers of dimensions

    Ranked List Loss for Deep Metric Learning

    The objective of deep metric learning (DML) is to learn embeddings that can capture semantic similarity and dissimilarity information among data points. Existing pairwise or tripletwise loss functions used in DML are known to suffer from slow convergence due to a large proportion of trivial pairs or triplets as the model improves. To improve this, ranking-motivated structured losses are proposed recently to incorporate multiple examples and exploit the structured information among them. They converge faster and achieve state-of-the-art performance. In this work, we unveil two limitations of existing ranking-motivated structured losses and propose a novel ranked list loss to solve both of them. First, given a query, only a fraction of data points is incorporated to build the similarity structure. Consequently, some useful examples are ignored and the structure is less informative. To address this, we propose to build a set-based similarity structure by exploiting all instances in the gallery. The learning setting can be interpreted as few-shot retrieval: given a mini-batch, every example is iteratively used as a query, and the rest ones compose the gallery to search, i.e., the support set in few-shot setting. The rest examples are split into a positive set and a negative set. For every mini-batch, the learning objective of ranked list loss is to make the query closer to the positive set than to the negative set by a margin. Second, previous methods aim to pull positive pairs as close as possible in the embedding space. As a result, the intraclass data distribution tends to be extremely compressed. In contrast, we propose to learn a hypersphere for each class in order to preserve useful similarity structure inside it, which functions as regularisation. Extensive experiments demonstrate the superiority of our proposal by comparing with the state-of-the-art methods.Comment: Accepted to T-PAMI. Therefore, to read the offical version, please go to IEEE Xplore. Fine-grained image retrieval task. Our source code is available online: https://github.com/XinshaoAmosWang/Ranked-List-Loss-for-DM

    Latent semantic analysis and cosine similarity for hadith search engine

    Search engine technology was used to find information as needed easily, quickly and efficiently, including in searching the information about the hadith which was a second guideline of life for muslim besides the Holy Qur'an. This study was aim to build a specialized search engine to find information about a complete and eleven hadith in Indonesian language. In this research, search engines worked by using latent semantic analysis (LSA) and cosine similarity based on the keywords entered. The LSA and cosine similarity methods were used in forming structured representations of text data as well as calculating the similarity of the keyword text entered with hadith text data, so the hadith information was issued in accordance with what was searched. Based on the results of the test conducted 50 times, it indicated that the LSA and cosine similarity had a success rate in finding high hadith information with an average recall value was 87.83%, although from all information obtained level of precision hadith was found semantically not many, it was indicated by the average precision value was 36.25%

    Table Augmentation in Data Lakes

    Data lakes are centralized repositories that store large quantities of raw, unstructured, and structured data, allowing for ad-hoc data analysis, exploratory data analysis, and machine learning. However, the lack of metadata and schema in data lakes makes it challenging to work with tabular data and find related information stored in different tables. However, it is still an open problem how efficiently retrieve these tables at large scale when the settings of a data lake holds. The thesis introduces a novel approach to table augmentation that enables efficient data integration from multiple sources in a data lake. Table augmentation involves adding new data to an existing table in a horizontal fashion (by retrieving tables that can be horizontally concatenated to a query that serves as query table). The proposed approach consists of several components, including data lakes hashing, join search, similarity, and augmentation. The proposed approach is named TASH. TASH is a framework based on a spatial index in which tables are mapped and queried. Its goal is to identify the most useful columns for subsequent machine learning tasks. The table retrieval process employs a combination of set containment search and similarity search. Candidate tables are initially identified using set containment search and then ranked based on their similarity to the query. Experimental results demonstrate that TASH can effectively identify joinable tables and select the most relevant features, thereby enabling efficient table augmentation in data lakes. This research contributes to the field of big data by providing a practical solution to the challenges of data integration and analysis in data lake environments