95,923 research outputs found
An examination of fast similarity search trees with gating
The emergence of complex data objects that must be indexed and queried in databases has created a need for access methods that are both generic and efficient. Traditional search algorithms that only check specified fields and keys are no longer effective. Tree-structured indexing techniques based on metric spaces are widely used to solve this problem. Unfortunately, these data structures can be slow as the computational complexity of computing the distance between two points in a metric space can be high. This thesis will explore data structures for the evaluation of range queries in general metric spaces. The performance limitations of metric spaces will be analyzed and opportunities for improvement will be discussed. It will culminate with the introduction of the Fast Similarity Search Tree as a viable alternative to existing methodologies
A Learned Index for Exact Similarity Search in Metric Spaces
Indexing is an effective way to support efficient query processing in large
databases. Recently the concept of learned index has been explored actively to
replace or supplement traditional index structures with machine learning models
to reduce storage and search costs. However, accurate and efficient similarity
query processing in high-dimensional metric spaces remains to be an open
challenge. In this paper, a novel indexing approach called LIMS is proposed to
use data clustering and pivot-based data transformation techniques to build
learned indexes for efficient similarity query processing in metric spaces. The
underlying data is partitioned into clusters such that each cluster follows a
relatively uniform data distribution. Data redistribution is achieved by
utilizing a small number of pivots for each cluster. Similar data are mapped
into compact regions and the mapped values are totally ordinal. Machine
learning models are developed to approximate the position of each data record
on the disk. Efficient algorithms are designed for processing range queries and
nearest neighbor queries based on LIMS, and for index maintenance with dynamic
updates. Extensive experiments on real-world and synthetic datasets demonstrate
the superiority of LIMS compared with traditional indexes and state-of-the-art
learned indexes.Comment: 14 pages, 14 figures, submitted to Transactions on Knowledge and Data
Engineerin
Implementation Notes for the Soft Cosine Measure
The standard bag-of-words vector space model (VSM) is efficient, and ubiquitous in information retrieval, but it underestimates the similarity of documents with the same meaning, but different terminology. To overcome this limitation, Sidorov et al. proposed the Soft Cosine Measure (SCM) that incorporates term similarity relations. Charlet and Damnati showed that the SCM is highly effective in question answering (QA) systems. However, the orthonormalization algorithm proposed by Sidorov et al. has an impractical time complexity of O(n^4), where n is the size of the vocabulary. In this paper, we prove a tighter lower worst-case time complexity bound of O(n^3). We also present an algorithm for computing the similarity between documents and we show that its worst-case time complexity is O(1) given realistic conditions. Lastly, we describe implementation in general-purpose vector databases such as Annoy, and Faiss and in the inverted indices of text search engines such as Apache Lucene, and ElasticSearch. Our results enable the deployment of the SCM in real-world information retrieval systems
Chemoinformatics Research at the University of Sheffield: A History and Citation Analysis
This paper reviews the work of the Chemoinformatics Research Group in the Department of Information Studies at the University of Sheffield, focusing particularly on the work carried out in the period 1985-2002. Four major research areas are discussed, these involving the development of methods for: substructure searching in databases of three-dimensional structures, including both rigid and flexible molecules; the representation and searching of the Markush structures that occur in chemical patents; similarity searching in databases of both two-dimensional and three-dimensional structures; and compound selection and the design of combinatorial libraries. An analysis of citations to 321 publications from the Group shows that it attracted a total of 3725 residual citations during the period 1980-2002. These citations appeared in 411 different journals, and involved 910 different citing organizations from 54 different countries, thus demonstrating the widespread impact of the Group's work
Efficient and Effective Similarity Search on Complex Objects
Due to the rapid development of computer technology and new methods for the extraction of data in the last few years, more and more applications of databases have emerged, for which an efficient and effective similarity search is of great importance. Application areas of similarity search include multimedia, computer aided engineering, marketing, image processing and many more. Special interest adheres to the task of finding similar objects in large amounts of data having complex representations. For example, set-valued objects as well as tree or graph structured objects are among these complex object representations. The grouping of similar objects, the so-called clustering, is a fundamental analysis technique, which allows to search through extensive data sets.
The goal of this dissertation is to develop new efficient and effective methods for similarity search in large quantities of complex objects. Furthermore, the efficiency of existing density-based clustering algorithms is to be improved when applied to complex objects.
The first part of this work motivates the use of vector sets for similarity modeling. For this purpose, a metric distance function is defined, which is suitable for various application ranges, but time-consuming to compute. Therefore, a filter refinement technology is suggested to efficiently process range queries and k-nearest neighbor queries, two basic query types within the field of similarity search. Several filter distances are presented, which approximate the exact object distance and can be computed efficiently. Moreover, a multi-step query processing approach is described, which can be directly integrated into the well-known density-based clustering algorithms DBSCAN and OPTICS.
In the second part of this work, new application ranges for density-based hierarchical clustering using OPTICS are discussed. A prototype is introduced, which has been developed for these new application areas and is based on the aforementioned similarity models and accelerated clustering algorithms for complex objects. This prototype facilitates interactive semi-automatic cluster analysis and allows visual search for similar objects in multimedia databases. Another prototype extends these concepts and enables the user to analyze multi-represented and multi-instance data. Finally, the problem of music genre classification is addressed as another application supporting multi-represented and multi-instance data objects. An extensive experimental evaluation examines efficiency and effectiveness of the presented techniques using real-world data and points out advantages in comparison to conventional approaches
Automated Protein Structure Classification: A Survey
Classification of proteins based on their structure provides a valuable
resource for studying protein structure, function and evolutionary
relationships. With the rapidly increasing number of known protein structures,
manual and semi-automatic classification is becoming ever more difficult and
prohibitively slow. Therefore, there is a growing need for automated, accurate
and efficient classification methods to generate classification databases or
increase the speed and accuracy of semi-automatic techniques. Recognizing this
need, several automated classification methods have been developed. In this
survey, we overview recent developments in this area. We classify different
methods based on their characteristics and compare their methodology, accuracy
and efficiency. We then present a few open problems and explain future
directions.Comment: 14 pages, Technical Report CSRG-589, University of Toront
Digital Image Access & Retrieval
The 33th Annual Clinic on Library Applications of Data Processing, held at the University of Illinois at Urbana-Champaign in March of 1996, addressed the theme of "Digital Image Access & Retrieval." The papers from this conference cover a wide range of topics concerning digital imaging technology for visual resource collections. Papers covered three general areas: (1) systems, planning, and implementation; (2) automatic and semi-automatic indexing; and (3) preservation with the bulk of the conference focusing on indexing and retrieval.published or submitted for publicatio
- …