Search CORE

256 research outputs found

Survey of Vector Database Management Systems

Author: Li Guoliang
Pan James Jie
Wang Jianguo
Publication venue
Publication date: 21/10/2023
Field of study

There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. Driving this shift from algorithms to systems are new data intensive applications, notably large language models, that demand vast stores of unstructured data coupled with reliable, secure, fast, and scalable query processing capability. A variety of new data management techniques now exist for addressing these needs, however there is no comprehensive survey to thoroughly review these techniques and systems. We start by identifying five main obstacles to vector data management, namely vagueness of semantic similarity, large size of vectors, high cost of similarity comparison, lack of natural partitioning that can be used for indexing, and difficulty of efficiently answering hybrid queries that require both attributes and vectors. Overcoming these obstacles has led to new approaches to query processing, storage and indexing, and query optimization and execution. For query processing, a variety of similarity scores and query types are now well understood; for storage and indexing, techniques include vector compression, namely quantization, and partitioning based on randomization, learning partitioning, and navigable partitioning; for query optimization and execution, we describe new operators for hybrid queries, as well as techniques for plan enumeration, plan selection, and hardware accelerated execution. These techniques lead to a variety of VDBMSs across a spectrum of design and runtime characteristics, including native systems specialized for vectors and extended systems that incorporate vector capabilities into existing systems. We then discuss benchmarks, and finally we outline research challenges and point the direction for future work.Comment: 25 page

arXiv.org e-Print Archive

Deep Generative Models for Semantic Text Hashing

Author: Chaidaroon Suthee
Publication venue: Scholar Commons
Publication date: 01/03/2020
Field of study

As the amount of textual data has been rapidly increasing over the past decade, efficient similarity search methods have become a crucial component of large-scale information retrieval systems. A popular strategy is to represent original data samples by compact binary codes through hashing. A spectrum of machine learning methods have been utilized, but they often lack expressiveness and flexibility in modeling to learn effective representations. The recent advances of deep learning in a wide range of applications has demonstrated its capability to learn robust and powerful feature representations for complex data. Especially, deep generative models naturally combine the expressiveness of probabilistic generative models with the high capacity of deep neural networks, which is very suitable for text modeling. However, little work has leveraged the recent progress in deep learning for text hashing. Meanwhile, most state-of-the-art semantic hashing approaches require large amounts of hand-labeled training data which are often expensive and time consuming to collect. The cost of getting labeled data is the key bottleneck in deploying these hashing methods. Furthermore, Most existing text hashing approaches treat each document separately and only learn the hash codes from the content of the documents. However, in reality, documents are related to each other either explicitly through an observed linkage such as citations or implicitly through unobserved connections such as adjacency in the original space. The document relationships are pervasive in the real world while they are largely ignored in the prior semantic hashing work. In this thesis, we propose a series of novel deep document generative models for text hashing to address the aforementioned challenges. Based on the deep generative modeling framework, our models employ deep neural networks to learn complex mappings from the original space to the hash space. We first introduce an unsupervised models for text hashing. Then we further introduce the supervised models that utilize document labels/tags as well as consider document-specific factors that affect the generation of words. To address the lack of labeled data, we employ unsupervised ranking methods such as BM25 to extract weak signals from training data. We propose two deep generative semantic hashing models to leverage weak signals for text hashing. Finally, we propose node2hash, an unsupervised deep generative model for semantic text hashing by utilizing graph context. It is designed to incorporate both document content and connection information through a probabilistic formulation. Based on the deep generative modeling framework, node2hash employs deep neural networks to learn complex mappings from the original space to the hash space. The probabilistic generative formulation of the proposed models provides a principled framework for model extension, uncertainty estimation, simulation, and interpretability. Based on variational inference and reparameterization, the proposed models can be interpreted as encoder-decoder deep neural networks and thus they are capable of learning complex nonlinear distributed representations of the original documents. We conduct a comprehensive set of experiments on various public testbeds. The experimental results have demonstrated the effectiveness of the proposed models over the competitive baselines

Scholar Commons - Santa Clara University

Next Generation Indexing for Genomic Intervals

Author: Ceri Stefano
Deldjoo Yashar
Goecks Jeremy
Jalili Vahid
Matteucci Matteo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Di4 (1D intervals incremental inverted index) is a multi-resolution, single-dimension indexing framework for efficient, scalable, and extensible computation of genomic interval expressions. The framework has a tri-layer architecture: the semantic layer provides orthogonal and generic means (including the support of user-defined function) of sense-making and higher-lever reasoning from region-based datasets; the logical layer provides building blocks for region calculus and topological relations between intervals; the physical layer abstracts from persistence technology and makes the model adaptable to variety of persistence technologies, spanning from small-scale (e.g., B+tree) to large-scale (e.g., LevelDB). The extensibility of Di4 to application scenarios is shown with an example of comparative evaluation of ChIP-seq and DNase-Seq replicates. Performance of Di4 is benchmarked for small and large scale scenarios under common bioinformatics application scenarios. Di4 is freely available from https://genometric.github.io/Di4

Archivio istituzionale della ricerca - Politecnico di Milano

Modular and ontogenetic evolution of virtual organisms

Author: Leibl Marek
Publication venue: Univerzita Karlova, Matematicko-fyzikální fakulta
Publication date: 01/01/2015
Field of study

Růst vypočetního výkonu umožňuje v současné době automatizovat mnoho praktických problémů pomocí počítačových programů. Automatizace zahrnuje i problémy jako je návrh virtuálních chodích robotů, na základě kterých je v některých případech možné zkonstruovat reálného robota. Tato práce porovnává dva odlišné přístupy k vývoji virtuálních robotických organismů: umělou ontogenezi (artificial ontogeny), kdy virtuální organismus nejprve vyroste umělým ontogenetickým procesem, a přímé metody bez ontogenetického procesu. Dále je na základě srovnání různých přístupů navržena nová metoda pro vývoj virtuálních robotických organismů: Hypercube-based artificial ontogeny, která je kombinací umělé ontogeneze a Hypercube-based neuroevolution of augmenting topologies (HyperNEAT). Powered by TCPDF (www.tcpdf.org)Increase of computational power and development of new methods in artificial intelligence allow these days many real-world problems to be solved automatically by a~computer program without human interaction. This includes automatized design of walking robots in a~physical virtual environment that can eventually result in construction of real robots. This work compares two different approaches to evolve virtual robotic organisms: artificial ontogeny, where the organism first grows using an~artificial ontogenetic process, and more direct methods. Furthermore, it proposes a~novel approach to evolve virtual robotic organisms: Hypercube-based artificial ontogeny that is combination of artificial ontogeny and Hypercube-based neuroevolution of augmenting topologies (HyperNEAT). Powered by TCPDF (www.tcpdf.org)Department of Software and Computer Science EducationKatedra softwaru a výuky informatikyMatematicko-fyzikální fakultaFaculty of Mathematics and Physic

CU Digital Repository