205 research outputs found
Bridging Dense and Sparse Maximum Inner Product Search
Maximum inner product search (MIPS) over dense and sparse vectors have
progressed independently in a bifurcated literature for decades; the latter is
better known as top- retrieval in Information Retrieval. This duality exists
because sparse and dense vectors serve different end goals. That is despite the
fact that they are manifestations of the same mathematical problem. In this
work, we ask if algorithms for dense vectors could be applied effectively to
sparse vectors, particularly those that violate the assumptions underlying
top- retrieval methods. We study IVF-based retrieval where vectors are
partitioned into clusters and only a fraction of clusters are searched during
retrieval. We conduct a comprehensive analysis of dimensionality reduction for
sparse vectors, and examine standard and spherical KMeans for partitioning. Our
experiments demonstrate that IVF serves as an efficient solution for sparse
MIPS. As byproducts, we identify two research opportunities and demonstrate
their potential. First, we cast the IVF paradigm as a dynamic pruning technique
and turn that insight into a novel organization of the inverted index for
approximate MIPS for general sparse vectors. Second, we offer a unified regime
for MIPS over vectors that have dense and sparse subspaces, and show its
robustness to query distributions
Exact heat kernel on a hypersphere and its applications in kernel SVM
Many contemporary statistical learning methods assume a Euclidean feature
space. This paper presents a method for defining similarity based on
hyperspherical geometry and shows that it often improves the performance of
support vector machine compared to other competing similarity measures.
Specifically, the idea of using heat diffusion on a hypersphere to measure
similarity has been previously proposed, demonstrating promising results based
on a heuristic heat kernel obtained from the zeroth order parametrix expansion;
however, how well this heuristic kernel agrees with the exact hyperspherical
heat kernel remains unknown. This paper presents a higher order parametrix
expansion of the heat kernel on a unit hypersphere and discusses several
problems associated with this expansion method. We then compare the heuristic
kernel with an exact form of the heat kernel expressed in terms of a uniformly
and absolutely convergent series in high-dimensional angular momentum
eigenmodes. Being a natural measure of similarity between sample points
dwelling on a hypersphere, the exact kernel often shows superior performance in
kernel SVM classifications applied to text mining, tumor somatic mutation
imputation, and stock market analysis
Extension of the BRIEF descriptor for color images and its evaluation in robotic applications.
During the development of the project, new extensions of the BRIEF descriptor are defined based in the use of color information in different color spaces. They are evaluated against each other and the original BRIEF descriptor (that uses intensity pixel values in grayscale images) by means of a matching test of different points of two different images of the same scene. The selected extension is used to recognize objects in a real time robotic application by means of a classification method. Therefore, the target is to improve the BRIEF descriptor to be applicable in color images by defining the different color extensions and making a comparative evaluation to ensure selecting the best one that most improve the basic BRIEF descriptor with the use of color information. The speed and the recognition rate are computed for every extension and the basic BRIEF to compare the performance of each descriptor. Ones an extension is selected, it is used the Bag Of Features model to create an algorithm able to recognize objects. The BOF model detects the keypoints with the FAST algorithm, describes them by means of the selected extension of the BRIEF descriptor and classifies them using a linear Support Vector Machine. Finally, the algorithm is tested in different situations and conditions, varying illumination, a little bit rotation and background, and checking performance with occluded object
Compressive Spectral Clustering
Spectral clustering has become a popular technique due to its high
performance in many contexts. It comprises three main steps: create a
similarity graph between N objects to cluster, compute the first k eigenvectors
of its Laplacian matrix to define a feature vector for each object, and run
k-means on these features to separate objects into k classes. Each of these
three steps becomes computationally intensive for large N and/or k. We propose
to speed up the last two steps based on recent results in the emerging field of
graph signal processing: graph filtering of random signals, and random sampling
of bandlimited graph signals. We prove that our method, with a gain in
computation time that can reach several orders of magnitude, is in fact an
approximation of spectral clustering, for which we are able to control the
error. We test the performance of our method on artificial and real-world
network data.Comment: 12 pages, 2 figure
Design of New Dispersants Using Machine Learning and Visual Analytics
Artificial intelligence (AI) is an emerging technology that is revolutionizing the discovery of new materials. One key application of AI is virtual screening of chemical libraries, which enables the accelerated discovery of materials with desired properties. In this study, we developed computational models to predict the dispersancy efficiency of oil and lubricant additives, a critical property in their design that can be estimated through a quantity named blotter spot. We propose a comprehensive approach that combines machine learning techniques with visual analytics strategies in an interactive tool that supports domain experts’ decision-making. We evaluated the proposed models quantitatively and illustrated their benefits through a case study. Specifically, we analyzed a series of virtual polyisobutylene succinimide (PIBSI) molecules derived from a known reference substrate. Our best-performing probabilistic model was Bayesian Additive Regression Trees (BART), which achieved a mean absolute error of (Formula presented.) and a root mean square error of (Formula presented.), as estimated through 5-fold cross-validation. To facilitate future research, we have made the dataset, including the potential dispersants used for modeling, publicly available. Our approach can help accelerate the discovery of new oil and lubricant additives, and our interactive tool can aid domain experts in making informed decisions based on blotter spot and other key propertie
Real-Time Multi-Fisheye Camera Self-Localization and Egomotion Estimation in Complex Indoor Environments
In this work a real-time capable multi-fisheye camera self-localization and egomotion estimation framework is developed. The thesis covers all aspects ranging from omnidirectional camera calibration to the development of a complete multi-fisheye camera SLAM system based on a generic multi-camera bundle adjustment method
Learning to compress and search visual data in large-scale systems
The problem of high-dimensional and large-scale representation of visual data
is addressed from an unsupervised learning perspective. The emphasis is put on
discrete representations, where the description length can be measured in bits
and hence the model capacity can be controlled. The algorithmic infrastructure
is developed based on the synthesis and analysis prior models whose
rate-distortion properties, as well as capacity vs. sample complexity
trade-offs are carefully optimized. These models are then extended to
multi-layers, namely the RRQ and the ML-STC frameworks, where the latter is
further evolved as a powerful deep neural network architecture with fast and
sample-efficient training and discrete representations. For the developed
algorithms, three important applications are developed. First, the problem of
large-scale similarity search in retrieval systems is addressed, where a
double-stage solution is proposed leading to faster query times and shorter
database storage. Second, the problem of learned image compression is targeted,
where the proposed models can capture more redundancies from the training
images than the conventional compression codecs. Finally, the proposed
algorithms are used to solve ill-posed inverse problems. In particular, the
problems of image denoising and compressive sensing are addressed with
promising results.Comment: PhD thesis dissertatio
- …