106 research outputs found
Tabu search model selection for SVM
International audienceA model selection method based on tabu search is proposed to build support vector machines (binary decision functions) of reduced complexity and efficient generalization. The aim is to build a fast and efficient support vector machines classifier. A criterion is defined to evaluate the decision function quality which blends recognition rate and the complexity of a binary decision functions together. The selection of the simplification level by vector quantization, of a feature subset and of support vector machines hyperparameters are performed by tabu search method to optimize the defined decision function quality criterion in order to find a good sub-optimal model on tractable times
Distributed multi-label learning on Apache Spark
This thesis proposes a series of multi-label learning algorithms for classification and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of individual information measures, and a method that selects the subset of features maximizing the geometric mean. The results indicate that each method excels in different scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets confirm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art
Efficient Image and Video Representations for Retrieval
Image (Video) retrieval is an interesting problem of retrieving images (videos) similar to the query. Images (Videos) are represented in an input (feature) space and similar images (videos) are obtained by finding nearest neighbors in the input representation space. Numerous input representations both in real valued and binary space have been proposed for conducting faster retrieval. In this thesis, we present techniques that obtain improved input representations for retrieval in both supervised and unsupervised settings for images and videos.
Supervised retrieval is a well known problem of retrieving same class images of the query. We address the practical aspects of achieving faster retrieval with binary codes as input representations for the supervised setting in the first part, where binary codes are used as addresses into hash tables. In practice, using binary codes as addresses does not guarantee fast retrieval, as similar images are not mapped to the same binary code (address). We address this problem by presenting an efficient supervised hashing (binary encoding) method that aims to explicitly map all the images of the same class ideally to a unique binary code. We refer to the binary codes of the images as `Semantic Binary Codes' and the unique code for all same class images as `Class Binary Code'. We also propose a new class based Hamming metric that dramatically reduces the retrieval times for larger databases, where only hamming distance is computed to the class binary codes. We also propose a Deep semantic binary code model, by replacing the output layer of a popular convolutional Neural Network (AlexNet) with the class binary codes and show that the hashing functions learned in this way outperforms the state of Âthe art, and at the same time provide fast retrieval times.
In the second part, we also address the problem of supervised retrieval by taking into account the relationship between classes. For a given query image, we want to retrieve images that preserve the relative order i.e. we want to retrieve all same class images first and then, the related classes images before different class images. We learn such relationship aware binary codes by minimizing the similarity between inner product of the binary codes and the similarity between the classes. We calculate the similarity between classes using output embedding vectors, which are vector representations of classes. Our method deviates from the other supervised binary encoding schemes as it is the first to use output embeddings for learning hashing functions. We also introduce new performance metrics that take into account the related class retrieval results and show significant gains over the state of the art.
High Dimensional descriptors like Fisher Vectors or Vector of Locally Aggregated Descriptors have shown to improve the performance of many computer vision applications including retrieval. In the third part, we will discuss an unsupervised technique for compressing high dimensional vectors into high dimensional binary codes, to reduce storage complexity. In this approach, we deviate from adopting traditional hyperplane hashing functions and instead learn hyperspherical hashing functions. The proposed method overcomes the computational challenges of directly applying the spherical hashing algorithm that is intractable for compressing high dimensional vectors. A practical hierarchical model that utilizes divide and conquer techniques using the Random Select and Adjust (RSA) procedure to compress such high dimensional vectors is presented. We show that our proposed high dimensional binary codes outperform the binary codes obtained using traditional hyperplane methods for higher compression ratios.
In the last part of the thesis, we propose a retrieval based solution to the Zero shot event classification problem - a setting where no training videos are available for the event. To do this, we learn a generic set of concept detectors and represent both videos and query events in the concept space. We then compute similarity between the query event and the video in the concept space and videos similar to the query event are classified as the videos belonging to the event. We show that we significantly boost the performance using concept features from other modalities
A Survey of Adaptive Resonance Theory Neural Network Models for Engineering Applications
This survey samples from the ever-growing family of adaptive resonance theory
(ART) neural network models used to perform the three primary machine learning
modalities, namely, unsupervised, supervised and reinforcement learning. It
comprises a representative list from classic to modern ART models, thereby
painting a general picture of the architectures developed by researchers over
the past 30 years. The learning dynamics of these ART models are briefly
described, and their distinctive characteristics such as code representation,
long-term memory and corresponding geometric interpretation are discussed.
Useful engineering properties of ART (speed, configurability, explainability,
parallelization and hardware implementation) are examined along with current
challenges. Finally, a compilation of online software libraries is provided. It
is expected that this overview will be helpful to new and seasoned ART
researchers
Kernel Methods in Computer-Aided Constructive Drug Design
A drug is typically a small molecule that interacts with the binding site of some
target protein. Drug design involves the optimization of this interaction so that the
drug effectively binds with the target protein while not binding with other proteins
(an event that could produce dangerous side effects). Computational drug design
involves the geometric modeling of drug molecules, with the goal of generating
similar molecules that will be more effective drug candidates. It is necessary that
algorithms incorporate strategies to measure molecular similarity by comparing
molecular descriptors that may involve dozens to hundreds of attributes. We use
kernel-based methods to define these measures of similarity. Kernels are general
functions that can be used to formulate similarity comparisons.
The overall goal of this thesis is to develop effective and efficient computational
methods that are reliant on transparent mathematical descriptors of molecules with
applications to affinity prediction, detection of multiple binding modes, and generation
of new drug leads. While in this thesis we derive computational strategies for
the discovery of new drug leads, our approach differs from the traditional ligandbased
approach. We have developed novel procedures to calculate inverse mappings
and subsequently recover the structure of a potential drug lead. The contributions
of this thesis are the following:
1. We propose a vector space model molecular descriptor (VSMMD) based on
a vector space model that is suitable for kernel studies in QSAR modeling.
Our experiments have provided convincing comparative empirical evidence
that our descriptor formulation in conjunction with kernel based regression
algorithms can provide sufficient discrimination to predict various biological
activities of a molecule with reasonable accuracy.
2. We present a new component selection algorithm KACS (Kernel Alignment
Component Selection) based on kernel alignment for a QSAR study. Kernel
alignment has been developed as a measure of similarity between two kernel
functions. In our algorithm, we refine kernel alignment as an evaluation tool,
using recursive component elimination to eventually select the most important
components for classification. We have demonstrated empirically and proven
theoretically that our algorithm works well for finding the most important
components in different QSAR data sets.
3. We extend the VSMMD in conjunction with a kernel based clustering algorithm
to the prediction of multiple binding modes, a challenging area of
research that has been previously studied by means of time consuming docking
simulations. The results reported in this study provide strong empirical
evidence that our strategy has enough resolving power to distinguish multiple
binding modes through the use of a standard k-means algorithm.
4. We develop a set of reverse engineering strategies for QSAR modeling based
on our VSMMD. These strategies include:
(a) The use of a kernel feature space algorithm to design or modify descriptor
image points in a feature space.
(b) The deployment of a pre-image algorithm to map the newly defined
descriptor image points in the feature space back to the input space of
the descriptors.
(c) The design of a probabilistic strategy to convert new descriptors to meaningful
chemical graph templates.
The most important aspect of these contributions is the presentation of strategies that actually generate the structure of a new drug candidate. While the training
set is still used to generate a new image point in the feature space, the reverse engineering
strategies just described allows us to develop a new drug candidate that is
independent of issues related to probability distribution constraints placed on test
set molecules
- …