8,520 research outputs found
An efficient -means-type algorithm for clustering datasets with incomplete records
The -means algorithm is arguably the most popular nonparametric clustering
method but cannot generally be applied to datasets with incomplete records. The
usual practice then is to either impute missing values under an assumed
missing-completely-at-random mechanism or to ignore the incomplete records, and
apply the algorithm on the resulting dataset. We develop an efficient version
of the -means algorithm that allows for clustering in the presence of
incomplete records. Our extension is called -means and reduces to the
-means algorithm when all records are complete. We also provide
initialization strategies for our algorithm and methods to estimate the number
of groups in the dataset. Illustrations and simulations demonstrate the
efficacy of our approach in a variety of settings and patterns of missing data.
Our methods are also applied to the analysis of activation images obtained from
a functional Magnetic Resonance Imaging experiment.Comment: 21 pages, 12 figures, 3 tables, in press, Statistical Analysis and
Data Mining -- The ASA Data Science Journal, 201
Spiking Neural Networks for Inference and Learning: A Memristor-based Design Perspective
On metrics of density and power efficiency, neuromorphic technologies have
the potential to surpass mainstream computing technologies in tasks where
real-time functionality, adaptability, and autonomy are essential. While
algorithmic advances in neuromorphic computing are proceeding successfully, the
potential of memristors to improve neuromorphic computing have not yet born
fruit, primarily because they are often used as a drop-in replacement to
conventional memory. However, interdisciplinary approaches anchored in machine
learning theory suggest that multifactor plasticity rules matching neural and
synaptic dynamics to the device capabilities can take better advantage of
memristor dynamics and its stochasticity. Furthermore, such plasticity rules
generally show much higher performance than that of classical Spike Time
Dependent Plasticity (STDP) rules. This chapter reviews the recent development
in learning with spiking neural network models and their possible
implementation with memristor-based hardware
Maximum Roaming Multi-Task Learning
Multi-task learning has gained popularity due to the advantages it provides
with respect to resource usage and performance. Nonetheless, the joint
optimization of parameters with respect to multiple tasks remains an active
research topic. Sub-partitioning the parameters between different tasks has
proven to be an efficient way to relax the optimization constraints over the
shared weights, may the partitions be disjoint or overlapping. However, one
drawback of this approach is that it can weaken the inductive bias generally
set up by the joint task optimization. In this work, we present a novel way to
partition the parameter space without weakening the inductive bias.
Specifically, we propose Maximum Roaming, a method inspired by dropout that
randomly varies the parameter partitioning, while forcing them to visit as many
tasks as possible at a regulated frequency, so that the network fully adapts to
each update. We study the properties of our method through experiments on a
variety of visual multi-task data sets. Experimental results suggest that the
regularization brought by roaming has more impact on performance than usual
partitioning optimization strategies. The overall method is flexible, easily
applicable, provides superior regularization and consistently achieves improved
performances compared to recent multi-task learning formulations.Comment: Accepted at the 35th AAAI Conference on Artificial Intelligence (AAAI
2021
Multiclass Data Segmentation using Diffuse Interface Methods on Graphs
We present two graph-based algorithms for multiclass segmentation of
high-dimensional data. The algorithms use a diffuse interface model based on
the Ginzburg-Landau functional, related to total variation compressed sensing
and image processing. A multiclass extension is introduced using the Gibbs
simplex, with the functional's double-well potential modified to handle the
multiclass case. The first algorithm minimizes the functional using a convex
splitting numerical scheme. The second algorithm is a uses a graph adaptation
of the classical numerical Merriman-Bence-Osher (MBO) scheme, which alternates
between diffusion and thresholding. We demonstrate the performance of both
algorithms experimentally on synthetic data, grayscale and color images, and
several benchmark data sets such as MNIST, COIL and WebKB. We also make use of
fast numerical solvers for finding the eigenvectors and eigenvalues of the
graph Laplacian, and take advantage of the sparsity of the matrix. Experiments
indicate that the results are competitive with or better than the current
state-of-the-art multiclass segmentation algorithms.Comment: 14 page
Similarity search and data mining techniques for advanced database systems.
Modern automated methods for measurement, collection, and analysis of data in industry and science are providing more and more data with drastically increasing structure complexity. On the one hand, this growing complexity is justified by the need for a richer and more precise description of real-world objects, on the other hand it is justified by the rapid progress in measurement and analysis techniques that allow the user a versatile exploration of objects. In order to manage the huge volume of such complex data, advanced database systems are employed. In contrast to conventional database systems that support exact match queries, the user of these advanced database systems focuses on applying similarity search and data mining techniques.
Based on an analysis of typical advanced database systems — such as biometrical, biological, multimedia, moving, and CAD-object database systems — the following three challenging characteristics of complexity are detected: uncertainty (probabilistic feature vectors), multiple instances (a set of homogeneous feature vectors), and multiple representations (a set of heterogeneous feature vectors). Therefore, the goal of this thesis is to develop similarity search and data mining techniques that are capable of handling uncertain, multi-instance, and multi-represented objects.
The first part of this thesis deals with similarity search techniques. Object identification is a similarity search technique that is typically used for the recognition of objects from image, video, or audio data. Thus, we develop a novel probabilistic model for object identification. Based on it, two novel types of identification queries are defined. In order to process the novel query types efficiently, we introduce an index structure called Gauss-tree. In addition, we specify further probabilistic models and query types for uncertain multi-instance objects and uncertain spatial objects. Based on the index structure, we develop algorithms for an efficient processing of these query types. Practical benefits of using probabilistic feature vectors are demonstrated on a real-world application for video similarity search. Furthermore, a similarity search technique is presented that is based on aggregated multi-instance objects, and that is suitable for video similarity search. This technique takes multiple representations into account in order to achieve better effectiveness.
The second part of this thesis deals with two major data mining techniques: clustering and classification. Since privacy preservation is a very important demand of distributed advanced applications, we propose using uncertainty for data obfuscation in order to provide privacy preservation during clustering. Furthermore, a model-based and a density-based clustering method for multi-instance objects are developed. Afterwards, original extensions and enhancements of the density-based clustering algorithms DBSCAN and OPTICS for handling multi-represented objects are introduced. Since several advanced database systems like biological or multimedia database systems handle predefined, very large class systems, two novel classification techniques for large class sets that benefit from using multiple representations are defined. The first classification method is based on the idea of a k-nearest-neighbor classifier. It employs a novel density-based technique to reduce training instances and exploits the entropy impurity of the local neighborhood in order to weight a given representation. The second technique addresses hierarchically-organized class systems. It uses a novel hierarchical, supervised method for the reduction of large multi-instance objects, e.g. audio or video, and applies support vector machines for efficient hierarchical classification of multi-represented objects. User benefits of this technique are demonstrated by a prototype that performs a classification of large music collections.
The effectiveness and efficiency of all proposed techniques are discussed and verified by comparison with conventional approaches in versatile experimental evaluations on real-world datasets
Mutual Information-based Generalized Category Discovery
We introduce an information-maximization approach for the Generalized
Category Discovery (GCD) problem. Specifically, we explore a parametric family
of loss functions evaluating the mutual information between the features and
the labels, and find automatically the one that maximizes the predictive
performances. Furthermore, we introduce the Elbow Maximum Centroid-Shift
(EMaCS) technique, which estimates the number of classes in the unlabeled set.
We report comprehensive experiments, which show that our mutual
information-based approach (MIB) is both versatile and highly competitive under
various GCD scenarios. The gap between the proposed approach and the existing
methods is significant, more so when dealing with fine-grained classification
problems. Our code: https://github.com/fchiaroni/Mutual-Information-Based-GCD
- …