361 research outputs found
Robust PCA as Bilinear Decomposition with Outlier-Sparsity Regularization
Principal component analysis (PCA) is widely used for dimensionality
reduction, with well-documented merits in various applications involving
high-dimensional data, including computer vision, preference measurement, and
bioinformatics. In this context, the fresh look advocated here permeates
benefits from variable selection and compressive sampling, to robustify PCA
against outliers. A least-trimmed squares estimator of a low-rank bilinear
factor analysis model is shown closely related to that obtained from an
-(pseudo)norm-regularized criterion encouraging sparsity in a matrix
explicitly modeling the outliers. This connection suggests robust PCA schemes
based on convex relaxation, which lead naturally to a family of robust
estimators encompassing Huber's optimal M-class as a special case. Outliers are
identified by tuning a regularization parameter, which amounts to controlling
sparsity of the outlier matrix along the whole robustification path of (group)
least-absolute shrinkage and selection operator (Lasso) solutions. Beyond its
neat ties to robust statistics, the developed outlier-aware PCA framework is
versatile to accommodate novel and scalable algorithms to: i) track the
low-rank signal subspace robustly, as new data are acquired in real time; and
ii) determine principal components robustly in (possibly) infinite-dimensional
feature spaces. Synthetic and real data tests corroborate the effectiveness of
the proposed robust PCA schemes, when used to identify aberrant responses in
personality assessment surveys, as well as unveil communities in social
networks, and intruders from video surveillance data.Comment: 30 pages, submitted to IEEE Transactions on Signal Processin
Application of statistical learning theory to plankton image analysis
Submitted to the Joint Program in Applied Ocean Science and Engineering
in partial fulfillment of the requirements for the degree of Doctor of Philosophy
At the Massachusetts Institute of Technology
and the Woods Hole Oceanographic Institution
June 2006A fundamental problem in limnology and oceanography is the inability to quickly
identify and map distributions of plankton. This thesis addresses the problem by
applying statistical machine learning to video images collected by an optical sampler,
the Video Plankton Recorder (VPR). The research is focused on development
of a real-time automatic plankton recognition system to estimate plankton abundance.
The system includes four major components: pattern representation/feature
measurement, feature extraction/selection, classification, and abundance estimation.
After an extensive study on a traditional learning vector quantization (LVQ)
neural network (NN) classifier built on shape-based features and different pattern
representation methods, I developed a classification system combined multi-scale cooccurrence matrices feature with support vector machine classifier. This new method
outperforms the traditional shape-based-NN classifier method by 12% in classification
accuracy. Subsequent plankton abundance estimates are improved in the regions of
low relative abundance by more than 50%.
Both the NN and SVM classifiers have no rejection metrics. In this thesis, two
rejection metrics were developed. One was based on the Euclidean distance in the
feature space for NN classifier. The other used dual classifier (NN and SVM) voting as
output. Using the dual-classification method alone yields almost as good abundance
estimation as human labeling on a test-bed of real world data. However, the distance
rejection metric for NN classifier might be more useful when the training samples are
not “good” ie, representative of the field data.
In summary, this thesis advances the current state-of-the-art plankton recognition
system by demonstrating multi-scale texture-based features are more suitable
for classifying field-collected images. The system was verified on a very large realworld
dataset in systematic way for the first time. The accomplishments include developing a multi-scale occurrence matrices and support vector machine system, a dual-classification system, automatic correction in abundance estimation, and ability to get accurate abundance estimation from real-time automatic classification. The methods developed are generic and are likely to work on range of other image classification applications.This work was supported by National Science Foundation Grants OCE-9820099
and Woods Hole Oceanographic Institution academic program
Comprehensive Assessment of the Performance of Deep Learning Classifiers Reveals a Surprising Lack of Robustness
Reliable and robust evaluation methods are a necessary first step towards
developing machine learning models that are themselves robust and reliable.
Unfortunately, current evaluation protocols typically used to assess
classifiers fail to comprehensively evaluate performance as they tend to rely
on limited types of test data, and ignore others. For example, using the
standard test data fails to evaluate the predictions made by the classifier to
samples from classes it was not trained on. On the other hand, testing with
data containing samples from unknown classes fails to evaluate how well the
classifier can predict the labels for known classes. This article advocates
bench-marking performance using a wide range of different types of data and
using a single metric that can be applied to all such data types to produce a
consistent evaluation of performance. Using such a benchmark it is found that
current deep neural networks, including those trained with methods that are
believed to produce state-of-the-art robustness, are extremely vulnerable to
making mistakes on certain types of data. This means that such models will be
unreliable in real-world scenarios where they may encounter data from many
different domains, and that they are insecure as they can easily be fooled into
making the wrong decisions. It is hoped that these results will motivate the
wider adoption of more comprehensive testing methods that will, in turn, lead
to the development of more robust machine learning methods in the future.
Code is available at:
\url{https://codeberg.org/mwspratling/RobustnessEvaluation
Biometric Liveness Detection for the Fingerprint Recognition Technology
Tato práce je zaměřena na detekci živosti pro technologii rozpoznávání otisků prstů. V první části této práce je popsána biometrie, biometrické systémy, rozpoznávání živosti a je navržena metoda pro detekci živosti, která je založena na spektroskopických vlastnostech lidské kůže. Druhá část práce popisuje a shrnuje výsledky experimentů po implementaci této metody, v závěru práce jsou výsledky diskutovány a je nastíněna další možná práce.This work focuses on liveness detection for the fingerprint recognition technology. The first part of this thesis describes biometrics, biometric systems, liveness detection and the method for liveness detection is proposed, which is based on spectroscopic characteristics of human skin. The second part describes and summarizes performed experiments. In the end, the results are discussed and further improvements are proposed.
Perceptual and semantic representations at encoding contribute to true and false recognition of objects
When encoding new episodic memories, visual and semantic processing are proposed to make distinct contributions to accurate memory and memory distortions. Here, we used functional magnetic resonance imaging (fMRI) and preregistered representational similarity analysis (RSA) to uncover the representations that predict true and false recognition of unfamiliar objects. Two semantic models captured coarse-grained taxonomic categories and specific object features, respectively, while two perceptual models embodied low-level visual properties. Twenty-eight female and male participants encoded images of objects during fMRI scanning, and later had to discriminate studied objects from similar lures and novel objects in a recognition memory test. Both perceptual and semantic models predicted true memory. When studied objects were later identified correctly, neural patterns corresponded to low-level visual representations of these object images in the early visual cortex, lingual, and fusiform gyri. In a similar fashion, alignment of neural patterns with fine-grained semantic feature representations in the fusiform gyrus also predicted true recognition. However, emphasis on coarser taxonomic representations predicted forgetting more anteriorly in the anterior ventral temporal cortex, left inferior frontal gyrus and, in an exploratory analysis, left perirhinal cortex. In contrast, false recognition of similar lure objects was associated with weaker visual analysis posteriorly in early visual and left occipitotemporal cortex. The results implicate multiple perceptual and semantic representations in successful memory encoding and suggest that fine-grained semantic as well as visual analysis contributes to accurate later recognition, while processing visual image detail is critical for avoiding false recognition errors
Graph set data mining
Graphs are among the most versatile abstract data types in computer science. With the variety comes great adoption in various application fields, such as chemistry, biology, social analysis, logistics, and computer science itself. With the growing capacities of digital storage, the collection of large amounts of data has become the norm in many application fields. Data mining, i.e., the automated extraction of non-trivial patterns from data, is a key step to extract knowledge from these datasets and generate value. This thesis is dedicated to concurrent scalable data mining algorithms beyond traditional notions of efficiency for large-scale datasets of small labeled graphs; more precisely, structural clustering and representative subgraph pattern mining. It is motivated by, but not limited to, the need to analyze molecular libraries of ever-increasing size in the drug discovery process. Structural clustering makes use of graph theoretical concepts, such as (common) subgraph isomorphisms and frequent subgraphs, to model cluster commonalities directly in the application domain. It is considered computationally demanding for non-restricted graph classes and with very few exceptions prior algorithms are only suitable for very small datasets. This thesis discusses the first truly scalable structural clustering algorithm StruClus with linear worst-case complexity. At the same time, StruClus embraces the inherent values of structural clustering algorithms, i.e., interpretable, consistent, and high-quality results. A novel two-fold sampling strategy with stochastic error bounds for frequent subgraph mining is presented. It enables fast extraction of cluster commonalities in the form of common subgraph representative sets. StruClus is the first structural clustering algorithm with a directed selection of structural cluster-representative patterns regarding homogeneity and separation aspects in the high-dimensional subgraph pattern space. Furthermore, a novel concept of cluster homogeneity balancing using dynamically-sized representatives is discussed. The second part of this thesis discusses the representative subgraph pattern mining problem in more general terms. A novel objective function maximizes the number of represented graphs for a cardinality-constrained representative set. It is shown that the problem is a special case of the maximum coverage problem and is NP-hard. Based on the greedy approximation of Nemhauser, Wolsey, and Fisher for submodular set function maximization a novel sampling approach is presented. It mines candidate sets that contain an optimal greedy solution with a probabilistic maximum error. This leads to a constant-time algorithm to generate the candidate sets given a fixed-size sample of the dataset. In combination with a cheap single-pass streaming evaluation of the candidate sets, this enables scalability to datasets with billions of molecules on a single machine. Ultimately, the sampling approach leads to the first distributed subgraph pattern mining algorithm that distributes the pattern space and the dataset graphs at the same time
- …