67 research outputs found

    Sparse machine learning methods with applications in multivariate signal processing

    Get PDF
    This thesis details theoretical and empirical work that draws from two main subject areas: Machine Learning (ML) and Digital Signal Processing (DSP). A unified general framework is given for the application of sparse machine learning methods to multivariate signal processing. In particular, methods that enforce sparsity will be employed for reasons of computational efficiency, regularisation, and compressibility. The methods presented can be seen as modular building blocks that can be applied to a variety of applications. Application specific prior knowledge can be used in various ways, resulting in a flexible and powerful set of tools. The motivation for the methods is to be able to learn and generalise from a set of multivariate signals. In addition to testing on benchmark datasets, a series of empirical evaluations on real world datasets were carried out. These included: the classification of musical genre from polyphonic audio files; a study of how the sampling rate in a digital radar can be reduced through the use of Compressed Sensing (CS); analysis of human perception of different modulations of musical key from Electroencephalography (EEG) recordings; classification of genre of musical pieces to which a listener is attending from Magnetoencephalography (MEG) brain recordings. These applications demonstrate the efficacy of the framework and highlight interesting directions of future research

    Design of Machine Learning Algorithms with Applications to Breast Cancer Detection

    Get PDF
    Machine learning is concerned with the design and development of algorithms and techniques that allow computers to 'learn' from experience with respect to some class of tasks and performance measure. One application of machine learning is to improve the accuracy and efficiency of computer-aided diagnosis systems to assist physician, radiologists, cardiologists, neuroscientists, and health-care technologists. This thesis focuses on machine learning and the applications to breast cancer detection. Emphasis is laid on preprocessing of features, pattern classification, and model selection. Before the classification task, feature selection and feature transformation may be performed to reduce the dimensionality of the features and to improve the classification performance. Genetic algorithm (GA) can be employed for feature selection based on different measures of data separability or the estimated risk of a chosen classifier. A separate nonlinear transformation can be performed by applying kernel principal component analysis and kernel partial least squares. Different classifiers are proposed in this work: The SOM-RBF network combines self-organizing maps (SOMs) and radial basis function (RBF) networks, with the RBF centers set as the weight vectors of neurons from the competitive layer of a trained SaM. The pairwise Rayleigh quotient (PRQ) classifier seeks one discriminating boundary by maximizing an unconstrained optimization objective, named as the PRQ criterion, formed with a set of pairwise const~aints instead of individual training samples. The strict 2-surface proximal (S2SP) classifier seeks two proximal planes that are not necessary parallel to fit the distribution of the samples in the original feature space or a kernel-defined feature space, by ma-ximizing two strict optimization objectives with a 'square of sum' optimization factor. Two variations of the support vector data description (SVDD) with negative samples (NSVDD) are proposed by involving different forms of slack vectors, which learn a closed spherically shaped boundary, named as the supervised compact hypersphere (SCH), around a set of samples in the target class. \Ve extend the NSVDDs to solve the multi-class classification problems based on distances between the samples and the centers of the learned SCHs in a kernel-defined feature space, using a combination of linear discriminant analysis and the nearest-neighbor rule. The problem of model selection is studied to pick the best values of the hyperparameters for a parametric classifier. To choose the optimal kernel or regularization parameters of a classifier, we investigate different criteria, such as the validation error estimate and the leave-out-out bound, as well as different optimization methods, such as grid search, gradient descent, and GA. By viewing the tuning problem of the multiple parameters of an 2-norm support vector machine (SVM) as an identification problem of a nonlinear dynamic system, we design a tuning system by employing the extended Kalman filter based on cross validation. Independent kernel optimization based on different measures of data separability are a~so investigated for different kernel-based classifiers. Numerous computer experiments using the benchmark datasets verify the theoretical results, make comparisons among the techniques in measures of classification accuracy or area under the receiver operating characteristics curve. Computational requirements, such as the computing time and the number of hyper-parameters, are also discussed. All of the presented methods are applied to breast cancer detection from fine-needle aspiration and in mammograms, as well as screening of knee-joint vibroarthrographic signals and automatic monitoring of roller bearings with vibration signals. Experimental results demonstrate the excellence of these methods with improved classification performance. For breast cancer detection, instead of only providing a binary diagnostic decision of 'malignant' or 'benign', we propose methods to assign a measure of confidence of malignancy to an individual mass, by calculating probabilities of being benign and malignant with a single classifier or a set of classifiers

    Two-Stage Fuzzy Multiple Kernel Learning Based on Hilbert-Schmidt Independence Criterion

    Full text link
    ยฉ 1993-2012 IEEE. Multiple kernel learning (MKL) is a principled approach to kernel combination and selection for a variety of learning tasks, such as classification, clustering, and dimensionality reduction. In this paper, we develop a novel fuzzy multiple kernel learning model based on the Hilbert-Schmidt independence criterion (HSIC) for classification, which we call HSIC-FMKL. In this model, we first propose an HSIC Lasso-based MKL formulation, which not only has a clear statistical interpretation that minimum redundant kernels with maximum dependence on output labels are found and combined, but also enables the global optimal solution to be computed efficiently by solving a Lasso optimization problem. Since the traditional support vector machine (SVM) is sensitive to outliers or noises in the dataset, fuzzy SVM (FSVM) is used to select the prediction hypothesis once the optimal kernel has been obtained. The main advantage of FSVM is that we can associate a fuzzy membership with each data point such that these data points can have different effects on the training of the learning machine. We propose a new fuzzy membership function using a heuristic strategy based on the HSIC. The proposed HSIC-FMKL is a two-stage kernel learning approach and the HSIC is applied in both stages. We perform extensive experiments on real-world datasets from the UCI benchmark repository and the application domain of computational biology which validate the superiority of the proposed model in terms of prediction accuracy

    ๋งค๊ฐœ๋ถ„ํฌ๊ทผ์‚ฌ๋ฅผ ํ†ตํ•œ ๊ณต์ •์‹œ์Šคํ…œ ๊ณตํ•™์—์„œ์˜ ํ™•๋ฅ ๊ธฐ๊ณ„ํ•™์Šต ์ ‘๊ทผ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ํ™”ํ•™์ƒ๋ฌผ๊ณตํ•™๋ถ€, 2021.8. ์ด์ข…๋ฏผ.With the rapid development of measurement technology, higher quality and vast amounts of process data become available. Nevertheless, process data are โ€˜scarceโ€™ in many cases as they are sampled only at certain operating conditions while the dimensionality of the system is large. Furthermore, the process data are inherently stochastic due to the internal characteristics of the system or the measurement noises. For this reason, uncertainty is inevitable in process systems, and estimating it becomes a crucial part of engineering tasks as the prediction errors can lead to misguided decisions and cause severe casualties or economic losses. A popular approach to this is applying probabilistic inference techniques that can model the uncertainty in terms of probability. However, most of the existing probabilistic inference techniques are based on recursive sampling, which makes it difficult to use them for industrial applications that require processing a high-dimensional and massive amount of data. To address such an issue, this thesis proposes probabilistic machine learning approaches based on parametric distribution approximation, which can model the uncertainty of the system and circumvent the computational complexity as well. The proposed approach is applied for three major process engineering tasks: process monitoring, system modeling, and process design. First, a process monitoring framework is proposed that utilizes a probabilistic classifier for fault classification. To enhance the accuracy of the classifier and reduce the computational cost for its training, a feature extraction method called probabilistic manifold learning is developed and applied to the process data ahead of the fault classification. We demonstrate that this manifold approximation process not only reduces the dimensionality of the data but also casts the data into a clustered structure, making the classifier have a low dependency on the type and dimension of the data. By exploiting this property, non-metric information (e.g., fault labels) of the data is effectively incorporated and the diagnosis performance is drastically improved. Second, a probabilistic modeling approach based on Bayesian neural networks is proposed. The parameters of deep neural networks are transformed into Gaussian distributions and trained using variational inference. The redundancy of the parameter is autonomously inferred during the model training, and insignificant parameters are eliminated a posteriori. Through a verification study, we demonstrate that the proposed approach can not only produce high-fidelity models that describe the stochastic behaviors of the system but also produce the optimal model structure. Finally, a novel process design framework is proposed based on reinforcement learning. Unlike the conventional optimization methods that recursively evaluate the objective function to find an optimal value, the proposed method approximates the objective function surface by parametric probabilistic distributions. This allows learning the continuous action policy without introducing any cumbersome discretization process. Moreover, the probabilistic policy gives means for effective control of the exploration and exploitation rates according to the certainty information. We demonstrate that the proposed framework can learn process design heuristics during the solution process and use them to solve similar design problems.๊ณ„์ธก๊ธฐ์ˆ ์˜ ๋ฐœ๋‹ฌ๋กœ ์–‘์งˆ์˜, ๊ทธ๋ฆฌ๊ณ  ๋ฐฉ๋Œ€ํ•œ ์–‘์˜ ๊ณต์ • ๋ฐ์ดํ„ฐ์˜ ์ทจ๋“์ด ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋งŽ์€ ๊ฒฝ์šฐ ์‹œ์Šคํ…œ ์ฐจ์›์˜ ํฌ๊ธฐ์— ๋น„ํ•ด์„œ ์ผ๋ถ€ ์šด์ „์กฐ๊ฑด์˜ ๊ณต์ • ๋ฐ์ดํ„ฐ๋งŒ์ด ์ทจ๋“๋˜๊ธฐ ๋•Œ๋ฌธ์—, ๊ณต์ • ๋ฐ์ดํ„ฐ๋Š” โ€˜ํฌ์†Œโ€™ํ•˜๊ฒŒ ๋œ๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๊ณต์ • ๋ฐ์ดํ„ฐ๋Š” ์‹œ์Šคํ…œ ๊ฑฐ๋™ ์ž์ฒด์™€ ๋”๋ถˆ์–ด ๊ณ„์ธก์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋…ธ์ด์ฆˆ๋กœ ์ธํ•œ ๋ณธ์งˆ์ ์ธ ํ™•๋ฅ ์  ๊ฑฐ๋™์„ ๋ณด์ธ๋‹ค. ๋”ฐ๋ผ์„œ ์‹œ์Šคํ…œ์˜ ์˜ˆ์ธก๋ชจ๋ธ์€ ์˜ˆ์ธก ๊ฐ’์— ๋Œ€ํ•œ ๋ถˆํ™•์‹ค์„ฑ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ๊ธฐ์ˆ ํ•˜๋Š” ๊ฒƒ์ด ์š”๊ตฌ๋˜๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ์˜ค์ง„์„ ์˜ˆ๋ฐฉํ•˜๊ณ  ์ž ์žฌ์  ์ธ๋ช… ํ”ผํ•ด์™€ ๊ฒฝ์ œ์  ์†์‹ค์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด์— ๋Œ€ํ•œ ๋ณดํŽธ์ ์ธ ์ ‘๊ทผ๋ฒ•์€ ํ™•๋ฅ ์ถ”์ •๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋Ÿฌํ•œ ๋ถˆํ™•์‹ค์„ฑ์„ ์ •๋Ÿ‰ํ™” ํ•˜๋Š” ๊ฒƒ์ด๋‚˜, ํ˜„์กดํ•˜๋Š” ์ถ”์ •๊ธฐ๋ฒ•๋“ค์€ ์žฌ๊ท€์  ์ƒ˜ํ”Œ๋ง์— ์˜์กดํ•˜๋Š” ํŠน์„ฑ์ƒ ๊ณ ์ฐจ์›์ด๋ฉด์„œ๋„ ๋‹ค๋Ÿ‰์ธ ๊ณต์ •๋ฐ์ดํ„ฐ์— ์ ์šฉํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๊ทผ๋ณธ์ ์ธ ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง„๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๋งค๊ฐœ๋ถ„ํฌ๊ทผ์‚ฌ์— ๊ธฐ๋ฐ˜ํ•œ ํ™•๋ฅ ๊ธฐ๊ณ„ํ•™์Šต์„ ์ ์šฉํ•˜์—ฌ ์‹œ์Šคํ…œ์— ๋‚ด์žฌ๋œ ๋ถˆํ™•์‹ค์„ฑ์„ ๋ชจ๋ธ๋งํ•˜๋ฉด์„œ๋„ ๋™์‹œ์— ๊ณ„์‚ฐ ํšจ์œจ์ ์ธ ์ ‘๊ทผ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋จผ์ €, ๊ณต์ •์˜ ๋ชจ๋‹ˆํ„ฐ๋ง์— ์žˆ์–ด ๊ฐ€์šฐ์‹œ์•ˆ ํ˜ผํ•ฉ ๋ชจ๋ธ (Gaussian mixture model)์„ ๋ถ„๋ฅ˜์ž๋กœ ์‚ฌ์šฉํ•˜๋Š” ํ™•๋ฅ ์  ๊ฒฐํ•จ ๋ถ„๋ฅ˜ ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ด๋•Œ ๋ถ„๋ฅ˜์ž์˜ ํ•™์Šต์—์„œ์˜ ๊ณ„์‚ฐ ๋ณต์žก๋„๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์ฐจ์›์œผ๋กœ ํˆฌ์˜์‹œํ‚ค๋Š”๋ฐ, ์ด๋ฅผ ์œ„ํ•œ ํ™•๋ฅ ์  ๋‹ค์–‘์ฒด ํ•™์Šต (probabilistic manifold learn-ing) ๋ฐฉ๋ฒ•์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋ฐ์ดํ„ฐ์˜ ๋‹ค์–‘์ฒด (manifold)๋ฅผ ๊ทผ์‚ฌํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ์‚ฌ์ด์˜ ์Œ๋ณ„ ์šฐ๋„ (pairwise likelihood)๋ฅผ ๋ณด์กดํ•˜๋Š” ํˆฌ์˜๋ฒ•์ด ์‚ฌ์šฉ๋œ๋‹ค. ์ด๋ฅผ ํ†ตํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ์ข…๋ฅ˜์™€ ์ฐจ์›์— ์˜์กด๋„๊ฐ€ ๋‚ฎ์€ ์ง„๋‹จ ๊ฒฐ๊ณผ๋ฅผ ์–ป์Œ๊ณผ ๋™์‹œ์— ๋ฐ์ดํ„ฐ ๋ ˆ์ด๋ธ”๊ณผ ๊ฐ™์€ ๋น„๊ฑฐ๋ฆฌ์  (non-metric) ์ •๋ณด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฐํ•จ ์ง„๋‹จ ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค. ๋‘˜์งธ๋กœ, ๋ฒ ์ด์ง€์•ˆ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง(Bayesian deep neural networks)์„ ์‚ฌ์šฉํ•œ ๊ณต์ •์˜ ํ™•๋ฅ ์  ๋ชจ๋ธ๋ง ๋ฐฉ๋ฒ•๋ก ์ด ์ œ์‹œ๋˜์—ˆ๋‹ค. ์‹ ๊ฒฝ๋ง์˜ ๊ฐ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ๊ฐ€์šฐ์Šค ๋ถ„ํฌ๋กœ ์น˜ํ™˜๋˜๋ฉฐ, ๋ณ€๋ถ„์ถ”๋ก  (variational inference)์„ ํ†ตํ•˜์—ฌ ๊ณ„์‚ฐ ํšจ์œจ์ ์ธ ํ›ˆ๋ จ์ด ์ง„ํ–‰๋œ๋‹ค. ํ›ˆ๋ จ์ด ๋๋‚œ ํ›„ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์œ ํšจ์„ฑ์„ ์ธก์ •ํ•˜์—ฌ ๋ถˆํ•„์š”ํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์†Œ๊ฑฐํ•˜๋Š” ์‚ฌํ›„ ๋ชจ๋ธ ์••์ถ• ๋ฐฉ๋ฒ•์ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ๋ฐ˜๋„์ฒด ๊ณต์ •์— ๋Œ€ํ•œ ์‚ฌ๋ก€ ์—ฐ๊ตฌ๋Š” ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๊ณต์ •์˜ ๋ณต์žกํ•œ ๊ฑฐ๋™์„ ํšจ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ๋ง ํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ชจ๋ธ์˜ ์ตœ์  ๊ตฌ์กฐ๋ฅผ ๋„์ถœํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ถ„ํฌํ˜• ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•œ ๊ฐ•ํ™”ํ•™์Šต์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ํ™•๋ฅ ์  ๊ณต์ • ์„ค๊ณ„ ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ตœ์ ์น˜๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ์žฌ๊ท€์ ์œผ๋กœ ๋ชฉ์  ํ•จ์ˆ˜ ๊ฐ’์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ธฐ์กด์˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋ก ๊ณผ ๋‹ฌ๋ฆฌ, ๋ชฉ์  ํ•จ์ˆ˜ ๊ณก๋ฉด (objective function surface)์„ ๋งค๊ฐœํ™” ๋œ ํ™•๋ฅ ๋ถ„ํฌ๋กœ ๊ทผ์‚ฌํ•˜๋Š” ์ ‘๊ทผ๋ฒ•์ด ์ œ์‹œ๋˜์—ˆ๋‹ค. ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด์‚ฐํ™” (discretization)๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์—ฐ์†์  ํ–‰๋™ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋ฉฐ, ํ™•์‹ค์„ฑ (certainty)์— ๊ธฐ๋ฐ˜ํ•œ ํƒ์ƒ‰ (exploration) ๋ฐ ํ™œ์šฉ (exploi-tation) ๋น„์œจ์˜ ์ œ์–ด๊ฐ€ ํšจ์œจ์ ์œผ๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค. ์‚ฌ๋ก€ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋Š” ๊ณต์ •์˜ ์„ค๊ณ„์— ๋Œ€ํ•œ ๊ฒฝํ—˜์ง€์‹ (heuristic)์„ ํ•™์Šตํ•˜๊ณ  ์œ ์‚ฌํ•œ ์„ค๊ณ„ ๋ฌธ์ œ์˜ ํ•ด๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐ ์ด์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.Chapter 1 Introduction 1 1.1. Motivation 1 1.2. Outline of the thesis 5 Chapter 2 Backgrounds and preliminaries 9 2.1. Bayesian inference 9 2.2. Monte Carlo 10 2.3. Kullback-Leibler divergence 11 2.4. Variational inference 12 2.5. Riemannian manifold 13 2.6. Finite extended-pseudo-metric space 16 2.7. Reinforcement learning 16 2.8. Directed graph 19 Chapter 3 Process monitoring and fault classification with probabilistic manifold learning 20 3.1. Introduction 20 3.2. Methods 25 3.2.1. Uniform manifold approximation 27 3.2.2. Clusterization 28 3.2.3. Projection 31 3.2.4. Mapping of unknown data query 32 3.2.5. Inference 33 3.3. Verification study 38 3.3.1. Dataset description 38 3.3.2. Experimental setup 40 3.3.3. Process monitoring 43 3.3.4. Projection characteristics 47 3.3.5. Fault diagnosis 50 3.3.6. Computational Aspects 56 Chapter 4 Process system modeling with Bayesian neural networks 59 4.1. Introduction 59 4.2. Methods 63 4.2.1. Long Short-Term Memory (LSTM) 63 4.2.2. Bayesian LSTM (BLSTM) 66 4.3. Verification study 68 4.3.1. System description 68 4.3.2. Estimation of the plasma variables 71 4.3.3. Dataset description 72 4.3.4. Experimental setup 72 4.3.5. Weight regularization during training 78 4.3.6. Modeling complex behaviors of the system 80 4.3.7. Uncertainty quantification and model compression 85 Chapter 5 Process design based on reinforcement learning with distributional actor-critic networks 89 5.1. Introduction 89 5.2. Methods 93 5.2.1. Flowsheet hashing 93 5.2.2. Behavioral cloning 99 5.2.3. Neural Monte Carlo tree search (N-MCTS) 100 5.2.4. Distributional actor-critic networks (DACN) 105 5.2.5. Action masking 110 5.3. Verification study 110 5.3.1. System description 110 5.3.2. Experimental setup 111 5.3.3. Result and discussions 115 Chapter 6 Concluding remarks 120 6.1. Summary of the contributions 120 6.2. Future works 122 Appendix 125 A.1. Proof of Lemma 1 125 A.2. Performance indices for dimension reduction 127 A.3. Model equations for process units 130 Bibliography 132 ์ดˆ ๋ก 149๋ฐ•

    Data Reduction Algorithms in Machine Learning and Data Science

    Get PDF
    Raw data are usually required to be pre-processed for better representation or discrimination of classes. This pre-processing can be done by data reduction, i.e., either reduction in dimensionality or numerosity (cardinality). Dimensionality reduction can be used for feature extraction or data visualization. Numerosity reduction is useful for ranking data points or finding the most and least important data points. This thesis proposes several algorithms for data reduction, known as dimensionality and numerosity reduction, in machine learning and data science. Dimensionality reduction tackles feature extraction and feature selection methods while numerosity reduction includes prototype selection and prototype generation approaches. This thesis focuses on feature extraction and prototype selection for data reduction. Dimensionality reduction methods can be divided into three categories, i.e., spectral, probabilistic, and neural network-based methods. The spectral methods have a geometrical point of view and are mostly reduced to the generalized eigenvalue problem. Probabilistic and network-based methods have stochastic and information theoretic foundations, respectively. Numerosity reduction methods can be divided into methods based on variance, geometry, and isolation. For dimensionality reduction, under the spectral category, I propose weighted Fisher discriminant analysis, Roweis discriminant analysis, and image quality aware embedding. I also propose quantile-quantile embedding as a probabilistic method where the distribution of embedding is chosen by the user. Backprojection, Fisher losses, and dynamic triplet sampling using Bayesian updating are other proposed methods in the neural network-based category. Backprojection is for training shallow networks with a projection-based perspective in manifold learning. Two Fisher losses are proposed for training Siamese triplet networks for increasing and decreasing the inter- and intra-class variances, respectively. Two dynamic triplet mining methods, which are based on Bayesian updating to draw triplet samples stochastically, are proposed. For numerosity reduction, principal sample analysis and instance ranking by matrix decomposition are the proposed variance-based methods; these methods rank instances using inter-/intra-class variances and matrix factorization, respectively. Curvature anomaly detection, in which the points are assumed to be the vertices of polyhedron, and isolation Mondrian forest are the proposed methods based on geometry and isolation, respectively. To assess the proposed tools developed for data reduction, I apply them to some applications in medical image analysis, image processing, and computer vision. Data reduction, used as a pre-processing tool, has different applications because it provides various ways of feature extraction and prototype selection for applying to different types of data. Dimensionality reduction extracts informative features and prototype selection selects the most informative data instances. For example, for medical image analysis, I use Fisher losses and dynamic triplet sampling for embedding histopathology image patches and demonstrating how different the tumorous cancer tissue types are from the normal ones. I also propose offline/online triplet mining using extreme distances for this embedding. In image processing and computer vision application, I propose Roweisfaces and Roweisposes for face recognition and 3D action recognition, respectively, using my proposed Roweis discriminant analysis method. I also introduce the concepts of anomaly landscape and anomaly path using the proposed curvature anomaly detection and use them to denoise images and video frames. I report extensive experiments, on different datasets, to show the effectiveness of the proposed algorithms. By experiments, I demonstrate that the proposed methods are useful for extracting informative features and instances for better accuracy, representation, prediction, class separation, data reduction, and embedding. I show that the proposed dimensionality reduction methods can extract informative features for better separation of classes. An example is obtaining an embedding space for separating cancer histopathology patches from the normal patches which helps hospitals diagnose cancers more easily in an automatic way. I also show that the proposed numerosity reduction methods are useful for ranking data instances based on their importance and reducing data volumes without a significant drop in performance of machine learning and data science algorithms

    Learning and Testing Powerful Hypotheses

    Get PDF
    Progress in science is driven through the formulation of hypotheses about phenomena of interest and by collecting evidence for their validity or refuting them. While some hypotheses are amenable to deductive proofs, other hypotheses can only be accessed in a data-driven manner. For most phenomena, scientists cannot control all degrees of freedom and hence data is often inherently stochastic. This stochasticity disallows to test hypotheses with absolute certainty. The field of statistical hypothesis testing formalizes the probabilistic assessment of hypotheses, enabling researchers to control the error rates, for example, at which they reject a true hypothesis, while aiming to reject false hypotheses as often as possible. But how do we come up with promising hypotheses, and how can we test them efficiently? Can we use machine learning systems to automatically generate promising hypotheses? This thesis studies different aspects of this question. A simple rule for statistical hypothesis testing states that one should not peek at the data when formulating a hypothesis. This is indeed true if done naively, that is, when the hypothesis is then simply tested with the data as if one had not looked at it yet. However, we show that in principle using the same data for learning the hypothesis and testing it is feasible if we can correct for the selection of the hypothesis. We treat this in the case of the two-sample problem. Given two samples, the hypothesis to be tested is whether the samples originate from the same distribution. We can reformulate this by testing whether the maximum mean discrepancy over a (unit ball of a) reproducing kernel Hilbert space is zero. We show that we can learn the kernel function, hence the exact test we use, and perform the test with the same data, while still correctly controlling the Type-I error rates. Likewise, we demonstrate experimentally that taking all data into account can lead to more powerful testing procedures than the data splitting approach. However, deriving the formulae that correct for the selection procedure requires strong assumptions, which are only valid for a specific, the linear-time, estimate of the maximum mean discrepancy. In more general settings it is difficult, if not impossible, to adjust for the selection. We thus also analyze the case where we split the data and use part of it to learn a test statistic. The maximum mean discrepancy implicitly optimizes a mean discrepancy over the unit ball of a reproducing kernel Hilbert space, and often the kernel itself is optimized on held-out data.We instead propose to optimize a witness function directly on held-out data and use its mean discrepancy as a test statistic. This allows us to directly maximize the test power, simplifies the theoretical treatment, and makes testing more efficient.We provide and implement algorithms to learn the test statistics. Furthermore, we show analytically that the optimization objective to learn powerful tests for the two-sample problem is closely related to the objectives used in standard supervised learning tasks, namely the least-square loss and cross-entropy loss. This allows us to indeed use existing machine learning tools when learning powerful hypotheses. Furthermore, since we use held-out data for learning the test statistic, we can use any kind of model-selection and cross-validation techniques to maximize the performance. To facilitate this for practitioners, we provide an open-source Python package โ€™autotstโ€™ implementing an interface to existing libraries and running the whole testing pipeline, including the learning of the hypothesis. Our presented methods reach state-of-the-art performance on two-sample testing tasks. We also show how to trade off the computational resources required for the test by sacrificing some statistical power, which can be important in practice. Furthermore, our test easily allows interpreting the results. Having more computational power potentially allows extracting more information from data and thus obtain more significant results. Hence, investigating whether quantum computers can help in machine learning tasks has gained popularity over the past years. We investigate this in light of the two-sample problem. We define the quantum mean embedding, mapping probability distributions onto quantum states, and analyze when this mapping is injective. While this is conceptually interesting on its own, we do not find a straight-forward way of harnessing any speed-up. The main problem here is that there is no known way to efficiently create the quantum mean embedding. On the contrary, fundamental results in quantum information theory show that this might generally be hard to do. For two-sample testing, the usage of reproducing kernel Hilbert spaces has been established for many years and proven important both theoretically and practically. In this case, we thus focused on practically relevant aspects to make the tests as powerful and easy to use as possible. For other hypothesis testing tasks, the usage of advanced machine learning tools still lags far behind. For specification tests based on conditional moment restrictions, popular in econometrics, we do the first steps by defining a consistent test based on kernel methods. Our test already has promising performance, but optimizing it, potentially with the other insights gained in this thesis, is an open task
    • โ€ฆ
    corecore