6 research outputs found
A Preconditioned Interior Point Method for Support Vector Machines Using an ANOVA-Decomposition and NFFT-Based Matrix-Vector Products
In this paper we consider the numerical solution to the soft-margin support
vector machine optimization problem. This problem is typically solved using the
SMO algorithm, given the high computational complexity of traditional
optimization algorithms when dealing with large-scale kernel matrices. In this
work, we propose employing an NFFT-accelerated matrix-vector product using an
ANOVA decomposition for the feature space that is used within an interior point
method for the overall optimization problem. As this method requires the
solution of a linear system of saddle point form we suggest a preconditioning
approach that is based on low-rank approximations of the kernel matrix together
with a Krylov subspace solver. We compare the accuracy of the ANOVA-based
kernel with the default LIBSVM implementation. We investigate the performance
of the different preconditioners as well as the accuracy of the ANOVA kernel on
several large-scale datasets.Comment: Official Code https://github.com/wagnertheresa/NFFTSVMip
Learning in High-Dimensional Feature Spaces Using ANOVA-Based Fast Matrix-Vector Multiplication
Kernel matrices are crucial in many learning tasks such as support vector
machines or kernel ridge regression. The kernel matrix is typically dense and
large-scale. Depending on the dimension of the feature space even the
computation of all of its entries in reasonable time becomes a challenging
task. For such dense matrices the cost of a matrix-vector product scales
quadratically with the dimensionality N , if no customized methods are applied.
We propose the use of an ANOVA kernel, where we construct several kernels based
on lower-dimensional feature spaces for which we provide fast algorithms
realizing the matrix-vector products. We employ the non-equispaced fast Fourier
transform (NFFT), which is of linear complexity for fixed accuracy. Based on a
feature grouping approach, we then show how the fast matrix-vector products can
be embedded into a learning method choosing kernel ridge regression and the
conjugate gradient solver. We illustrate the performance of our approach on
several data sets.Comment: Official Code https://github.com/wagnertheresa/NFFT4ANOV
New approaches for unsupervised transcriptomic data analysis based on Dictionary learning
The era of high-throughput data generation enables new access to biomolecular profiles and exploitation thereof. However, the analysis of such biomolecular data, for example, transcriptomic data, suffers from the so-called "curse of dimensionality". This occurs in the analysis of datasets with a significantly larger number of variables than data points. As a consequence, overfitting and unintentional learning of process-independent patterns can appear. This can lead to insignificant results in the application. A common way of counteracting this problem is the application of dimension reduction methods and subsequent analysis of the resulting low-dimensional representation that has a smaller number of variables.
In this thesis, two new methods for the analysis of transcriptomic datasets are introduced and evaluated. Our methods are based on the concepts of Dictionary learning, which is an unsupervised dimension reduction approach. Unlike many dimension reduction approaches that are widely applied for transcriptomic data analysis, Dictionary learning does not impose constraints on the components that are to be derived. This allows for great flexibility when adjusting the representation to the data. Further, Dictionary learning belongs to the class of sparse methods. The result of sparse methods is a model with few non-zero coefficients, which is often preferred for its simplicity and ease of interpretation. Sparse methods exploit the fact that the analysed datasets are highly structured. Indeed, a characteristic of transcriptomic data is particularly their structuredness, which appears due to the connection of genes and pathways, for example. Nonetheless, the application of Dictionary learning in medical data analysis is mainly restricted to image analysis. Another advantage of Dictionary learning is that it is an interpretable approach. Interpretability is a necessity in biomolecular data analysis to gain a holistic understanding of the investigated processes.
Our two new transcriptomic data analysis methods are each designed for one main task: (1) identification of subgroups for samples from mixed populations, and (2) temporal ordering of samples from dynamic datasets, also referred to as "pseudotime estimation". Both methods are evaluated on simulated and real-world data and compared to other methods that are widely applied in transcriptomic data analysis. Our methods convince through high performance and overall outperform the comparison methods