3 research outputs found

    Dataset Reduction Techniques to Speed Up SVD Analyses on Big Geo-Datasets

    Get PDF
    The Singular Value Decomposition (SVD) is a mathematical procedure with multiple applications in the geosciences. For instance, it is used in dimensionality reduction and as a support operator for various analytical tasks applicable to spatio-temporal data. Performing SVD analyses on large datasets, however, can be computationally costly, time consuming, and sometimes practically infeasible. However, techniques exist to arrive at the same output, or at a close approximation, which requires far less effort. This article examines several such techniques in relation to the inherent scale of the structure within the data. When the values of a dataset vary slowly, e.g., in a spatial field of temperature over a country, there is autocorrelation and the field contains large scale structure. Datasets do not need a high resolution to describe such fields and their analysis can benefit from alternative SVD techniques based on rank deficiency, coarsening, or matrix factorization approaches. We use both simulated Gaussian Random Fields with various levels of autocorrelation and real-world geospatial datasets to illustrate our study while examining the accuracy of various SVD techniques. As the main result, this article provides researchers with a decision tree indicating which technique to use when and predicting the resulting level of accuracy based on the dataset’s structure scale

    Data Dimensionality Reduction Techniques: What Works with Machine Learning Models

    Get PDF
    High-dimensional data has a wide range of applications in research, such as education, health, social media, and many other research fields. However, the high dimensionality of data can raise many problems for data analyses. This study focuses on commonly used techniques of dimensionality reduction for machine learning models, which play an essential and inevitable role in data prepossessing and statistical analysis. The main issues of high-dimensional data for machine learning tasks include the accuracy of data classification and visualization in machine learning models. Therefore, in this study, machine learning algorithms are used to predict and classify datasets to evaluate the accuracy, precision, recall, and F1 score of results, which are evaluated and compared by mean, variance, confidence intervals, and coverage. This study focuses on data mining issues, comparing and discussing different dimensionality reduction techniques with different dataset features. Eight dimensionality reduction techniques (Principal Component Analysis, Kernel Principal Component Analysis, Singular Value Decomposition, Non-negative matrix factorization, Independent Component Analysis, Multidimensional Scaling, Isomap, and Auto-encoder) are compared and evaluated on simulated datasets. Specifically, this study evaluates and compares the performances of the commonly used dimensionality reduction techniques by exploring the issues about features and characteristics of different techniques through Monte Carlo simulation studies with four machine learning classification models: logistic regression, linear support vector machine, nonlinear support vector machine, and k-nearest neighbors. The results of this study indicated that the DRTs decreased the accuracy, precision, recall, and F1 scores compared with results without DRTs. And overall, MDS performed dramatically better than other DRTs. SVD, PCA, and ICA had similar results because they are all linear DRTs. Although it is also a linear DRT, NMF performed as poorly as KPCA, which is a nonlinear DRT. The other two nonlinear DRTs, Isomap and Autoencoder, had the worst performance in this study. The results provided recommendations for empirical researchers using machine learning models with high dimensional data under specific conditions
    corecore