11 research outputs found

    Principal component analysis as tool for data reduction with an application

    Get PDF
    The recent trends in collecting huge datasets have posed a great challenge that is brought by the high dimensionality and aggravated by the presence of irrelevant dimensions. Machine learning models for regression is recognized as a convenient way of improving the estimation for empirical models. Popular machine learning models is support vector regression (SVR). However, the usage of principal component analysis (PCA) as a variable reduction method along with SVR is suggested. The principal component analysis helps in building a predictive model that is simple as it contains the smallest number of variables and efficient. In this paper, we investigate the competence of SVR with PCA to explore its performance for a more accurate estimation. Simulation study and Renal Failure (RF) data of SVR optimized by four different kernel functions; linear, polynomial, radial basis, and sigmoid functions using R software, version (R x64 3.2.5) to compare the behavior of ε SVR and v-SVR models for different sample sizes ranges from small, moderate to large such as; 50, 100, and 150. The performance criteria are root mean squared error (RMSE) and coefficient of determination R2 showed the superiority of ε-SVR over v- SVR. Furthermore, the implementation of SVR after employing PCA improves the results. Also, the simulation results showed that the best performing kernel function is the linear kernel. For real data the results showed that the best kernels are linear and radial basis function. It is also clear that, with ε-SVR and v-SVR, the RMSE values for almost kernel functions decreased with increasing sample size. Therefore, the performance of ε-SVR improved after applying PCA. In addition sample size n=50 gave good results for linear and radial kerne

    Classifying Imbalanced Financial Fraud Data Utilizing Enhanced Random Forest Algorithm

    Get PDF
    Imbalanced datasets have been a unique challenge for machine learning, requiring specialized approaches to correctly classify the minority class. Financial fraud detection involves using highly imbalanced datasets with a class imbalance of up to .01% frauds to 99.99% regular transactions. It is essential to identify all frauds in financial fraud detection, even if some classifications\u27 precision is low. I developed a random forest assembly that separates fraudulent transactions into tiers of precision. With this approach, 96% of fraudulent transactions are identified, showing an 8% increase in recall when compared to standard approaches. 59% of fraud classifications\u27 precision increases by 10% up to 98% by optimizing several random forests on different fitness functions. These models are then combined to act as a sieve with increasing tolerance for low precision classifications. The effectiveness of random forest for financial fraud detection is also improved through feature extraction techniques. Random forest is weak at detecting patterns between interdepended features. This problem is address through unsupervised feature extraction. I will demonstrate a new random forest architecture PCA-embedded random forest, which increased random forest performance

    The detection of age groups by dynamic gait outcomes using machine learning approaches

    Get PDF
    Prevalence of gait impairments increases with age and is associated with mobility decline, fall risk and loss of independence. For geriatric patients, the risk of having gait disorders is even higher. Consequently, gait assessment in the clinics has become increasingly important. The purpose of the present study was to classify healthy young-middle aged, older adults and geriatric patients based on dynamic gait outcomes. Classification performance of three supervised machine learning methods was compared. From trunk 3D-accelerations of 239 subjects obtained during walking, 23 dynamic gait outcomes were calculated. Kernel Principal Component Analysis (KPCA) was applied for dimensionality reduction of the data for Support Vector Machine (SVM) classification. Random Forest (RF) and Artificial Neural Network (ANN) were applied to the 23 gait outcomes without prior data reduction. Classification accuracy of SVM was 89%, RF accuracy was 73%, and ANN accuracy was 90%. Gait outcomes that significantly contributed to classification included: Root Mean Square (Anterior-Posterior, Vertical), Cross Entropy (Medio-Lateral, Vertical), Lyapunov Exponent (Vertical), step regularity (Vertical) and gait speed. ANN is preferable due to the automated data reduction and significant gait outcome identification. For clinicians, these gait outcomes could be used for diagnosing subjects with mobility disabilities, fall risk and to monitor interventions. (This work was supported by Keep Control project, funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 721577.

    Comparative analysis of methods for microbiome study

    Get PDF
    Microbiome analysis is garnering much interest with benefits including improved treatment options, enhanced capabilities for personalized medicine, greater understanding of the human body, and contributions to ecological study. Data from these communities of bacteria, viruses, and fungi are feature rich, sparse, and have sample sizes not appreciably larger than the feature space, making analysis challenging and necessitating a coordinated approach utilizing multiple techniques alongside domain expertise. This thesis provides an overview and comparative analysis of these methods, with a case study on cirrhosis and hepatic encephalopathy demonstrating a selection of methods. Approaches are considered in a medically motivated context where relationships between microbes in the human body and diseases or conditions are of primary interest, with additional objectives being the identification of how microbes influence each other and how these influences relate to the diseases and conditions being studied. These analysis methods are partitioned into three categories: univariate statistical methods, classifier-based methods, and joint analysis methods. Univariate statistical methods provide results corresponding to how much a single variable or feature differs between groups in the data. Classifier-based approaches can be generalized as those where a classification model with microbe abundance as inputs and disease states as outputs is used, resulting in a predictive model which is then analyzed to learn about the data. The joint analysis category corresponds to techniques which specifically target relationships between microbes and compare those relationships among subpopulations within the data. Despite significant differences between these categories and the individual methods, each has strengths and weaknesses and plays an important role in microbiome analysis

    Kernel PCA for feature extraction and de-noising in nonlinear regression

    No full text
    In this paper, we propose the application of the Kernel Principal Component Analysis (PCA) technique for feature selection in a high-dimensional feature space, where input variables are mapped by a Gaussian kernel. The extracted features are employed in the regression problems of chaotic Mackey-Glass time-series prediction in a noisy environment and estimating human signal detection performance from brain event-related potentials elicited by task relevant signals. We compared results obtained using either Kernel PCA or linear PCA as data preprocessing steps. On the human signal detection task, we report the superiority of Kernel PCA feature extraction over linear PCA. Similar to linear PCA, we demonstrate de-noising of the original data by the appropriate selection of various nonlinear principal components. The theoretical relation and experimental comparison of Kernel Principal Components Regression, Kernel Ridge Regression and ∈-insensitive Support Vector Regression is also provided. © Springer-Verlag London Limited 2001

    Desenvolvimento da transferência de aprendizagem via redes neurais artificiais profundas na modelagem do índice de transmissão da fala

    Get PDF
    Orientador: Prof. Titular Dr. -Ing. Paulo Henrique Trombetta ZanninTese (doutorado) - Universidade Federal do Paraná, Setor de Tecnologia, Programa de Pós-Graduação em Engenharia Mecânica. Defesa : Curitiba, 14/06/2023Inclui referênciasÁrea de concentração: Fenômenos de Transporte e Mecânica dos SólidosResumo: O conforto acústico em salas de aula interfere na dinâmica do ensinoaprendizagem. Assim, salas com problemas acústicos causam um ônus socioeconômico, comprometendo a eficiência da educação. A inteligibilidade da fala é mensurada pelo Índice de Transmissão da Fala, em inglês, Speech Transmission Index (STI). Contudo, a medição desse índice é complexa e cara. Na literatura corrente, há modelos preditivos do STI que usam o Tempo de Reverberação (TR) como variável de regressão. Entretanto, esses modelos não ponderam os efeitos espectrais da localização dual tempo-frequência do sinal da resposta impulsiva das salas (RIR), bem como o conteúdo espectral do ruído de fundo. Este trabalho objetivou gerar um modelo preditivo do STI ao aplicar uma rede neural convolucionalunidimensional profunda (1DCONVNET). Contudo, devido à dificuldade em realizar medições in situ do STI, e com o objetivo de prover robustez às estimativas da rede 1DCONVNET, foi proposto um novo algoritmo de Transferência de Aprendizagem (TF), denominado de Minimização Variacional Projetiva-Adaptativa não Paramétrica (MVPAnP). O MVPAnP objetiva otimizar um modelo generativo treinado com dados simulados e transferir a generalização para um conjunto de dados de medições. O algoritmo MVPAnP baseia-se em Inteligência Artificial Generativa, via Redes Autocodificadoras Variacionais (VAE). A rede VAE gerou uma distribuição à posteriori agregada no espaço latente de origem que foi aplicada como parametrização da distribuição do conjunto alvo. A relação entre os conjuntos de origem e alvo deu-se via regressão de componentes principais de núcleo (KPCR) sobre os espaços latentes gerados pela rede VAE. A dimensão dos conjuntos de treinamento e validação da rede VAE foi de 25000 e 5000 salas, respectivamente. A rede 1DCONVNET possui dois sinais de entrada, a resposta impulsiva da sala, simulada via o Método das Imagens e o ruído de fundo (BGN), ambas entradas foram processadas pela Densidade Espectral de Potência (PSD), a saída foi o STI. A validação experimental da rede 1DCONVNET foi realizada em 13 salas de aula, sendo 301 de medições de TR - ISO 3288-2 e 90 medições do STI - IEC 60268-16. Dessas medições gerou-se o conjunto de testes com 600 salas, este conjunto foi utilizado para avaliar a rede 1DCONVNET nos cenários com e sem TF, tal configuração dos cenários é representada pelo acoplamento do algoritmo MVPAnP na função custo. Como resultado, a validação deste algoritmo foi realizada com base no teste ANOVA, com p-valor < 0,01, evidenciando uma melhora significativa na precisão do modelo no conjunto de teste para a rede 1D CONVNET. Os erros médios foram de MSPE = 4,49%, RMPSE = 2,23% e EQM = 0,09, para 600 salas de teste. A melhoria com a TF foi de 6 vezes, em termos da média do EQM com validação cruzada (CV = 10), nos cenários com e sem TF. Esses resultados demonstram a eficácia do método de TF para melhorar a precisão da predição do STI. Conclui-se que o algoritmo MVPAnP poderá ser empregado como uma ferramenta de apoio para a avaliação STI em salas de aula, usando apenas a instrumentação associada às medições do TRAbstract: Acoustic comfort in classrooms interferes with the teaching-learning dynamics. Therefore, classrooms with acoustic problems cause a socioeconomic burden, compromising the efficiency of education. Speech intelligibility is measured by the Speech Transmission Index (STI). However, measuring this index is complex and expensive. In current literature, there are STI predictive models that use Reverberation Time (RT) as a regression variable. However, these models do not consider the spectral effects of the dual time-frequency location of the Room Impulse Response (RIR) signal, as well as the spectral content of the background noise. This work aimed to generate a predictive model of the STI by applying a deep one-dimensional convolutional neural network (1DCONVNET). However, due to the difficulty in performing in situ measurements of the STI, and with the objective of providing robustness to the estimates of the 1DCONVNET network, a new Transfer Learning (TF) algorithm was proposed, called Non-Parametric Projective-Adaptive Variational Minimization (NP-PAVM). NP-PAVM aims to optimize a generative model trained with simulated data and transfer the generalization to a measurement dataset. The NPPAVM algorithm is based on Generative Artificial Intelligence, via Variational Autoencoders (VAE) networks. The VAE network generated an aggregated posterior distribution in the source latent space that was applied as a parameterization of the distribution of the target set. The relationship between the source and target sets was achieved via Kernel Principal Component Regression (KPCR) on the latent spaces generated by the VAE network. The size of the training and validation sets of the VAE network was 25000 and 5000 classrooms, respectively. The 1DCONVNET network has two input signals, the RIR, simulated via the Image Method and the background noise (BGN), both inputs were processed by Power Spectral Density (PSD), the output was the STI. The experimental validation of the 1DCONVNET network was carried out in 13 classrooms, with 301 measurements of TR - ISO 3288-2 and 90 measurements of STI - IEC 60268-16. From these measurements, a test set with 600 classrooms was generated. This set was used to evaluate the 1DCONVNET network in scenarios with and without TF. This scenario configuration is represented by coupling the NP-PAVM algorithm in the cost function. As a result, the validation of this algorithm was carried out based on the ANOVA test, with p-value < 0.01, showing a significant improvement in model accuracy in the test set for the 1D CONVNET network. The average errors were MSPE = 4.49%, RMPSE = 2.23% and MSE = 0.09, for 600 test classrooms. The improvement with TF was 6 times, in terms of the mean of the MSE with crossvalidation (CV = 10), in the scenarios with and without TF. These results demonstrate the effectiveness of the TF method in improving the accuracy of STI prediction. It is concluded that the NP-PAVM algorithm can be used as a support tool for STI assessment in classrooms, using only the instrumentation associated with RT measurement
    corecore