13 research outputs found

    Parsimonious Mahalanobis Kernel for the Classification of High Dimensional Data

    Full text link
    The classification of high dimensional data with kernel methods is considered in this article. Exploit- ing the emptiness property of high dimensional spaces, a kernel based on the Mahalanobis distance is proposed. The computation of the Mahalanobis distance requires the inversion of a covariance matrix. In high dimensional spaces, the estimated covariance matrix is ill-conditioned and its inversion is unstable or impossible. Using a parsimonious statistical model, namely the High Dimensional Discriminant Analysis model, the specific signal and noise subspaces are estimated for each considered class making the inverse of the class specific covariance matrix explicit and stable, leading to the definition of a parsimonious Mahalanobis kernel. A SVM based framework is used for selecting the hyperparameters of the parsimonious Mahalanobis kernel by optimizing the so-called radius-margin bound. Experimental results on three high dimensional data sets show that the proposed kernel is suitable for classifying high dimensional data, providing better classification accuracies than the conventional Gaussian kernel

    Attributes regrouping in Fuzzy Rule Based Classification Systems: an intra-classes approach

    Get PDF
    International audienceFuzzy rule-based classification systems (FRBCS) are able to build linguistic interpretable models, they automatically generate fuzzy if-then rules and use them to classify new observations. However, in these supervised learning systems, a high number of predictive attributes leads to an exponential increase of the number of generated rules. Moreover the antecedent conditions of the obtained rules are very large since they contain all the attributes that describe the examples. Therefore the accuracy of these systems as well as their interpretability degraded. To address this problem, we propose to use ensemble methods for FRBCS where the decisions of different classifiers are combined in order to form the final classification model. We are interested in particular in ensemble methods which split the attributes into subgroups and treat each subgroup separately. We propose to regroup attributes by correlation search among the training set elements that belongs to the same class, such an intra-classes correlation search allows to characterize each class separately. Several experiences were carried out on various data. The results show a reduction in the number of rules and of antecedents without altering accuracy, on the contrary classification rates are even improved

    Development Of A Deep Neural Network Based Method For Quality Control Of Modis Maiac Aerosol Data For Aerosol Modeling Applications

    Get PDF
    Quality-assured satellite aerosol data have been shown to improve aerosol analysis and forecasts in Chemical Transport Models. However, biases present in the satellite-based aerosol data can also introduce non-negligible uncertainties in the downstream aerosol forecasts and impact model forecast accuracy. Therefore, in this study we evaluated uncertainties in Moderate Imaging Spectroradiometer (MODIS) Multi-Angle Implementation of Atmospheric Correction (MAIAC) aerosol products and developed a deep neural network (DNN) based method for quality control of Terra and Aqua MODIS MAIAC Aerosol Optical Depth (AOD) data using the version 3 level 2 AErosol RObotic NETwork (AERONET) data as the ground truth. This method is done using 14 years of Aqua MODIS (2002-2016) and 16 years of Terra MODIS (2000-2016) MAIAC data which are collocated with the AERONET observations. The resulting trained network, which is tested on one year of Aqua/Terra data, can detect and significantly reduce noisy retrieval in MAIAC AOD data resulting in an approximate 31%/27% reduction in Root-Mean-Square-Error in Aqua/Terra MODIS MAIAC AOD with an associated 14%/16% data loss. A sensitivity study performed in this effort suggests that reducing the number of output categories and hidden layers can significantly improve performance of the deep neural network in this case. This study suggests that DNN can be used as an effective method for quality control of satellite based AOD data for potential modeling applications

    Indice di Pericolosità Offensiva: un metodo di predizione per le partite di Calcio

    Get PDF
    La tesi tratta un metodo innovativo per il calcolo e l'implementazione di un indice utile a calcolare l'esito di partite di calcio, tramite reti neurali. Sfrutta un indice già presente, ideato da esperti del settore, cercando però di estenderne e migliorarne le performance (aumentando la percentuale di previsione corretta degli esiti delle partite). Si mostrerà come ottiene buoni risultati e si incoraggerà una rifelssione sui possibili risvolti futuri nell'utilizzo e diffusione di tecnologie all'avanguardia per il miglioramento del calcio in generale, come già avvenuto in altri sport

    Principal Component Analysis (PCA) para mejorar la performance de aprendizaje de los algoritmos Support Vector Machine (SVM) y Red Neuronal Multicapa (MLNN)

    Get PDF
    Esta tesis explora el problema de data sets con un número alto de atributos; y el impacto que generan en la performance de aprendizaje de los algoritmos Support Vector Machine (SVM) y Redes Neuronales Multicapa (MLNN). Para poder resolver este problema, proponemos la siguiente hipótesis: ""La aplicación de Principal Component Analysis (PCA) sobre el data set; mejorará la performance de aprendizaje de los algoritmos Support Vector Machine (SVM) y Redes neuronales Multicapa (MLNN). De acuerdo con nuestra hipótesis; tenemos el siguiente objetivo general: ""Mejorar la performance de aprendizaje de los algoritmos Support Vector Machine (SVM) y Redes Neuronales Multicapa (MLNN) a través de la aplicación de Principal Component Analysis (PCA) sobre el data set"". Para poder implementar los algoritmos (SVM, MLNN y PCA); usamos el data set QSAR biodegradation, de obtenido del repositorio gratuito Machine Learning (UCI), asimismo, todo la implementación de los algoritmos fue realizada usando Matlab 2014a. Una vez que los algoritmos fueron implementados, empezamos la prueba de la hipótesis; para ello creamos dos dataset, uno aplicando PCA y el otro sin aplicarle PCA; luego medimos la performance de aprendizaje de los algoritmos SVM y MLNN contra sus contrapartes sin PCA; al final, los resultados mostraron que ambos algoritmos SVM y MLNN ganaron una mejora significativa en sus performances de aprendizaje en contraste con simplemente entrenar los algoritmos sin aplicar PCA al data set.This thesis explores the problem of data sets with a high number of attributes, and its impact on the learning performance of the algorithms Support Vector Machine (SVM) and Multilayer Neural Network (MLNN). In order to solve this problem we propose the following hypothesis: “The applicat ion of Principal Component Analysis (PCA) over the data set; will improve the learning performance of the algorithms Support Vector Machine (SVM) and Multilayer Neural Network (MLNN)” According wit h our hypothesis; we have the following general object ive: “Improve the learning performance of the algorithms Support Vector Machine (SVM) and Multilayer Neural Network through the application of Principal Component Analysis (PCA) over the data set”. In order to implement the algorithms (SVM, MLNN and PCA), we used the QSAR biodegradation dataset, obtained from the Free Machine Learning Repository (UCI), also all the development of the algorithms was done using Matlab 2014a. Once the algorithms were developed, we begin with the test of our hypothesis, to do so, we create two sets, one applying PCA to the dataset, and the other without applying it, then we measure the learning performance of the algorithms SVM and MLNN against themselves on both datasets (one applying PCA and the other not), at the end, the results show us that both algorithms SVM and MLNN gain a major improvement in their learning performance compared to simple train the algorithms without applying PCA to the dataset.Tesi

    Projection-Based Clustering through Self-Organization and Swarm Intelligence

    Get PDF
    It covers aspects of unsupervised machine learning used for knowledge discovery in data science and introduces a data-driven approach to cluster analysis, the Databionic swarm (DBS). DBS consists of the 3D landscape visualization and clustering of data. The 3D landscape enables 3D printing of high-dimensional data structures. The clustering and number of clusters or an absence of cluster structure are verified by the 3D landscape at a glance. DBS is the first swarm-based technique that shows emergent properties while exploiting concepts of swarm intelligence, self-organization and the Nash equilibrium concept from game theory. It results in the elimination of a global objective function and the setting of parameters. By downloading the R package DBS can be applied to data drawn from diverse research fields and used even by non-professionals in the field of data mining

    Classifying the suras by their lexical semantics :an exploratory multivariate analysis approach to understanding the Qur'an

    Get PDF
    PhD ThesisThe Qur'an is at the heart of Islamic culture. Careful, well-informed interpretation of it is fundamental both to the faith of millions of Muslims throughout the world, and also to the non-Islamic world's understanding of their religion. There is a long and venerable tradition of Qur'anic interpretation, and it has necessarily been based on literary-historical methods for exegesis of hand-written and printed text. Developments in electronic text representation and analysis since the second half of the twentieth century now offer the opportunity to supplement traditional techniques by applying the newly-emergent computational technology of exploratory multivariate analysis to interpretation of the Qur'an. The general aim of the present discussion is to take up that opportunity. Specifically, the discussion develops and applies a methodology for discovering the thematic structure of the Qur'an based on a fundamental idea in a range of computationally oriented disciplines: that, with respect to some collection of texts, the lexical frequency profiles of the individual texts are a good indicator of their semantic content, and thus provide a reliable criterion for their conceptual categorization relative to one another. This idea is applied to the discovery of thematic interrelationships among the suras that constitute the Qur'an by abstracting lexical frequency data from them and then analyzing that data using exploratory multivariate methods in the hope that this will generate hypotheses about the thematic structure of the Qur'an. The discussion is in eight main parts. The first part introduces the discussion. The second gives an overview of the structure and thematic content of the Qur'an and of the tradition of Qur'anic scholarship devoted to its interpretation. The third part xvi defines the research question to be addressed together with a methodology for doing so. The fourth reviews the existing literature on the research question. The fifth outlines general principles of data creation and applies them to creation of the data on which the analysis of the Qur'an in this study is based. The sixth outlines general principles of exploratory multivariate analysis, describes in detail the analytical methods selected for use, and applies them to the data created in part five. The seventh part interprets the results of the analyses conducted in part six with reference to the existing results in Qur'anic interpretation described in part two. And, finally, the eighth part draws conclusions relative to the research question and identifies directions along which the work presented in this study can be developed

    Projection-Based Clustering through Self-Organization and Swarm Intelligence: Combining Cluster Analysis with the Visualization of High-Dimensional Data

    Get PDF
    Cluster Analysis; Dimensionality Reduction; Swarm Intelligence; Visualization; Unsupervised Machine Learning; Data Science; Knowledge Discovery; 3D Printing; Self-Organization; Emergence; Game Theory; Advanced Analytics; High-Dimensional Data; Multivariate Data; Analysis of Structured Dat

    Técnicas basadas en kernel para el análisis de texturas en imagen biomédica

    Get PDF
    [Resumen] En problemas del mundo real es relevante el estudio de la importancia de todas las variables obtenidas de manera que sea posible la eliminación de ruido, es en este punto donde surgen las técnicas de selección de variables. El objetivo de estas técnicas es pues encontrar el subconjunto de variables que describan de la mejor manera posible la información útil contenida en los datos permitiendo mejorar el rendimiento. En espacios de alta dimensionalidad son especialmente interesantes las técnicas basadas en kernel, donde han demostrado una alta eficiencia debido a su capacidad para generalizar en dichos espacios. En este trabajo se realiza una nueva propuesta para el análisis de texturas en imagen biomédica mediante la integración, utilizando técnicas basadas en kernel, de diferentes tipos de datos de textura para la selección de las variables más representativas con el objetivo de mejorar los resultados obtenidos en clasificación y en interpretabilidad de las variables obtenidas. Para validar esta propuesta se ha formalizado un diseño experimental con cuatro fases diferenciadas: extracción y preprocesado de los datos, aprendizaje y selección del mejor modelo asegurando la reproducibilidad de los resultados a la vez que una comparación en condiciones de igualdad.[Resumo] En problemas do mundo real é relevante o estudo da importancia de todas as variables obtidas de maneira que sexa posible a eliminación de ruído, é neste punto onde xorden as técnicas de selección de variables. O obxectivo destas técnicas é pois encontrar o subconxunto de variables que describan do mellor xeito posible a información útil contida nos datos permitindo mellorar o rendemento. En espazos de alta dimensionalidade son especialmente interesantes as técnicas baseadas en kernel, onde demostraron unha alta eficiencia debido á súa capacidade para xeneralizar nos devanditos espazos. Neste traballo realízase unha nova proposta para a análise de texturas en imaxe biomédica mediante a integración, utilizando técnicas baseadas en kernel, de diferentes tipos de datos de textura para a selección das variables máis representativas co obxectivo de mellorar os resultados obtidos en clasificación e en interpretabilidade das variables obtidas. Para validar esta proposta formalizouse un deseño experimental con catro fases diferenciadas: extracción e preprocesar dos datos, aprendizaxe e selección do mellor modelo asegurando a reproducibilidade dos resultados á vez que unha comparación en condicións de igualdade. Utilizáronse imaxes de xeles de electroforese bidimensional.[Abstract] In real-world problems it is of relevance to study the importance of all the variables obtained, so that denoising could be possible, because it is at this point when the variable selection techniques arise. Therefore, these techniques are aimed at finding the subset of variables that describe' in the best possible way the useful information contained in the data, allowing improved performance. In high-dimensional spaces, the kernel-based techniques are of special relevance, as they have demonstrated a high efficiency due to their ability to generalize in these spaces. In this work, a new approach for texture analysis in biomedical imaging is performed by means of integration. For this procedure, kernel-based techniques were used with different types of texture data for the selection of the most representative variables in order to improve the results obtained in classification and interpretability of the obtained variables. To validate this proposal, an experimental design has been concluded, consisting of four different phases: 1) Data extraction; 2) Data pre-processing; 3) Learning and 4) Selection of the best model to ensure the reproducibility of results while making a comparison under conditions of equality. In this regard, two-dimensional electrophoresis gel images have been used
    corecore