424 research outputs found

    An ordinal CNN approach for the assessment of neurological damage in Parkinson’s disease patients

    Get PDF
    3D image scans are an assessment tool for neurological damage in Parkinson’s disease (PD) patients. This diagnosis process can be automatized to help medical staff through Decision Support Systems (DSSs), and Convolutional Neural Networks (CNNs) are good candidates, because they are effective when applied to spatial data. This paper proposes a 3D CNN ordinal model for assessing the level or neurological damage in PD patients. Given that CNNs need large datasets to achieve acceptable performance, a data augmentation method is adapted to work with spatial data. We consider the Ordinal Graph-based Oversampling via Shortest Paths (OGO-SP) method, which applies a gamma probability distribution for inter-class data generation. A modification of OGO-SP is proposed, the OGO-SP- algorithm, which applies the beta distribution for generating synthetic samples in the inter-class region, a better suited distribution when compared to gamma. The evaluation of the different methods is based on a novel 3D image dataset provided by the Hospital Universitario ‘Reina Sofía’ (Córdoba, Spain). We show how the ordinal methodology improves the performance with respect to the nominal one, and how OGO-SP- yields better performance than OGO-SP

    Small data oversampling: improving small data prediction accuracy using the geometric SMOTE algorithm

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsIn the age of Big Data, many machine learning tasks in numerous industries are still restricted due to the use of small datasets. The limited availability of data often results in unsatisfactory prediction performance of supervised learning algorithms and, consequently, poor decision making. The current research work aims to mitigate the small dataset problem by artificial data generation in the pre-processing phase of the data analysis process. The oversampling technique Geometric SMOTE is applied to generate new training instances and enhance crisp data structures. Experimental results show a significant improvement on the prediction accuracy when compared with the use of original, small datasets and over other oversampling techniques such as Random Oversampling, SMOTE and Borderline SMOTE. These findings show that artificial data creation is a promising approach to overcome the problem of small data in classification tasks

    Estudio de métodos de construcción de ensembles de clasificadores y aplicaciones

    Get PDF
    La inteligencia artificial se dedica a la creación de sistemas informáticos con un comportamiento inteligente. Dentro de este área el aprendizaje computacional estudia la creación de sistemas que aprenden por sí mismos. Un tipo de aprendizaje computacional es el aprendizaje supervisado, en el cual, se le proporcionan al sistema tanto las entradas como la salida esperada y el sistema aprende a partir de estos datos. Un sistema de este tipo se denomina clasificador. En ocasiones ocurre, que en el conjunto de ejemplos que utiliza el sistema para aprender, el número de ejemplos de un tipo es mucho mayor que el número de ejemplos de otro tipo. Cuando esto ocurre se habla de conjuntos desequilibrados. La combinación de varios clasificadores es lo que se denomina "ensemble", y a menudo ofrece mejores resultados que cualquiera de los miembros que lo forman. Una de las claves para el buen funcionamiento de los ensembles es la diversidad. Esta tesis, se centra en el desarrollo de nuevos algoritmos de construcción de ensembles, centrados en técnicas de incremento de la diversidad y en los problemas desequilibrados. Adicionalmente, se aplican estas técnicas a la solución de varias problemas industriales.Ministerio de Economía y Competitividad, proyecto TIN-2011-2404

    Sentiment Analysis and Stance Detection on German YouTube Comments on Gender Diversity

    Full text link
    [EN] This paper explores different options of detecting the stance of German YouTube comments regarding the topic of gender diversity and compares the respective results with those of sentiment analysis, showing that these are two very different NLP tasks focusing on distinct characteristics of the discourse. While an already existing model was used to analyze the comments sentiment (BERT), the comments stance was first annotated and then used to train different models SVM with TF-IDF, DistilBERT, LSTM and CNN for predicting the stance of unseen comments. The best results were achieved by the CNN, reaching 78.3% accuracy (92% after dataset normalization) on the test set. Whereas the most common stance identified in the comments is a neutral one (neither completely in favor nor completely against gender diversity), the overall sentiment of the discourse turns out to be negative. This shows that the discourse revolving around the topic of gender diversity in YouTube comments is filled with strong opinions, on the one hand, but also opens up a space for anonymously inquiring and learning about the topic and its implications, on the other. Our research thereby (1) contributes to the understanding and application of different NLP tasks used to predict the sentiment and stance of unstructured textual data, and (2) provides relevant insights into society s attitudes towards a changing system of values and beliefs.Melnyk, L.; Feld, L. (2022). Sentiment Analysis and Stance Detection on German YouTube Comments on Gender Diversity. Journal of Computer-Assisted Linguistic Research. 6:59-86. https://doi.org/10.4995/jclr.2022.182245986

    Geometric SMOTE for imbalanced datasets with nominal and continuous features

    Get PDF
    Fonseca, J., & Bacao, F. (2023). Geometric SMOTE for imbalanced datasets with nominal and continuous features. Expert Systems with Applications, 234(December), 1-9. [121053]. https://doi.org/10.1016/j.eswa.2023.121053 --- This research was supported by research grants of the Portuguese Foundation for Science and Technology (“Fundação para a Ciência e a Tecnologia”), references SFRH/BD/151473/2021, DSAIPA/DS/0116/2019, and by project UIDB/04152/2020 — Centro de Investigação em Gestão de Informação (MagIC) .Imbalanced learning can be addressed in 3 different ways: Resampling, algorithmic modifications and cost-sensitive solutions. Resampling, and specifically oversampling, are more general approaches when opposed to algorithmic and cost-sensitive methods. Since the proposal of the Synthetic Minority Oversampling TEchnique (SMOTE), various SMOTE variants and neural network-based oversampling methods have been developed. However, the options to oversample datasets with nominal and continuous features are limited. We propose Geometric SMOTE for Nominal and Continuous features (G-SMOTENC), based on a combination of G-SMOTE and SMOTENC. Our method modifies SMOTENC’s encoding and generation mechanism for nominal features while using G-SMOTE’s data selection mechanism to determine the center observation and k-nearest neighbors and generation mechanism for continuous features. G-SMOTENC’s performance is compared against SMOTENC’s along with two other baseline methods, a State-of-the-art oversampling method and no oversampling. The experiment was performed over 20 datasets with varying imbalance ratios, number of metric and non-metric features and target classes. We found a significant improvement in classification performance when using G-SMOTENC as the oversampling method. An open-source implementation of G-SMOTENC is made available in the Python programming language.publishersversionpublishe

    The Role of Synthetic Data in Improving Supervised Learning Methods: The Case of Land Use/Land Cover Classification

    Get PDF
    A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information ManagementIn remote sensing, Land Use/Land Cover (LULC) maps constitute important assets for various applications, promoting environmental sustainability and good resource management. Although, their production continues to be a challenging task. There are various factors that contribute towards the difficulty of generating accurate, timely updated LULC maps, both via automatic or photo-interpreted LULC mapping. Data preprocessing, being a crucial step for any Machine Learning task, is particularly important in the remote sensing domain due to the overwhelming amount of raw, unlabeled data continuously gathered from multiple remote sensing missions. However a significant part of the state-of-the-art focuses on scenarios with full access to labeled training data with relatively balanced class distributions. This thesis focuses on the challenges found in automatic LULC classification tasks, specifically in data preprocessing tasks. We focus on the development of novel Active Learning (AL) and imbalanced learning techniques, to improve ML performance in situations with limited training data and/or the existence of rare classes. We also show that much of the contributions presented are not only successful in remote sensing problems, but also in various other multidisciplinary classification problems. The work presented in this thesis used open access datasets to test the contributions made in imbalanced learning and AL. All the data pulling, preprocessing and experiments are made available at https://github.com/joaopfonseca/publications. The algorithmic implementations are made available in the Python package ml-research at https://github.com/joaopfonseca/ml-research
    corecore