4,554 research outputs found

    A systematic review of data quality issues in knowledge discovery tasks

    Get PDF
    Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust

    Differentiating Mental Stress Levels: Analysing Machine Learning Algorithms Comparatively For EEG-Based Mental Stress Classification Using MNE-Python

    Get PDF
    Mental stress is a prevalent and consequential condition that impacts individuals' well-being and productivity. Accurate classification of mental stress levels using electroencephalogram (EEG) signals is a promising avenue for early detection and intervention. In this study, we present a comprehensive investigation into mental stress classification using EEG data processed with the MNE-Python library. Our research leverages a diverse set of machines learning algorithms, including Random Forest (RF), Decision Tree, K-Nearest Neighbors (KNN), Multilayer Perceptron (MLP), Support Vector Machine (SVM), Adaboost, and Extreme Gradient Boosting (XGBoost), to discerndifferences in classification performance. We employed a single dataset to ensure consistency in our experiments, facilitating a direct comparison of these algorithms. The EEG data were pre-processed using MNE-Python, which included tasks such as signal cleaning, and feature selection. Subsequently, we applied the selected machine learning models to the processed data and assessed their classification performance in terms of accuracy, precision, recall, and F1-score. Our results demonstrate notable variations in the classification accuracy of mental stress levels across the different algorithms. These findings suggest that the choice of machine learning technique plays a pivotal role in theeffectiveness of EEG-based mental stress classification. Our study not only highlights the potential of MNE-Python for EEG signal processing but also provides valuable insights into the selection of appropriate machine learning algorithms for accurate and reliable mental stress assessment. These outcomes hold promise for the development of robust and practical systems for real-time mental stress monitoring, contributing to enhanced well-being and performance in various domains such as healthcare, education, and workplace environment

    Ensemble approach on enhanced compressed noise EEG data signal in wireless body area sensor network

    Get PDF
    The Wireless Body Area Sensor Network (WBASN) is used for communication among sensor nodes operating on or inside the human body in order to monitor vital body parameters and movements. One of the important applications of WBASN is patients’ healthcare monitoring of chronic diseases such as epileptic seizure. Normally, epileptic seizure data of the electroencephalograph (EEG) is captured and compressed in order to reduce its transmission time. However, at the same time, this contaminates the overall data and lowers classification accuracy. The current work also did not take into consideration that large size of collected EEG data. Consequently, EEG data is a bandwidth intensive. Hence, the main goal of this work is to design a unified compression and classification framework for delivery of EEG data in order to address its large size issue. EEG data is compressed in order to reduce its transmission time. However, at the same time, noise at the receiver side contaminates the overall data and lowers classification accuracy. Another goal is to reconstruct the compressed data and then recognize it. Therefore, a Noise Signal Combination (NSC) technique is proposed for the compression of the transmitted EEG data and enhancement of its classification accuracy at the receiving side in the presence of noise and incomplete data. The proposed framework combines compressive sensing and discrete cosine transform (DCT) in order to reduce the size of transmission data. Moreover, Gaussian noise model of the transmission channel is practically implemented to the framework. At the receiving side, the proposed NSC is designed based on weighted voting using four classification techniques. The accuracy of these techniques namely Artificial Neural Network, Naïve Bayes, k-Nearest Neighbour, and Support Victor Machine classifiers is fed to the proposed NSC. The experimental results showed that the proposed technique exceeds the conventional techniques by achieving the highest accuracy for noiseless and noisy data. Furthermore, the framework performs a significant role in reducing the size of data and classifying both noisy and noiseless data. The key contributions are the unified framework and proposed NSC, which improved accuracy of the noiseless and noisy EGG large data. The results have demonstrated the effectiveness of the proposed framework and provided several credible benefits including simplicity, and accuracy enhancement. Finally, the research improves clinical information about patients who not only suffer from epilepsy, but also neurological disorders, mental or physiological problems

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

    Methods to Improve the Prediction Accuracy and Performance of Ensemble Models

    Get PDF
    The application of ensemble predictive models has been an important research area in predicting medical diagnostics, engineering diagnostics, and other related smart devices and related technologies. Most of the current predictive models are complex and not reliable despite numerous efforts in the past by the research community. The performance accuracy of the predictive models have not always been realised due to many factors such as complexity and class imbalance. Therefore there is a need to improve the predictive accuracy of current ensemble models and to enhance their applications and reliability and non-visual predictive tools. The research work presented in this thesis has adopted a pragmatic phased approach to propose and develop new ensemble models using multiple methods and validated the methods through rigorous testing and implementation in different phases. The first phase comprises of empirical investigations on standalone and ensemble algorithms that were carried out to ascertain their performance effects on complexity and simplicity of the classifiers. The second phase comprises of an improved ensemble model based on the integration of Extended Kalman Filter (EKF), Radial Basis Function Network (RBFN) and AdaBoost algorithms. The third phase comprises of an extended model based on early stop concepts, AdaBoost algorithm, and statistical performance of the training samples to minimize overfitting performance of the proposed model. The fourth phase comprises of an enhanced analytical multivariate logistic regression predictive model developed to minimize the complexity and improve prediction accuracy of logistic regression model. To facilitate the practical application of the proposed models; an ensemble non-invasive analytical tool is proposed and developed. The tool links the gap between theoretical concepts and practical application of theories to predict breast cancer survivability. The empirical findings suggested that: (1) increasing the complexity and topology of algorithms does not necessarily lead to a better algorithmic performance, (2) boosting by resampling performs slightly better than boosting by reweighting, (3) the prediction accuracy of the proposed ensemble EKF-RBFN-AdaBoost model performed better than several established ensemble models, (4) the proposed early stopped model converges faster and minimizes overfitting better compare with other models, (5) the proposed multivariate logistic regression concept minimizes the complexity models (6) the performance of the proposed analytical non-invasive tool performed comparatively better than many of the benchmark analytical tools used in predicting breast cancers and diabetics ailments. The research contributions to ensemble practice are: (1) the integration and development of EKF, RBFN and AdaBoost algorithms as an ensemble model, (2) the development and validation of ensemble model based on early stop concepts, AdaBoost, and statistical concepts of the training samples, (3) the development and validation of predictive logistic regression model based on breast cancer, and (4) the development and validation of a non-invasive breast cancer analytic tools based on the proposed and developed predictive models in this thesis. To validate prediction accuracy of ensemble models, in this thesis the proposed models were applied in modelling breast cancer survivability and diabetics’ diagnostic tasks. In comparison with other established models the simulation results of the models showed improved predictive accuracy. The research outlines the benefits of the proposed models, whilst proposes new directions for future work that could further extend and improve the proposed models discussed in this thesis
    corecore