5 research outputs found

    Attribute Noise-Sensitivity Impact: Model Performance and Feature Ranking

    Get PDF
    Developing robust and less complex models capable of coping with environment volatility is the quest of every data mining project. This study attempts to establish heuristics for investigating the impact of noise in instance attributes data on learning model volatility. In addition, an alternative method for determining attribute importance and feature ranking, based on attribute sensitivity to noise is introduced. We present empirical analysis of the effect of attribute noise on model performance and how it impacts the overall learning process. Datasets drawn from different domains including Medicine, CRM, and security are employed by the study. Using proposed technique has practical implications by supporting building low volatile, high performance predictive models prior to production deployment. Also the study has implications for research by filling the gap in attribute noise research and its impact

    Un nuevo conjunto de datos para la detección de roya en cultivos de café Colombianos basado en clasificadores

    Get PDF
    Coffee production is the main agricultural activity in Colombia. More than 350.000 Colombian families depend on coffee harvest. Since coffee rust disease was first reported in the country in 1983, these families have had to face severe consequences. Recently, machine learning approaches have built a dataset for monitoring coffee rust incidence that involves weather conditions and physic crop properties. This background encouraged us to build a dataset for coffee rust detection in Colombian crops through data mining process as Cross Industry Standard Process for Data Mining (CRISP-DM). In this paper we define a proper data to generate accurate models; once the dataset is built, this is tested using classifiers as: Support Vector Regression, Backpropagation Neural Networks and Regression Trees.La producción de café es la principal actividad agrícola en Colombia. Más de 350.000 familias colombianas dependen de la cosecha de café. En este sentido, la roya fue reportada por primera vez en el país en 1983, y desde entonces estas familias han tenido que enfrentar graves consecuencias. Recientemente, diversos enfoques basados en aprendizaje automático han construido un conjunto de datos para el monitoreo de la incidencia de la roya del café, teniendo en cuenta las condiciones climáticas y las propiedades físicas de los cultivos. Estas investigaciones motivaron la creación de un conjunto de datos para la detección de la roya en cultivos Colombianos a través del proceso de minería de datos CRISP-DM. En este trabajo se definió un conjunto de datos con el objetivo de generar clasificadores precisos; una vez construido el conjunto de datos, fue probado mediante tres clasificadores: Maquinas de vector de regresión, Redes neuronales con propagación hacia atrás y Árboles de regresión

    A Statistical Comparison of Classification Algorithms on a Single Data Set

    Get PDF
    This research uses four classification algorithms in standard and boosted forms to predict members of a class for an online community. We compare two performance measures, area under the curve (AUC) and accuracy in the standard and boosted forms. The research compares four popular algorithms Bayes, logistic regression, J48 and Nearest Neighbor (NN). The analysis shows that there are significant differences among the base classification algorithms—J48 had the best accuracy. Additionally, the results show that boosted methods improved the accuracy of logistic regression. ANOVA was used to detect the differences between the algorithms; post hoc analysis shows the differences between specific algorithms

    Noise simulation in classification with the noisemodel R package: Applications analyzing the impact of errors with chemical data

    Get PDF
    Classification datasets created from chemical processes can be affected by errors, which impair the accuracy of the models built. This fact highlights the importance of analyzing the robustness of classifiers against different types and levels of noise to know their behavior against potential errors. In this con- text, noise models have been proposed to study noise-related phenomenology in a controlled environment, allowing errors to be introduced into the data in a supervised manner. This paper introduces the noisemodel R package, which contains the first extensive implementation of noise models for classification datasets, proposing it as support tool to analyze the impact of errors related to chemical data. It provides 72 noise models found in the specialized literature that allow errors to be introduced in different ways in classes and attributes. Each of them is properly documented and referenced, unifying their results through a specific S3 class, which benefits from customized print, summary and plot methods. The usage of the package is illustrated through four applica- tion examples considering real-world chemical datasets, where errors are prone to occur. The software presented will help to deepen the understanding of the problem of noisy chemical data, as well as to develop new robust algo- rithms and noise preprocessing methods properly adapted to different types of errors in this scenario.University of Granada/CBU
    corecore