8 research outputs found

    How False Data Affects Machine Learning Models in Electrochemistry?

    Recently, the selection of machine learning model based on only the data distribution without concerning the noise of the data. This study aims to distinguish, which models perform well under noisy data, and establish whether stacking machine learning models actually provide robustness to otherwise weak-to-noise models. The electrochemical data were tested with 12 standalone models and stacking model. This includes XGB, LGBM, RF, GB, ADA, NN, ELAS, LASS, RIDGE, SVM, KNN, DT, and the stacking model. It is found that linear models handle noise well with the average error of (slope) to 1.75 F g-1 up to error per 100% percent noise added; but it suffers from prediction accuracy due to having an average of 60.19 F g-1 estimated at minimal error at 0% noise added. Tree-based models fail in terms of noise handling (average slope is 55.24 F g-1 at 100% percent noise), but it can provide higher prediction accuracy (lowest error of 23.9 F g-1) than that of linear. To address the controversial between prediction accuracy and error handling, the stacking model was constructed, which is not only show high accuracy (intercept of 25.03 F g-1), but it also exhibits good noise handling (slope of 43.58 F g-1), making stacking models a relatively low risk and viable choice for beginner and experienced machine learning research in electrochemistry. Even though neural networks (NN) are gaining popularity in the electrochemistry field. However, this study presents that NN is not suitable for electrochemical data, and improper tuning resulting in a model that is susceptible to noise. Thus, STACK models should provide better benefits in that even with untuned base models, they can achieve an accurate and noise-tolerant model. Overall, this work provides insight into machine learning model selection for electrochemical data, which should aid the understanding of data science in chemistry context.Comment: 40 pages, 11 figure

    Noise Reduction for Instance-Based Learning with a Local Maximal Margin Approach

    To some extent the problem of noise reduction in machine learning has been finessed by the development of learning techniques that are noise-tolerant. However, it is difficult to make instance-based learning noise tolerant and noise reduction still plays an important role in k-nearest neighbour classification. There are also other motivations for noise reduction, for instance the elimination of noise may result in simpler models or data cleansing may be an end in itself. In this paper we present a novel approach to noise reduction based on local Support Vector Machines (LSVM) which brings the benefits of maximal margin classifiers to bear on noise reduction. This provides a more robust alternative to the majority rule on which almost all the existing noise reduction techniques are based. Roughly speaking, for each training sample an SVM is trained on its neighbourhood and if the SVM classification for the central sample disagrees with its actual class there is evidence in favour of removing it from the training set. We provide an empirical evaluation on 15 real datasets showing improved classification accuracy when using training data edited with our method as well as specific experiments regarding the spam filtering application domain. We present a further evaluation on two artificial datasets where we analyse two different types of noise (Gaussian sample noise and mislabelling noise) and the influence of different class densities. The conclusion is that LSVM noise reduction is significatively better than the other analysed algorithms for real datasets and for artificial datasets perturbed by Gaussian noise and in presence of uneven class densities

    Reputation-based maintenance in case-based reasoning

    Case Base Maintenance algorithms update the contents of a case base in order to improve case-based reasoner performance. In this paper, we introduce a new case base maintenance method called Reputation-Based Maintenance (RBM) with the aim of increasing the classification accuracy of a Case-Based Reasoning system while reducing the size of its case base. The proposed RBM algorithm calculates a case property called Reputationfor each member of the case base, the value of which reflects the competence of the related case. Based on this case property, several removal policies and maintenance methods have been designed, each focusing on different aspects of the case base maintenance. The performance of the RBM method was compared with well-known state-of-the-art algorithms. The tests were performed on 30 datasets selected from the UCI repository. The results show that the RBM method in all its variations achieves greater accuracy than a baseline CBR, while some variations significantly outperform the state-of-the-art methods. We particularly highlight theRBM_ACBR algorithm, which achieves the highest accuracy among the methods in the comparison to a statistically significant degree, and the RBMcr algorithm, which increases the baseline accuracy while removing, on average, over half of the case basehis work has been partially supported by the SpanishMinistry of Science and Innovation with project MISMIS-LANGUAGE (grantnumber PGC2018-096212-B-C33), by the Catalan Agency of University andResearch Grants Management (AGAUR) (grants number 2017 SGR 341 and 2017SGR 574), by Spanish Network ‘‘Learning Machines for Singular Problems andApplications (MAPAS)’’ (TIN2017-90567-REDT, MINECO/FEDER EU) and by theEuropean Union’s Horizon 2020 research and innovation programme under theMarie Sklodowska-Curie grant agreement No. 860843Peer ReviewedPostprint (author's final draft

    Noise cleaning for classification based on neighborhood and concept changes over time

    An important field within data mining and pattern recognition is classification. Classification is necessary in a number nowadays-world processes. Several works and methods have been proposed with the goal to achieve classifiers to be more effective each time. However, most of them consider the training sets to be perfectly clustered, without having into account that incorrectly classified data might be in them. The process of removing incorrectly classified objects is called noise cleaning. Obviously, noise cleaning influences considerably in classification of new samples. In this work, we present a neighborhood-based algorithm for noise cleaning on data stream for classification. In addition, it considers the data distribution changes that may occur on the time. It was measured, by several experiments, the effect of the method on automatic building of training sets by using databases from UCI repository and two synthetic ones. The obtained results show prove the efficacy of the proposed noise cleaning strategy and its influence on the right classification of new samples

    A dynamic adaptive framework for improving case-based reasoning system performance

    An optimal performance of a Case-Based Reasoning (CBR) system means, the CBR system must be efficient both in time and in size, and must be optimally competent. The efficiency in time is closely related to an efficient and optimal retrieval process over the Case Base of the CBR system. Efficiency in size means that the Case Library (CL) size should be minimal. Therefore, the efficiency in size is closely related to optimal case learning policies, optimal meta-case learning policies, optimal case forgetting policies, etc. On the other hand, the optimal competence of a CBR system means that the number of problems that the CBR system can satisfactorily solve must be maximum. To improve or optimize all three dimensions in a CBR system at the same time is a difficult challenge because they are interrelated, and it becomes even more difficult when the CBR system is applied to a dynamic or continuous domain (data stream). In this thesis, a Dynamic Adaptive Case Library framework (DACL) is proposed to improve the CBR system performance coping especially with reducing the retrieval time, increasing the CBR system competence, and maintaining and adapting the CL to be efficient in size, especially in continuous domains. DACL learns cases and organizes them into dynamic cluster structures. The DACL is able to adapt itself to a dynamic environment, where new clusters, meta-cases or prototype of cases, and associated indexing structures (discriminant trees, k-d trees, etc.) can be formed, updated, or even removed. DACL offers a possible solution to the management of the large amount of data generated in an unsupervised continuous domain (data stream). In addition, we propose the use of a Multiple Case Library (MCL), which is a static version of a DACL, with the same structure but being defined statically to be used in supervised domains. The thesis work proposes some techniques for improving the indexation and the retrieval task. The most important indexing method is the NIAR k-d tree algorithm, which improves the retrieval time and competence, compared against the baseline approach (a flat CL) and against the well-known techniques based on using standard k-d tree strategies. The proposed Partial Matching Exploration (PME) technique explores a hierarchical case library with a tree indexing-structure aiming at not losing the most similar cases to a query case. This technique allows not only exploring the best matching path, but also several alternative partial matching paths to be explored. The results show an improvement in competence and time of retrieving of similar cases. Through the experimentation tests done, with a set of well-known benchmark supervised databases. The dynamic building of prototypes in DACL has been tested in an unsupervised domain (environmental domain) where the air pollution is evaluated. The core task of building prototypes in a DACL is the implementation of a stochastic method for the learning of new cases and management of prototypes. Finally, the whole dynamic framework, integrating all the main proposed approaches of the research work, has been tested in simulated unsupervised domains with several well-known databases in an incremental way, as data streams are processed in real life. The conclusions outlined that from the experimental results, it can be stated that the dynamic adaptive framework proposed (DACL/MCL), jointly with the contributed indexing strategies and exploration techniques, and with the proposed stochastic case learning policies, and meta-case learning policies, improves the performance of standard CBR systems both in supervised domains (MCL) and in unsupervised continuous domains (DACL).El rendimiento óptimo de un sistema de razonamiento basado en casos (CBR) significa que el sistema CBR debe ser eficiente tanto en tiempo como en tamaño, y debe ser competente de manera óptima. La eficiencia temporal está estrechamente relacionada con que el proceso de recuperación sobre la Base de Casos del sistema CBR sea eficiente y óptimo. La eficiencia en tamaño significa que el tamaño de la Base de Casos (CL) debe ser mínimo. Por lo tanto, la eficiencia en tamaño está estrechamente relacionada con las políticas óptimas de aprendizaje de casos y meta-casos, y las políticas óptimas de olvido de casos, etc. Por otro lado, la competencia óptima de un sistema CBR significa que el número de problemas que el sistema puede resolver de forma satisfactoria debe ser máximo. Mejorar u optimizar las tres dimensiones de un sistema CBR al mismo tiempo es un reto difícil, ya que están relacionadas entre sí, y se vuelve aún más difícil cuando se aplica el sistema de CBR a un dominio dinámico o continuo (flujo de datos). En esta tesis se propone el Dynamic Adaptive Case Library framework (DACL) para mejorar el rendimiento del sistema CBR especialmente con la reducción del tiempo de recuperación, aumentando la competencia del sistema CBR, manteniendo y adaptando la CL para ser eficiente en tamaño, especialmente en dominios continuos. DACL aprende casos y los organiza en estructuras dinámicas de clusters. DACL es capaz de adaptarse a entornos dinámicos, donde los nuevos clusters, meta-casos o prototipos de los casos, y las estructuras asociadas de indexación (árboles discriminantes, árboles k-d, etc.) se pueden formar, actualizarse, o incluso ser eliminados. DACL ofrece una posible solución para la gestión de la gran cantidad de datos generados en un dominio continuo no supervisado (flujo de datos). Además, se propone el uso de la Multiple Case Library (MCL), que es una versión estática de una DACL, con la misma estructura pero siendo definida estáticamente para ser utilizada en dominios supervisados. El trabajo de tesis propone algunas técnicas para mejorar los procesos de indexación y de recuperación. El método de indexación más importante es el algoritmo NIAR k-d tree, que mejora el tiempo de recuperación y la competencia, comparado con una CL plana y con las técnicas basadas en el uso de estrategias de árboles k-d estándar. Partial Matching Exploration (PME) technique, la técnica propuesta, explora una base de casos jerárquica con una indexación de estructura de árbol con el objetivo de no perder los casos más similares a un caso de consulta. Esta técnica no sólo permite explorar el mejor camino coincidente, sino también varios caminos parciales alternativos coincidentes. Los resultados, a través de la experimentación realizada con bases de datos supervisadas conocidas, muestran una mejora de la competencia y del tiempo de recuperación de casos similares. Además la construcción dinámica de prototipos en DACL ha sido probada en un dominio no supervisado (dominio ambiental), donde se evalúa la contaminación del aire. La tarea central de la construcción de prototipos en DACL es la implementación de un método estocástico para el aprendizaje de nuevos casos y la gestión de prototipos. Por último, todo el sistema, integrando todos los métodos propuestos en este trabajo de investigación, se ha evaluado en dominios no supervisados simulados con varias bases de datos de una manera gradual, como se procesan los flujos de datos en la vida real. Las conclusiones, a partir de los resultados experimentales, muestran que el sistema de adaptación dinámica propuesto (DACL / MCL), junto con las estrategias de indexación y de exploración, y con las políticas de aprendizaje de casos estocásticos y de meta-casos propuestas, mejora el rendimiento de los sistemas estándar de CBR tanto en dominios supervisados (MCL) como en dominios continuos no supervisados (DACL).Postprint (published version

    Learning from noisy data through robust feature selection, ensembles and simulation-based optimization

    The presence of noise and uncertainty in real scenarios makes machine learning a challenging task. Acquisition errors or missing values can lead to models that do not generalize well on new data. Under-fitting and over-fitting can occur because of feature redundancy in high-dimensional problems as well as data scarcity. In these contexts the learning task can show difficulties in extracting relevant and stable information from noisy features or from a limited set of samples with high variance. In some extreme cases, the presence of only aggregated data instead of individual samples prevents the use of instance-based learning. In these contexts, parametric models can be learned through simulations to take into account the inherent stochastic nature of the processes involved. This dissertation includes contributions to different learning problems characterized by noise and uncertainty. In particular, we propose i) a novel approach for robust feature selection based on the neighborhood entropy, ii) an approach based on ensembles for robust salary prediction in the IT job market, and iii) a parametric simulation-based approach for dynamic pricing and what-if analyses in hotel revenue management when only aggregated data are available

    A Principled Methodology: A Dozen Principles of Software Effort Estimation

    Software effort estimation (SEE) is the activity of estimating the total effort required to complete a software project. Correctly estimating the effort required for a software project is of vital importance for the competitiveness of the organizations. Both under- and over-estimation leads to undesirable consequences for the organizations. Under-estimation may result in overruns in budget and schedule, which in return may cause the cancellation of projects; thereby, wasting the entire effort spent until that point. Over-estimation may cause promising projects not to be funded; hence, harming the organizational competitiveness.;Due to the significant role of SEE for software organizations, there is a considerable research effort invested in SEE. Thanks to the accumulation of decades of prior research, today we are able to identify the core issues and search for the right principles to tackle pressing questions. For example, regardless of decades of work, we still lack concrete answers to important questions such as: What is the best SEE method? The introduced estimation methods make use of local data, however not all the companies have their own data, so: How can we handle the lack of local data? Common SEE methods take size attributes for granted, yet size attributes are costly and the practitioners place very little trust in them. Hence, we ask: How can we avoid the use of size attributes? Collection of data, particularly dependent variable information (i.e. effort values) is costly: How can find an essential subset of the SEE data sets? Finally, studies make use of sampling methods to justify a new method\u27s performance on SEE data sets. Yet, trade-off among different variants is ignored: How should we choose sampling methods for SEE experiments? ;This thesis is a rigorous investigation towards identification and tackling of the pressing issues in SEE. Our findings rely on extensive experimentation performed with a large corpus of estimation techniques on a large set of public and proprietary data sets. We summarize our findings and industrial experience in the form of 12 principles: 1) Know your domain 2) Let the Experts Talk 3) Suspect your data 4) Data Collection is Cyclic 5) Use a Ranking Stability Indicator 6) Assemble Superior Methods 7) Weighting Analogies is Over-elaboration 8) Use Easy-path Design 9) Use Relevancy Filtering 10) Use Outlier Pruning 11) Combine Outlier and Synonym Pruning 12) Be Aware of Sampling Method Trade-off

    Large dataset complexity reduction for classification: An optimization perspective

    Doctor of PhilosophyComputational complexity in data mining is attributed to algorithms but lies hugely with the data. Different algorithms may exist to solve the same problem, but the simplest is not always the best. At the same time, data of astronomical proportions is rather common, boosted by automation, and the fuller the data, the better resolution of the concept it projects. Paradoxically, it is the computing power that is lacking. Perhaps a fast algorithm can be run on the data, but not the optimal. Even then any modeling is much constrained, involving serial application of many algorithms. The only other way to relieve the computational load is via making the data lighter. Any representative subset has to preserve the data essence suiting, ideally, any algorithm. The reduction should minimize the error of approximation, while trading precision for performance. Data mining is a wide field. We concentrate on classification. In the literature review we present a variety of methods, emphasizing the effort of past decade. Two major objects of reduction are instances and attributes. The data can be also recast into a more economical format. We address sampling, noise reduction, class domain binarization, feature ranking, feature subset selection, feature extraction, and also discretization of continuous features. Achievements are tremendous, but so are possibilities. We improve an existing technique of data cleansing and suggest a way of data condensing as the extension. We also touch on noise reduction. Instance similarity, excepting the class mix, prompts a technique of feature selection. Additionally, we consider multivariate discretization, enabling a compact data representation without the size change. We compare proposed methods with alternative techniques which we introduce new, implement or use available