1,182 research outputs found

    On the role of pre and post-processing in environmental data mining

    Get PDF
    The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed

    Interpretable Categorization of Heterogeneous Time Series Data

    Get PDF
    Understanding heterogeneous multivariate time series data is important in many applications ranging from smart homes to aviation. Learning models of heterogeneous multivariate time series that are also human-interpretable is challenging and not adequately addressed by the existing literature. We propose grammar-based decision trees (GBDTs) and an algorithm for learning them. GBDTs extend decision trees with a grammar framework. Logical expressions derived from a context-free grammar are used for branching in place of simple thresholds on attributes. The added expressivity enables support for a wide range of data types while retaining the interpretability of decision trees. In particular, when a grammar based on temporal logic is used, we show that GBDTs can be used for the interpretable classi cation of high-dimensional and heterogeneous time series data. Furthermore, we show how GBDTs can also be used for categorization, which is a combination of clustering and generating interpretable explanations for each cluster. We apply GBDTs to analyze the classic Australian Sign Language dataset as well as data on near mid-air collisions (NMACs). The NMAC data comes from aircraft simulations used in the development of the next-generation Airborne Collision Avoidance System (ACAS X).Comment: 9 pages, 5 figures, 2 tables, SIAM International Conference on Data Mining (SDM) 201

    Discretization of Continuous Attributes

    No full text
    7 pagesIn the data mining field, many learning methods -like association rules, Bayesian networks, induction rules (Grzymala-Busse & Stefanowski, 2001)- can handle only discrete attributes. Therefore, before the machine learning process, it is necessary to re-encode each continuous attribute in a discrete attribute constituted by a set of intervals, for example the age attribute can be transformed in two discrete values representing two intervals: less than 18 (a minor) and 18 and more (of age). This process, known as discretization, is an essential task of the data preprocessing, not only because some learning methods do not handle continuous attributes, but also for other important reasons: the data transformed in a set of intervals are more cognitively relevant for a human interpretation (Liu, Hussain, Tan & Dash, 2002); the computation process goes faster with a reduced level of data, particularly when some attributes are suppressed from the representation space of the learning problem if it is impossible to find a relevant cut (Mittal & Cheong, 2002); the discretization can provide non-linear relations -e.g., the infants and the elderly people are more sensitive to illness

    An association rule mining method for estimating the impact of project management policies on software quality, development time and effort

    Get PDF
    Accurate and early estimations are essential for effective decision making in software project management. Nowadays, classical estimation models are being replaced by data mining models due to their application simplicity and the rapid production of profitable results. In this work, a method for mining association rules that relate project attributes is proposed. It deals with the problem of discretizing continuous data in order to generate a manageable number of high confident association rules. The method was validated by applying it to data from a Software Project Simulator. The association model obtained allows us to estimate the influence of certain management policy factors on various software project attributes simultaneously

    Local Rule-Based Explanations of Black Box Decision Systems

    Get PDF
    The recent years have witnessed the rise of accurate but obscure decision systems which hide the logic of their internal decision processes to the users. The lack of explanations for the decisions of black box systems is a key ethical issue, and a limitation to the adoption of machine learning components in socially sensitive and safety-critical contexts. %Therefore, we need explanations that reveals the reasons why a predictor takes a certain decision. In this paper we focus on the problem of black box outcome explanation, i.e., explaining the reasons of the decision taken on a specific instance. We propose LORE, an agnostic method able to provide interpretable and faithful explanations. LORE first leans a local interpretable predictor on a synthetic neighborhood generated by a genetic algorithm. Then it derives from the logic of the local interpretable predictor a meaningful explanation consisting of: a decision rule, which explains the reasons of the decision; and a set of counterfactual rules, suggesting the changes in the instance's features that lead to a different outcome. Wide experiments show that LORE outperforms existing methods and baselines both in the quality of explanations and in the accuracy in mimicking the black box

    Multivariate discretization of continuous valued attributes.

    Get PDF
    The area of Knowledge discovery and data mining is growing rapidly. Feature Discretization is a crucial issue in Knowledge Discovery in Databases (KDD), or Data Mining because most data sets used in real world applications have features with continuously values. Discretization is performed as a preprocessing step of the data mining to make data mining techniques useful for these data sets. This thesis addresses discretization issue by proposing a multivariate discretization (MVD) algorithm. It begins withal number of common discretization algorithms like Equal width discretization, Equal frequency discretization, Naïve; Entropy based discretization, Chi square discretization, and orthogonal hyper planes. After that comparing the results achieved by the multivariate discretization (MVD) algorithm with the accuracy results of other algorithms. This thesis is divided into six chapters, covering a few common discretization algorithms and tests these algorithms on a real world datasets which varying in size and complexity, and shows how data visualization techniques will be effective in determining the degree of complexity of the given data set. We have examined the multivariate discretization (MVD) algorithm with the same data sets. After that we have classified discrete data using artificial neural network single layer perceptron and multilayer perceptron with back propagation algorithm. We have trained the Classifier using the training data set, and tested its accuracy using the testing data set. Our experiments lead to better accuracy results with some data sets and low accuracy results with other data sets, and this is subject ot the degree of data complexity then we have compared the accuracy results of multivariate discretization (MVD) algorithm with the results achieved by other discretization algorithms. We have found that multivariate discretization (MVD) algorithm produces good accuracy results in comparing with the other discretization algorithm

    Investigating hybrids of evolution and learning for real-parameter optimization

    Get PDF
    In recent years, more and more advanced techniques have been developed in the field of hybridizing of evolution and learning, this means that more applications with these techniques can benefit from this progress. One example of these advanced techniques is the Learnable Evolution Model (LEM), which adopts learning as a guide for the general evolutionary search. Despite this trend and the progress in LEM, there are still many ideas and attempts which deserve further investigations and tests. For this purpose, this thesis has developed a number of new algorithms attempting to combine more learning algorithms with evolution in different ways. With these developments, we expect to understand the effects and relations between evolution and learning, and also achieve better performances in solving complex problems. The machine learning algorithms combined into the standard Genetic Algorithm (GA) are the supervised learning method k-nearest-neighbors (KNN), the Entropy-Based Discretization (ED) method, and the decision tree learning algorithm ID3. We test these algorithms on various real-parameter function optimization problems, especially the functions in the special session on CEC 2005 real-parameter function optimization. Additionally, a medical cancer chemotherapy treatment problem is solved in this thesis by some of our hybrid algorithms. The performances of these algorithms are compared with standard genetic algorithms and other well-known contemporary evolution and learning hybrid algorithms. Some of them are the CovarianceMatrix Adaptation Evolution Strategies (CMAES), and variants of the Estimation of Distribution Algorithms (EDA). Some important results have been derived from our experiments on these developed algorithms. Among them, we found that even some very simple learning methods hybridized properly with evolution procedure can provide significant performance improvement; and when more complex learning algorithms are incorporated with evolution, the resulting algorithms are very promising and compete very well against the state of the art hybrid algorithms both in well-defined real-parameter function optimization problems and a practical evaluation-expensive problem

    Vision-based neural network classifiers and their applications

    Get PDF
    A thesis submitted for the degree of Doctor of Philosophy of University of LutonVisual inspection of defects is an important part of quality assurance in many fields of production. It plays a very useful role in industrial applications in order to relieve human inspectors and improve the inspection accuracy and hence increasing productivity. Research has previously been done in defect classification of wood veneers using techniques such as neural networks, and a certain degree of success has been achieved. However, to improve results in tenus of both classification accuracy and running time are necessary if the techniques are to be widely adopted in industry, which has motivated this research. This research presents a method using rough sets based neural network with fuzzy input (RNNFI). Variable precision rough set (VPRS) method is proposed to remove redundant features utilising the characteristics of VPRS for data analysis and processing. The reduced data is fuzzified to represent the feature data in a more suitable foml for input to an improved BP neural network classifier. The improved BP neural network classifier is improved in three aspects: additional momentum, self-adaptive learning rates and dynamic error segmenting. Finally, to further consummate the classifier, a uniform design CUD) approach is introduced to optimise the key parameters because UD can generate a minimal set of uniform and representative design points scattered within the experiment domain. Optimal factor settings are achieved using a response surface (RSM) model and the nonlinear quadratic programming algorithm (NLPQL). Experiments have shown that the hybrid method is capable of classifying the defects of wood veneers with a fast convergence speed and high classification accuracy, comparing with other methods such as a neural network with fuzzy input and a rough sets based neural network. The research has demonstrated a methodology for visual inspection of defects, especially for situations where there is a large amount of data and a fast running speed is required. It is expected that this method can be applied to automatic visual inspection for production lines of other products such as ceramic tiles and strip steel

    Supervised classification and mathematical optimization

    Get PDF
    Data Mining techniques often ask for the resolution of optimization problems. Supervised Classification, and, in particular, Support Vector Machines, can be seen as a paradigmatic instance. In this paper, some links between Mathematical Optimization methods and Supervised Classification are emphasized. It is shown that many different areas of Mathematical Optimization play a central role in off-the-shelf Supervised Classification methods. Moreover, Mathematical Optimization turns out to be extremely useful to address important issues in Classification, such as identifying relevant variables, improving the interpretability of classifiers or dealing with vagueness/noise in the data.Ministerio de Ciencia e InnovaciónJunta de Andalucí
    corecore