83,768 research outputs found

    Towards a framework for designing full model selection and optimization systems

    Get PDF
    People from a variety of industrial domains are beginning to realise that appropriate use of machine learning techniques for their data mining projects could bring great benefits. End-users now have to face the new problem of how to choose a combination of data processing tools and algorithms for a given dataset. This problem is usually termed the Full Model Selection (FMS) problem. Extended from our previous work [10], in this paper, we introduce a framework for designing FMS algorithms. Under this framework, we propose a novel algorithm combining both genetic algorithms (GA) and particle swarm optimization (PSO) named GPS (which stands for GA-PSO-FMS), in which a GA is used for searching the optimal structure for a data mining solution, and PSO is used for searching optimal parameters for a particular structure instance. Given a classification dataset, GPS outputs a FMS solution as a directed acyclic graph consisting of diverse data mining operators that are available to the problem. Experimental results demonstrate the benefit of the algorithm. We also present, with detailed analysis, two model-tree-based variants for speeding up the GPS algorithm

    KOMPARASI ALGORITMA C4.5 DAN C4.5 BERBASIS PSO UNTUK PREDIKSI JUMLAH PENGGUNAAN BBM PERBULAN PADA KANTOR DINAS LINGKUNGAN HIDUP DAN KEBERSIHAN KABUPATEN LOMBOK TIMUR

    Get PDF
    East Lombok Regency is one of the second level regions in West Nusa Tenggara Province which is located on the east side of Lombok Island. The capital city of East Lombok Regency is the city of Selong, where all government agencies are based in this city. One of them is the Department of Environment and Hygiene of East Lombok Regency. In carrying out operational duties at the Office of Environment and Hygiene the operational vehicle requires that the fuel oil is a subsidy from the government. Therefore, the use of BBM every day must be recorded properly so that it can be predicted the amount of fuel usage every month. However, the Office of the Environment and Hygiene Office has difficulty in processing such data in large quantities. Predicted information on fuel use is needed by the head of the agency to assist in making decisions or policies. Of these problems the right data mining technique to use is classification. One method of classification of data mining is the decition tree algorithm (C4.5) or called the decision tree. The decition tree (C4.5) algorithm has weaknesses in reading large amounts of data, so researchers use weighting by applying Particle Swarm Optimization (PSO) for attribute selection to increase the accuracy of C4.5.Thus the researcher will utilize data mining software in applying a comparison of the decition tree (C4.5) and C4.5 algorithms based on Particle Swarm Optimization (PSO) to get the best accuracy value in predicting the amount of monthly use of fuel oil at the Service Office Environment and Cleanliness of East Lombok Regency.DOI : 10.29408/jit.v2i1.117

    Automatic classification of oranges using image processing and data mining techniques

    Get PDF
    Data mining is the discovery of patterns and regularities from large amounts of data using machine learning algorithms. This can be applied to object recognition using image processing techniques. In fruits and vegetables production lines, the quality assurance is done by trained people who inspect the fruits while they move in a conveyor belt, and classify them in several categories based on visual features. In this paper we present an automatic orange’s classification system, which uses visual inspection to extract features from images captured with a digital camera. With these features train several data mining algorithms which should classify the fruits in one of the three pre-established categories. The data mining algorithms used are five different decision trees (J48, Classification and Regression Tree (CART), Best First Tree, Logistic Model Tree (LMT) and Random For- est), three artificial neural networks (Multilayer Perceptron with Backpropagation, Radial Basis Function Network (RBF Network), Sequential Minimal Optimization for Support Vector Machine (SMO)) and a classification rule (1Rule). The obtained results are encouraging because of the good accuracy achieved by the clas- sifiers and the low computational costs.Workshop de Agentes y Sistemas Inteligentes (WASI)Red de Universidades con Carreras en Informática (RedUNCI

    An Investigation in Efficient Spatial Patterns Mining

    Get PDF
    The technical progress in computerized spatial data acquisition and storage results in the growth of vast spatial databases. Faced with large amounts of increasing spatial data, a terminal user has more difficulty in understanding them without the helpful knowledge from spatial databases. Thus, spatial data mining has been brought under the umbrella of data mining and is attracting more attention. Spatial data mining presents challenges. Differing from usual data, spatial data includes not only positional data and attribute data, but also spatial relationships among spatial events. Further, the instances of spatial events are embedded in a continuous space and share a variety of spatial relationships, so the mining of spatial patterns demands new techniques. In this thesis, several contributions were made. Some new techniques were proposed, i.e., fuzzy co-location mining, CPI-tree (Co-location Pattern Instance Tree), maximal co-location patterns mining, AOI-ags (Attribute-Oriented Induction based on Attributes’ Generalization Sequences), and fuzzy association prediction. Three algorithms were put forward on co-location patterns mining: the fuzzy co-location mining algorithm, the CPI-tree based co-location mining algorithm (CPI-tree algorithm) and the orderclique- based maximal prevalence co-location mining algorithm (order-clique-based algorithm). An attribute-oriented induction algorithm based on attributes’ generalization sequences (AOI-ags algorithm) is further given, which unified the attribute thresholds and the tuple thresholds. On the two real-world databases with time-series data, a fuzzy association prediction algorithm is designed. Also a cell-based spatial object fusion algorithm is proposed. Two fuzzy clustering methods using domain knowledge were proposed: Natural Method and Graph-Based Method, both of which were controlled by a threshold. The threshold was confirmed by polynomial regression. Finally, a prototype system on spatial co-location patterns’ mining was developed, and shows the relative efficiencies of the co-location techniques proposed The techniques presented in the thesis focus on improving the feasibility, usefulness, effectiveness, and scalability of related algorithm. In the design of fuzzy co-location Abstract mining algorithm, a new data structure, the binary partition tree, used to improve the process of fuzzy equivalence partitioning, was proposed. A prefix-based approach to partition the prevalent event set search space into subsets, where each sub-problem can be solved in main-memory, was also presented. The scalability of CPI-tree algorithm is guaranteed since it does not require expensive spatial joins or instance joins for identifying co-location table instances. In the order-clique-based algorithm, the co-location table instances do not need be stored after computing the Pi value of corresponding colocation, which dramatically reduces the executive time and space of mining maximal colocations. Some technologies, for example, partitions, equivalence partition trees, prune optimization strategies and interestingness, were used to improve the efficiency of the AOI-ags algorithm. To implement the fuzzy association prediction algorithm, the “growing window” and the proximity computation pruning were introduced to reduce both I/O and CPU costs in computing the fuzzy semantic proximity between time-series. For new techniques and algorithms, theoretical analysis and experimental results on synthetic data sets and real-world datasets were presented and discussed in the thesis

    Evolutionary Algorithms in Decision Tree Induction

    Get PDF
    One of the biggest problem that many data analysis techniques have to deal with nowadays is Combinatorial Optimization that, in the past, has led many methods to be taken apart. Actually, the (still not enough!) higher computing power available makes it possible to apply such techniques within certain bounds. Since other research fields like Artificial Intelligence have been (and still are) dealing with such problems, their contribute to statistics has been very significant. This chapter tries to cast the Combinatorial Optimization methods into the Artificial Intelligence framework, particularly with respect Decision Tree Induction, which is considered a powerful instrument for the knowledge extraction and the decision making support. When the exhaustive enumeration and evaluation of all the possible candidate solution to a Tree-based Induction problem is not computationally affordable, the use of Nature Inspired Optimization Algorithms, which have been proven to be powerful instruments for attacking many combinatorial optimization problems, can be of great help. In this respect, the attention is focused on three main problems involving Decision Tree Induction by mainly focusing the attention on the Classification and Regression Tree-CART (Breiman et al., 1984) algorithm. First, the problem of splitting complex predictors such a multi-attribute ones is faced through the use of Genetic Algorithms. In addition, the possibility of growing “optimal” exploratory trees is also investigated by making use of Ant Colony Optimization (ACO) algorithm. Finally, the derivation of a subset of decision trees for modelling multi-attribute response on the basis of a data-driven heuristic is also described. The proposed approaches might be useful for knowledge extraction from large databases as well as for data mining applications. The solution they offer for complicated data modelling and data analysis problems might be considered for a possible implementation in a Decision Support System (DSS). The remainder of the chapter is as follows. Section 2 describes the main features and the recent developments of Decision Tree Induction. An overview of Combinatorial Optimization with a particular focus on Genetic Algorithms and Ant Colony Optimization is presented in section 3. The use of these two algorithms within the Decision Tree Induction Framework is described in section 4, together with the description of the algorithm for modelling multi-attribute response. Section 5 summarizes the results of the proposed method on real and simulated datasets. Concluding remarks are presented in section 6. The chapter also includes an appendix that presents J-Fast, a Java-based software for Decision Tree that currently implements Genetic Algorithms and Ant Colony Optimization

    Efficiently mining frequent itemsets from very large databases

    Get PDF
    Efficient algorithms for mining frequent itemsets are crucial for mining association rules and for other data mining tasks. Methods for mining frequent itemsets and for iceberg data cube computation have been implemented using a prefix-tree structure, known as a FP-tree, for storing compressed frequency information. Numerous experimental results have demonstrated that these algorithms perform extremely well. In this thesis we present a novel FP-array technique that greatly reduces the need to traverse FP-trees, thus obtaining significantly improved performance for FP-tree based algorithms. The technique works especially well for sparse datasets. We then present new algorithms for mining all frequent itemsets, maximal frequent itemsets, and closed frequent item-sets. The algorithms use the FP-tree data structure in combination with the FP-array technique efficiently, and incorporate various optimization techniques. In the algorithm for mining maximal frequent itemsets, a variant FP-tree data structure, called a MFI-tree, and an efficient maximality-checking approach are used. Another variant FP-tree data structure, called a CFI-tree, and an efficient closedness-testing approach are also given in the algorithm for mining closed frequent itemsets. Experimental results show that our methods outperform the existing methods in not only the speed of the algorithms, but also their memory consumption and their scalability. We also notice that most algorithms for mining frequent itemsets assume that the main memory is large enough for the data structures used in the mining, and very few efficient algorithms deal with the cases when the database is very large or the minimum support is very low. We thus investigate approaches to mining frequent itemsets when data structures are too large to fit in main memory. Several divide-and-conquer algorithms are presented for mining from disks. Many novel techniques are introduced. Experimental results show that the techniques reduce the required disk accesses by orders of magnitude, and enable truly scalable data mining

    A Data Mining Methodology for Vehicle Crashworthiness Design

    Get PDF
    This study develops a systematic design methodology based on data mining theory for decision-making in the development of crashworthy vehicles. The new data mining methodology allows the exploration of a large crash simulation dataset to discover the underlying relationships among vehicle crash responses and design variables at multiple levels and to derive design rules based on the whole-vehicle safety requirements to make decisions about component-level and subcomponent-level design. The method can resolve a major issue with existing design approaches related to vehicle crashworthiness: that is, limited abilities to explore information from large datasets, which may hamper decision-making in the design processes. At the component level, two structural design approaches were implemented for detailed component design with the data mining method: namely, a dimension-based approach and a node-based approach to handle structures with regular and irregular shapes, respectively. These two approaches were used to design a thin-walled vehicular structure, the S-shaped beam, against crash loading. A large number of design alternatives were created, and their responses under loading were evaluated by finite element simulations. The design variables and computed responses formed a large design dataset. This dataset was then mined to build a decision tree. Based on the decision tree, the interrelationships among the design parameters were revealed, and design rules were generated to produce a set of good designs. After the data mining, the critical design parameters were identified and the design space was reduced, which can simplify the design process. To partially replace the expensive finite element simulations, a surrogate model was used to model the relationships between design variables and response. Four machine learning algorithms, which can be used for surrogate model development, were compared. Based on the results, Gaussian process regression was determined to be the most suitable technique in the present scenario, and an optimization process was developed to tune the algorithm’s hyperparameters, which govern the model structure and training process. To account for engineering uncertainty in the data mining method, a new decision tree for uncertain data was proposed based on the joint probability in uncertain spaces, and it was implemented to again design the S-beam structure. The findings show that the new decision tree can produce effective decision-making rules for engineering design under uncertainty. To evaluate the new approaches developed in this work, a comprehensive case study was conducted by designing a vehicle system against the frontal crash. A publicly available vehicle model was simplified and validated. Using the newly developed approaches, new component designs in this vehicle were generated and integrated back into the vehicle model so their crash behavior could be simulated. Based on the simulation results, one can conclude that the designs with the new method can outperform the original design in terms of measures of mass, intrusion and peak acceleration. Therefore, the performance of the new design methodology has been confirmed. The current study demonstrates that the new data mining method can be used in vehicle crashworthiness design, and it has the potential to be applied to other complex engineering systems with a large amount of design data
    • 

    corecore