74,055 research outputs found

    An intelligent assistant for exploratory data analysis

    Get PDF
    In this paper we present an account of the main features of SNOUT, an intelligent assistant for exploratory data analysis (EDA) of social science survey data that incorporates a range of data mining techniques. EDA has much in common with existing data mining techniques: its main objective is to help an investigator reach an understanding of the important relationships ina data set rather than simply develop predictive models for selectd variables. Brief descriptions of a number of novel techniques developed for use in SNOUT are presented. These include heuristic variable level inference and classification, automatic category formation, the use of similarity trees to identify groups of related variables, interactive decision tree construction and model selection using a genetic algorithm

    A Multi-Tiered Genetic Algorithm for Data Mining and Hypothesis Refinement

    Get PDF
    While there are many approaches to data mining, it seems that there is a hole in the ability to make use of the advantages of multiple techniques. There are many methods that use rigid heuristics and guidelines in constructing rules for data, and are thus limited in their ability to describe patterns. Genetic algorithms provide a more flexible approach, and yet the genetic algorithms that have been employed don't capitalize on the fact that data models have two levels: individual rules and the overall data model. This dissertation introduces a multi-tiered genetic algorithm capable of evolving individual rules and the data model at the same time. The multi-tiered genetic algorithm also provides a means for taking advantage of the strengths of the more rigid methods by using their output as input to the genetic algorithm. Most genetic algorithms use a single "roulette wheel" approach. As such, they are only able to select either good data models or good rules, but are incapable of selecting for both simultaneously. With the additional roulette wheel of the multi-tiered genetic algorithm, the fitness of both rules and data models can be evaluated, enabling the algorithm to select good rules from good data models. This also more closely emulates how genes are passed from parents to children in actual biology. Consequently, this technique strengthens the "genetics" of genetic algorithms. For ease of discussion, the multi-tiered genetic algorithm has been named "Arcanum." This technique was tested on thirteen data sets obtained from The University of California Irvine Knowledge Discovery in Databases Archive. Results for these same data sets were gathered for GAssist, another genetic algorithm designed for data mining, and J4.8, the WEKA implementation of C4.5. While both of the other techniques outperformed Arcanum overall, it was able to provide comparable or better results for 5 of the 13 data sets, indicating that the algorithm can be used for data mining, although it needs improvement. The second stage of testing was on the ability to take results from a previous algorithm and perform refinement on the data model. Initially, Arcanum was used to refine its own data models. Of the six data models used for hypothesis refinement, Arcanum was able to improve upon 3 of them. Next, results from the LEM2 algorithm were used as input to Arcanum. Of the three data models used from LEM2, Arcanum was able to improve upon all three data models by sacrificing accuracy in order to improve coverage, resulting in a better data model overall. The last phase of hypothesis refinement was performed upon C4.5. It required several attempts, each using different parameters, but Arcanum was finally able to make a slight improvement to the C4.5 data model. From the experimental results, Arcanum was shown to yield results comparable to GAssist and C4.5 on some of the data sets. It was also able to take data models from three different techniques and improve upon them. While there is certainly room for improvement of the multi-tiered genetic algorithm described in this dissertation, the experimental evidence supports the claims that it can perform both data mining and hypothesis refinement of data models from other data mining techniques

    Application of data mining techniques in bioinformatics

    Get PDF
    With the widespread use of databases and the explosive growth in their sizes, there is a need to effectively utilize these massive volumes of data. This is where data mining comes in handy, as it scours the databases for extracting hidden patterns, finding hidden information, decision making and hypothesis testing. Bioinformatics, an upcoming field in today’s world, which involves use of large databases can be effectively searched through data mining techniques to derive useful rules. Based on the type of knowledge that is mined, data mining techniques [1] can be mainly classified into association rules, decision trees and clustering. Until recently, biology lacked the tools to analyze massive repositories of information such as the human genome database [3]. The data mining techniques are effectively used to extract meaningful relationships from these data.Data mining is especially used in microarray analysis which is used to study the activity of different cells under different conditions. Two algorithms under each mining techniques were implemented for a large database and compared with each other. 1. Association Rule Mining: - (a) a priori (b) partition 2. Clustering: - (a) k-means (b) k-medoids 3. Classification Rule Mining:- Decision tree generation using (a) gini index (b) entropy value. Genetic algorithms were applied to association and classification techniques. Further, kmeans and Density Based Spatial Clustering of Applications of Noise (DBSCAN) clustering techniques [1] were applied to a microarray dataset and compared. The microarray dataset was downloaded from internet using the Gene Array Analyzer Software(GAAS).The clustering was done on the basis of the signal color intensity of the genes in the microarray experiment. The following results were obtained:- 1. Association:- For smaller databases, the a priori algorithm works better than partition algorithm and for larger databases partition works better. 2. Clustering:- With respect to the number of interchanges, k-medoids algorithm works better than k-means algorithm. 3. Classification:- The results were similar for both the indices (gini index and entropy value). The application of genetic algorithm improved the efficiency of the association and classification techniques. For the microarray dataset, it was found that DBSCAN is less efficient than k-means when the database is small but for larger database DBSCAN is more accurate and efficient in terms of no. of clusters and time of execution. DBSCAN execution time increases linearly with the increase in database and was much lesser than that of k-means for larger database. Owing to the involvement of large datasets and the need to derive results from them, data mining techniques can be effectively put in use in the field of Bio-informatics [2]. The techniques can be applied to find associations among the genes, cluster similar gene and protein sequences and draw decision trees to classify the genes. Further, the data mining techniques can be made more efficient by applying genetic algorithms which greatly improves the search procedure and reduces the execution time

    Tuberculosis Disease Forecasting Among Indian Patients

    Get PDF
    Tuberculosis is a conspicuous syndrome for all individuals in developing countries including India. It is an uttermost causation of bereavement in personage. It is an ailment triggered by bacteria which strikes hominid body parts, primarily lungs. The desideratum of this paper is to foretell tuberculosis disease using data mining techniques, which tends to make a medical diagnosis of tuberculosis rigorous. Data Mining Techniques will help to glean that whether it is plausible to start tuberculosis treatment on suspected victims or not, without waiting for pernickety medical test outcomes. This scrutiny emphasis on patients health and provides treatment at low outlay through forecasting systems. There are assorted parameters such as Cough, Chest Pain, Night Sweats, Age, Weight Loss, Gender and Fever, Coughing up Blood, No Appetite which are used for predicting tuberculosis. Both Genetic algorithm and Neural network backwash better than other techniques. Tuberculosis disease forecasting is accomplished by soft computing technique. Genetic algorithm offers best fitness value, disembroil optimization problems whereas Neural Network takes parameters as input and also utilize genetic operators to train the neural network and spawn an output for presaging tuberculosis disease. This research outlines the main review and technical papers on tuberculosis detection that are implemented using multifarious data mining techniques. Review of papers surmises that soft computing technique acquires the highest accuracy

    Modelling epistasis in genetic disease using Petri nets, evolutionary computation and frequent itemset mining

    Get PDF
    Petri nets are useful for mathematically modelling disease-causing genetic epistasis. A Petri net model of an interaction has the potential to lead to biological insight into the cause of a genetic disease. However, defining a Petri net by hand for a particular interaction is extremely difficult because of the sheer complexity of the problem and degrees of freedom inherent in a Petri net’s architecture. We propose therefore a novel method, based on evolutionary computation and data mining, for automatically constructing Petri net models of non-linear gene interactions. The method comprises two main steps. Firstly, an initial partial Petri net is set up with several repeated sub-nets that model individual genes and a set of constraints, comprising relevant common sense and biological knowledge, is also defined. These constraints characterise the class of Petri nets that are desired. Secondly, this initial Petri net structure and the constraints are used as the input to a genetic algorithm. The genetic algorithm searches for a Petri net architecture that is both a superset of the initial net, and also conforms to all of the given constraints. The genetic algorithm evaluation function that we employ gives equal weighting to both the accuracy of the net and also its parsimony. We demonstrate our method using an epistatic model related to the presence of digital ulcers in systemic sclerosis patients that was recently reported in the literature. Our results show that although individual “perfect” Petri nets can frequently be discovered for this interaction, the true value of this approach lies in generating many different perfect nets, and applying data mining techniques to them in order to elucidate common and statistically significant patterns of interaction

    Prediction of Stock Market Index Using Genetic Algorithm

    Get PDF
    The generation of profitable trading rules for stock market investments is a difficult task but admired problem. First stage is classifying the prone direction of the price for BSE index (India cements stock price index (ICSPI)) futures with several technical indicators using artificial intelligence techniques. And second stage is mining the trading rules to determined conflict among the outputs of the first stage using the evolve learning. We have found trading rule which would have yield the highest return over a certain time period using historical data. These groundwork results suggest that genetic algorithms are promising model yields highest profit than other comparable models and buy-and-sell strategy. Experimental results of buying and selling of trading rules were outstanding. Key words: Data mining, Trading rule, Genetic algorithm, ANN, ICSPI predictio

    Data Mining with Multivariate Kernel Regression Using Information Complexity and the Genetic Algorithm

    Get PDF
    Kernel density estimation is a data smoothing technique that depends heavily on the bandwidth selection. The current literature has focused on optimal selectors for the univariate case that are primarily data driven. Plug-in and cross validation selectors have recently been extended to the general multivariate case. This dissertation will introduce and develop new and novel techniques for data mining with multivariate kernel density regression using information complexity and the genetic algorithm as a heuristic optimizer to choose the optimal bandwidth and the best predictors in kernel regression models. Simulated and real data will be used to cross validate the optimal bandwidth selectors using information complexity. The genetic algorithm is used in conjunction with information complexity to determine kernel density estimates for variable selection from high dimension multivariate data sets. Kernel regression is also hybridized with the implicit enumeration algorithm to determine the set of independent variables for the global optimal solution using information criteria as the objective function. The results from the genetic algorithm are compared to the optimal solution from the implicit enumeration algorithm and the known global optimal solution from an explicit enumeration of all possible subset models

    A new model for iris data set classification based on linear support vector machine parameter's optimization

    Get PDF
    Data mining is known as the process of detection concerning patterns from essential amounts of data. As a process of knowledge discovery. Classification is a data analysis that extracts a model which describes an important data classes. One of the outstanding classifications methods in data mining is support vector machine classification (SVM). It is capable of envisaging results and mostly effective than other classification methods. The SVM is a one technique of machine learning techniques that is well known technique, learning with supervised and have been applied perfectly to a vary problems of: regression, classification, and clustering in diverse domains such as gene expression, web text mining. In this study, we proposed a newly mode for classifying iris data set using SVM classifier and genetic algorithm to optimize c and gamma parameters of linear SVM, in addition principle components analysis (PCA) algorithm was use for features reduction

    Predicción de rotación de clientes en la industria de las telecomunicaciones utilizando métodos de minería de datos

    Get PDF
    At present, in competitive space between companies and organizations, customers churn is their most important challenge. When a customer becomes churn, organizations lose one of their most important assets, which can lead to financial losses and even bankruptcy.  Customer churn prediction using data mining techniques can alleviate these problems to some extent.  The aim of the present study is to provide a hybrid method based on Genetic Algorithm and Modular Neural Network to customer churn prediction in telecommunication industries and use Irancell data as a sample. The accuracy result of this study which is 95.5% get the highest accuracy rank in comparisons with the result of other methods, which shows using modular neural network with two modules of feedforward neural network and also using genetic algorithm to obtain optimal structure for modules of the neural network are the most important indicators of this method to each the highest accuracy result among the rest of methods.At present, in competitive space between companies and organizations, customers churn is their most important challenge. When a customer becomes churn, organizations lose one of their most important assets, which can lead to financial losses and even bankruptcy.  Customer churn prediction using data mining techniques can alleviate these problems to some extent.  The aim of the present study is to provide a hybrid method based on Genetic Algorithm and Modular Neural Network to customer churn prediction in telecommunication industries and use Irancell data as a sample. The accuracy result of this study which is 95.5% get the highest accuracy rank in comparisons with the result of other methods, which shows using modular neural network with two modules of feedforward neural network and also using genetic algorithm to obtain optimal structure for modules of the neural network are the most important indicators of this method to each the highest accuracy result among the rest of methods
    corecore