5 research outputs found

    O Algoritmo de Classificação C4.5 e suas Aplicações na Área Médica

    Get PDF
    A evolução tecnológica na área de armazenamento de dados e o conseqüente acúmulo destes, trouxe a necessidade de adquirir conhecimento útil a partir desses repositórios de dados. Sendo assim, surgiu a técnica de data mining com diversas tarefas úteis para obtenção de conhecimento, como por exemplo, a classificação. Dentre os seus métodos tem-se o de árvores de decisão e diferentes algoritmos que o implementam. Neste artigo será abordado o algoritmo C4.5 e exemplos de algumas aplicações suas em bases de dados da área médica, a fim de auxiliar na tomada de decisão

    HYBRID METODE BOOSTRAP DAN TEKNIK IMPUTASI PADA METODE C4-5 UNTUK PREDIKSI PENYAKIT GINJAL KRONIS

    Get PDF
    Missing values is a serious problem that most often found in real data today. The C4.5 method is a popular classification predictive modeling used because of its ease of implementation. However, C4.5 is still weak when testing data that contains large missing. In this study we used a hybrid approach the bootstrap method and k-NN imputation to overcome missing values. The proposed method tested using Chronic Kidney Disease (CKD) data, and evaluated using accuracy and AUC. The results showed that the proposed method was superior in overcoming missing values in CKD. It can be concluded that the proposed method is able to overcome missing values for chronic kidney disease prediction

    Penerapan Metode Average Gain, Threshold Pruning Dan Cost Complexity Pruning Untuk Split Atribut Pada Algoritma C4.5

    Full text link
    C4.5 is a supervised learning classifier to establish a Decision Tree of data. Split attribute is main process in the formation of a decision tree in C4.5. Split attribute in C4.5 can not be overcome in any misclassification cost split so the effect on the performance of the classifier. After the split attributes, the next process is pruning. Pruning is process to cut or eliminate some of unnecessary branches. Branch or node that is not needed can cause the size of Decision Tree to be very large and it is called over- fitting. Over- fitting is state of the art for this time. Methods for split attributes are Gini Index, Information Gain, Gain Ratio and Average Gain which proposed by Mitchell. Average Gain not only overcome the weakness in the Information Gain but also help to solve the problems of Gain Ratio. Attribute split method which proposed in this research is use average gain value multiplied by the difference of misclassification. While the technique of pruning is done by combining threshold pruning and cost complexity pruning. In this research, testing the proposed method will be applied to datasets and then the results of performance will be compared with results split method performance attributes using the Gini Index, Information Gain and Gain Ratio. The selecting method of split attributes using average gain that multiplied by the difference of misclassification can improve the performance of classifiying C4.5. This is demonstrated through the Friedman test that the proposed split method attributes, combined with threshold pruning and cost complexity pruning have accuracy ratings in rank 1. A Decision Tree formed by the proposed method are smaller

    Software defect prediction using maximal information coefficient and fast correlation-based filter feature selection

    Get PDF
    Software quality ensures that applications that are developed are failure free. Some modern systems are intricate, due to the complexity of their information processes. Software fault prediction is an important quality assurance activity, since it is a mechanism that correctly predicts the defect proneness of modules and classifies modules that saves resources, time and developers’ efforts. In this study, a model that selects relevant features that can be used in defect prediction was proposed. The literature was reviewed and it revealed that process metrics are better predictors of defects in version systems and are based on historic source code over time. These metrics are extracted from the source-code module and include, for example, the number of additions and deletions from the source code, the number of distinct committers and the number of modified lines. In this research, defect prediction was conducted using open source software (OSS) of software product line(s) (SPL), hence process metrics were chosen. Data sets that are used in defect prediction may contain non-significant and redundant attributes that may affect the accuracy of machine-learning algorithms. In order to improve the prediction accuracy of classification models, features that are significant in the defect prediction process are utilised. In machine learning, feature selection techniques are applied in the identification of the relevant data. Feature selection is a pre-processing step that helps to reduce the dimensionality of data in machine learning. Feature selection techniques include information theoretic methods that are based on the entropy concept. This study experimented the efficiency of the feature selection techniques. It was realised that software defect prediction using significant attributes improves the prediction accuracy. A novel MICFastCR model, which is based on the Maximal Information Coefficient (MIC) was developed to select significant attributes and Fast Correlation Based Filter (FCBF) to eliminate redundant attributes. Machine learning algorithms were then run to predict software defects. The MICFastCR achieved the highest prediction accuracy as reported by various performance measures.School of ComputingPh. D. (Computer Science
    corecore