7 research outputs found

    Penerapan Metode Average Gain, Threshold Pruning Dan Cost Complexity Pruning Untuk Split Atribut Pada Algoritma C4.5

    Full text link
    C4.5 is a supervised learning classifier to establish a Decision Tree of data. Split attribute is main process in the formation of a decision tree in C4.5. Split attribute in C4.5 can not be overcome in any misclassification cost split so the effect on the performance of the classifier. After the split attributes, the next process is pruning. Pruning is process to cut or eliminate some of unnecessary branches. Branch or node that is not needed can cause the size of Decision Tree to be very large and it is called over- fitting. Over- fitting is state of the art for this time. Methods for split attributes are Gini Index, Information Gain, Gain Ratio and Average Gain which proposed by Mitchell. Average Gain not only overcome the weakness in the Information Gain but also help to solve the problems of Gain Ratio. Attribute split method which proposed in this research is use average gain value multiplied by the difference of misclassification. While the technique of pruning is done by combining threshold pruning and cost complexity pruning. In this research, testing the proposed method will be applied to datasets and then the results of performance will be compared with results split method performance attributes using the Gini Index, Information Gain and Gain Ratio. The selecting method of split attributes using average gain that multiplied by the difference of misclassification can improve the performance of classifiying C4.5. This is demonstrated through the Friedman test that the proposed split method attributes, combined with threshold pruning and cost complexity pruning have accuracy ratings in rank 1. A Decision Tree formed by the proposed method are smaller

    Design an Optimal Decision Tree based Algorithm to Improve Model Prediction Performance

    Get PDF
    Performance of decision trees is assessed by prediction accuracy for unobserved occurrences. In order to generate optimised decision trees with high classification accuracy and smaller decision trees, this study will pre-process the data. In this study, some decision tree components are addressed and enhanced. The algorithms should produce precise and ideal decision trees in order to increase prediction performance. Additionally, it hopes to create a decision tree algorithm with a tiny global footprint and excellent forecast accuracy. The typical decision tree-based technique was created for classification purposes and is used with various kinds of uncertain information. Prior to preparing the dataset for classification, the uncertain dataset was first processed through missing data treatment and other uncertainty handling procedures to produce the balanced dataset. Three different real-time datasets, including the Titanic dataset, the PIMA Indian Diabetes dataset, and datasets relating to heart disease, have been used to test the proposed algorithm. The suggested algorithm's performance has been assessed in terms of the precision, recall, f-measure, and accuracy metrics. The outcomes of suggested decision tree and the standard decision tree have been contrasted. On all three datasets, it was found that the decision tree with Gini impurity optimization performed remarkably well

    Handling over-fitting in test cost-sensitive decision tree learning by feature selection, smoothing and pruning

    Full text link
    Cost-sensitive learning algorithms are typically designed for minimizing the total cost when multiple costs are taken into account. Like other learning algorithms, cost-sensitive learning algorithms must face a significant challenge, over-fitting, in an applied context of cost-sensitive learning. Specifically speaking, they can generate good results on training data but normally do not produce an optimal model when applied to unseen data in real world applications. It is called data over-fitting. This paper deals with the issue of data over-fitting by designing three simple and efficient strategies, feature selection, smoothing and threshold pruning, against the TCSDT (test cost-sensitive decision tree) method. The feature selection approach is used to pre-process the data set before applying the TCSDT algorithm. The smoothing and threshold pruning are used in a TCSDT algorithm before calculating the class probability estimate for each decision tree leaf. To evaluate our approaches, we conduct extensive experiments on the selected UCI data sets across different cost ratios, and on a real world data set, KDD-98 with real misclassification cost. The experimental results show that our algorithms outperform both the original TCSDT and other competing algorithms on reducing data over-fitting. © 2010 Elsevier Inc. All rights reserved

    Applications of artificial neural networks in financial market forecasting

    Get PDF
    This thesis evaluates the utility of Artificial Neural Networks (ANNs) applied to financial market and macroeconomic forecasting. In application, ANNs are evaluated in comparison to traditional forecasting models to evaluate if their nonlinear and adaptive properties yield superior forecasting performance in terms of robustness and accuracy. Furthermore, as ANNs are data-driven models, an emphasis is placed on the data collection stage by compiling extensive candidate input variable pools, a task frequently underperformed by prior research. In evaluating their performance, ANNs are applied to the domains of: exchange rate forecasting, volatility forecasting, and macroeconomic forecasting. Regarding exchange rate forecasting, ANNs are applied to forecast the daily logarithmic returns of the EUR/USD over a short-term forecast horizon of one period. Initially, the analytic method of Technical Analysis (TA) and its sub-section of technical indicators are utilized to compile an extensive candidate input variable pool featuring standard and advanced technical indicators measuring all technical aspects of the EUR/USD time series. The candidate input variable pool is then subjected to a two-stage Input Variable Selection (IVS) process, producing an informative subset of technical indicators to serve as inputs to the ANNs. A collection of ANNs is then trained and tested on the EUR/USD time series data with their performance evaluated over a 5-year sample period (2012 to 2016), reserving the last two years for out of sample testing. A Moving Average Convergence Divergence (MACD) model serves as a benchmark with the in-sample and out-of-sample empirical results demonstrating the MACD is a superior forecasting model across most forecast evaluation metrics. For volatility forecasting, ANNs are applied to forecast the volatility of the Nikkei 225 Index over a short-term forecast horizon of one period. Initially, an extensive candidate input variable pool is compiled consisting of implied volatility models and historical volatility models. The candidate input variable pool is then subjected to a two-stage IVS process. A collection of ANNs is then trained and tested on the Nikkei 225 Index time series data with their performance evaluated over a 4-year sample period (2014 to 2017), reserving the last year for out-of-sample testing. A GARCH (1,1) model serves as a benchmark with the out-of-sample empirical results finding the GARCH (1,1) model to be the superior volatility forecasting model. The research concludes with ANNs applied to macroeconomic forecasting, where ANNs are applied to forecast the monthly per cent-change in U.S. civilian unemployment and the quarterly per cent-change in U.S. Gross Domestic Product (GDP). For both studies, an extensive candidate input variable pool is compiled using relevant macroeconomic indicator data sourced from the Federal Bank of St Louis. The candidate input variable pools are then subjected to a two-stage IVS process. A collection of ANNs is trained and tested on the U.S. unemployment time series data (UNEMPLOY) and U.S. GDP time series data. The sample periods are (1972 to 2017) and (1960 to 2016) respectively, reserving the last 20% of data for out of sample testing. In both studies, the performance of the ANNs is benchmarked against a Support Vector Regression (SVR) model and a NaĂŻve forecast. In both studies, the ANNs outperform the SVR benchmark model. The empirical results demonstrate that ANNs are superior forecasting models in the domain of macroeconomic forecasting, with the Modular Neural Network performing notably well. However, the empirical results question the utility of ANNs in the domains of exchange rate forecasting and volatility forecasting. A MACD model outperforms ANNs in exchange rate forecasting both in-sample and out-of-sample, and a GARCH (1,1) model outperforms ANNs in volatility forecasting

    Data Mining As A Tool To Evaluate Thermal Comfort Of Horses

    No full text
    Thermal comfort is of great importance to preserve body temperature homeostasis during thermal stress conditions. Although thermal comfort of horses has been widely studied, research has not reported its relationship to surface temperature (TS). The aim of this study was to investigate the potential of data mining techniques as a tool to associate surface temperature with thermal comfort of horses. TS was measured using infrared thermographic image processing. Physiological and environmental variables were used to define the predicted class, which classified thermal comfort as "comfort" and "discomfort". The TS variables for the armpit, croup, breast and groin of horses and the predicted class were then submitted to a machine learning process. All dataset variables were considered relevant to the classification problem and the decision-tree model yielded an accuracy rate of 74.0%. The feature selection methods used to reduce computational cost and simplify predictive learning reduced the model accuracy to 70.1%; however the model became simpler with representative rules. For these selection methods and for the classification using all attributes, TS of armpit and breast had a higher rating power for predicting thermal comfort. The data mining techniques had discovered new variables relating to the thermal comfort of horses.281290FancomAutio, E., Neste, R., Airaksinen, S., Heiskanen, M., Measuring the heat loss in horses in different seasons by infrared thermography (2006) Journal of Applied Animal Welfare Science, 9, pp. 211-221Batista, G.H.A.P.A., Prati, R.C., Monard, M.C., A study of the behavior of several methods for balancing machine learning training data (2004) SIGK DD Explorations, 6, pp. 20-29Castanheira, M., Paiva, S.R., Louvandini, H., Landim, A., Fiorvanti, M.C.S., Paludo, G.R., Dallago, B.S., McManus, C., Multivariate analysis for characteristics of heat tolerance in horses in Brazil (2010) Tropical Animal Health and Production, 42, pp. 185-191Chapman, P., Clinton, J., Kerber, R., Khabaz, T., Reinartz, T., Shearer, C., Wirth, R., CRIS P-DM 1.0: Step-by-step data mining guide (2000) The CRIS P-DM Consortium, , http://www.spss.ch/upload/1107356429_CrispDM1.0.pdf, Available atChawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., SMOTE : Synthetic minority over-sampling technique (2002) Journal of Artificial Intelligence Research, 16, pp. 321-357Crivelenti, R.C., Coelho, R.M., Adami, S.F., Oliveira, S.R.M., Data mining to infer soil-landscape relationships in digital soil mapping (2009) Pesquisa Agropecuária Brasileira, 44, pp. 1707-1715. , Portuguese, with abstract in EnglishCunningham, J.G., (2002) Textbook F Veterinary Physiology, , Saunders/Elsevier, Philadelphia, PA, USAHan, J., Kamber, M., Pei, J., (2011) Data Mining: Concepts and Techniques, , Morgan Kaufmann Publishers, San Francisco, CA, USAHuang, C.-J., Yang, D.-X., Chuang, Y.-T., Application of wrapper approach and composite classifier to the stock trend prediction (2008) Expert Systems with Applications, 34, pp. 2870-2878Japkowicz, N., (2003) Class Imbalances: Are We Focusing on the Right Issue?, , http://www.site.uottawa.ca/~nat/Papers/papers.html, Accessed Oct. 16, 2012Jodkowska, E., Dudek, K., Przewozny, M., The maximum temperatures (Tmax) distribution on the body surface of sport horses (2011) Journal of Life Sciences, 5, pp. 291-297Jones, S., Horseback riding in the dog days (2009) Animal Science E-news University of Arkansas, 2 (3-4), p. 7. , http://www.aragriculture.org/news/animal_science_enews/2009/july2009.htm, The Cooperative Extension DivisonKohn, C.W., Hinchcliff, K.W., Physiological responses to the endurance test of a 3-dayevent during hot and cool weather (1995) Equine Veterinary Journal, 20, pp. 31-36Kohn, C.W., Hinchcliff, K.W., McKeever, K.H., Evaluation of washing with cold water to facilitate heat dissipation in horses exercised in hot, humid conditions (1999) American Journal of Veterinary Research, 60, pp. 299-305Lin, S.-W., Chen, S.-C., Parameter determination and feature selection for C4.5 algorithm using scatter search approach (2012) Software Computer, 16, pp. 63-75Lutu, P.E.N., Engelbrecht, A.P., A decision rule-based method for feature selection in predictive data mining (2010) Expert Systems with Applications, 37, pp. 602-609Marlin, D.J., Scott, C.M., Roberts, C.A., Casas, I., Holah, G., Schroter, R., Post exercise changes in compartmental body temperature accompanying intermittent cold water cooling in the hyperthermic horse (1998) Equine Veterinary Journal, 30, pp. 28-34McConaghy, F.F., Hodgson, D.R., Rose, R.J., Hales, J.R., Redistribution of cardiac output in response to heat exposure in the pony (1996) Equine Veterinary Journal Supplement, 22, pp. 42-46McCutcheon, L.J., Geor, R.J., Thermoregulation and exercise-associated heat stress (2008) Equine Exercise Physiology: The Science of Exercise in the Athletic Horse, pp. 382-396. , Hinchcliff, K.W. Geor, R.J. Kaneps, A.J. eds. Elsevier Health Sciences, Philadelphia, PA, USAMcKeever, K.H., Eaton, T.L., Geiser, S., Kearns, C.F., Lehnhard, R.A., Age related decreases I thermoregulation and cardiovascular function in horses (2010) Equine Veterinary Journal, 42, pp. 449-454Quinlan, J.R., (1993) C4.5: Programs for Machine Learning, , Morgan Kaufmann, San Francisco, CA, USASikora, M., Induction and pruning of classification rules for prediction of microseismic hazards in coal mines (2011) Expert Systems with Applications, 38, pp. 6748-6758Tattersall, G.J., Cadena, V., Insights into animal temperature adaptations revealed through thermal imaging (2010) The Imaging Science Journal, 58, pp. 261-268Tsang, S., Kao, B., Yip, K.Y., Ho, W., Lee, S.D., Decision tree for uncertain data (2011) IEEE Transactions on Knowledge and Data Engineering, 23, pp. 64-78Wang, T., Qin, Z., Jin, Z., Zhang, S., Handling over-fitting in test cost-sensitive decision tree learning by feature selection, smoothing and pruning (2010) The Journal of Systems and Software, 83, pp. 1137-114
    corecore