3,429 research outputs found

    Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

    Get PDF
    Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%

    Omnivariate rule induction using a novel pairwise statistical test

    Get PDF
    Rule learning algorithms, for example, RIPPER, induces univariate rules, that is, a propositional condition in a rule uses only one feature. In this paper, we propose an omnivariate induction of rules where under each condition, both a univariate and a multivariate condition are trained, and the best is chosen according to a novel statistical test. This paper has three main contributions: First, we propose a novel statistical test, the combined 5 x 2 cv t test, to compare two classifiers, which is a variant of the 5 x 2 cv t test and give the connections to other tests as 5 x 2 cv F test and k-fold paired t test. Second, we propose a multivariate version of RIPPER, where support vector machine with linear kernel is used to find multivariate linear conditions. Third, we propose an omnivariate version of RIPPER, where the model selection is done via the combined 5 x 2 cv t test. Our results indicate that 1) the combined 5 x 2 cv t test has higher power (lower type II error), lower type I error, and higher replicability compared to the 5 x 2 cv t test, 2) omnivariate rules are better in that they choose whichever condition is more accurate, selecting the right model automatically and separately for each condition in a rule.Publisher's VersionAuthor Post Prin

    Predicting Corporate Bankruptcy: Lessons from the Past

    Get PDF
    The need for corporate bankruptcy prediction models arises in 1960 after the increase in incidence of some major bankruptcies. Over the years, the episodes of financial turmoil increase in number and so does these bankruptcy prediction models. Existing reviews of bankruptcy models are either narrowly focused or outdated. Current study aims to provide an overview of the existing models for predicting bankruptcy and review the significance of these models. Furthermore, it highlights the problems and issues in the existing models which hinders the accuracy in predicting bankruptcy

    An academic review: applications of data mining techniques in finance industry

    Get PDF
    With the development of Internet techniques, data volumes are doubling every two years, faster than predicted by Moore’s Law. Big Data Analytics becomes particularly important for enterprise business. Modern computational technologies will provide effective tools to help understand hugely accumulated data and leverage this information to get insights into the finance industry. In order to get actionable insights into the business, data has become most valuable asset of financial organisations, as there are no physical products in finance industry to manufacture. This is where data mining techniques come to their rescue by allowing access to the right information at the right time. These techniques are used by the finance industry in various areas such as fraud detection, intelligent forecasting, credit rating, loan management, customer profiling, money laundering, marketing and prediction of price movements to name a few. This work aims to survey the research on data mining techniques applied to the finance industry from 2010 to 2015.The review finds that Stock prediction and Credit rating have received most attention of researchers, compared to Loan prediction, Money Laundering and Time Series prediction. Due to the dynamics, uncertainty and variety of data, nonlinear mapping techniques have been deeply studied than linear techniques. Also it has been proved that hybrid methods are more accurate in prediction, closely followed by Neural Network technique. This survey could provide a clue of applications of data mining techniques for finance industry, and a summary of methodologies for researchers in this area. Especially, it could provide a good vision of Data Mining Techniques in computational finance for beginners who want to work in the field of computational finance

    Statistical techniques vs. SEES algorithm : an application to a small business environment

    Get PDF
    The aim of this research is to compare the accuracy of a rule induction classifier system –Quinlan’s SEE5– with linear discriminant analysis and logit. The classification task chosen is the differentiation of the most efficient companies from the least efficient ones on the basis of a set of financial variables. The sample consists of a database containing the annual accounts of the companies located in the Principality of Asturias (Spain), which are mainly small businesses. The main results indicate that SEE5 outperforms logit, but it is not clearly better than discriminant analysis. However, SEE5 models suffer from bigger increases in error rates when tested with validation samples. Another interesting finding is that in SEE5 systems both the number of variables selected and the number of rules inferred grow when sample size increases.El objetivo de esta investigaciĂłn es comparar la precisiĂłn de un sistema de clasificaciĂłn por reglas inductivas (SEE5, de Quinlan) con discriminaciĂłn de anĂĄlisis y logĂ­stica. La tarea de clasificaciĂłn elegida es la diferenciaciĂłn entre las compañías mĂĄs y menos eficientes en base a una serie de variables financieras. La muestra consiste en una base de datos que contiene las cuentas anuales de las compañías localizadas en el Principado de Asturias (España), que mayormente se trata de negocios pequeños. Los principales resultados indican que SEE5 supera la logĂ­stica, pero no es claramente mejor que un anĂĄlisis discriminatorio. Sin embargo, los modelos SEE5 padecen un aumento en los ratios de error cuando se prueban con muestras de validaciĂłn. Otro hallazgo interesante es que en los sistemas SEE5 tanto el nĂșmero de variables seleccionadas como el nĂșmero de reglas inferidas aumentan cuando el tamaño de la muestra es mayor
    • 

    corecore