6 research outputs found

    Prediction of earnings per share for industry

    Get PDF
    Prediction of Earnings Per Share (EPS) is the fundamental problem in finance industry. Various Data Mining technologies have been widely used in computational finance. This research work aims to predict the future EPS with previous values through the use of data mining technologies, thus to provide decision makers a reference or evidence for their economic strategies and business activity. We created three models LR, RBF and MLP for the regression problem. Our experiments with these models were carried out on the real datasets provided by a software company. The performance assessment was based on Correlation Coefficient and Root Mean Squared Error. These algorithms were validated with the data of six different companies. Some differences between the models have been observed. In most cases, Linear Regression and Multilayer Perceptron are effectively capable of predicting the future EPS. But for the high nonlinear data, MLP gives better performance

    Information gain directed genetic algorithm wrapper feature selection for credit rating

    Get PDF
    Financial credit scoring is one of the most crucial processes in the finance industry sector to be able to assess the credit-worthiness of individuals and enterprises. Various statistics-based machine learning techniques have been employed for this task. “Curse of Dimensionality” is still a significant challenge in machine learning techniques. Some research has been carried out on Feature Selection (FS) using genetic algorithm as wrapper to improve the performance of credit scoring models. However, the challenge lies in finding an overall best method in credit scoring problems and improving the time-consuming process of feature selection. In this study, the credit scoring problem is investigated through feature selection to improve classification performance. This work proposes a novel approach to feature selection in credit scoring applications, called as Information Gain Directed Feature Selection algorithm (IGDFS), which performs the ranking of features based on information gain, propagates the top m features through the GA wrapper (GAW) algorithm using three classical machine learning algorithms of KNN, Naïve Bayes and Support Vector Machine (SVM) for credit scoring. The first stage of information gain guided feature selection can help reduce the computing complexity of GA wrapper, and the information gain of features selected with the IGDFS can indicate their importance to decision making

    Methods to Improve the Prediction Accuracy and Performance of Ensemble Models

    Get PDF
    The application of ensemble predictive models has been an important research area in predicting medical diagnostics, engineering diagnostics, and other related smart devices and related technologies. Most of the current predictive models are complex and not reliable despite numerous efforts in the past by the research community. The performance accuracy of the predictive models have not always been realised due to many factors such as complexity and class imbalance. Therefore there is a need to improve the predictive accuracy of current ensemble models and to enhance their applications and reliability and non-visual predictive tools. The research work presented in this thesis has adopted a pragmatic phased approach to propose and develop new ensemble models using multiple methods and validated the methods through rigorous testing and implementation in different phases. The first phase comprises of empirical investigations on standalone and ensemble algorithms that were carried out to ascertain their performance effects on complexity and simplicity of the classifiers. The second phase comprises of an improved ensemble model based on the integration of Extended Kalman Filter (EKF), Radial Basis Function Network (RBFN) and AdaBoost algorithms. The third phase comprises of an extended model based on early stop concepts, AdaBoost algorithm, and statistical performance of the training samples to minimize overfitting performance of the proposed model. The fourth phase comprises of an enhanced analytical multivariate logistic regression predictive model developed to minimize the complexity and improve prediction accuracy of logistic regression model. To facilitate the practical application of the proposed models; an ensemble non-invasive analytical tool is proposed and developed. The tool links the gap between theoretical concepts and practical application of theories to predict breast cancer survivability. The empirical findings suggested that: (1) increasing the complexity and topology of algorithms does not necessarily lead to a better algorithmic performance, (2) boosting by resampling performs slightly better than boosting by reweighting, (3) the prediction accuracy of the proposed ensemble EKF-RBFN-AdaBoost model performed better than several established ensemble models, (4) the proposed early stopped model converges faster and minimizes overfitting better compare with other models, (5) the proposed multivariate logistic regression concept minimizes the complexity models (6) the performance of the proposed analytical non-invasive tool performed comparatively better than many of the benchmark analytical tools used in predicting breast cancers and diabetics ailments. The research contributions to ensemble practice are: (1) the integration and development of EKF, RBFN and AdaBoost algorithms as an ensemble model, (2) the development and validation of ensemble model based on early stop concepts, AdaBoost, and statistical concepts of the training samples, (3) the development and validation of predictive logistic regression model based on breast cancer, and (4) the development and validation of a non-invasive breast cancer analytic tools based on the proposed and developed predictive models in this thesis. To validate prediction accuracy of ensemble models, in this thesis the proposed models were applied in modelling breast cancer survivability and diabetics’ diagnostic tasks. In comparison with other established models the simulation results of the models showed improved predictive accuracy. The research outlines the benefits of the proposed models, whilst proposes new directions for future work that could further extend and improve the proposed models discussed in this thesis

    Data mining in computational finance

    Get PDF
    Computational finance is a relatively new discipline whose birth can be traced back to early 1950s. Its major objective is to develop and study practical models focusing on techniques that apply directly to financial analyses. The large number of decisions and computationally intensive problems involved in this discipline make data mining and machine learning models an integral part to improve, automate, and expand the current processes. One of the objectives of this research is to present a state-of-the-art of the data mining and machine learning techniques applied in the core areas of computational finance. Next, detailed analysis of public and private finance datasets is performed in an attempt to find interesting facts from data and draw conclusions regarding the usefulness of features within the datasets. Credit risk evaluation is one of the crucial modern concerns in this field. Credit scoring is essentially a classification problem where models are built using the information about past applicants to categorise new applicants as ‘creditworthy’ or ‘non-creditworthy’. We appraise the performance of a few classical machine learning algorithms for the problem of credit scoring. Typically, credit scoring databases are large and characterised by redundant and irrelevant features, making the classification task more computationally-demanding. Feature selection is the process of selecting an optimal subset of relevant features. We propose an improved information-gain directed wrapper feature selection method using genetic algorithms and successfully evaluate its effectiveness against baseline and generic wrapper methods using three benchmark datasets. One of the tasks of financial analysts is to estimate a company’s worth. In the last piece of work, this study predicts the growth rate for earnings of companies using three machine learning techniques. We employed the technique of lagged features, which allowed varying amounts of recent history to be brought into the prediction task, and transformed the time series forecasting problem into a supervised learning problem. This work was applied on a private time series dataset

    Estudio de aplicabilidad del modelo SWAT para la gestión hidrológica de cuencas de montaña

    Get PDF
    The mountain basins of Catalonia have seen their forest area progressively increased in recent decades. Having tools that allow us to assess the impact of changes in land use on water resources is essential for proper management and planning. In this sense, this work has configured, calibrated and validated the ‘Soil and Water Assessment Tool’ (SWAT) hydrological model in the tributary sub-basin of La Baells Reservoir, a mountain basin located halfway between the Pyrenees and the Catalan Pre-Pyrenees belonging to the Llobregat River basin. A "Split-sample test" type Calibration-Validation has been followed with flow data restored to the natural regime at the outlet of the reservoir, with a calibration period from 01/01/1988 to 12/31/1999 (12 years) and 12 years of warming up, and with a validation period from 01/01/1978 to 12/31/1987 (10 years) and 2 years of warming up. After identifying the parameters that provided the greatest improvement in behavior, combinations of these were tested, increasing the number 1 by 1 in the order of individual improvement. The best ones from these combinations of parameters in terms of statistics about the behavior of the model in the calibration period were selected, and then the calibrated model was run according to these configurations in the validation period, choosing the configuration with the best behavior both in the validation and calibration period. The chosen configuration has consisted in the relative change of the values of the parameters SOL_AWC (.sol), SOL_Z (.sol), SOL_CBN (.sol) of 0.444773, 0.7875 and 4.41853 respectively, and in the replacement of the values of CO2 (.sub) and LAI_INIT (.mgt, {[],1} (Planting)) by 316 and 52.5 respectively. The study also evaluated the effects of: a) introducing only meteorological data for temperature and precipitation (Spain02_v5 grid of Universidad de Cantabria (UC) and Agencia Estatal de Meteorología (AEMET)), b) introducing this meteorological data by creating virtual stations at the centroids of the sub-basins based on the proportion of the area of influence of the grid points with data according to the accumulated cost taking into account the orography, or c) the use of more complete data interpolated for the centroids of the sub-basins using the Meteoland App tool of the Laboratori Forestal Català (Centre de Recerca Ecològica i Aplicacions Forestals ‘ Centre de Ciència i Tecnologia Forestal de Catalunya, CREAF-CTFC). In this sense, the third option yielded the best results, with average improvements in the statistics of 14.11% and 23.86%, respectively, compared to the first option during the calibration period for the uncalibrated model. As a result of this work, a very good evaluation of the behavior of the model has been obtained for the calibration period according to Coefficient of Determination (R2) (0.86), ‘Nash-Sutcliffe Efficiency’ coefficient (NSE) (0.81), ‘Ratio of Standard Deviation of Observation to Root Mean Square Error’ (RSR) (0.44) and ‘Index of Agreement’ (d) (0.96), and good according to ‘Percent Bias’ (PBIAS) (6.7%), with average improvements in the statistics of 32.48% compared to the uncalibrated model results, while very good according to PBIAS (0.1%) and d (0.92), good according to R2 (0.81) and satisfactory according to NSE (0.53) and RSR (0.68) for the validation period, with average improvements in the statistics of 28.05% compared to the uncalibrated model results. In this way, we can affirm that the SWAT hydrological model can be considered a useful and robust tool for estimating flows in the study basin, and thereby, for quantifying the effects that changes in climate and/or land use could have in them.Les conques de muntanya de Catalunya estan veient incrementada la seva superfície forestal de manera progressiva a les darreres dècades. Comptar amb eines que ens permetin avaluar l'impacte dels canvis d'usos de sòl als recursos hídrics és essencial per a una ordenació i planificació correctes. En aquest sentit, aquest treball ha configurat, calibrat i validat el model hidrològic ‘Soil and Water Assessment Tool’ (SWAT) a la subconca tributària de l'Embassament de la Baells, una conca de muntanya ubicada a cavall entre el Pirineu i el Prepirineu català pertanyent a la conca del Riu Llobregat. S'ha seguit un Calibratge-Validació del tipus ‘Split-sample test’ amb dades de cabal restituïdes a règim natural a la sortida de l'embassament, amb un període de calibratge del 01/01/1988 al 31/12/1999 (12 anys) i 12 anys d'escalfament, i amb un període de validació del 01/01/1978 al 31/12/1987 (10 anys) i 2 anys d'escalfament. Després d'identificar els paràmetres que més millora del comportament proporcionaven, se'n van provar combinacions, incrementant el nombre d'1 en 1 en l'ordre de millora individual. D'aquestes combinacions de paràmetres, es van seleccionar les millors en quant a estadístics sobre el comportament del model en el període de calibratge, i a continuació, es va córrer el model calibrat segons aquestes configuracions en el període de validació, escollint la configuració amb millor comportament tant pel període de validació com de calibratge. La configuració escollida ha consistit en el canvi relatiu dels valors dels paràmetres SOL_AWC (.sol), SOL_Z (.sol), SOL_CBN (.sol) de 0.444773, 0.7875 i 4,41853 respectivament, i en el reemplaçament dels valors de CO2 (.sub) i LAI_INIT (.mgt, {[],1} (Planting)) per 316 i 52.5 respectivament. També es van avaluar els efectes de: a) introduir dades meteorològiques només de temperatura i precipitació (quadrícula Spain02_v5 de la Universidad de Cantabria (UC) i l’Agencia Estatal de Meteorología (AEMET)), b) fer-ho creant estacions virtuals als centroides de les subconques en funció de la proporció de l'àrea d'influència dels punts de la quadrícula amb dades en relació al cost acumulat en tenir en compte l'orografia, o c) utilitzar dades més completes interpolades per als centroides de les subconques mitjançant l'eina Meteoland App del Laboratori Forestal Català (Centre de Recerca Ecològica i Aplicacions Forestals ‘ Centre de Ciència i Tecnologia Forestal de Catalunya, CREAF-CTFC). En aquest sentit, la tercera opció va donar els millors resultats, amb millores mitjanes en els estadístics del 14,11% i del 23,86% respectivament respecte a la primera opció durant el període de calibratge per al model sense calibrar. Com a resultat d'aquest treball, s'ha obtingut una valoració del comportament del model per al període de calibratge molt bona segons el Coeficient de Determinació (R2) (0.86), el coeficient ‘Nash-Sutcliffe Efficiency’ (NSE) (0.81), el ‘Ratio of Standard Deviation of Observation to Root Mean Square Error’ (RSR) (0.44) i el ‘Index of Agreement’ (d) (0.96), i bona segons el ‘Percent Bias’ (PBIAS) (6.7%), amb millores mitjanes en els estadístics del 32.48% respecte al model sense calibrar, mentre que molt bona segons PBIAS (0.1%) i d (0.92), bona segons R2 (0.81) i satisfactòria segons NSE (0.53) i RSR (0.68) per al període de validació, amb millores mitjanes en els estadístics del 28.05% respecte al model sense calibrar. D'aquesta manera, podem afirmar que el model hidrològic SWAT es pot considerar una eina útil i robusta per a l'estimació dels cabals a la conca d'estudi, i així, quantificar els efectes que canvis en el clima i/o usos del sòl poguessin tenir en aquests.Las cuencas de montaña de Cataluña están viendo incrementada su superficie forestal de manera progresiva en las últimas décadas. Contar con herramientas que nos permitan evaluar el impacto de los cambios de usos de suelo en los recursos hídricos es esencial para una correcta ordenación y planificación. En este sentido, este trabajo ha configurado, calibrado y validado el modelo hidrológico ‘Soil and Water Assessment Tool’ (SWAT) en la subcuenca tributaria del Embalse de La Baells, una cuenca de montaña ubicada a caballo entre el Pirineo y el Prepirineo catalán perteneciente a la cuenca del Río Llobregat. Se ha seguido una Calibración-Validación del tipo ‘Split-sample test’ con datos de caudal restituidos a régimen natural a la salida del embalse, con un periodo de calibración del 01/01/1988 al 31/12/1999 (12 años) y 12 años de calentamiento, y con un periodo de validación del 01/01/1978 al 31/12/1987 (10 años) y 2 años de calentamiento. Tras identificar los parámetros que mayor mejora del comportamiento proporcionaban, se probaron combinaciones de éstos, incrementando el número de 1 en 1 en el orden de mejora individual. De éstas combinaciones de parámetros, se seleccionaron las mejores en cuanto a estadísticos sobre el comportamiento del modelo en el periodo de calibración, y a continuación, se corrió el modelo calibrado según estas configuraciones en el periodo de validación, escogiéndose la configuración con mejor comportamiento tanto en el periodo de validación como de calibración. La configuración escogida ha consistido en el cambio relativo de los valores de los parámetros SOL_AWC (.sol), SOL_Z (.sol), SOL_CBN (.sol) de 0.444773, 0.7875 y 4,41853 respectivamente, y en el reemplazo de los valores de CO2 (.sub) y LAI_INIT (.mgt, {[],1} (Planting)) por 316 y 52.5 respectivamente. También se evaluaron los efectos de: a) introducir datos meteorológicos solo de temperatura y precipitación (rejilla Spain02_v5 de la Universidad de Cantabria (UC) y la Agencia Estatal de Meteorología (AEMET)), b) hacerlo creando estaciones virtuales en los centroides de las subcuencas en función de la proporción del área de influencia de los puntos de la rejilla con datos en relación al coste acumulado al tener en cuenta la orografía, o c) utilizar datos más completos interpolados para los centroides de las subcuencas mediante la herramienta Meteoland App del Laboratori Forestal Català (Centre de Recerca Ecològica i Aplicacions Forestals ‘ Centre de Ciència i Tecnologia Forestal de Catalunya, CREAF-CTFC). En este sentido, la tercera opción arrojó los mejores resultados, con mejoras medias en los estadísticos del 14,11% y del 23,86% respectivamente respecto la primera opción durante el periodo de calibración para el modelo sin calibrar. Como resultado de este trabajo, se ha obtenido una valoración del comportamiento del modelo para el periodo de calibración muy buena según el Coeficiente de Determinación (R2) (0.86), el coeficiente ‘Nash-Sutcliffe Efficiency’ (NSE) (0.81), el ‘Ratio of Standard Deviation of Observation to Root Mean Square Error’ (RSR) (0.44) y el ‘Index of Agreement’ (d) (0.96), y buena según el ‘Percent Bias’ (PBIAS) (6.7%), con mejoras medias de los estadísticos del 32.48% respecto al modelo sin calibrar, mientras que muy buena según PBIAS (0.1%) y d (0.92), buena según R2 (0.81) y satisfactoria según NSE (0.53) y RSR (0.68) para el periodo de validación, con mejoras medias de los estadísticos del 28.05% respecto al modelo sin calibrar. De esta manera, podemos afirmar que el modelo hidrológico SWAT puede considerarse una herramienta útil y robusta para la estimación de los caudales en la cuenca de estudio, y con ello, cuantificar los efectos que cambios en el clima y/o usos del suelo pudieran tener en éstos.Objectius de Desenvolupament Sostenible::13 - Acció per al ClimaObjectius de Desenvolupament Sostenible::11 - Ciutats i Comunitats Sostenible
    corecore