58 research outputs found

    An Extensive Analysis of Machine Learning Based Boosting Algorithms for Software Maintainability Prediction

    Get PDF
    Software Maintainability is an indispensable factor to acclaim for the quality of particular software. It describes the ease to perform several maintenance activities to make a software adaptable to the modified environment. The availability & growing popularity of a wide range of Machine Learning (ML) algorithms for data analysis further provides the motivation for predicting this maintainability. However, an extensive analysis & comparison of various ML based Boosting Algorithms (BAs) for Software Maintainability Prediction (SMP) has not been made yet. Therefore, the current study analyzes and compares five different BAs, i.e., AdaBoost, GBM, XGB, LightGBM, and CatBoost, for SMP using open-source datasets. Performance of the propounded prediction models has been evaluated using Root Mean Square Error (RMSE), Mean Magnitude of Relative Error (MMRE), Pred(0.25), Pred(0.30), & Pred(0.75) as prediction accuracy measures followed by a non-parametric statistical test and a post hoc analysis to account for the differences in the performances of various BAs. Based on the residual errors obtained, it was observed that GBM is the best performer, followed by LightGBM for RMSE, whereas, in the case of MMRE, XGB performed the best for six out of the seven datasets, i.e., for 85.71% of the total datasets by providing minimum values for MMRE, ranging from 0.90 to 3.82. Further, on applying the statistical test and on performing the post hoc analysis, it was found that significant differences exist in the performance of different BAs and, XGB and CatBoost outperformed all other BAs for MMRE. Lastly, a comparison of BAs with four other ML algorithms has also been made to bring out BAs superiority over other algorithms. This study would open new doors for the software developers for carrying out comparatively more precise predictions well in time and hence reduce the overall maintenance costs

    Group Method of Data Handling Using Christiano–Fitzgerald Random Walk Filter for Insulator Fault Prediction

    Get PDF
    Disruptive failures threaten the reliability of electric supply in power branches, often indicated by the rise of leakage current in distribution insulators. This paper presents a novel, hybrid method for fault prediction based on the time series of the leakage current of contaminated insulators. In a controlled high-voltage laboratory simulation, 15 kV-class insulators from an electrical power distribution network were exposed to increasing contamination in a salt chamber. The leakage current was recorded over 28 h of effective exposure, culminating in a flashover in all considered insulators. This flashover event served as the prediction mark that this paper proposes to evaluate. The proposed method applies the Christiano–Fitzgerald random walk (CFRW) filter for trend decomposition and the group data-handling (GMDH) method for time series prediction. The CFRW filter, with its versatility, proved to be more effective than the seasonal decomposition using moving averages in reducing non-linearities. The CFRW-GMDH method, with a root-mean-squared error of 3.44×10−12, outperformed both the standard GMDH and long short-term memory models in fault prediction. This superior performance suggested that the CFRW-GMDH method is a promising tool for predicting faults in power grid insulators based on leakage current data. This approach can provide power utilities with a reliable tool for monitoring insulator health and predicting failures, thereby enhancing the reliability of the power supply

    Automobile Insurance Fraud Detection Using Data Mining: A Systematic Literature Review

    Get PDF
    Insurance is a pivotal element in modern society, but insurers face a persistent challenge from fraudulent behaviour performed by policyholders. This behaviour could be detrimental to both insurance companies and their honest customers, but the intricate nature of insurance fraud severely complicates its efficient, automated detection. This study surveys fifty recent publications on automobile insurance fraud detection, published between January 2019 and March 2023, and presents both the most commonly used data sets and methods for resampling and detection, as well as interesting, novel approaches. The study adopts the highly-cited Systematic Literature Review (SLR) methodology for software engineering research proposed by Kitchenham and Charters and collected studies from four online databases. The findings indicate limited public availability of automobile insurance fraud data sets. In terms of detection methods, the prevailing approach involves supervised machine learning methods that utilise structured, intrinsic features of claims or policies and that lack consideration of an example-dependent cost of misclassification. However, alternative techniques are also explored, including the use of graph-based methods, unstructured textual data, and cost-sensitive classifiers. The most common resampling approach was found to be oversampling. This SLR has identified commonly used methods in recent automobile insurance fraud detection research, and interesting directions for future research. It adds value over a related review by also including studies published from 2021 onward, and by detailing the used methodology. Limitations of this SLR include its restriction to a small number of considered publication years and limited validation of choices made during the process

    Machine Learning Approach for Credit Score Predictions

    Get PDF
    This paper addresses the problem of managing the significant rise in requests for credit products that banking and financial institutions face. The aim is to propose an adaptive, dynamic heterogeneous ensemble credit model that integrates the XGBoost and Support Vector Machine models to improve the accuracy and reliability of risk assessment credit scoring models. The method employs machine learning techniques to recognise patterns and trends from past data to anticipate future occurrences. The proposed approach is compared with existing credit score models to validate its efficacy using five popular evaluation metrics, Accuracy, ROC AUC, Precision, Recall and F1_Score. The paper highlights credit scoring models’ challenges, such as class imbalance, verification latency and concept drift. The results show that the proposed approach outperforms the existing models regarding the evaluation metrics, achieving a balance between predictive accuracy and computational cost. The conclusion emphasises the significance of the proposed approach for the banking and financial sector in developing robust and reliable credit scoring models to evaluate the creditworthiness of their clients

    Predicting stable gravel-bed river hydraulic geometry: A test of novel, advanced, hybrid data mining algorithms

    Get PDF
    Accurate prediction of stable alluvial hydraulic geometry, in which erosion and sedimentation are in equilibrium, is one of the most difficult but critical topics in the field of river engineering. Data mining algorithms have been gaining more attention in this field due to their high performance and flexibility. However, an understanding of the potential for these algorithms to provide fast, cheap, and accurate predictions of hydraulic geometry is lacking. This study provides the first quantification of this potential. Using at-a-station field data, predictions of flow depth, water-surface width and longitudinal water surface slope are made using three standalone data mining techniques -, Instance-based Learning (IBK), KStar, Locally Weighted Learning (LWL) - along with four types of novel hybrid algorithms in which the standalone models are trained with Vote, Attribute Selected Classifier (ASC), Regression by Discretization (RBD), and Cross-validation Parameter Selection (CVPS) algorithms (Vote-IBK, Vote-Kstar, Vote-LWL, ASC-IBK, ASC-Kstar, ASC-LWL, RBD-IBK, RBD-Kstar, RBD-LWL, CVPS-IBK, CVPS-Kstar, CVPS-LWL). Through a comparison of their predictive performance and a sensitivity analysis of the driving variables, the results reveal: (1) Shield stress was the most effective parameter in the prediction of all geometry dimensions; (2) hybrid models had a higher prediction power than standalone data mining models, empirical equations and traditional machine learning algorithms; (3) Vote-Kstar model had the highest performance in predicting depth and width, and ASC-Kstar in estimating slope, each providing very good prediction performance. Through these algorithms, the hydraulic geometry of any river can potentially be predicted accurately and with ease using just a few, readily available flow and channel parameters. Thus, the results reveal that these models have great potential for use in stable channel design in data poor catchments, especially in developing nations where technical modelling skills and understanding of the hydraulic and sediment processes occurring in the river system may be lacking

    Poly(vinylidene fluoride) electrospun nonwovens morphology: Prediction and optimization of the size and number of beads on fibers through response surface methodology and machine learning regressions

    Get PDF
    Electrospinning is one of the leading techniques for fiber development. Still, one of the biggest challenges of the technique is to control the nanofiber morphology without many trial-and-error tests. In this study, it is demonstrated that via design of experiments (DoE), response surface methodology (RSM) and machine learning regressions (MLR) it is possible to predict the beads-on-string size, size distribution and bead density in electrospun poly(vinylidene fluoride) (PVDF) mats with a small number of tests. PVDF concentration, dimethylacetamide/acetone ratio, tip-to-collector voltage and distance were the parameters considered for the design. The results show good agreement between the experimental and modeled data. It was found that concentration and solvent ratio play the main roles in minimizing bead size and number, distance tends to reduce them, and voltage does not play a significant role. As an evaluation of the potential of the method, bead-free fibers were obtained through the predicted parameter values. Comparison of the performance of the two methods is presented for the first time in electrospinning research. Response surface methodology resulted much faster, but MLR achieved a lower error and better generalization abilities. This approach and the availability of the MLR script used in this work may help other groups implement it in their research and find information hidden in the data while improving model prediction performance.Fil: Trupp, Federico Javier. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Física. Laboratorio de Polímeros y Materiales Compuestos; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Física de Buenos Aires. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Física de Buenos Aires; ArgentinaFil: Cibils, Roberto Manuel. Invap S. E.; ArgentinaFil: Goyanes, Silvia Nair. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Física. Laboratorio de Polímeros y Materiales Compuestos; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Física de Buenos Aires. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Física de Buenos Aires; Argentin

    A systematic literature review of machine learning techniques for software maintainability prediction

    Get PDF
    Context: Software maintainability is one of the fundamental quality attributes of software engineering. The accurate prediction of software maintainability is a significant challenge for the effective management of the software maintenance process. Objective: The major aim of this paper is to present a systematic review of studies related to the prediction of maintainability of object-oriented software systems using machine learning techniques. This review identifies and investigates a number of research questions to comprehensively summarize, analyse and discuss various viewpoints concerning software maintainability measurements, metrics, datasets, evaluation measures, individual models and ensemble models. Method: The review uses the standard systematic literature review method applied to the most common computer science digital database libraries from January 1991 to July 2018. Results: We survey 56 relevant studies in 35 journals and 21 conference proceedings. The results indicate that there is relatively little activity in the area of software maintainability prediction compared with other software quality attributes. CHANGE maintenance effort and the maintainability index were the most commonly used software measurements (dependent variables) employed in the selected primary studies, and most made use of class-level product metrics as the independent variables. Several private datasets were used in the selected studies, and there is a growing demand to publish datasets publicly. Most studies focused on regression problems and performed k-fold cross-validation. Individual prediction models were employed in the majority of studies, while ensemble models relatively rarely. Conclusion: Based on the findings obtained in this systematic literature review, ensemble models demonstrated increased accuracy prediction over individual models, and have been shown to be useful models in predicting software maintainability. However, their application is relatively rare and there is a need to apply these, and other models to an extensive variety of datasets with the aim of improving the accuracy and consistency of results

    Prediction of iceberg-seabed interaction using machine learning algorithms

    Get PDF
    Every year thousands of icebergs are born out of glaciers in the Arctic zone and carried away by the currents and winds into the North Atlantic. These icebergs may touch the sea bottom in shallow waters and scratch the seabed, an incident called “ice-gouging”. Ice-gouging may endanger the integrity of the buried subsea pipelines and power cables because of subgouge soil displacement. In other words, the shear resistance of the soil causes the subgouge soil displacement to extend much deeper than the ice keel tip. This, in turn, may cause the displacement of the pipelines and cables buried deeper than the most possible gouge depth. Determining the best burial depth of the pipeline is a key design aspect and needs advanced continuum numerical modeling and costly centrifuge tests. Empirical equations suggested by design codes may be also used but they usually result in an over-conservative design. Iceberg management, i.e., iceberg towing and re-routing, is currently the most reliable approach to protect the subsea and offshore structures, where the approaching icebergs are hooked and towed in a safe direction. Iceberg management is costly and involves a range of marine fleets and advanced subsea survey tools to determine the iceberg draft, etc. The industry is constantly looking for cost-effective and quick alternatives to predict the iceberg draft and subgouge soil displacements. In this study, powerful machine learning (ML) algorithms were used as an alternative cost-effective approach to first screen the threatening icebergs by determining their drafts and then to predict the subgouge soil displacement to be fed into the structural integrity analysis. Developing a reliable solution to predict the iceberg draft and subgouge soil displacement requires a profound understanding of the problem's dominant parameters. Therefore, the present study started with dimensional analyses to identify the dimensionless groups of key parameters governing the physics of the problem. Two comprehensive datasets were constructed using the monitored characteristics of the real icebergs for draft prediction and experimental studies for the subgouge soil displacements reported in the literature. Using the constructed database, 14 ML algorithms ranging from neural network-based (NN-based) to three-based methods were sequentially used to predict the iceberg draft and the subgouge soil displacement. The studies were conducted both in clay and sand seabed. By different combinations of the input parameters, several ML models were developed and assessed by performing sensitivity analysis, error analysis, discrepancy analysis, uncertainty analysis, and partial derivative sensitivity analysis to identify the superior ML models along with the most influential input parameters. The best ML model was able to predict the iceberg drafts alongside the subgouge soil features with the highest level of precision, correlation, and lowest degree of complexity. A set of ML-based explicit equations were also derived from the wide range of field and experimental measurements for the estimation of iceberg drafts, subgouge soil deformations, and ice keel reaction forces, which outperformed the existing empirical equations. The study resulted in developing a set of tools that can be used for both a cost-effective screening of the threatening icebergs and the prediction of the corresponding subgouge soil displacements. The outcome of the study can effectively contribute to a significant reduction of iceberg management costs and greenhouse gas (GHG) emissions through the mitigation of the marine spread operation

    Aplicação de modelos preditivos para o setor alimentar : um estudo comparativo

    Get PDF
    Mestrado em Econometria Aplicada e PrevisãoNa sociedade atual a inovação surge como um papel cada vez mais preponderante nas empresas. O presente relatório surge no âmbito de um estágio curricular desenvolvido numa empresa líder a nível mundial no comércio grossista de azeites, com o principal objetivo de encontrar um modelo capaz de prever os preços das suas mercadorias. Para tal, foram analisadas várias metodologias, fazendo uma junção entre modelos tradicionais e mais inovadores e recentes. Sendo por isso, analisados os modelos ARIMA; ARIMAX; VAR como modelos mais tradicionais, em contradição às redes neuronais artificiais do tipo MLP; GMDH. Para o estudo de caso foram utilizados os dados dos três azeites de mais interesse para a empresa, distribuídos por dois conjuntos temporais diferentes, permitindo assim a análise do impacto da dimensão da amostra nas previsões. Estudou-se o impacto de variáveis independentes (nomeadamente meteorológicas, macroeconómicas, entre outras que afetam a produção da azeitona), têm nos preços de compra do azeite. Os resultados apontam para um melhor desempenho do modelo VAR em todos os grupos de dados em análise, obtendo assim as melhores previsões dentro do conjunto de modelos. Destaca-se ainda, a preferência de modelos mais tradicionais quando a série tem um menor comprimento temporal, e uma melhor eficácia das redes neuronais em conjuntos de dados mais elevados, destacando ainda a preferência da rede do tipo GMDH face à rede MLP. Conclui-se ainda, que dentro do vasto conjunto de variáveis em análise, é uma variável binária que influencia a produção (safra), a que possuí maior impacto nas previsões.In today's society, innovation appears as an increasingly prevalent role in companies. This report comes as a part of a curricular internship developed at a world leader in the wholesale of olive oil with the main objective of finding a model capable of predicting the prices of its goods. To this end, several methodologies were analyzed, making a junction between traditional and more innovative and recent models. Therefore, the ARIMA models were analyzed; ARIMAX; VAR as more traditional models, in contradiction to artificial neural networks of the MLP type; GMDH. For the case study, data from the three olive oils of most interest to the company was used, distributed over two different time sets. Thus, allowing the analysis of the impact of the sample size on the forecasts. The impact of independent variables (namely meteorological, macroeconomic, among others that affect olive production) was studied on the purchase prices of olive oil. The results point to a better performance of the VAR model in all groups of data under analysis, thus obtaining the best forecasts within the set of models. Also, noteworthy is the preference for more traditional models when the series has a shorter time length, and a better efficiency of neural networks in higher data sets, also highlighting the preference of the GMDH type network over the MLP network. It is also concluded that, within the vast set of variables under analysis, it is a binary variable that influences production (safra), which has the greatest impact on forecasts.info:eu-repo/semantics/publishedVersio
    corecore