62 research outputs found

    An Extensive Analysis of Machine Learning Based Boosting Algorithms for Software Maintainability Prediction

    Get PDF
    Software Maintainability is an indispensable factor to acclaim for the quality of particular software. It describes the ease to perform several maintenance activities to make a software adaptable to the modified environment. The availability & growing popularity of a wide range of Machine Learning (ML) algorithms for data analysis further provides the motivation for predicting this maintainability. However, an extensive analysis & comparison of various ML based Boosting Algorithms (BAs) for Software Maintainability Prediction (SMP) has not been made yet. Therefore, the current study analyzes and compares five different BAs, i.e., AdaBoost, GBM, XGB, LightGBM, and CatBoost, for SMP using open-source datasets. Performance of the propounded prediction models has been evaluated using Root Mean Square Error (RMSE), Mean Magnitude of Relative Error (MMRE), Pred(0.25), Pred(0.30), & Pred(0.75) as prediction accuracy measures followed by a non-parametric statistical test and a post hoc analysis to account for the differences in the performances of various BAs. Based on the residual errors obtained, it was observed that GBM is the best performer, followed by LightGBM for RMSE, whereas, in the case of MMRE, XGB performed the best for six out of the seven datasets, i.e., for 85.71% of the total datasets by providing minimum values for MMRE, ranging from 0.90 to 3.82. Further, on applying the statistical test and on performing the post hoc analysis, it was found that significant differences exist in the performance of different BAs and, XGB and CatBoost outperformed all other BAs for MMRE. Lastly, a comparison of BAs with four other ML algorithms has also been made to bring out BAs superiority over other algorithms. This study would open new doors for the software developers for carrying out comparatively more precise predictions well in time and hence reduce the overall maintenance costs

    Visual Transfer Learning in the Absence of the Source Data

    Get PDF
    Image recognition has become one of the most popular topics in machine learning. With the development of Deep Convolutional Neural Networks (CNN) and the help of the large scale labeled image database such as ImageNet, modern image recognition models can achieve competitive performance compared to human annotation in some general image recognition tasks. Many IT companies have adopted it to improve their visual related tasks. However, training these large scale deep neural networks requires thousands or even millions of labeled images, which is an obstacle when applying it to a specific visual task with limited training data. Visual transfer learning is proposed to solve this problem. Visual transfer learning aims at transferring the knowledge from a source visual task to a target visual task. Typically, the target task is related to the source task, and the training data in the target task is relatively small. In visual transfer learning, the majority of existing methods assume that the source data is freely available and use the source data to measure the discrepancy between the source and target task to help the transfer process. However, in many real applications, source data are often a subject of legal, technical and contractual constraints between data owners and data customers. Beyond privacy and disclosure obligations, customers are often reluctant to share their data. When operating customer care, collected data may include information on recent technical problems which is a highly sensitive topic that companies are not willing to share. This scenario is often called Hypothesis Transfer Learning (HTL) where the source data is absent. Therefore, these previous methods cannot be applied to many real visual transfer learning problems. In this thesis, we investigate the visual transfer learning problem under HTL setting. Instead of using the source data to measure the discrepancy, we use the source model as the proxy to transfer the knowledge from the source task to the target task. Compared to the source data, the well-trained source model is usually freely accessible in many tasks and contains equivalent source knowledge as well. Specifically, in this thesis, we investigate the visual transfer learning in two scenarios: domain adaptation and learning new categories. In contrast to the previous methods in HTL, our methods can both leverage knowledge from more types of source models and achieve better transfer performance. In chapter 3, we investigate the visual domain adaptation problem under the setting of Hypothesis Transfer Learning. We propose Effective Multiclass Transfer Learning (EMTLe) that can effectively transfer the knowledge when the size of the target set is small. Specifically, EMTLe can effectively transfer the knowledge using the outputs of the source models as the auxiliary bias to adjust the prediction in the target task. Experiment results show that EMTLe can outperform other baselines under the setting of HTL. In chapter 4, we investigate the semi-supervised domain adaptation scenario under the setting of HTL and propose our framework Generalized Distillation Semi-supervised Domain Adaptation (GDSDA). Specifically, we show that GDSDA can effectively transfer the knowledge using the unlabeled data. We also demonstrate that the imitation parameter, the hyperparameter in GDSDA that balances the knowledge from source and target task, is important to the transfer performance. Then we propose GDSDA-SVM which uses SVMs as the base classifier in GDSDA. We show that GDSDA-SVM can determine the imitation parameter in GDSDA autonomously. Compared to previous methods, whose imitation parameter can only be determined by either brutal force search or background knowledge, GDSDA-SVM is more effective in real applications. In chapter 5, we investigate the problem of fine-tuning the deep CNN to learn new food categories using the large ImageNet database as our source. Without accessing to the source data, i.e. the ImageNet dataset, we show that by fine-tuning the parameters of the source model with our target food dataset, we can achieve better performance compared to those previous methods. To conclude, the main contribution of is that we investigate the visual transfer learning problem under the HTL setting. We propose several methods to transfer the knowledge from the source task in supervised and semi-supervised learning scenarios. Extensive experiments results show that without accessing to any source data, our methods can outperform previous work

    A-SFS: Semi-supervised Feature Selection based on Multi-task Self-supervision

    Full text link
    Feature selection is an important process in machine learning. It builds an interpretable and robust model by selecting the features that contribute the most to the prediction target. However, most mature feature selection algorithms, including supervised and semi-supervised, fail to fully exploit the complex potential structure between features. We believe that these structures are very important for the feature selection process, especially when labels are lacking and data is noisy. To this end, we innovatively introduce a deep learning-based self-supervised mechanism into feature selection problems, namely batch-Attention-based Self-supervision Feature Selection(A-SFS). Firstly, a multi-task self-supervised autoencoder is designed to uncover the hidden structure among features with the support of two pretext tasks. Guided by the integrated information from the multi-self-supervised learning model, a batch-attention mechanism is designed to generate feature weights according to batch-based feature selection patterns to alleviate the impacts introduced by a handful of noisy data. This method is compared to 14 major strong benchmarks, including LightGBM and XGBoost. Experimental results show that A-SFS achieves the highest accuracy in most datasets. Furthermore, this design significantly reduces the reliance on labels, with only 1/10 labeled data needed to achieve the same performance as those state of art baselines. Results show that A-SFS is also most robust to the noisy and missing data.Comment: 18 pages, 7 figures, accepted by knowledge-based system

    Machine Learning Approach for Credit Score Predictions

    Get PDF
    This paper addresses the problem of managing the significant rise in requests for credit products that banking and financial institutions face. The aim is to propose an adaptive, dynamic heterogeneous ensemble credit model that integrates the XGBoost and Support Vector Machine models to improve the accuracy and reliability of risk assessment credit scoring models. The method employs machine learning techniques to recognise patterns and trends from past data to anticipate future occurrences. The proposed approach is compared with existing credit score models to validate its efficacy using five popular evaluation metrics, Accuracy, ROC AUC, Precision, Recall and F1_Score. The paper highlights credit scoring models’ challenges, such as class imbalance, verification latency and concept drift. The results show that the proposed approach outperforms the existing models regarding the evaluation metrics, achieving a balance between predictive accuracy and computational cost. The conclusion emphasises the significance of the proposed approach for the banking and financial sector in developing robust and reliable credit scoring models to evaluate the creditworthiness of their clients

    Dropout Prediction: A Systematic Literature Review

    Get PDF
    Dropout predicting is challenging analysis process which requires appropriate approaches to address the dropout. Existing approaches are applied in different areas such as education, telecommunications, retail, social networks, and banking services. The goal is to identify customers in the risk of dropout to support retention strategies. This research developed a systematic literature review to evaluate the development of existing studies to predict dropout using machine learning, following the guidelines recommended by Kitchenham and Peterson. The systematic review followed three phases planning, conducting, and reporting. The selection of the most relevant articles was based on the use of Active Systematic Review tool using artificial intelligence algorithms. The criteria identified 28 articles and several research lines where identified. Dropout is a transversal problem for several sectors of economic activity, where it can be taken countermeasures before it happens if detected early

    Quadri-dimensional approach for data analytics in mobile networks

    Get PDF
    The telecommunication market is growing at a very fast pace with the evolution of new technologies to support high speed throughput and the availability of a wide range of services and applications in the mobile networks. This has led to a need for communication service providers (CSPs) to shift their focus from network elements monitoring towards services monitoring and subscribers’ satisfaction by introducing the service quality management (SQM) and the customer experience management (CEM) that require fast responses to reduce the time to find and solve network problems, to ensure efficiency and proactive maintenance, to improve the quality of service (QoS) and the quality of experience (QoE) of the subscribers. While both the SQM and the CEM demand multiple information from different interfaces, managing multiple data sources adds an extra layer of complexity with the collection of data. While several studies and researches have been conducted for data analytics in mobile networks, most of them did not consider analytics based on the four dimensions involved in the mobile networks environment which are the subscriber, the handset, the service and the network element with multiple interface correlation. The main objective of this research was to develop mobile network analytics models applied to the 3G packet-switched domain by analysing data from the radio network with the Iub interface and the core network with the Gn interface to provide a fast root cause analysis (RCA) approach considering the four dimensions involved in the mobile networks. This was achieved by using the latest computer engineering advancements which are Big Data platforms and data mining techniques through machine learning algorithms.Electrical and Mining EngineeringM. Tech. (Electrical Engineering

    Koneoppimiskehys petrokemianteollisuuden sovelluksille

    Get PDF
    Machine learning has many potentially useful applications in process industry, for example in process monitoring and control. Continuously accumulating process data and the recent development in software and hardware that enable more advanced machine learning, are fulfilling the prerequisites of developing and deploying process automation integrated machine learning applications which improve existing functionalities or even implement artificial intelligence. In this master's thesis, a framework is designed and implemented on a proof-of-concept level, to enable easy acquisition of process data to be used with modern machine learning libraries, and to also enable scalable online deployment of the trained models. The literature part of the thesis concentrates on studying the current state and approaches for digital advisory systems for process operators, as a potential application to be developed on the machine learning framework. The literature study shows that the approaches for process operators' decision support tools have shifted from rule-based and knowledge-based methods to machine learning. However, no standard methods can be concluded, and most of the use cases are quite application-specific. In the developed machine learning framework, both commercial software and open source components with permissive licenses are used. Data is acquired over OPC UA and then processed in Python, which is currently almost the de facto standard language in data analytics. Microservice architecture with containerization is used in the online deployment, and in a qualitative evaluation, it proved to be a versatile and functional solution.Koneoppimisella voidaan osoittaa olevan useita hyödyllisiä käyttökohteita prosessiteollisuudessa, esimerkiksi prosessinohjaukseen liittyvissä sovelluksissa. Jatkuvasti kerääntyvä prosessidata ja toisaalta koneoppimiseen soveltuvien ohjelmistojen sekä myös laitteistojen viimeaikainen kehitys johtavat tilanteeseen, jossa prosessiautomaatioon liitettyjen koneoppimissovellusten avulla on mahdollista parantaa nykyisiä toiminnallisuuksia tai jopa toteuttaa tekoälysovelluksia. Tässä diplomityössä suunniteltiin ja toteutettiin prototyypin tasolla koneoppimiskehys, jonka avulla on helppo käyttää prosessidataa yhdessä nykyaikaisten koneoppimiskirjastojen kanssa. Kehys mahdollistaa myös koneopittujen mallien skaalautuvan käyttöönoton. Diplomityön kirjallisuusosa keskittyy prosessioperaattoreille tarkoitettujen digitaalisten avustajajärjestelmien nykytilaan ja toteutustapoihin, avustajajärjestelmän tai sen päätöstukijärjestelmän ollessa yksi mahdollinen koneoppimiskehyksen päälle rakennettava ohjelma. Kirjallisuustutkimuksen mukaan prosessioperaattorin päätöstukijärjestelmien taustalla olevat menetelmät ovat yhä useammin koneoppimiseen perustuvia, aiempien sääntö- ja tietämyskantoihin perustuvien menetelmien sijasta. Selkeitä yhdenmukaisia lähestymistapoja ei kuitenkaan ole helposti pääteltävissä kirjallisuuden perusteella. Lisäksi useimmat tapausesimerkit ovat sovellettavissa vain kyseisissä erikoistapauksissa. Kehitetyssä koneoppimiskehyksessä on käytetty sekä kaupallisia että avoimen lähdekoodin komponentteja. Prosessidata haetaan OPC UA -protokollan avulla, ja sitä on mahdollista käsitellä Python-kielellä, josta on muodostunut lähes de facto -standardi data-analytiikassa. Kehyksen käyttöönottokomponentit perustuvat mikropalveluarkkitehtuuriin ja konttiteknologiaan, jotka osoittautuivat laadullisessa testauksessa monipuoliseksi ja toimivaksi toteutustavaksi

    Development of sustainable groundwater management methodologies to control saltwater intrusion into coastal aquifers with application to a tropical Pacific island country

    Get PDF
    Saltwater intrusion due to the over-exploitation of groundwater in coastal aquifers is a critical challenge facing groundwater-dependent coastal communities throughout the world. Sustainable management of coastal aquifers for maintaining abstracted groundwater quality within permissible salinity limits is regarded as an important groundwater management problem necessitating urgent reliable and optimal management methodologies. This study focuses on the development and evaluation of groundwater salinity prediction tools, coastal aquifer multi-objective management strategies, and adaptive management strategies using new prediction models, coupled simulation-optimization (S/O) models, and monitoring network design, respectively. Predicting the extent of saltwater intrusion into coastal aquifers in response to existing and changing pumping patterns is a prerequisite of any groundwater management framework. This study investigates the feasibility of using support vector machine regression (SVMR), an innovative artificial intelligence-based machine learning algorithm, to predict salinity at monitoring wells in an illustrative aquifer under variable groundwater pumping conditions. For evaluation purposes, the prediction results of SVMR are compared with well-established genetic programming (GP) based surrogate models. The prediction capabilities of the two learning machines are evaluated using several measures to ensure their practicality and generalisation ability. Also, a sensitivity analysis methodology is proposed for assessing the impact of pumping rates on salt concentrations at monitoring locations. The performance evaluations suggest that the predictive capability of SVMR is superior to that of GP models. The sensitivity analysis identifies a subset of the most influential pumping rates, which is used to construct new SVMR surrogate models with improved predictive capabilities. The improved predictive capability and generalisation ability of SVMR models, together with the ability to improve the accuracy of prediction by refining the dataset used for training, make the use of SVMR models more attractive. Coupled S/O models are efficient tools that are used for designing multi-objective coastal aquifer management strategies. This study applies a regional-scale coupled S/O methodology with a Pareto front clustering technique to prescribe optimal groundwater withdrawal patterns from the Bonriki aquifer in the Pacific Island of Kiribati. A numerical simulation model is developed, calibrated and validated using field data from the Bonriki aquifer. For computational feasibility, SVMR surrogate models are trained and tested utilizing input-output datasets generated using the flow and transport numerical simulation model. The developed surrogate models were externally coupled with a multi-objective genetic algorithm optimization (MOGA) model, as a substitute for the numerical model. The study area consisted of freshwater pumping wells for extracting groundwater. Pumping from barrier wells installed along the coastlines is also considered as a management option to hydraulically control saltwater intrusion. The objective of the multi-objective management model was to maximise pumping from production wells and minimize pumping from barrier wells (which provide a hydraulic barrier) to ensure that the water quality at different monitoring locations remains within pre-specified limits. The executed multi-objective coupled S/O model generated 700 Pareto-optimal solutions. Analysing a large set of Pareto-optimal solution is a challenging task for the decision-makers. Hence, the k-means clustering technique was utilized to reduce the large Pareto-optimal solution set and help solve the large-scale saltwater intrusion problem in the Bonriki aquifer. The S/O-based management models have delivered optimal saltwater intrusion management strategies. However, at times, uncertainties in the numerical simulation model due to uncertain aquifer parameters are not incorporated into the management models. The present study explicitly incorporates aquifer parameter uncertainty into a multi-objective management model for the optimal design of groundwater pumping strategies from the unconfined Bonriki aquifer. To achieve computational efficiency and feasibility of the management model, the calibrated numerical simulation model in the S/O model was is replaced with ensembles of SVMR surrogate models. Each SVMR standalone surrogate model in the ensemble is constructed using datasets from different numerical simulation models with different hydraulic conductivity and porosity values. These ensemble SVMR models were coupled to the MOGA model to solve the Bonriki aquifer management problem for ensuring sustainable withdrawal rates that maintain specified salinity limits. The executed optimization model presented a Pareto-front with 600 non-dominated optimal trade-off pumping solutions. The reliability of the management model, established after validation of the optimal solution results, suggests that the implemented constraints of the optimization problem were satisfied; i.e., the salinities at monitoring locations remained within the pre-specified limits. The correct implementation of a prescribed optimal management strategy based on the coupled S/O model is always a concern for decision-makers. The management strategy actually implemented in the field sometimes deviates from the recommended optimal strategy, resulting in field-level deviations. Monitoring such field-level deviations during actual implementation of the recommended optimal management strategy and sequentially updating the strategy using feedback information is an important step towards adaptive management of coastal groundwater resources. In this study, a three-phase adaptive management framework for a coastal aquifer subjected to saltwater intrusion is applied and evaluated for a regional-scale coastal aquifer study area. The methodology adopted includes three sequential components. First, an optimal management strategy (consisting of groundwater extraction from production and barrier wells) is derived and implemented for the optimal management of the aquifer. The implemented management strategy is obtained by solving a homogeneous ensemble-based coupled S/O model. Second, a regional-scale optimal monitoring network is designed for the aquifer system, which considers possible user noncompliance of a recommended management strategy and uncertainty in aquifer parameter estimates. A new monitoring network design is formulated to ensure that candidate monitoring wells are placed at high risk (highly contaminated) locations. In addition, a k-means clustering methodology is utilized to select candidate monitoring wells in areas representative of the entire model domain. Finally, feedback information in the form of salinity measurements at monitoring wells is used to sequentially modify pumping strategies for future time periods in the management horizon. The developed adaptive management framework is evaluated by applying it to the Bonriki aquifer system. Overall, the results of this study suggest that the implemented adaptive management strategy has the potential to address practical implementation issues arising due to user noncompliance, as well as deviations between predicted and actual consequences of implementing a management strategy, and uncertainty in aquifer parameters. The use of ensemble prediction models is known to be more accurate standalone prediction models. The present study develops and utilises homogeneous and heterogeneous ensemble models based on several standalone evolutionary algorithms, including artificial neural networks (ANN), GP, SVMR and Gaussian process regression (GPR). These models are used to predict groundwater salinity in the Bonriki aquifer. Standalone and ensemble prediction models are trained and validated using identical pumping and salinity concentration datasets generated by solving numerical 3D transient density-dependent coastal aquifer flow and transport numerical simulation models. After validation, the ensemble models are used to predict salinity concentration at selected monitoring wells in the modelled aquifer under variable groundwater pumping conditions. The predictive capabilities of the developed ensemble models are quantified using standard statistical procedures. The performance evaluation results suggest that the predictive capabilities of the standalone prediction models (ANN, GP, SVMR and GPR) are comparable to those of the groundwater variable-density flow and salt transport numerical simulation model. However, GPR standalone models had better predictive capabilities than the other standalone models. Also, SVMR and GPR standalone models were more efficient (in terms of computational training time) than other standalone models. In terms of ensemble models, the performance of the homogeneous GPR ensemble model was found to be superior to that of the other homogeneous and heterogeneous ensemble models. Employing data-driven predictive models as replacements for complex groundwater flow and transport models enables the prediction of future scenarios and also helps save computational time, effort and requirements when developing optimal coastal aquifer management strategies based on coupled S/O models. In this study, a new data-driven model, namely Group method for data handling (GMDH) approach is developed and utilized to predict salinity concentration in a coastal aquifer and, simultaneously, determine the most influential input predictor variables (pumping rates) that had the most impact onto the outcomes (salinity at monitoring locations). To confirm the importance of variables, three tests are conducted, in which new GMDH models are constructed using subsets of the original datasets. In TEST 1, new GMDH models are constructed using a set of most influential variables only. In TEST 2, a subset of 20 variables (10 most and 10 least influential variables) are used to develop new GMDH models. In TEST 3, a subset of the least influential variables is used to develop GMDH models. A performance evaluation demonstrates that the GMDH models developed using the entire dataset have reasonable predictive accuracy and efficiency. A comparison of the performance evaluations of the three tests highlights the importance of appropriately selecting input pumping rates when developing predictive models. These results suggest that incorporating the least influential variables decreases model accuracy; thus, only considering the most influential variables in salinity prediction models is beneficial and appropriate. This study also investigated the efficiency and viability of using artificial freshwater recharge (AFR) to increase fresh groundwater pumping rates from production wells. First, the effect of AFR on the inland encroachment of saline water is quantified for existing scenarios. Specifically, groundwater head and salinity differences at monitoring locations before and after artificial recharge are presented. Second, a multi-objective management model incorporating groundwater pumping and AFR is implemented to control groundwater salinization in an illustrative coastal aquifer system. A coupled SVMR-MOGA model is developed for prescribing optimal management strategies that incorporate AFR and groundwater pumping wells. The Pareto-optimal front obtained from the SVMR-MOGA optimization model presents a set of optimal solutions for the sustainable management of the coastal aquifer. The pumping strategies obtained as Pareto-optimal solutions with and without freshwater recharge shows that saltwater intrusion is sensitive to AFR. Also, the hydraulic head lenses created by AFR can be used as one practical option to control saltwater intrusion. The developed 3D saltwater intrusion model, the predictive capabilities of the developed SVMR models, and the feasibility of using the proposed coupled multi-objective SVMR-MOGA optimization model make the proposed methodology potentially suitable for solving large-scale regional saltwater intrusion management problems. Overall, the development and evaluation of various groundwater numerical simulation models, predictive models, multi-objective management strategies and adaptive methodologies will provide decision-makers with tools for the sustainable management of coastal aquifers. It is envisioned that the outcomes of this research will provide useful information to groundwater managers and stakeholders, and offer potential resolutions to policy-makers regarding the sustainable management of groundwater resources. The real-life case study of the Bonriki aquifer presented in this study provides the scientific community with a broader understanding of groundwater resource issues in coastal aquifers and establishes the practical utility of the developed management strategies

    An academic review: applications of data mining techniques in finance industry

    Get PDF
    With the development of Internet techniques, data volumes are doubling every two years, faster than predicted by Moore’s Law. Big Data Analytics becomes particularly important for enterprise business. Modern computational technologies will provide effective tools to help understand hugely accumulated data and leverage this information to get insights into the finance industry. In order to get actionable insights into the business, data has become most valuable asset of financial organisations, as there are no physical products in finance industry to manufacture. This is where data mining techniques come to their rescue by allowing access to the right information at the right time. These techniques are used by the finance industry in various areas such as fraud detection, intelligent forecasting, credit rating, loan management, customer profiling, money laundering, marketing and prediction of price movements to name a few. This work aims to survey the research on data mining techniques applied to the finance industry from 2010 to 2015.The review finds that Stock prediction and Credit rating have received most attention of researchers, compared to Loan prediction, Money Laundering and Time Series prediction. Due to the dynamics, uncertainty and variety of data, nonlinear mapping techniques have been deeply studied than linear techniques. Also it has been proved that hybrid methods are more accurate in prediction, closely followed by Neural Network technique. This survey could provide a clue of applications of data mining techniques for finance industry, and a summary of methodologies for researchers in this area. Especially, it could provide a good vision of Data Mining Techniques in computational finance for beginners who want to work in the field of computational finance
    corecore