1,621 research outputs found

    On Identifying Critical Nuggets Of Information During Classification Task

    Get PDF
    In large databases, there may exist critical nuggets - small collections of records or instances that contain domain-specific important information. This information can be used for future decision making such as labeling of critical, unlabeled data records and improving classification results by reducing false positive and false negative errors. In recent years, data mining efforts have focussed on pattern and outlier detection methods. However, not much effort has been dedicated to finding critical nuggets within a data set. This work introduces the idea of critical nuggets, proposes an innovative domain-independent method to measure criticality, suggests a heuristic to reduce the search space for finding critical nuggets, and isolates and validates critical nuggets from some real world data sets. It seems that only a few subsets may qualify to be critical nuggets, underlying the importance of finding them. The proposed methodology can detect them. This work also identifies certain properties of critical nuggets and provides experimental validation of the properties. Critical nuggets were then applied to 2 important classification task related performance metrics - classification accuracy and misclassification costs. Experimental results helped validate that critical nuggets can assist in improving classification accuracies in real world data sets when compared with other standalone classification algorithms. The improvements in accuracy using the critical nuggets were statistically significant. Extensive studies were also undertaken on real world data sets that utilized critical nuggets to help minimize misclassification costs. In this case as well the critical nuggets based approach yielded statistically significant, lower misclassification costs than than standalone classification methods

    Predictive Modelling of Retail Banking Transactions for Credit Scoring, Cross-Selling and Payment Pattern Discovery

    Get PDF
    Evaluating transactional payment behaviour offers a competitive advantage in the modern payment ecosystem, not only for confirming the presence of good credit applicants or unlocking the cross-selling potential between the respective product and service portfolios of financial institutions, but also to rule out bad credit applicants precisely in transactional payments streams. In a diagnostic test for analysing the payment behaviour, I have used a hybrid approach comprising a combination of supervised and unsupervised learning algorithms to discover behavioural patterns. Supervised learning algorithms can compute a range of credit scores and cross-sell candidates, although the applied methods only discover limited behavioural patterns across the payment streams. Moreover, the performance of the applied supervised learning algorithms varies across the different data models and their optimisation is inversely related to the pre-processed dataset. Subsequently, the research experiments conducted suggest that the Two-Class Decision Forest is an effective algorithm to determine both the cross-sell candidates and creditworthiness of their customers. In addition, a deep-learning model using neural network has been considered with a meaningful interpretation of future payment behaviour through categorised payment transactions, in particular by providing additional deep insights through graph-based visualisations. However, the research shows that unsupervised learning algorithms play a central role in evaluating the transactional payment behaviour of customers to discover associations using market basket analysis based on previous payment transactions, finding the frequent transactions categories, and developing interesting rules when each transaction category is performed on the same payment stream. Current research also reveals that the transactional payment behaviour analysis is multifaceted in the financial industry for assessing the diagnostic ability of promotion candidates and classifying bad credit applicants from among the entire customer base. The developed predictive models can also be commonly used to estimate the credit risk of any credit applicant based on his/her transactional payment behaviour profile, combined with deep insights from the categorised payment transactions analysis. The research study provides a full review of the performance characteristic results from different developed data models. Thus, the demonstrated data science approach is a possible proof of how machine learning models can be turned into cost-sensitive data models

    Integrated performance framework to guide facade retrofit

    Get PDF
    The façade retrofit market faces some key barriers: the selection of performance criteria and the reliability of the performance data. On the demand side, the problem is approached from an investment perspective which creates "split incentives" between the stakeholders who pay for the investment and those who benefit from it. On the supply side, there is an inherent complexity in modeling these options because of the incomplete knowledge of the physical and cost parameters involved in the performance evaluation. The thermal comfort of the building occupant is an important component of the retrofit performance assessment. This research attempts to fill a gap in the approach to façade retrofit decision by 1) quantifying uncertainties in these three dimensions of performance, 2) incorporating new financing models available in the retrofit market, 3) considering the target and risk attitude of the decision maker. The methodology proposed in this research integrates key indicators for delivery process, environmental performance, and investment performance. The purpose is to provide a methodological framework for performance evaluation. A residential case study is conducted to test the proposed framework. Three retrofit scenarios including the financing structure are examined. Each façade retrofit scenario is then evaluated based on the level of confidence to meet or exceed a specific target improvement for the Net Present Value and the risk to fall below a minimum improvement threshold. The case study results confirm that risk must be considered for more reliable façade retrofit decision-making. Research findings point to further research needed to expand the understanding of the interdependencies among uncertain parameters.PhDCommittee Chair: Augenbroe, Godfried; Committee Chair: De Wilde, Pieter; Committee Member: Aksamija, Ajla; Committee Member: Brown, Jason; Committee Member: Eastman, Charle

    Customer retention

    Get PDF
    A research report submitted to the Faculty of Engineering and the Built Environment, University of the Witwatersrand, Johannesburg, in partial fulfillment of the requirements for the degree of Master of Science in Engineering. Johannesburg, May 2018The aim of this study is to model the probability of a customer to attrite/defect from a bank where, for example, the bank is not their preferred/primary bank for salary deposits. The termination of deposit inflow serves as the outcome parameter and the random forest modelling technique was used to predict the outcome, in which new data sources (transactional data) were explored to add predictive power. The conventional logistic regression modelling technique was used to benchmark the random forest’s results. It was found that the random forest model slightly overfit during the training process and loses predictive power during validation and out of training period data. The random forest model, however, remains predictive and performs better than logistic regression at a cut-off probability of 20%.MT 201

    A Comparative Study on Statistical and Machine Learning Forecasting Methods for an FMCG Company

    Get PDF
    Demand forecasting has been an area of study among scholars and businessmen ever since the start of the industrial revolution and has only gained focus in recent years with the advancements in AI. Accurate forecasts are no longer a luxury, but a necessity to have for effective decisions made in planning production and marketing. Many aspects of the business depend on demand, and this is particularly true for the Fast-Moving Consumer Goods industry where the high volume and demand volatility poses a challenge for planners to generate accurate forecasts as consumer demand complexity rises. Inaccurate demand forecasts lead to multiple issues such as high holding costs on excess inventory, shortages on certain SKUs in the market leading to sales loss and a significant impact on both top line and bottom line for the business. Researchers have attempted to look at the performance of statistical time series models in comparison to machine learning methods to evaluate their robustness, computational time and power. In this paper, a comparative study was conducted using statistical and machine learning techniques to generate an accurate forecast using shipment data of an FMCG company. Naïve method was used as a benchmark to evaluate performance of other forecasting techniques, and was compared to exponential smoothing, ARIMA, KNN, Facebook Prophet and LSTM using past 3 years shipments. Methodology followed was CRISP-DM from data exploration, pre-processing and transformation before applying different forecasting algorithms and evaluation. Moreover, secondary goals behind this paper include understanding associations between SKUs through market basket analysis, and clustering using KNN based on brand, customer, order quantity and value to propose a product segmentation strategy. The results of both clustering and forecasting models are then evaluated to choose the optimal forecasting technique, and a visual representation of the forecast and exploratory analysis conducted is displayed using R

    Developing collaborative planning support tools for optimised farming in Western Australia

    Get PDF
    Land-use (farm) planning is a highly complex and dynamic process. A land-use plan can be optimal at one point in time, but its currency can change quickly due to the dynamic nature of the variables driving the land-use decision-making process. These include external drivers such as weather and produce markets, that also interact with the biophysical interactions and management activities of crop production.The active environment of an annual farm planning process can be envisioned as being cone-like. At the beginning of the sowing year, the number of options open to the manager is huge, although uncertainty is high due to the inability to foresee future weather and market conditions. As the production year reveals itself, the uncertainties around weather and markets become more certain, as does the impact of weather and management activities on future production levels. This restricts the number of alternative management options available to the farm manager. Moreover, every decision made, such as crop type sown in a paddock, will constrains the range of management activities possible in that paddock for the rest of the growing season.This research has developed a prototype Land-use Decision Support System (LUDSS) to aid farm managers in their tactical farm management decision making. The prototype applies an innovative approach that mimics the way in which a farm manager and/or consultant would search for optimal solutions at a whole-farm level. This model captured the range of possible management activities available to the manager and the impact that both external (to the farm) and internal drivers have on crop production and the environment. It also captured the risk and uncertainty found in the decision space.The developed prototype is based on a Multiple Objective Decision-making (MODM) - á Posteriori approach incorporating an Exhaustive Search method. The objective set used for the model is: maximising profit and minimising environmental impact. Pareto optimisation theory was chosen as the method to select the optimal solution and a Monte Carlo simulator is integrated into the prototype to incorporate the dynamic nature of the farm decision making process. The prototype has a user-friendly front and back end to allow farmers to input data, drive the application and extract information easily

    Essays on the Economic Analysis of Transportation Systems

    Full text link
    This dissertation consists of four essays on the economic analysis of transportation systems. In the first chapter, the conventional disaggregate travel demand model, a probability model for the modeling of multiple modes, generally called random utility maximization (RUM), is expanded to a model of count of mode choice. The extended travel demand model is derived from general economic theory -- maximizing instantaneous utility on the time horizon, subject to a budget constraint -- and can capture the dynamic behavior of countable travel demand. Because the model is for countable dependent variables, it has a more realistic set of assumptions to explain travel demand then the RUM model. An empirical test of the theoretical model using a toll facility user survey in the New York City area was performed. The results show that the theoretical model explain more than 50 percent of the trip frequency behavior observed in the New York City toll facility users. Travel demand for facility users increase with respect to household employment, household vehicle count, and employer payment for tolls and decrease with travel time, road pricing, travel distance and mass transit access. In the second chapter, we perform a statistical comparison of driving travel demand on toll facilities between Electronic Toll Collection (ETC) users, as a treatment group, and non users, as a control group, in order to examine the effect of ETC on travel demand that uses toll facilities. The data that is used for the comparison is a user survey of the ten toll bridges and tunnels in New York City, and the data contains individual user\u27s travel attributes and demographic characteristics, as well as the frequency of usage of the toll facilities so that the data thus allows us to examine the difference in travel demand of E-ZPass, the Electronic Toll Collection System for Northeastern United States\u27 highway ETC system and compare tag holders and non tag holders. We find that the estimated difference of travel demand between E-ZPass users and non-users is biased due to model misspecification and sampling selection, and E-ZPass has no statistically significant effect on travel demand after controlling for possible sources of biases. In the third chapter, we develop a parallel sparse matrix-transpose-matrix multiplication algorithm using the outer product of row vectors. The outer product algorithm works with the compressed sparse row (CSR) form matrix, and as such it does not require a transposition operation prior to perform multiplication. In addition, since the outer product algorithm in the parallel implementation decomposes a matrix by rows, it thus imposes no additional restrictions with respect to matrix size and shape. We particularly focus on implementation of this technique on rectangular matrices, which have a larger number of rows and smaller number or columns for per- forming statistical analysis on large scale data. We test the outer product algorithm for randomly generated matrices. We then apply it to compute descriptive statistics of the New York City taxicab data, which is originally given by a 140.56 Gbytes file. The performance measures of the test and application shows that the outer product algorithm is effective and performed well on large-scale matrix multiplication in a parallel computing environment. In the last chapter, I develop a taxi market mechanism design model that demonstrates the role of a regulated taxi fare system on taxi drivers\u27 route choice behavior. In this model, a fare system is imposed by a taxi market authority with the recognition of asymmetric information, which in this case is about road network and traffic conditions, between passengers and drivers, and taxi trip demand is different and uncertain at its origin and destination. I derive a prediction from the model that shows the drivers have an incentive to make trip longer than optimal if they have passenger

    Modelling and solution methods for stochastic optimisation

    Get PDF
    This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.In this thesis we consider two research problems, namely, (i) language constructs for modelling stochastic programming (SP) problems and (ii) solution methods for processing instances of different classes of SP problems. We first describe a new design of an SP modelling system which provides greater extensibility and reuse. We implement this enhanced system and develop solver connections. We also investigate in detail the following important classes of SP problems: singlestage SP with risk constraints, two-stage linear and stochastic integer programming problems. We report improvements to solution methods for single-stage problems with second-order stochastic dominance constraints and two-stage SP problems. In both cases we use the level method as a regularisation mechanism. We also develop novel heuristic methods for stochastic integer programming based on variable neighbourhood search. We describe an algorithmic framework for implementing decomposition methods such as the L-shaped method within our SP solver system. Based on this framework we implement a number of established solution algorithms as well as a new regularisation method for stochastic linear programming. We compare the performance of these methods and their scale-up properties on an extensive set of benchmark problems. We also implement several solution methods for stochastic integer programming and report a computational study comparing their performance. The three solution methods, (a) processing of a single-stage problem with second-order stochastic dominance constraints, (b) regularisation by the level method for two-stage SP and (c) method for solving integer SP problems, are novel approaches and each of these makes a contribution to knowledge.Financial support was obtained from OptiRisk Systems
    corecore