27 research outputs found

    Ant colony optimization approach for stacking configurations

    Full text link
    In data mining, classifiers are generated to predict the class labels of the instances. An ensemble is a decision making system which applies certain strategies to combine the predictions of different classifiers and generate a collective decision. Previous research has empirically and theoretically demonstrated that an ensemble classifier can be more accurate and stable than its component classifiers in most cases. Stacking is a well-known ensemble which adopts a two-level structure: the base-level classifiers to generate predictions and the meta-level classifier to make collective decisions. A consequential problem is: what learning algorithms should be used to generate the base-level and meta-level classifier in the Stacking configuration? It is not easy to find a suitable configuration for a specific dataset. In some early works, the selection of a meta classifier and its training data are the major concern. Recently, researchers have tried to apply metaheuristic methods to optimize the configuration of the base classifiers and the meta classifier. Ant Colony Optimization (ACO), which is inspired by the foraging behaviors of real ant colonies, is one of the most popular approaches among the metaheuristics. In this work, we propose a novel ACO-Stacking approach that uses ACO to tackle the Stacking configuration problem. This work is the first to apply ACO to the Stacking configuration problem. Different implementations of the ACO-Stacking approach are developed. The first version identifies the appropriate learning algorithms in generating the base-level classifiers while using a specific algorithm to create the meta-level classifier. The second version simultaneously finds the suitable learning algorithms to create the base-level classifiers and the meta-level classifier. Moreover, we study how different kinds on local information of classifiers will affect the classification results. Several pieces of local information collected from the initial phase of ACO-Stacking are considered, such as the precision, f-measure of each classifier and correlative differences of paired classifiers. A series of experiments are performed to compare the ACO-Stacking approach with other ensembles on a number of datasets of different domains and sizes. The experiments show that the new approach can achieve promising results and gain advantages over other ensembles. The correlative differences of the classifiers could be the best local information in this approach. Under the agile ACO-Stacking framework, an application to deal with a direct marketing problem is explored. A real world database from a US-based catalog company, containing more than 100,000 customer marketing records, is used in the experiments. The results indicate that our approach can gain more cumulative response lifts and cumulative profit lifts in the top deciles. In conclusion, it is competitive with some well-known conventional and ensemble data mining methods

    Advances and applications in Ensemble Learning

    Get PDF

    Advances in Data Mining Knowledge Discovery and Applications

    Get PDF
    Advances in Data Mining Knowledge Discovery and Applications aims to help data miners, researchers, scholars, and PhD students who wish to apply data mining techniques. The primary contribution of this book is highlighting frontier fields and implementations of the knowledge discovery and data mining. It seems to be same things are repeated again. But in general, same approach and techniques may help us in different fields and expertise areas. This book presents knowledge discovery and data mining applications in two different sections. As known that, data mining covers areas of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas. In this book, most of the areas are covered with different data mining applications. The eighteen chapters have been classified in two parts: Knowledge Discovery and Data Mining Applications

    Unsupervised learning for anomaly detection in Australian medical payment data

    Full text link
    Fraudulent or wasteful medical insurance claims made by health care providers are costly for insurers. Typically, OECD healthcare organisations lose 3-8% of total expenditure due to fraud. As Australia’s universal public health insurer, Medicare Australia, spends approximately A34billionperannumontheMedicareBenefitsSchedule(MBS)andPharmaceuticalBenefitsScheme,wastedspendingofA 34 billion per annum on the Medicare Benefits Schedule (MBS) and Pharmaceutical Benefits Scheme, wasted spending of A1–2.7 billion could be expected.However, fewer than 1% of claims to Medicare Australia are detected as fraudulent, below international benchmarks. Variation is common in medicine, and health conditions, along with their presentation and treatment, are heterogenous by nature. Increasing volumes of data and rapidly changing patterns bring challenges which require novel solutions. Machine learning and data mining are becoming commonplace in this field, but no gold standard is yet available. In this project, requirements are developed for real-world application to compliance analytics at the Australian Government Department of Health and Aged Care (DoH), covering: unsupervised learning; problem generalisation; human interpretability; context discovery; and cost prediction. Three novel methods are presented which rank providers by potentially recoverable costs. These methods used association analysis, topic modelling, and sequential pattern mining to provide interpretable, expert-editable models of typical provider claims. Anomalous providers are identified through comparison to the typical models, using metrics based on costs of excess or upgraded services. Domain knowledge is incorporated in a machine-friendly way in two of the methods through the use of the MBS as an ontology. Validation by subject-matter experts and comparison to existing techniques shows that the methods perform well. The methods are implemented in a software framework which enables rapid prototyping and quality assurance. The code is implemented at the DoH, and further applications as decision-support systems are in progress. The developed requirements will apply to future work in this fiel

    Optimisation approaches for data mining in biological systems

    Get PDF
    The advances in data acquisition technologies have generated massive amounts of data that present considerable challenge for analysis. How to efficiently and automatically mine through the data and extract the maximum value by identifying the hidden patterns is an active research area, called data mining. This thesis tackles several problems in data mining, including data classification, regression analysis and community detection in complex networks, with considerable applications in various biological systems. First, the problem of data classification is investigated. An existing classifier has been adopted from literature and two novel solution procedures have been proposed, which are shown to improve the predictive accuracy of the original method and significantly reduce the computational time. Disease classification using high throughput genomic data is also addressed. To tackle the problem of analysing large number of genes against small number of samples, a new approach of incorporating extra biological knowledge and constructing higher level composite features for classification has been proposed. A novel model has been introduced to optimise the construction of composite features. Subsequently, regression analysis is considered where two piece-wise linear regression methods have been presented. The first method partitions one feature into multiple complementary intervals and ts each with a distinct linear function. The other method is a more generalised variant of the previous one and performs recursive binary partitioning that permits partitioning of multiple features. Lastly, community detection in complex networks is investigated where a new optimisation framework is introduced to identify the modular structure hidden in directed networks via optimisation of modularity. A non-linear model is firstly proposed before its linearised variant is presented. The optimisation framework consists of two major steps, including solving the non-linear model to identify a coarse initial partition and a second step of solving repeatedly the linearised models to re fine the network partition

    A Principled Methodology: A Dozen Principles of Software Effort Estimation

    Get PDF
    Software effort estimation (SEE) is the activity of estimating the total effort required to complete a software project. Correctly estimating the effort required for a software project is of vital importance for the competitiveness of the organizations. Both under- and over-estimation leads to undesirable consequences for the organizations. Under-estimation may result in overruns in budget and schedule, which in return may cause the cancellation of projects; thereby, wasting the entire effort spent until that point. Over-estimation may cause promising projects not to be funded; hence, harming the organizational competitiveness.;Due to the significant role of SEE for software organizations, there is a considerable research effort invested in SEE. Thanks to the accumulation of decades of prior research, today we are able to identify the core issues and search for the right principles to tackle pressing questions. For example, regardless of decades of work, we still lack concrete answers to important questions such as: What is the best SEE method? The introduced estimation methods make use of local data, however not all the companies have their own data, so: How can we handle the lack of local data? Common SEE methods take size attributes for granted, yet size attributes are costly and the practitioners place very little trust in them. Hence, we ask: How can we avoid the use of size attributes? Collection of data, particularly dependent variable information (i.e. effort values) is costly: How can find an essential subset of the SEE data sets? Finally, studies make use of sampling methods to justify a new method\u27s performance on SEE data sets. Yet, trade-off among different variants is ignored: How should we choose sampling methods for SEE experiments? ;This thesis is a rigorous investigation towards identification and tackling of the pressing issues in SEE. Our findings rely on extensive experimentation performed with a large corpus of estimation techniques on a large set of public and proprietary data sets. We summarize our findings and industrial experience in the form of 12 principles: 1) Know your domain 2) Let the Experts Talk 3) Suspect your data 4) Data Collection is Cyclic 5) Use a Ranking Stability Indicator 6) Assemble Superior Methods 7) Weighting Analogies is Over-elaboration 8) Use Easy-path Design 9) Use Relevancy Filtering 10) Use Outlier Pruning 11) Combine Outlier and Synonym Pruning 12) Be Aware of Sampling Method Trade-off

    Advances in Artificial Intelligence: Models, Optimization, and Machine Learning

    Get PDF
    The present book contains all the articles accepted and published in the Special Issue “Advances in Artificial Intelligence: Models, Optimization, and Machine Learning” of the MDPI Mathematics journal, which covers a wide range of topics connected to the theory and applications of artificial intelligence and its subfields. These topics include, among others, deep learning and classic machine learning algorithms, neural modelling, architectures and learning algorithms, biologically inspired optimization algorithms, algorithms for autonomous driving, probabilistic models and Bayesian reasoning, intelligent agents and multiagent systems. We hope that the scientific results presented in this book will serve as valuable sources of documentation and inspiration for anyone willing to pursue research in artificial intelligence, machine learning and their widespread applications

    Fuzzy Rules from Ant-Inspired Computation

    Get PDF
    Centre for Intelligent Systems and their ApplicationsThis research identifies and investigates major issues in inducing accurate and comprehensible fuzzy rules from datasets.A review of the current literature on fuzzy rulebase induction uncovers two significant issues: A. There is a tradeoff between inducing accurate fuzzy rules and inducing comprehensible fuzzy rules; and, B. A common strategy for the induction of fuzzy rulebases, that of iterative rule learning where the rules are generated one by one and independently of each other, may not be an optimal one.FRANTIC, a system that provides a framework for exploring the claims above is developed. At the core lies a mechanism for creating individual fuzzy rules. This is based on a significantly modified social insect-inspired heuristic for combinatorial optimisation -- Ant Colony Optimisation. The rule discovery mechanism is utilised in two very different strategies for the induction of a complete fuzzy rulebase: 1. The first follows the common iterative rule learning approach for the induction of crisp and fuzzy rules; 2. The second has been designed during this research explicitly for the induction of a fuzzy rulebase, and generates all rules in parallel.Both strategies have been tested on a number of classification problems, including medical diagnosis and industrial plant fault detection, and compared against other crisp or fuzzy induction algorithms that use more well-established approaches. The results challenge statement A above, by presenting evidence to show that one criterion need not be met at the expense of the other. This research also uncovers the cost that is paid -- that of computational expenditure -- and makes concrete suggestions on how this may be resolved.With regards to statement B, until now little or no evidence has been put forward to support or disprove the claim. The results of this research indicate that definite advantages are offered by the second simultaneous strategy, that are not offered by the iterative one. These benefits include improved accuracy over a wide range of values for several key system parameters. However, both approaches also fare well when compared to other learning algorithms. This latter fact is due to the rule discovery mechanism itself -- the adapted Ant Colony Optimisation algorithm -- which affords several additional advantages. These include a simple mechanism within the rule construction process that enables it to cope with datasets that have an imbalanced distribution between the classes, and another for controlling the amount of fit to the training data.In addition, several system parameters have been designed to be semi-autonomous so as to avoid unnecessary user intervention, and in future work the social insect metaphor may be exploited and extended further to enable it to deal with industrial-strength data mining issues involving large volumes of data, and distributed and/or heterogeneous databases

    Decision Support Systems for Risk Assessment in Credit Operations Against Collateral

    Get PDF
    With the global economic crisis, which reached its peak in the second half of 2008, and before a market shaken by economic instability, financial institutions have taken steps to protect the banks’ default risks, which had an impact directly in the form of analysis in credit institutions to individuals and to corporate entities. To mitigate the risk of banks in credit operations, most banks use a graded scale of customer risk, which determines the provision that banks must do according to the default risk levels in each credit transaction. The credit analysis involves the ability to make a credit decision inside a scenario of uncertainty and constant changes and incomplete transformations. This ability depends on the capacity to logically analyze situations, often complex and reach a clear conclusion, practical and practicable to implement. Credit Scoring models are used to predict the probability of a customer proposing to credit to become in default at any given time, based on his personal and financial information that may influence the ability of the client to pay the debt. This estimated probability, called the score, is an estimate of the risk of default of a customer in a given period. This increased concern has been in no small part caused by the weaknesses of existing risk management techniques that have been revealed by the recent financial crisis and the growing demand for consumer credit.The constant change affects several banking sections because it prevents the ability to investigate the data that is produced and stored in computers that are too often dependent on manual techniques. Among the many alternatives used in the world to balance this risk, the provision of guarantees stands out of guarantees in the formalization of credit agreements. In theory, the collateral does not ensure the credit return, as it is not computed as payment of the obligation within the project. There is also the fact that it will only be successful if triggered, which involves the legal area of the banking institution. The truth is, collateral is a mitigating element of credit risk. Collaterals are divided into two types, an individual guarantee (sponsor) and the asset guarantee (fiduciary). Both aim to increase security in credit operations, as an payment alternative to the holder of credit provided to the lender, if possible, unable to meet its obligations on time. For the creditor, it generates liquidity security from the receiving operation. The measurement of credit recoverability is a system that evaluates the efficiency of the collateral invested return mechanism. In an attempt to identify the sufficiency of collateral in credit operations, this thesis presents an assessment of smart classifiers that uses contextual information to assess whether collaterals provide for the recovery of credit granted in the decision-making process before the credit transaction become insolvent. The results observed when compared with other approaches in the literature and the comparative analysis of the most relevant artificial intelligence solutions, considering the classifiers that use guarantees as a parameter to calculate the risk contribute to the advance of the state of the art advance, increasing the commitment to the financial institutions.Com a crise econômica global, que atingiu seu auge no segundo semestre de 2008, e diante de um mercado abalado pela instabilidade econômica, as instituições financeiras tomaram medidas para proteger os riscos de inadimplência dos bancos, medidas que impactavam diretamente na forma de análise nas instituições de crédito para pessoas físicas e jurídicas. Para mitigar o risco dos bancos nas operações de crédito, a maioria destas instituições utiliza uma escala graduada de risco do cliente, que determina a provisão que os bancos devem fazer de acordo com os níveis de risco padrão em cada transação de crédito. A análise de crédito envolve a capacidade de tomar uma decisão de crédito dentro de um cenário de incerteza e mudanças constantes e transformações incompletas. Essa aptidão depende da capacidade de analisar situações lógicas, geralmente complexas e de chegar a uma conclusão clara, prática e praticável de implementar. Os modelos de Credit Score são usados para prever a probabilidade de um cliente propor crédito e tornar-se inadimplente a qualquer momento, com base em suas informações pessoais e financeiras que podem influenciar a capacidade do cliente de pagar a dívida. Essa probabilidade estimada, denominada pontuação, é uma estimativa do risco de inadimplência de um cliente em um determinado período. A mudança constante afeta várias seções bancárias, pois impede a capacidade de investigar os dados que são produzidos e armazenados em computadores que frequentemente dependem de técnicas manuais. Entre as inúmeras alternativas utilizadas no mundo para equilibrar esse risco, destacase o aporte de garantias na formalização dos contratos de crédito. Em tese, a garantia não “garante” o retorno do crédito, já que não é computada como pagamento da obrigação dentro do projeto. Tem-se ainda, o fato de que esta só terá algum êxito se acionada, o que envolve a área jurídica da instituição bancária. A verdade é que, a garantia é um elemento mitigador do risco de crédito. As garantias são divididas em dois tipos, uma garantia individual (patrocinadora) e a garantia do ativo (fiduciário). Ambos visam aumentar a segurança nas operações de crédito, como uma alternativa de pagamento ao titular do crédito fornecido ao credor, se possível, não puder cumprir suas obrigações no prazo. Para o credor, gera segurança de liquidez a partir da operação de recebimento. A mensuração da recuperabilidade do crédito é uma sistemática que avalia a eficiência do mecanismo de retorno do capital investido em garantias. Para tentar identificar a suficiência das garantias nas operações de crédito, esta tese apresenta uma avaliação dos classificadores inteligentes que utiliza informações contextuais para avaliar se as garantias permitem prever a recuperação de crédito concedido no processo de tomada de decisão antes que a operação de crédito entre em default. Os resultados observados quando comparados com outras abordagens existentes na literatura e a análise comparativa das soluções de inteligência artificial mais relevantes, mostram que os classificadores que usam garantias como parâmetro para calcular o risco contribuem para o avanço do estado da arte, aumentando o comprometimento com as instituições financeiras
    corecore