311 research outputs found

    Explainable temporal data mining techniques to support the prediction task in Medicine

    Get PDF
    In the last decades, the increasing amount of data available in all fields raises the necessity to discover new knowledge and explain the hidden information found. On one hand, the rapid increase of interest in, and use of, artificial intelligence (AI) in computer applications has raised a parallel concern about its ability (or lack thereof) to provide understandable, or explainable, results to users. In the biomedical informatics and computer science communities, there is considerable discussion about the `` un-explainable" nature of artificial intelligence, where often algorithms and systems leave users, and even developers, in the dark with respect to how results were obtained. Especially in the biomedical context, the necessity to explain an artificial intelligence system result is legitimate of the importance of patient safety. On the other hand, current database systems enable us to store huge quantities of data. Their analysis through data mining techniques provides the possibility to extract relevant knowledge and useful hidden information. Relationships and patterns within these data could provide new medical knowledge. The analysis of such healthcare/medical data collections could greatly help to observe the health conditions of the population and extract useful information that can be exploited in the assessment of healthcare/medical processes. Particularly, the prediction of medical events is essential for preventing disease, understanding disease mechanisms, and increasing patient quality of care. In this context, an important aspect is to verify whether the database content supports the capability of predicting future events. In this thesis, we start addressing the problem of explainability, discussing some of the most significant challenges need to be addressed with scientific and engineering rigor in a variety of biomedical domains. We analyze the ``temporal component" of explainability, focusing on detailing different perspectives such as: the use of temporal data, the temporal task, the temporal reasoning, and the dynamics of explainability in respect to the user perspective and to knowledge. Starting from this panorama, we focus our attention on two different temporal data mining techniques. The first one, based on trend abstractions, starting from the concept of Trend-Event Pattern and moving through the concept of prediction, we propose a new kind of predictive temporal patterns, namely Predictive Trend-Event Patterns (PTE-Ps). The framework aims to combine complex temporal features to extract a compact and non-redundant predictive set of patterns composed by such temporal features. The second one, based on functional dependencies, we propose a methodology for deriving a new kind of approximate temporal functional dependencies, called Approximate Predictive Functional Dependencies (APFDs), based on a three-window framework. We then discuss the concept of approximation, the data complexity of deriving an APFD, the introduction of two new error measures, and finally the quality of APFDs in terms of coverage and reliability. Exploiting these methodologies, we analyze intensive care unit data from the MIMIC dataset

    Deep language models for software testing and optimisation

    Get PDF
    Developing software is difficult. A challenging part of production development is ensuring programs are correct and fast, two properties satisfied with software testing and optimisation. While both tasks still rely on manual effort and expertise, the recent surge in software applications has led them to become tedious and time-consuming. Under this fast-pace environment, manual testing and optimisation hinders productivity significantly and leads to error-prone or sub-optimal programs that waste energy and lead users to frustration. In this thesis, we propose three novel approaches to automate software testing and optimisation with modern language models based on deep learning. In contrast to our methods, existing few techniques in these two domains have limited scalability and struggle when they face real-world applications. Our first contribution lies in the field of software testing and aims to automate the test oracle problem, which is the procedure of determining the correctness of test executions. The test oracle is still largely manual, relying on human experts. Automating the oracle is a non-trivial task that requires software specifications or derived information that are often too difficult to extract. We present the first application of deep language models over program execution traces to predict runtime correctness. Our technique classifies test executions of large-scale codebases used in production as “pass” or “fail”. Our proposed approach reduces by 86% the amount of test inputs an expert has to label by training only on 14% and classifying the rest automatically. Our next two contributions improve the effectiveness of compiler optimisation. Compilers optimise programs by applying heuristic-based transformations constructed by compiler engineers. Selecting the right transformations requires extensive knowledge of the compiler, the subject program and the target architecture. Predictive models have been successfully used to automate heuristics construction but their performance is hindered by a shortage of training benchmarks in quantity and feature diversity. Our next contributions address the scarcity of compiler benchmarks by generating human-likely synthetic programs to improve the performance of predictive models. Our second contribution is BENCHPRESS, the first steerable deep learning synthesizer for executable compiler benchmarks. BENCHPRESS produces human-like programs that compile at a rate of 87%. It targets parts of the feature space previously unreachable by other synthesizers, addressing the scarcity of high-quality training data for compilers. BENCHPRESS improves the performance of a device mapping predictive model by 50% when it introduces synthetic benchmarks into its training data. BENCHPRESS is restricted by a feature-agnostic synthesizer that requires thou sands of random inferences to select a few that target the desired features. Our third contribution addresses this inefficiency. We develop BENCHDIRECT, a directed language model for compiler benchmark generation. BENCHDIRECT synthesizes programs by jointly observing the source code context and the compiler features that are targeted. This enables efficient steerable generation on large scale tasks. Compared to BENCHPRESS, BENCHDIRECT matches successfully 1.8× more Rodinia target benchmarks, while it is up to 36% more accurate and up to 72% faster in targeting three different feature spaces for compilers. All three contributions demonstrate the exciting potential of deep learning and language models to simplify the testing of programs and the construction of better optimi sation heuristics for compilers. The outcomes of this thesis provides developers with tools to keep up with the rapidly evolving landscape of software engineering

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

    Decisioning 2022 : Collaboration in knowledge discovery and decision making: Applications to sustainable agriculture

    Get PDF
    Sustainable agriculture is one of the Sustainable Development Goals (SDG) proposed by UN (United Nations), but little systematic work on Knowledge Discovery and Decision Making has been applied to it. Knowledge discovery and decision making are becoming active research areas in the last years. The era of FAIR (Findable, Accessible, Interoperable, Reusable) data science, in which linked data with a high degree of variety and different degrees of veracity can be easily correlated and put in perspective to have an empirical and scientific perception of best practices in sustainable agricultural domain. This requires combining multiple methods such as elicitation, specification, validation, technologies from semantic web, information retrieval, formal concept analysis, collaborative work, semantic interoperability, ontological matching, specification, smart contracts, and multiple decision making. Decisioning 2022 is the first workshop on Collaboration in knowledge discovery and decision making: Applications to sustainable agriculture. It has been organized by six research teams from France, Argentina, Colombia and Chile, to explore the current frontier of knowledge and applications in different areas related to knowledge discovery and decision making. The format of this workshop aims at the discussion and knowledge exchange between the academy and industry members.Laboratorio de Investigación y Formación en Informática Avanzad

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters

    Towards a new generation of geographical information systems

    Full text link

    Application of Machine Learning Algorithms to Actuarial Ratemaking within Property and Casualty Insurance

    Get PDF
    A scientific pricing assessment is essential for maintaining viable customer relationship management solutions (CRM) for various stakeholders including consumers, insurance intermediaries, and insurers. The thesis aims to examine research problems neighboring the ratemaking process, including relaxing the conventional loss model assumption of homogeneity and independence. The thesis identified three major research scopes within multiperil insurance settings: heterogeneity in consumer behaviour on pricing decisions, loss trending under non-linearity and temporal dependencies, and loss modelling in presence of inflationary pressure. Heterogeneous consumers on pricing decisions were examined using demand and loyalty-based strategy. A hybrid decision tree classification framework is implemented, that includes semi-supervised learning model, variable selection technique, and partitioning approach with different treatment effects in order to achieve adequate risk profiling. Also, the thesis explored a supervised tree learning mechanism under highly imbalanced overlap classes and having a non-linear response-predictors relationship. The two-phase classification framework is applied to an owner’s occupied property portfolio from a personal insurance brokerage powered by a digital platform within the Canadian market. The hybrid three-phase tree algorithm, which includes conditional inference trees, random forest wrapped by the Boruta algorithm, and model-based recursive partitioning under a multinomial generalized linear model, is proposed to study the price sensitivity ranking of digital consumers. The empirical results suggest a well-defined segmentation of digital consumers with differential price sensitivity. Further, with highly imbalanced and overlapped classes, the resampling technique was modelled together with the decision tree algorithm, providing a more scientific approach to overcome classification problems than the traditional multinomial regression. The resulting segmentation was able to identify the high-sensitivity consumers group, where premium rate reductions are recommended to reduce the churn rate. Consumers are classified as an insensitive group for which the price strategy to increase the premium rate is expected to have a slight impact on the closing ratio and retention rate. Insurance loss incurred greatly exhibits abnormal characteristics such as temporal dependence, nonlinear relationship between dependent and independent variables, seasonal variation, and mixture distribution resulting from the implicit claim inflation component. With such abnormal variable characteristics, the severity and frequency components may exhibit an altered trending pattern, that changes over time and never repeats. This could have a profound impact on the experience rating model, where the estimates of the pure premium and the rate relativity of tariff class are likely to be under or over-estimated. A discussion of the pros and cons of the conventional loss trending approach leads to an alternative framework for the loss cost structure. The conventional pure premium is further split into base severity and severity deflator random variables using a do(·) operator within causal inference. The components are separately modelled based on different time basis predictors using the semiparametric generalized additive model (GAM) with a spline curve. To maximize the claim inflation calendar year effect and improve the efficiency of severity trending, this thesis refines the claim inflation estimation by adapting Taylor’s [86] separation method that estimates the inflation index from a loss development triangle. In the second phase of developing the severity trend model, we integrated both the base severity and severity deflator under a new generalized mechanism known as Discount, Model, and Trend (DMT). The two-phase modelling was built to overcome the mixture distribution effect on final trend estimates. A simulation study constructed using the claims paid development triangle from a Canadian Insurtech broker’s houseowners/householders portfolio was used in a severity trend movement prediction analysis. We discovered that the conventional framework understated the severity trends more than the separation cum DMT framework. GAM provides a flexible and effective mechanism for modelling nonlinear time series in studies of the frequency loss trend. However, GAM assumes that residuals are independent and identically distributed (iid), while frequency loss time series can be correlated in adjacent time points. This thesis introduces a new model called Generalized Additive Model with Seasonal Autoregressive term (GAMSAR) that accounts for temporal dependency and seasonal variation in order to improve prediction confidence intervals. Parameters of the GAMSAR model are estimated by maximum partial likelihood using a modified Newton’s method developed by Yang et al. [97], and the goodness-of-fit between GAM, and GAMSAR is demonstrated using a simulation study. Simulation results show that the bias of the mean estimates from GAM differs greatly from their true value. The proposed GAMSAR model shows to be superior, especially in the presence of seasonal variation. Further, a comparison study is conducted between GAMSAR and Generalized Additive Model with Autoregressive term (GAMAR) developed by Yang et al. [97], and the coverage rate of 95% confidence interval confirms that the GAMSAR model has the ability to incorporate the nonlinear trend effects as well as capture the serial correlation between the observations. In the empirical analysis, a claim dataset of personal property insurance obtained from digital brokers in Canada is used to show that the GAMSAR(1)12 captures the periodic dependence structure of the data precisely compared to standard regression models. The proposed frequency severity trend models support the thesis’s goal of establishing a scientific approach to pricing that is robust under different trending processes
    corecore