133 research outputs found

    The Role of Synthetic Data in Improving Supervised Learning Methods: The Case of Land Use/Land Cover Classification

    Get PDF
    A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information ManagementIn remote sensing, Land Use/Land Cover (LULC) maps constitute important assets for various applications, promoting environmental sustainability and good resource management. Although, their production continues to be a challenging task. There are various factors that contribute towards the difficulty of generating accurate, timely updated LULC maps, both via automatic or photo-interpreted LULC mapping. Data preprocessing, being a crucial step for any Machine Learning task, is particularly important in the remote sensing domain due to the overwhelming amount of raw, unlabeled data continuously gathered from multiple remote sensing missions. However a significant part of the state-of-the-art focuses on scenarios with full access to labeled training data with relatively balanced class distributions. This thesis focuses on the challenges found in automatic LULC classification tasks, specifically in data preprocessing tasks. We focus on the development of novel Active Learning (AL) and imbalanced learning techniques, to improve ML performance in situations with limited training data and/or the existence of rare classes. We also show that much of the contributions presented are not only successful in remote sensing problems, but also in various other multidisciplinary classification problems. The work presented in this thesis used open access datasets to test the contributions made in imbalanced learning and AL. All the data pulling, preprocessing and experiments are made available at https://github.com/joaopfonseca/publications. The algorithmic implementations are made available in the Python package ml-research at https://github.com/joaopfonseca/ml-research

    Statistical Machine Learning Methodology for Individualized Treatment Rule Estimation in Precision Medicine

    Get PDF
    Precision medicine aims to deliver optimal, individualized treatments for patients by accounting for their unique characteristics. With a foundation in reinforcement learning, decision theory, and causal inference, the field of precision medicine has seen many advancements in recent years. Significant focus has been placed on creating algorithms to estimate individualized treatment rules (ITRs), which map from patient covariates to the space of available treatments with the goal of maximizing patient outcome. In Chapter 1, we extend ITR estimation methodology in the scenario where variance of the outcome is heterogeneous with respect to treatment and covariates. Accordingly, we propose Stabilized Direct Learning (SD-Learning), which utilizes heteroscedasticity in the error term through a residual reweighting framework that models residual variance via flexible machine learning algorithms such as XGBoost and random forests. We also develop an internal cross-validation scheme which determines the best residual model among competing models. Further, we extend this methodology to multi-arm treatment scenarios. In Chapter 2, we develop ITR estimation methodology for situations where clinical decision-making involves balancing multiple outcomes of interest. Our proposed framework estimates an ITR which maximizes a combination of the multiple clinical outcomes, accounting for the fact that patients may ascribe importance to outcomes differently (utility heterogeneity). This approach employs inverse reinforcement learning (IRL) techniques through an expert-augmentation solution, whereby physicians provide input to guide the utility estimation and ITR learning processes. In Chapter 3, we apply an end-to-end precision medicine workflow to novel data from older adults with Type 1 Diabetes in order to understand the heterogeneous treatment effects of continuous glucose monitoring (CGM) and develop an interpretable ITR to reveal patients for which CGM confers a major safety benefit. The results from this analysis elucidate the demographic and clinical markers which moderate CGM's success, provide the basis for using diagnostic CGM to inform therapeutic CGM decisions, and serve to augment clinical decision-making. Finally, in Chapter 4, as a future research direction, we propose a deep autoencoder framework which simultaneously performs feature selection and ITR optimization, contributing to methodology built for direct consumption of unstructured, high-dimensional data in the precision medicine pipeline.Doctor of Philosoph

    Optimal sizing and placement of Electrical Vehicle charging stations to serve Battery Electric Trucks

    Get PDF
    For Norway to reach the emission limits in the Paris Agreement, a substantial amount of CO2 must be reduced. Road traffic alone accounts for a high percentage of the total emissions during 2021. This thesis will focus on electrifying the transport sector and analyzing charging infrastructure for heavy-duty electric vehicles. New charging infrastructure for heavy-duty Electric Vehicles (EVs) provides issues regarding profitability due to the currently low adaption rates. However, heavy-duty EVs use the same charging sockets as EVs. As a result, EVs may finance the charging infrastructure needed to increase the adaption of heavy-duty EVs. Projections from Norwegian grid operators suggest that the total electricity surplus is diminishing during the next years and will be negative by 2027. This highlights the importance of modeling the power system in combination with finding optimal locations for charging stations. This study uses prescriptive analytics to suggest optimal locations for charging infrastructure to maximize returned profits to motivate station builders to implement more charging stations. A soft-linking will be done with PyPSA-eur to model the power system, where the new infrastructure is added as an additional load. Analyzing the results, it is possible to see that charging infrastructure has the potential to become profitable as the adaption rate for heavy-duty EVs rise. The collaboration between the models offers an open-source tool for scholars, researchers, and planners to study how new charging infrastructure affects key components in the Norwegian power system and could be useful in modeling state-of-the-art technologies

    Designing similarity functions

    Get PDF
    The concept of similarity is important in many areas of cognitive science, computer science, and statistics. In machine learning, functions that measure similarity between two instances form the core of instance-based classifiers. Past similarity measures have been primarily based on simple Euclidean distance. As machine learning has matured, it has become obvious that a simple numeric instance representation is insufficient for most domains. Similarity functions for symbolic attributes have been developed, and simple methods for combining these functions with numeric similarity functions were devised. This sequence of events has revealed three important issues, which this thesis addresses. The first issue is concerned with combining multiple measures of similarity. There is no equivalence between units of numeric similarity and units of symbolic similarity. Existing similarity functions for numeric and symbolic attributes have no common foundation, and so various schemes have been devised to avoid biasing the overall similarity towards one type of attribute. The similarity function design framework proposed by this thesis produces probability distributions that describe the likelihood of transforming between two attribute values. Because common units of probability are employed, similarities may be combined using standard methods. It is empirically shown that the resulting similarity functions treat different attribute types coherently. The second issue relates to the instance representation itself. The current choice of numeric and symbolic attribute types is insufficient for many domains, in which more complicated representations are required. For example, a domain may require varying numbers of features, or features with structural information. The framework proposed by this thesis is sufficiently general to permit virtually any type of instance representation-all that is required is that a set of basic transformations that operate on the instances be defined. To illustrate the framework’s applicability to different instance representations, several example similarity functions are developed. The third, and perhaps most important, issue concerns the ability to incorporate domain knowledge within similarity functions. Domain information plays an important part in choosing an instance representation. However, even given an adequate instance representation, domain information is often lost. For example, numeric features that are modulo (such as the time of day) can be perfectly represented as a numeric attribute, but simple linear similarity functions ignore the modulo nature of the attribute. Similarly, symbolic attributes may have inter-symbol relationships that should be captured in the similarity function. The design framework proposed by this thesis allows domain information to be captured in the similarity function, both in the transformation model and in the probability assigned to basic transformations. Empirical results indicate that such domain information improves classifier performance, particularly when training data is limited

    LIPIcs, Volume 258, SoCG 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 258, SoCG 2023, Complete Volum

    Machine Learning for Observational Cosmology

    Full text link
    An array of large observational programs using ground-based and space-borne telescopes is planned in the next decade. The forthcoming wide-field sky surveys are expected to deliver a sheer volume of data exceeding an exabyte. Processing the large amount of multiplex astronomical data is technically challenging, and fully automated technologies based on machine learning and artificial intelligence are urgently needed. Maximizing scientific returns from the big data requires community-wide efforts. We summarize recent progress in machine learning applications in observational cosmology. We also address crucial issues in high-performance computing that are needed for the data processing and statistical analysis.Comment: 55 pages, 8 figures, submitted to Reports on Progress in Physic

    Efficient Bayesian inference via Monte Carlo and machine learning algorithms

    Get PDF
    Mención Internacional en el título de doctorIn many fields of science and engineering, we are faced with an inverse problem where we aim to recover an unobserved parameter or variable of interest from a set of observed variables. Bayesian inference is a probabilistic approach for inferring this unknown parameter that has become extremely popular, finding application in myriad problems in fields such as machine learning, signal processing, remote sensing and astronomy. In Bayesian inference, all the information about the parameter is summarized by the posterior distribution. Unfortunately, the study of the posterior distribution requires the computation of complicated integrals, that are analytically intractable and need to be approximated. Monte Carlo is a huge family of sampling algorithms for performing optimization and numerical integration that has become the main horsepower for carrying out Bayesian inference. The main idea of Monte Carlo is that we can approximate the posterior distribution by a set of samples, obtained by an iterative process that involves sampling from a known distribution. Markov chain Monte Carlo (MCMC) and importance sampling (IS) are two important groups of Monte Carlo algorithms. This thesis focuses on developing and analyzing Monte Carlo algorithms (either MCMC, IS or combination of both) under different challenging scenarios presented below. In summary, in this thesis we address several important points, enumerated (a)–(f), that currently represent a challenge in Bayesian inference via Monte Carlo. A first challenge that we address is the problematic exploration of the parameter space by off-the-shelf MCMC algorithms when there is (a) multimodality, or with (b) highly concentrated posteriors. Another challenge that we address is the (c) proposal construction in IS. Furtheremore, in recent applications we need to deal with (d) expensive posteriors, and/or we need to handle (e) noisy posteriors. Finally, the Bayesian framework also offers a way of comparing competing hypothesis (models) in a principled way by means of marginal likelihoods. Hence, a task that arises as of fundamental importance is (f) marginal likelihood computation. Chapters 2 and 3 deal with (a), (b), and (c). In Chapter 2, we propose a novel population MCMC algorithm called Parallel Metropolis-Hastings Coupler (PMHC). PMHC is very suitable for multimodal scenarios since it works with a population of states, instead of a single one, hence allowing for sharing information. PMHC combines independent exploration by the use of parallel Metropolis-Hastings algorithms, with cooperative exploration by the use of a population MCMC technique called Normal Kernel Coupler. In Chapter 3, population MCMC are combined with IS within the layered adaptive IS (LAIS) framework. The combination of MCMC and IS serves two purposes. First, an automatic proposal construction. Second, it aims at increasing the robustness, since the MCMC samples are not used directly to form the sample approximation of the posterior. The use of minibatches of data is proposed to deal with highly concentrated posteriors. Other extensions for reducing the costs with respect to the vanilla LAIS framework, based on recycling and clustering, are discussed and analyzed. Chapters 4, 5 and 6 deal with (c), (d) and (e). The use of nonparametric approximations of the posterior plays an important role in the design of efficient Monte Carlo algorithms. Nonparametric approximations of the posterior can be obtained using machine learning algorithms for nonparametric regression, such as Gaussian Processes and Nearest Neighbors. Then, they can serve as cheap surrogate models, or for building efficient proposal distributions. In Chapter 4, in the context of expensive posteriors, we propose adaptive quadratures of posterior expectations and the marginal likelihood using a sequential algorithm that builds and refines a nonparametric approximation of the posterior. In Chapter 5, we propose Regression-based Adaptive Deep Importance Sampling (RADIS), an adaptive IS algorithm that uses a nonparametric approximation of the posterior as the proposal distribution. We illustrate the proposed algorithms in applications of astronomy and remote sensing. Chapter 4 and 5 consider noiseless posterior evaluations for building the nonparametric approximations. More generally, in Chapter 6 we give an overview and classification of MCMC and IS schemes using surrogates built with noisy evaluations. The motivation here is the study of posteriors that are both costly and noisy. The classification reveals a connection between algorithms that use the posterior approximation as a cheap surrogate, and algorithms that use it for building an efficient proposal. We illustrate specific instances of the classified schemes in an application of reinforcement learning. Finally, in Chapter 7 we study noisy IS, namely, IS when the posterior evaluations are noisy, and derive optimal proposal distributions for the different estimators in this setting. Chapter 8 deals with (f). In Chapter 8, we provide with an exhaustive review of methods for marginal likelihood computation, with special focus on the ones based on Monte Carlo. We derive many connections among the methods and compare them in several simulations setups. Finally, in Chapter 9 we summarize the contributions of this thesis and discuss some potential avenues of future research.Programa de Doctorado en Ingeniería Matemática por la Universidad Carlos III de MadridPresidente: Valero Laparra Pérez-Muelas.- Secretario: Michael Peter Wiper.- Vocal: Omer Deniz Akyildi

    Data-driven methods to estimate the committor function in conceptual ocean models

    Get PDF
    In recent years, several climate subsystems have been identified that may undergo a relatively rapid transition compared to the changes in their forcing. Such transitions are rare events in general and simulating long-enough trajectories in order to gather sufficient data to determine transition statistics would be too expensive. Conversely, rare-events algorithms like TAMS (Trajectory-Adaptive Multilevel Sampling) encourage the transition while keeping track of the model statistics. However, this algorithm relies on a score function whose choice is crucial to ensure its efficiency. The optimal score function, called committor function, is in practice very difficult to compute. In this paper, we compare different data-based methods (Analogue Markov Chains, Neural Networks, Reservoir Computing, Dynamical Galerkin Approximation) to estimate the committor from trajectory data. We apply these methods on two models of the Atlantic Ocean circulation featuring very different dynamical behavior. We compare these methods in terms of two measures, evaluating how close the estimate is from the true committor, and in terms of the computational time. We find that all methods are able to extract information from the data in order to provide a good estimate of the committor. Analogue Markov Chains provide a very reliable estimate of the true committor in simple models but prove not so robust when applied to systems with a more complex phase space. Neural network methods clearly stand out by their relatively low testing time, and their training time scales more favorably with the complexity of the model than the other methods. In particular, feedforward neural networks consistently achieve the best performance when trained with enough data, making this method promising for committor estimation in sophisticated climate models.</p

    Privacy-preserving distributed data mining

    Get PDF
    This thesis is concerned with privacy-preserving distributed data mining algorithms. The main challenges in this setting are inference attacks and the formation of collusion groups. The inference problem is the reconstruction of sensitive data by attackers from non-sensitive sources, such as intermediate results, exchanged messages, or public information. Moreover, in a distributed scenario, malicious insiders can organize collusion groups to deploy more effective inference attacks. This thesis shows that existing privacy measures do not adequately protect privacy against inference and collusion. Therefore, in this thesis, new measures based on information theory are developed to overcome the identiffied limitations. Furthermore, a new distributed data clustering algorithm is presented. The clustering approach is based on a kernel density estimates approximation that generates a controlled amount of ambiguity in the density estimates and provides privacy to original data. Besides, this thesis also introduces the first privacy-preserving algorithms for frequent pattern discovery in a distributed time series. Time series are transformed into a set of n-dimensional data points and finding frequent patterns reduced to finding local maxima in the n-dimensional density space. The proposed algorithms are linear in the size of the dataset with low communication costs, validated by experimental evaluation using different datasets.Diese Arbeit befasst sich mit vertraulichkeitsbewahrendem Data Mining in verteilten Umgebungen mit Schwerpunkt auf ausgewählten N-Agenten-Angriffsszenarien für das Inferenzproblem im Data-Clustering und der Zeitreihenanalyse. Dabei handelt es sich um Angriffe von einzelnen oder Teilgruppen von Agenten innerhalb einer verteilten Data Mining-Gruppe oder von einem einzelnen Agenten außerhalb dieser Gruppe. Zunächst werden in dieser Arbeit zwei neue Privacy-Maße vorgestellt, die im Gegensatz zu bislang existierenden, die im verteilten Data Mining allgemein geforderte Eigenschaften zur Vertraulichkeitsbewahrung erfüllen und bei denen sich der gemessene Grad der Vertraulichkeit auf die verwendete Datenanalysemethode und die Anzahl von Angreifern bezieht. Für den Zweck eines vertraulichkeitsbewahrenden, verteilten Data-Clustering wird ein neues Kernel-Dichteabschätzungsbasiertes Verfahren namens KDECS vorgestellt. KDECS verwendet eine Approximation der originalen, lokalen Kernel-Dichteschätzung, so dass die ursprünglichen Daten anderer Agenten in der Data Mining-Gruppe mit einer höheren Wahrscheinlichkeit als einem hierfür vorgegebenen Wert nicht mehr zu rekonstruieren sind. Das Verfahren ist nachweislich sicherer als Data-Clustering mit generativen Mixture Modellen und SMC-basiert sicherem k-means Data-Clustering. Zusätzlich stellen wir neue Verfahren, namens DPD-TS, DPD-HE und DPDFS, für eine vertraulichkeitsbewahrende, verteilte Mustererkennung in Zeitreihen vor, deren Komplexität und Sicherheitsgrad wir mit den zuvor erwähnten neuen Privacy-Maßen analysieren. Dabei hängt ein von einzelnen Agenten einer Data Mining-Gruppe jeweils vorgegebener, minimaler Sicherheitsgrad von DPD-TS und DPD-FS nur von der Dimensionsreduktion der Zeitreihenwerte und ihrer Diskretisierung ab und kann leicht überprüft werden. Einen noch besseren Schutz von sensiblen Daten bietet das Verfahren DPD HE mit Hilfe von homomorpher Verschlüsselung. Neben der theoretischen Analyse wurden die experimentellen Leistungsbewertungen der entwickelten Verfahren mit verschiedenen, öffentlich verfügbaren Datensätzen durchgeführt

    An Efficient Global Optimization Algorithm with Adaptive Estimates of the Local Lipschitz Constants

    Full text link
    In this work, we present a new deterministic partition-based Global Optimization (GO) algorithm that uses estimates of the local Lipschitz constants associated with different sub-regions of the domain of the objective function. The estimates of the local Lipschitz constants associated with each partition are the result of adaptively balancing the global and local information obtained so far from the algorithm, given in terms of absolute slopes. We motivate a coupling strategy with local optimization algorithms to accelerate the convergence speed of the proposed approach. In the end, we compare our approach HALO (Hybrid Adaptive Lipschitzian Optimization) with respect to popular GO algorithms using hundreds of test functions. From the numerical results, the performance of HALO is very promising and can extend our arsenal of efficient procedures for attacking challenging real-world GO problems. The Python code of HALO is publicly available on GitHub. https://github.com/dannyzx/HAL
    corecore