64 research outputs found

    Wind turbine power curve estimation based on earth mover distance and artificial neural networks

    Get PDF
    A data-based estimation of the wind–power curve in wind turbines may be a challenging task due to the presence of anomalous data, possibly due to wrong sensor reads, operation halts, malfunctions or other. In this study, the authors describe a data-based procedure to build a robust and accurate estimate of the wind–power curve. In particular, they combine a joint clustering procedure, where both the wind speeds and the power data are clustered, with an Earth Mover Distance-based Extreme Learning Machine algorithm to filter out data that poorly contribute to explain the unknown curve. After estimating the cut-in and the rated speed, they use a radial basis function neural network to fit the filtered data and obtain the curve estimate. They extensively compared the proposed procedure against other conventional methodologies over measured data of nine turbines, to assess and discuss its performance

    Learning and inference with Wasserstein metrics

    Get PDF
    Thesis: Ph. D., Massachusetts Institute of Technology, Department of Brain and Cognitive Sciences, 2018.Cataloged from PDF version of thesis.Includes bibliographical references (pages 131-143).This thesis develops new approaches for three problems in machine learning, using tools from the study of optimal transport (or Wasserstein) distances between probability distributions. Optimal transport distances capture an intuitive notion of similarity between distributions, by incorporating the underlying geometry of the domain of the distributions. Despite their intuitive appeal, optimal transport distances are often difficult to apply in practice, as computing them requires solving a costly optimization problem. In each setting studied here, we describe a numerical method that overcomes this computational bottleneck and enables scaling to real data. In the first part, we consider the problem of multi-output learning in the presence of a metric on the output domain. We develop a loss function that measures the Wasserstein distance between the prediction and ground truth, and describe an efficient learning algorithm based on entropic regularization of the optimal transport problem. We additionally propose a novel extension of the Wasserstein distance from probability measures to unnormalized measures, which is applicable in settings where the ground truth is not naturally expressed as a probability distribution. We show statistical learning bounds for both the Wasserstein loss and its unnormalized counterpart. The Wasserstein loss can encourage smoothness of the predictions with respect to a chosen metric on the output space. We demonstrate this property on a real-data image tagging problem, outperforming a baseline that doesn't use the metric. In the second part, we consider the probabilistic inference problem for diffusion processes. Such processes model a variety of stochastic phenomena and appear often in continuous-time state space models. Exact inference for diffusion processes is generally intractable. In this work, we describe a novel approximate inference method, which is based on a characterization of the diffusion as following a gradient flow in a space of probability densities endowed with a Wasserstein metric. Existing methods for computing this Wasserstein gradient flow rely on discretizing the underlying domain of the diffusion, prohibiting their application to problems in more than several dimensions. In the current work, we propose a novel algorithm for computing a Wasserstein gradient flow that operates directly in a space of continuous functions, free of any underlying mesh. We apply our approximate gradient flow to the problem of filtering a diffusion, showing superior performance where standard filters struggle. Finally, we study the ecological inference problem, which is that of reasoning from aggregate measurements of a population to inferences about the individual behaviors of its members. This problem arises often when dealing with data from economics and political sciences, such as when attempting to infer the demographic breakdown of votes for each political party, given only the aggregate demographic and vote counts separately. Ecological inference is generally ill-posed, and requires prior information to distinguish a unique solution. We propose a novel, general framework for ecological inference that allows for a variety of priors and enables efficient computation of the most probable solution. Unlike previous methods, which rely on Monte Carlo estimates of the posterior, our inference procedure uses an efficient fixed point iteration that is linearly convergent. Given suitable prior information, our method can achieve more accurate inferences than existing methods. We additionally explore a sampling algorithm for estimating credible regions.by Charles Frogner.Ph. D

    MailTrout:a machine learning browser extension for detecting phishing emails

    Get PDF
    The onset of the COVID-19 pandemic has given rise to an increase in cyberattacks and cybercrime, particularly with respect to phishing attempts. Cybercrime associated with phishing emails can significantly impact victims, who may be subjected to monetary loss and identity theft. Existing anti-phishing tools do not always catch all phishing emails, leaving the user to decide the legitimacy of an email. The ability of machine learning technology to identify reoccurring patterns yet cope with overall changes complements the nature of anti-phishing techniques, as phishing attacks may vary in wording but often follow similar patterns. This paper presents a browser extension called MailTrout, which incorporates machine learning within a usable security tool to assist users in detecting phishing emails. MailTrout demonstrated high levels of accuracy when detecting phishing emails and high levels of usability for end-users

    An Unsupervised Cluster: Learning Water Customer Behavior Using Variation of Information on a Reconstructed Phase Space

    Get PDF
    The unsupervised clustering algorithm described in this dissertation addresses the need to divide a population of water utility customers into groups based on their similarities and differences, using only the measured flow data collected by water meters. After clustering, the groups represent customers with similar consumption behavior patterns and provide insight into ‘normal’ and ‘unusual’ customer behavior patterns. This research focuses upon individually metered water utility customers and includes both residential and commercial customer accounts serviced by utilities within North America. The contributions of this dissertation not only represent a novel academic work, but also solve a practical problem for the utility industry. This dissertation introduces a method of agglomerative clustering using information theoretic distance measures on Gaussian mixture models within a reconstructed phase space. The clustering method accommodates a utility’s limited human, financial, computational, and environmental resources. The proposed weighted variation of information distance measure for comparing Gaussian mixture models places emphasis upon those behaviors whose statistical distributions are more compact over those behaviors with large variation and contributes a novel addition to existing comparison options

    Generation and optimisation of real-world static and dynamic location-allocation problems with application to the telecommunications industry.

    Get PDF
    The location-allocation (LA) problem concerns the location of facilities and the allocation of demand, to minimise or maximise a particular function such as cost, profit or a measure of distance. Many formulations of LA problems have been presented in the literature to capture and study the unique aspects of real-world problems. However, some real-world aspects, such as resilience, are still lacking in the literature. Resilience ensures uninterrupted supply of demand and enhances the quality of service. Due to changes in population shift, market size, and the economic and labour markets - which often cause demand to be stochastic - a reasonable LA problem formulation should consider some aspect of future uncertainties. Almost all LA problem formulations in the literature that capture some aspect of future uncertainties fall in the domain of dynamic optimisation problems, where new facilities are located every time the environment changes. However, considering the substantial cost associated with locating a new facility, it becomes infeasible to locate facilities each time the environment changes. In this study, we propose and investigate variations of LA problem formulations. Firstly, we develop and study new LA formulations, which extend the location of facilities and the allocation of demand to add a layer of resilience. We apply the population-based incremental learning algorithm for the first time in the literature to solve the new novel LA formulations. Secondly, we propose and study a new dynamic formulation of the LA problem where facilities are opened once at the start of a defined period and are expected to be satisfactory in servicing customers' demands irrespective of changes in customer distribution. The problem is based on the idea that customers will change locations over a defined period and that these changes have to be taken into account when establishing facilities to service changing customers' distributions. Thirdly, we employ a simulation-based optimisation approach to tackle the new dynamic formulation. Owing to the high computational costs associated with simulation-based optimisation, we investigate the concept of Racing, an approach used in model selection, to reduce the high computational cost by employing the minimum number of simulations for solution selection

    Highly efficient low-level feature extraction for video representation and retrieval.

    Get PDF
    PhDWitnessing the omnipresence of digital video media, the research community has raised the question of its meaningful use and management. Stored in immense multimedia databases, digital videos need to be retrieved and structured in an intelligent way, relying on the content and the rich semantics involved. Current Content Based Video Indexing and Retrieval systems face the problem of the semantic gap between the simplicity of the available visual features and the richness of user semantics. This work focuses on the issues of efficiency and scalability in video indexing and retrieval to facilitate a video representation model capable of semantic annotation. A highly efficient algorithm for temporal analysis and key-frame extraction is developed. It is based on the prediction information extracted directly from the compressed domain features and the robust scalable analysis in the temporal domain. Furthermore, a hierarchical quantisation of the colour features in the descriptor space is presented. Derived from the extracted set of low-level features, a video representation model that enables semantic annotation and contextual genre classification is designed. Results demonstrate the efficiency and robustness of the temporal analysis algorithm that runs in real time maintaining the high precision and recall of the detection task. Adaptive key-frame extraction and summarisation achieve a good overview of the visual content, while the colour quantisation algorithm efficiently creates hierarchical set of descriptors. Finally, the video representation model, supported by the genre classification algorithm, achieves excellent results in an automatic annotation system by linking the video clips with a limited lexicon of related keywords

    AI-assisted patent prior art searching - feasibility study

    Get PDF
    This study seeks to understand the feasibility, technical complexities and effectiveness of using artificial intelligence (AI) solutions to improve operational processes of registering IP rights. The Intellectual Property Office commissioned Cardiff University to undertake this research. The research was funded through the BEIS Regulators’ Pioneer Fund (RPF). The RPF fund was set up to help address barriers to innovation in the UK economy

    Semantically enhanced document clustering

    Get PDF
    This thesis advocates the view that traditional document clustering could be significantly improved by representing documents at different levels of abstraction at which the similarity between documents is considered. The improvement is with regard to the alignment of the clustering solutions to human judgement. The proposed methodology employs semantics with which the conceptual similarity be-tween documents is measured. The goal is to design algorithms which implement the meth-odology, in order to solve the following research problems: (i) how to obtain multiple deter-ministic clustering solutions; (ii) how to produce coherent large-scale clustering solutions across domains, regardless of the number of clusters; (iii) how to obtain clustering solutions which align well with human judgement; and (iv) how to produce specific clustering solu-tions from the perspective of the user’s understanding for the domain of interest. The developed clustering methodology enhances separation between and improved coher-ence within clusters generated across several domains by using levels of abstraction. The methodology employs a semantically enhanced text stemmer, which is developed for the pur-pose of producing coherent clustering, and a concept index that provides generic document representation and reduced dimensionality of document representation. These characteristics of the methodology enable addressing the limitations of traditional text document clustering by employing computationally expensive similarity measures such as Earth Mover’s Distance (EMD), which theoretically aligns the clustering solutions closer to human judgement. A threshold for similarity between documents that employs many-to-many similarity matching is proposed and experimentally proven to benefit the traditional clustering algorithms in pro-ducing clustering solutions aligned closer to human judgement. 4 The experimental validation demonstrates the scalability of the semantically enhanced document clustering methodology and supports the contributions: (i) multiple deterministic clustering solutions and different viewpoints to a document collection are obtained; (ii) the use of concept indexing as a document representation technique in the domain of document clustering is beneficial for producing coherent clusters across domains; (ii) SETS algorithm provides an improved text normalisation by using external knowledge; (iv) a method for measuring similarity between documents on a large scale by using many-to-many matching; (v) a semantically enhanced methodology that employs levels of abstraction that correspond to a user’s background, understanding and motivation. The achieved results will benefit the research community working in the area of document management, information retrieval, data mining and knowledge management
    corecore