86 research outputs found

    New procedures for visualizing data and diagnosing regression models

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 97-103).This thesis presents new methods for exploring data using visualization techniques. The first part of the thesis develops a procedure for visualizing the sampling variability of a plot. The motivation behind this development is that reporting a single plot of a sample of data without a description of its sampling variability can be uninformative and misleading in the same way that reporting a sample mean without a confidence interval can be. Next, the thesis develops a method for simplifying large scatter plot matrices, using similar techniques as the above procedure. The second part of the thesis introduces a new diagnostic method for regression called backward selection search. Backward selection search identifies a relevant feature set and a set of influential observations with good accuracy, given the difficulty of the problem, and additionally provides a description, in the form of a set of plots, of how the regression inferences would be affected with other model choices, which are close to optimal. This description is useful, because an observation, that one analyst identifies as an outlier, could be identified as the most important observation in the data set by another analyst. The key idea behind backward selection search has implications for methodology improvements beyond the realm of visualization. This is described following the presentation of backward selection search. Real and simulated examples, provided throughout the thesis, demonstrate that the methods developed in the first part of the thesis will improve the effectiveness and validity of data visualization, while the methods developed in the second half of the thesis will improve analysts' abilities to select robust models.by Rajiv Menjoge.Ph.D

    Model-based compressive sensing with Earth Mover's Distance constraints

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 71-72).In compressive sensing, we want to recover ... from linear measurements of the form ... describes the measurement process. Standard results in compressive sensing show that it is possible to exactly recover the signal x from only m ... measurements for certain types of matrices. Model-based compressive sensing reduces the number of measurements even further by limiting the supports of x to a subset of the ... possible supports. Such a family of supports is called a structured sparsity model. In this thesis, we introduce a structured sparsity model for two-dimensional signals that have similar support in neighboring columns. We quantify the change in support between neighboring columns with the Earth Mover's Distance (EMD), which measures both how many elements of the support change and how far the supported elements move. We prove that for a reasonable limit on the EMD between adjacent columns, we can recover signals in our model from only ... measurements, where w is the width of the signal. This is an asymptotic improvement over the ... bound in standard compressive sensing. While developing the algorithmic tools for our proposed structured sparsity model, we also extend the model-based compressed sensing framework. In order to use a structured sparsity model in compressive sensing, we need a model projection algorithm that, given an arbitrary signal x, returns the best approximation in the model. We relax this constraint and develop a variant of IHT, an existing sparse recovery algorithm, that works with approximate model projection algorithms.by Ludwig Schmidt.S.M

    Generation and optimisation of real-world static and dynamic location-allocation problems with application to the telecommunications industry.

    Get PDF
    The location-allocation (LA) problem concerns the location of facilities and the allocation of demand, to minimise or maximise a particular function such as cost, profit or a measure of distance. Many formulations of LA problems have been presented in the literature to capture and study the unique aspects of real-world problems. However, some real-world aspects, such as resilience, are still lacking in the literature. Resilience ensures uninterrupted supply of demand and enhances the quality of service. Due to changes in population shift, market size, and the economic and labour markets - which often cause demand to be stochastic - a reasonable LA problem formulation should consider some aspect of future uncertainties. Almost all LA problem formulations in the literature that capture some aspect of future uncertainties fall in the domain of dynamic optimisation problems, where new facilities are located every time the environment changes. However, considering the substantial cost associated with locating a new facility, it becomes infeasible to locate facilities each time the environment changes. In this study, we propose and investigate variations of LA problem formulations. Firstly, we develop and study new LA formulations, which extend the location of facilities and the allocation of demand to add a layer of resilience. We apply the population-based incremental learning algorithm for the first time in the literature to solve the new novel LA formulations. Secondly, we propose and study a new dynamic formulation of the LA problem where facilities are opened once at the start of a defined period and are expected to be satisfactory in servicing customers' demands irrespective of changes in customer distribution. The problem is based on the idea that customers will change locations over a defined period and that these changes have to be taken into account when establishing facilities to service changing customers' distributions. Thirdly, we employ a simulation-based optimisation approach to tackle the new dynamic formulation. Owing to the high computational costs associated with simulation-based optimisation, we investigate the concept of Racing, an approach used in model selection, to reduce the high computational cost by employing the minimum number of simulations for solution selection

    Learning and inference with Wasserstein metrics

    Get PDF
    Thesis: Ph. D., Massachusetts Institute of Technology, Department of Brain and Cognitive Sciences, 2018.Cataloged from PDF version of thesis.Includes bibliographical references (pages 131-143).This thesis develops new approaches for three problems in machine learning, using tools from the study of optimal transport (or Wasserstein) distances between probability distributions. Optimal transport distances capture an intuitive notion of similarity between distributions, by incorporating the underlying geometry of the domain of the distributions. Despite their intuitive appeal, optimal transport distances are often difficult to apply in practice, as computing them requires solving a costly optimization problem. In each setting studied here, we describe a numerical method that overcomes this computational bottleneck and enables scaling to real data. In the first part, we consider the problem of multi-output learning in the presence of a metric on the output domain. We develop a loss function that measures the Wasserstein distance between the prediction and ground truth, and describe an efficient learning algorithm based on entropic regularization of the optimal transport problem. We additionally propose a novel extension of the Wasserstein distance from probability measures to unnormalized measures, which is applicable in settings where the ground truth is not naturally expressed as a probability distribution. We show statistical learning bounds for both the Wasserstein loss and its unnormalized counterpart. The Wasserstein loss can encourage smoothness of the predictions with respect to a chosen metric on the output space. We demonstrate this property on a real-data image tagging problem, outperforming a baseline that doesn't use the metric. In the second part, we consider the probabilistic inference problem for diffusion processes. Such processes model a variety of stochastic phenomena and appear often in continuous-time state space models. Exact inference for diffusion processes is generally intractable. In this work, we describe a novel approximate inference method, which is based on a characterization of the diffusion as following a gradient flow in a space of probability densities endowed with a Wasserstein metric. Existing methods for computing this Wasserstein gradient flow rely on discretizing the underlying domain of the diffusion, prohibiting their application to problems in more than several dimensions. In the current work, we propose a novel algorithm for computing a Wasserstein gradient flow that operates directly in a space of continuous functions, free of any underlying mesh. We apply our approximate gradient flow to the problem of filtering a diffusion, showing superior performance where standard filters struggle. Finally, we study the ecological inference problem, which is that of reasoning from aggregate measurements of a population to inferences about the individual behaviors of its members. This problem arises often when dealing with data from economics and political sciences, such as when attempting to infer the demographic breakdown of votes for each political party, given only the aggregate demographic and vote counts separately. Ecological inference is generally ill-posed, and requires prior information to distinguish a unique solution. We propose a novel, general framework for ecological inference that allows for a variety of priors and enables efficient computation of the most probable solution. Unlike previous methods, which rely on Monte Carlo estimates of the posterior, our inference procedure uses an efficient fixed point iteration that is linearly convergent. Given suitable prior information, our method can achieve more accurate inferences than existing methods. We additionally explore a sampling algorithm for estimating credible regions.by Charles Frogner.Ph. D

    High-throughput genomic/proteomic studies : finding structure and meaning by similarity

    Get PDF
    The post-genomic challenge was to develop high-throughput technologies for measuring genome scale mRNA expression levels. Analyses of these data rely on computers in an unprecedented way to make the results accessible to researchers. My research in this area enabled the first compendium of microarray experiments for a multi-cellular eukaryote, Caenorhabditis elegans. Prior to this research approximately 6% of the C. elegans genome had been studied, and little was known about global expression patterns in this organism. Here I cluster data from 553 different microarray experiments and show that the results are stable, statistically significant and highly enriched for specific biological functions. These enrichments allow identification of gene function for the majority of C. elegans genes. Tissue specific expression patterns are discovered suggesting the role of particular proteins in digestion, tumor suppression, protection from bacteria and from heavy metals. I report evidence that genome instability in males involves transposons, and find co-expression patterns between sperm proteins, protein kinases and phosphatases suggesting that sperm, that are transcriptionally inactive cells, commonly use phosphorylation to regulate protein activities. My subsequent research addresses protein concentrations and interactions, beginning with a simultaneous comparison of multiple data sets to analyze Saccharomyces cerevisiae gene-expression (cell cycle and exit from stationary phase/G0) and protein-interaction studies. Here, I find that G1-regulated genes are not co-regulated during exit from stationary phase, indicating that the cells are not synchronized. The tight clustering of other genes during exit from stationary-phase does indicate that the physiological responses during G0 exit are separable from cell-cycle events. Subsequently, I report in vivo proteomic research investigating population phenotypes in stationary phase cultures using the yeast Green Fluorescent Protein-fusion library (4156 strains) together with flow cytometry. Stationary phase cultures consist of dense quiescent (Q) and less dense non-quiescent (NQ) fractions. The Q-cell fraction is generally composed of daughter cells with high concentrations of proteins involved in the citric acid cycle and the electron transport chain, for example Cit1p. The NQ fraction has subpopulations of cells that can be separated by the low and high concentrations of these mitochondrial proteins, i.e., NQ cells often have double intensity peaks: a bright fraction and a much dimmer fraction, which is the case for Cit1p. The Q fraction uses oxygen 6 times as rapidly as the NQ fraction, and 1.6 times as rapidly as exponentially growing cells. NQ cells are less reproductively capable than Q cells, and show evidence of reactive oxygen species stress. These phenotypes develop as early as 20-24 hours after the diauxic shift, which is as early as we can make a differentiating measurement using fluorescence intensities. Finally, I propose a new way to analyze multidimensional flow cytometry data, which may lead to better understanding of Q/NQ cell differentiation

    Requirements for Defining Utility Drive Cycles: An Exploratory Analysis of Grid Frequency Regulation Data for Establishing Battery Performance Testing Standards

    Get PDF
    Battery testing procedures are important for understanding battery performance, including degradation over the life of the battery. Standards are important to provide clear rules and uniformity to an industry. The work described in this report addresses the need for standard battery testing procedures that reflect real-world applications of energy storage systems to provide regulation services to grid operators. This work was motivated by the need to develop Vehicle-to-Grid (V2G) testing procedures, or V2G drive cycles. Likewise, the stationary energy storage community is equally interested in standardized testing protocols that reflect real-world grid applications for providing regulation services. As the first of several steps toward standardizing battery testing cycles, this work focused on a statistical analysis of frequency regulation signals from the Pennsylvania-New Jersey-Maryland Interconnect with the goal to identify patterns in the regulation signal that would be representative of the entire signal as a typical regulation data set. Results from an extensive time-series analysis are discussed, and the results are explained from both the statistical and the battery-testing perspectives. The results then are interpreted in the context of defining a small set of V2G drive cycles for standardization, offering some recommendations for the next steps toward standardizing testing protocols

    Automated Bedform Identification—A Meta-Analysis of Current Methods and the Heterogeneity of Their Outputs

    Get PDF
    Ongoing efforts to characterize underwater dunes have led to a considerable number of freely available tools that identify these bedforms in a (semi-)automated way. However, these tools differ with regard to their research focus and appear to produce results that are far from unequivocal. We scrutinize this assumption by comparing the results of five recently published dune identification tools in a comprehensive meta-analysis. Specifically, we analyze dune populations identified in three bathymetries under diverse flow conditions and compare the resulting dune characteristics in a quantitative manner. Besides the impact of underlying definitions, it is shown that the main heterogeneity arises from the consideration of a secondary dune scale, which has a significant influence on statistical distributions. Based on the quantitative results, we discuss the individual strengths and limitations of each algorithm, with the aim of outlining adequate fields of application. However, the concerted bedform analysis and subsequent combination of results have another benefit: the creation of a benchmarking data set which is inherently less biased by individual focus and therefore a valuable instrument for future validations. Nevertheless, it is apparent that the available tools are still very specific and that end-users would profit by their merging into a universal and modular toolbox

    An Unsupervised Cluster: Learning Water Customer Behavior Using Variation of Information on a Reconstructed Phase Space

    Get PDF
    The unsupervised clustering algorithm described in this dissertation addresses the need to divide a population of water utility customers into groups based on their similarities and differences, using only the measured flow data collected by water meters. After clustering, the groups represent customers with similar consumption behavior patterns and provide insight into ‘normal’ and ‘unusual’ customer behavior patterns. This research focuses upon individually metered water utility customers and includes both residential and commercial customer accounts serviced by utilities within North America. The contributions of this dissertation not only represent a novel academic work, but also solve a practical problem for the utility industry. This dissertation introduces a method of agglomerative clustering using information theoretic distance measures on Gaussian mixture models within a reconstructed phase space. The clustering method accommodates a utility’s limited human, financial, computational, and environmental resources. The proposed weighted variation of information distance measure for comparing Gaussian mixture models places emphasis upon those behaviors whose statistical distributions are more compact over those behaviors with large variation and contributes a novel addition to existing comparison options

    Coevolutionary algorithms for the optimization of strategies for red teaming applications

    Get PDF
    Red teaming (RT) is a process that assists an organization in finding vulnerabilities in a system whereby the organization itself takes on the role of an “attacker” to test the system. It is used in various domains including military operations. Traditionally, it is a manual process with some obvious weaknesses: it is expensive, time-consuming, and limited from the perspective of humans “thinking inside the box”. Automated RT is an approach that has the potential to overcome these weaknesses. In this approach both the red team (enemy forces) and blue team (friendly forces) are modelled as intelligent agents in a multi-agent system and the idea is to run many computer simulations, pitting the plan of the red team against the plan of blue team. This research project investigated techniques that can support automated red teaming by conducting a systematic study involving a genetic algorithm (GA), a basic coevolutionary algorithm and three variants of the coevolutionary algorithm. An initial pilot study involving the GA showed some limitations, as GAs only support the optimization of a single population at a time against a fixed strategy. However, in red teaming it is not sufficient to consider just one, or even a few, opponent‟s strategies as, in reality, each team needs to adjust their strategy to account for different strategies that competing teams may utilize at different points. Coevolutionary algorithms (CEAs) were identified as suitable algorithms which were capable of optimizing two teams simultaneously for red teaming. The subsequent investigation of CEAs examined their performance in addressing the characteristics of red teaming problems, such as intransitivity relationships and multimodality, before employing them to optimize two red teaming scenarios. A number of measures were used to evaluate the performance of CEAs and in terms of multimodality, this study introduced a novel n-peak problem and a new performance measure based on the Circular Earth Movers‟ Distance. Results from the investigations involving an intransitive number problem, multimodal problem and two red teaming scenarios showed that in terms of the performance measures used, there is not a single algorithm that consistently outperforms the others across the four test problems. Applications of CEAs on the red teaming scenarios showed that all four variants produced interesting evolved strategies at the end of the optimization process, as well as providing evidence of the potential of CEAs in their future application in red teaming. The developed techniques can potentially be used for red teaming in military operations or analysis for protection of critical infrastructure. The benefits include the modelling of more realistic interactions between the teams, the ability to anticipate and to counteract potentially new types of attacks as well as providing a cost effective solution
    • 

    corecore