207 research outputs found

    Approximate Computation and Implicit Regularization for Very Large-scale Data Analysis

    Full text link
    Database theory and database practice are typically the domain of computer scientists who adopt what may be termed an algorithmic perspective on their data. This perspective is very different than the more statistical perspective adopted by statisticians, scientific computers, machine learners, and other who work on what may be broadly termed statistical data analysis. In this article, I will address fundamental aspects of this algorithmic-statistical disconnect, with an eye to bridging the gap between these two very different approaches. A concept that lies at the heart of this disconnect is that of statistical regularization, a notion that has to do with how robust is the output of an algorithm to the noise properties of the input data. Although it is nearly completely absent from computer science, which historically has taken the input data as given and modeled algorithms discretely, regularization in one form or another is central to nearly every application domain that applies algorithms to noisy data. By using several case studies, I will illustrate, both theoretically and empirically, the nonobvious fact that approximate computation, in and of itself, can implicitly lead to statistical regularization. This and other recent work suggests that, by exploiting in a more principled way the statistical properties implicit in worst-case algorithms, one can in many cases satisfy the bicriteria of having algorithms that are scalable to very large-scale databases and that also have good inferential or predictive properties.Comment: To appear in the Proceedings of the 2012 ACM Symposium on Principles of Database Systems (PODS 2012

    Application of nearly linear solvers to electric power system computation

    Get PDF
    To meet the future needs of the electric power system, improvements need to be made in the areas of power system algorithms, simulation, and modeling, specifically to achieve a time frame that is useful to industry. If power system time-domain simulations could run in real-time, then system operators would have situational awareness to implement and avoid cascading failures, significantly improving power system reliability. Several power system applications rely on the solution of a very large linear system. As the demands on power systems continue to grow, there is a greater computational complexity involved in solving these large linear systems within reasonable time. This project expands on the current work in fast linear solvers, developed for solving symmetric and diagonally dominant linear systems, in order to produce power system specific methods that can be solved in nearly-linear run times. The work explores a new theoretical method that is based on ideas in graph theory and combinatorics. The technique builds a chain of progressively smaller approximate systems with preconditioners based on the system\u27s low stretch spanning tree. The method is compared to traditional linear solvers and shown to reduce the time and iterations required for an accurate solution, especially as the system size increases. A simulation validation is performed, comparing the solution capabilities of the chain method to LU factorization, which is the standard linear solver for power flow. The chain method was successfully demonstrated to produce accurate solutions for power flow simulation on a number of IEEE test cases, and a discussion on how to further improve the method\u27s speed and accuracy is included --Abstract, page iv

    Techniques of High Performance Reservoir Simulation for Unconventional Challenges

    Get PDF
    The quest to improve the performance of reservoir simulators has been evolving with the newly encountered challenges of modeling more complex recovery mechanisms and related phenomena. Reservoir subsidence, fracturing and fault reactivation etc. require coupled flow and poroelastic simulation. These features, in turn, bring a heavy burden on linear solvers. The booming unconventional plays such as shale/tight oil in North America demand reservoir simulation techniques to handle more physics (or more hypotheses). This dissertation deals with three aspects in improving the performance of reservoir simulation toward these unconventional challenges. Compositional simulation is often required for many reservoir studies with complex recovery mechanisms such as gas inject. But, it is time consuming and its parallelization often suffers sever load imbalance problems. In the first section, a novel approach based on domain over-decomposition is investigated and implemented to improve the parallel performance of compositional simulation. For a realistic reservoir case, it is shown the speedup is improved from 29.27 to 62.38 on 64 processors using this technique. Another critical part that determines the performance of a reservoir simulator is the linear solver. In the second section, a new type of linear solver based the combinatorial multilevel method (CML) is introduced and investigated for several reservoir simulation applications. The results show CML has better scalability and performance empirically and is well-suited for coupled poroelastic problems. These results also suggest that CML might be a promising way of precondition for flow simulation with and without coupled poroelastic calculations. In order to handle unconventional petroleum fluid properties for tight oil, the third section incorporates a simulator with extended vapor-liquid equilibrium calculations to consider the capillarity effect caused by the dynamic nanopore properties. The enhanced simulator can correctly capture the pressure dependent impact of the nanopore on rock and fluid properties. It is shown inclusion of these enhanced physics in simulation will lead to significant improvements in field operation decision-making and greatly enhance the reliability of recovery predictions

    Doctor of Philosophy

    Get PDF
    dissertationNetwork emulation has become an indispensable tool for the conduct of research in networking and distributed systems. It offers more realism than simulation and more control and repeatability than experimentation on a live network. However, emulation testbeds face a number of challenges, most prominently realism and scale. Because emulation allows the creation of arbitrary networks exhibiting a wide range of conditions, there is no guarantee that emulated topologies reflect real networks; the burden of selecting parameters to create a realistic environment is on the experimenter. While there are a number of techniques for measuring the end-to-end properties of real networks, directly importing such properties into an emulation has been a challenge. Similarly, while there exist numerous models for creating realistic network topologies, the lack of addresses on these generated topologies has been a barrier to using them in emulators. Once an experimenter obtains a suitable topology, that topology must be mapped onto the physical resources of the testbed so that it can be instantiated. A number of restrictions make this an interesting problem: testbeds typically have heterogeneous hardware, scarce resources which must be conserved, and bottlenecks that must not be overused. User requests for particular types of nodes or links must also be met. In light of these constraints, the network testbed mapping problem is NP-hard. Though the complexity of the problem increases rapidly with the size of the experimenter's topology and the size of the physical network, the runtime of the mapper must not; long mapping times can hinder the usability of the testbed. This dissertation makes three contributions towards improving realism and scale in emulation testbeds. First, it meets the need for realistic network conditions by creating Flexlab, a hybrid environment that couples an emulation testbed with a live-network testbed, inheriting strengths from each. Second, it attends to the need for realistic topologies by presenting a set of algorithms for automatically annotating generated topologies with realistic IP addresses. Third, it presents a mapper, assign, that is capable of assigning experimenters' requested topologies to testbeds' physical resources in a manner that scales well enough to handle large environments

    Proceedings of the 17th Cologne-Twente Workshop on Graphs and Combinatorial Optimization

    Get PDF

    Variational methods and its applications to computer vision

    Get PDF
    Many computer vision applications such as image segmentation can be formulated in a ''variational'' way as energy minimization problems. Unfortunately, the computational task of minimizing these energies is usually difficult as it generally involves non convex functions in a space with thousands of dimensions and often the associated combinatorial problems are NP-hard to solve. Furthermore, they are ill-posed inverse problems and therefore are extremely sensitive to perturbations (e.g. noise). For this reason in order to compute a physically reliable approximation from given noisy data, it is necessary to incorporate into the mathematical model appropriate regularizations that require complex computations. The main aim of this work is to describe variational segmentation methods that are particularly effective for curvilinear structures. Due to their complex geometry, classical regularization techniques cannot be adopted because they lead to the loss of most of low contrasted details. In contrast, the proposed method not only better preserves curvilinear structures, but also reconnects some parts that may have been disconnected by noise. Moreover, it can be easily extensible to graphs and successfully applied to different types of data such as medical imagery (i.e. vessels, hearth coronaries etc), material samples (i.e. concrete) and satellite signals (i.e. streets, rivers etc.). In particular, we will show results and performances about an implementation targeting new generation of High Performance Computing (HPC) architectures where different types of coprocessors cooperate. The involved dataset consists of approximately 200 images of cracks, captured in three different tunnels by a robotic machine designed for the European ROBO-SPECT project.Open Acces

    Efficient Network Domination for Life Science Applications

    Get PDF
    With the ever-increasing size of data available to researchers, traditional methods of analysis often cannot scale to match problems being studied. Often only a subset of variables may be utilized or studied further, motivating the need of techniques that can prioritize variable selection. This dissertation describes the development and application of graph theoretic techniques, particularly the notion of domination, for this purpose. In the first part of this dissertation, algorithms for vertex prioritization in the field of network controllability are studied. Here, the number of solutions to which a vertex belongs is used to classify said vertex and determine its suitability in controlling a network. Novel efficient scalable algorithms are developed and analyzed. Empirical tests demonstrate the improvement of these algorithms over those already established in the literature. The second part of this dissertation concerns the prioritization of genes for loss-of-function allele studies in mice. The International Mouse Phenotyping Consortium leads the initiative to develop a loss-of-function allele for each protein coding gene in the mouse genome. Only a small proportion of untested genes can be selected for further study. To address the need to prioritize genes, a generalizable data science strategy is developed. This strategy models genes as a gene-similarity graph, and from it selects subset that will be further characterized. Empirical tests demonstrate the method’s utility over that of pseudorandom selection and less computationally demanding methods. Finally, part three addresses the important task of preprocessing in the context of noisy public health data. Many public health databases have been developed to collect, curate, and store a variety of environmental measurements. Idiosyncrasies in these measurements, however, introduce noise to data found in these databases in several ways including missing, incorrect, outlying, and incompatible data. Beyond noisy data, multiple measurements of similar variables can introduce problems of multicollinearity. Domination is again employed in a novel graph method to handle autocorrelation. Empirical results using the Public Health Exposome dataset are reported. Together these three parts demonstrate the utility of subset selection via domination when applied to a multitude of data sources from a variety of disciplines in the life sciences

    Development of Human Body CAD Models and Related Mesh Processing Algorithms with Applications in Bioelectromagnetics

    Get PDF
    Simulation of the electromagnetic response of the human body relies heavily upon efficient computational CAD models or phantoms. The Visible Human Project (VHP)-Female v. 3.1 - a new platform-independent full-body electromagnetic computational model is revealed. This is a part of a significant international initiative to develop powerful computational models representing the human body. This model’s unique feature is full compatibility both with MATLAB and specialized FEM computational software packages such as ANSYS HFSS/Maxwell 3D and CST MWS. Various mesh processing algorithms such as automatic intersection resolver, Boolean operation on meshes, etc. used for the development of the Visible Human Project (VHP)-Female are presented. The VHP - Female CAD Model is applied to two specific low frequency applications: Transcranial Magnetic Stimulation (TMS) and Transcranial Direct Current Stimulation (tDCS). TMS and tDCS are increasingly used as diagnostic and therapeutic tools for numerous neuropsychiatric disorders. The development of a CAD model based on an existing voxel model of a Japanese pregnant woman is also presented. TMS for treatment of depression is an appealing alternative to drugs which are teratogenic for pregnant women. This CAD model was used to study fetal wellbeing during induced peak currents by TMS in two possible scenarios: (i) pregnant woman as a patient; and (ii) pregnant woman as an operator. An insight into future work and potential areas of research such as a deformable phantom, implants, and RF applications will be presented

    Faster Randomized Interior Point Methods for Tall/Wide Linear Programs

    Full text link
    Linear programming (LP) is an extremely useful tool which has been successfully applied to solve various problems in a wide range of areas, including operations research, engineering, economics, or even more abstract mathematical areas such as combinatorics. It is also used in many machine learning applications, such as â„“1\ell_1-regularized SVMs, basis pursuit, nonnegative matrix factorization, etc. Interior Point Methods (IPMs) are one of the most popular methods to solve LPs both in theory and in practice. Their underlying complexity is dominated by the cost of solving a system of linear equations at each iteration. In this paper, we consider both feasible and infeasible IPMs for the special case where the number of variables is much larger than the number of constraints. Using tools from Randomized Linear Algebra, we present a preconditioning technique that, when combined with the iterative solvers such as Conjugate Gradient or Chebyshev Iteration, provably guarantees that IPM algorithms (suitably modified to account for the error incurred by the approximate solver), converge to a feasible, approximately optimal solution, without increasing their iteration complexity. Our empirical evaluations verify our theoretical results on both real-world and synthetic data.Comment: Extended version of the NeurIPS 2020 submission. arXiv admin note: substantial text overlap with arXiv:2003.0807
    • …
    corecore