668 research outputs found

    Multi-Resolution Functional ANOVA for Large-Scale, Many-Input Computer Experiments

    Full text link
    The Gaussian process is a standard tool for building emulators for both deterministic and stochastic computer experiments. However, application of Gaussian process models is greatly limited in practice, particularly for large-scale and many-input computer experiments that have become typical. We propose a multi-resolution functional ANOVA model as a computationally feasible emulation alternative. More generally, this model can be used for large-scale and many-input non-linear regression problems. An overlapping group lasso approach is used for estimation, ensuring computational feasibility in a large-scale and many-input setting. New results on consistency and inference for the (potentially overlapping) group lasso in a high-dimensional setting are developed and applied to the proposed multi-resolution functional ANOVA model. Importantly, these results allow us to quantify the uncertainty in our predictions. Numerical examples demonstrate that the proposed model enjoys marked computational advantages. Data capabilities, both in terms of sample size and dimension, meet or exceed best available emulation tools while meeting or exceeding emulation accuracy

    BClean: A Bayesian Data Cleaning System

    Full text link
    There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian methods. However, existing probabilistic methods often assume a simplistic distribution (e.g., Gaussian distribution), which is frequently underfitted in practice, or they necessitate experts to provide a complex prior distribution (e.g., via a programming language). This requirement is both labor-intensive and costly, rendering these methods less suitable for real-world applications. In this paper, we propose BClean, a Bayesian Cleaning system that features automatic Bayesian network construction and user interaction. We recast the data cleaning problem as a Bayesian inference that fully exploits the relationships between attributes in the observed dataset and any prior information provided by users. To this end, we present an automatic Bayesian network construction method that extends a structure learning-based functional dependency discovery method with similarity functions to capture the relationships between attributes. Furthermore, our system allows users to modify the generated Bayesian network in order to specify prior information or correct inaccuracies identified by the automatic generation process. We also design an effective scoring model (called the compensative scoring model) necessary for the Bayesian inference. To enhance the efficiency of data cleaning, we propose several approximation strategies for the Bayesian inference, including graph partitioning, domain pruning, and pre-detection. By evaluating on both real-world and synthetic datasets, we demonstrate that BClean is capable of achieving an F-measure of up to 0.9 in data cleaning, outperforming existing Bayesian methods by 2% and other data cleaning methods by 15%.Comment: Our source code is available at https://github.com/yyssl88/BClea

    Analysis of cuttings concentration experimental data using exploratory data analysis

    Get PDF
    Cuttings transportation is a complex phenomenon involving many inter-acting variables. Experimental investigations on cuttings transport are carried out by different research groups for decades and varying findings are reported which points out to the need for a methodical data analysis approach. In the current paper, six experimental datasets (702 observations) are analyzed using exploratory data analysis (EDA) in a two-fold manner – univariate and multivariate analysis. Univariate analysis shows the asymmetry in distribution for each experimental parameter indicating the need for a nonparametric modeling approach. Multivariate analysis shows the interaction of the experimental parameters among themselves and their influence on downhole cuttings concentration (Cc) using 6D scatter plots and correlation coefficients (Kendall's ). EDA of the current experimental data reveals the following major findings: • Smaller Cc in concentric vertical wells compared to concentric non-vertical wells. • Drilling fluid flow rate is a dominant operational parameter in vertical wellbore cleaning while string rotation (RPM) is dominant in non-vertical wellbore cleaning. • Little impact of RPM in concentric vertical well and negative eccentric deviated/highly deviated well cleaning. However, RPM together with drilling fluid flow rate provides better cleaning of non-vertical wells with positive eccentricity. • RPM has higher influence on cuttings transport in narrow annulus compared to that in wide annulus. • Assuming drilling fluid of sufficient viscosity and drill string rotation present, low viscous fluid under turbulent flow and high viscous fluid under laminar flow provide better hole cleaning. Further, Kendall's indicates apparent viscosity playing a more significant role in cleaning deviated wellbores compared to other inclinations for the current dataset. • Drilling fluid flow rate influences the transport of heavier cuttings and larger cuttings more while RPM has higher influence on the transport of lighter cuttings and smaller cuttings. • Better hole cleaning by heavier drilling fluids than that by lighter fluids.publishedVersio

    Exact Bayesian Inference on Discrete Models via Probability Generating Functions: A Probabilistic Programming Approach

    Full text link
    We present an exact Bayesian inference method for discrete statistical models, which can find exact solutions to many discrete inference problems, even with infinite support and continuous priors. To express such models, we introduce a probabilistic programming language that supports discrete and continuous sampling, discrete observations, affine functions, (stochastic) branching, and conditioning on events. Our key tool is probability generating functions: they provide a compact closed-form representation of distributions that are definable by programs, thus enabling the exact computation of posterior probabilities, expectation, variance, and higher moments. Our inference method is provably correct, fully automated and uses automatic differentiation (specifically, Taylor polynomials), but does not require computer algebra. Our experiments show that its performance on a range of real-world examples is competitive with approximate Monte Carlo methods, while avoiding approximation errors

    CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS

    Get PDF
    The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research

    Digital Learning in the Wild: Re-Imagining New Ruralism, Digital Equity, and Deficit Discourses through the Thirdspace

    Get PDF
    abstract: Digital media is becoming increasingly important to learning in today’s changing times. At the same time, digital technologies and related digital skills are unevenly distributed. Further, deficit-based notions of this digital divide define the public’s educational paradigm. Against this backdrop, I forayed into the social reality of one rural Americana to examine digital learning in the wild. The larger purpose of this dissertation was to spatialize understandings of rural life and pervasive social ills therein, in order to rethink digital equity, such that we dismantle deficit thinking, problematize new ruralism, and re-imagine more just rural geographies. Under a Thirdspace understanding of space as dynamic, relational, and agentive (Soja, 1996), I examined how digital learning is caught up spatially to position the rural struggle over geography amid the ‘Right to the City’ rhetoric (Lefebvre, 1968). In response to this limiting and urban-centric rhetoric, I contest digital inequity as a spatial issue of justice in rural areas. After exploring how digital learning opportunities are distributed at state and local levels, I geo-ethnographically explored digital use to story how families across socio-economic spaces were utilizing digital tools. Last, because ineffective and deficit-based models of understanding erupt from blaming the oppressed for their own self-made oppression, or framing problems (e.g., digital inequity) as solely human-centered, I drew in posthumanist Latourian (2005) social cartographies of Thirdspace. From this, I re-imagined educational equity within rural space to recast digital equity not in terms of the “haves and have nots” but as an account of mutually transformative socio-technical agency. Last, I pay the price of criticism by suggesting possible actions and solutions to the social ills denounced throughout this dissertation.Dissertation/ThesisDoctoral Dissertation Learning, Literacies and Technologies 201

    Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain

    Get PDF
    The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio

    Variable selection and sensitivity analysis using dynamic trees, with an application to computer code performance tuning

    Full text link
    We investigate an application in the automatic tuning of computer codes, an area of research that has come to prominence alongside the recent rise of distributed scientific processing and heterogeneity in high-performance computing environments. Here, the response function is nonlinear and noisy and may not be smooth or stationary. Clearly needed are variable selection, decomposition of influence, and analysis of main and secondary effects for both real-valued and binary inputs and outputs. Our contribution is a novel set of tools for variable selection and sensitivity analysis based on the recently proposed dynamic tree model. We argue that this approach is uniquely well suited to the demands of our motivating example. In illustrations on benchmark data sets, we show that the new techniques are faster and offer richer feature sets than do similar approaches in the static tree and computer experiment literature. We apply the methods in code-tuning optimization, examination of a cold-cache effect, and detection of transformation errors.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS590 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …
    corecore