854 research outputs found

    Interpretable statistics for complex modelling: quantile and topological learning

    Get PDF
    As the complexity of our data increased exponentially in the last decades, so has our need for interpretable features. This thesis revolves around two paradigms to approach this quest for insights. In the first part we focus on parametric models, where the problem of interpretability can be seen as a “parametrization selection”. We introduce a quantile-centric parametrization and we show the advantages of our proposal in the context of regression, where it allows to bridge the gap between classical generalized linear (mixed) models and increasingly popular quantile methods. The second part of the thesis, concerned with topological learning, tackles the problem from a non-parametric perspective. As topology can be thought of as a way of characterizing data in terms of their connectivity structure, it allows to represent complex and possibly high dimensional through few features, such as the number of connected components, loops and voids. We illustrate how the emerging branch of statistics devoted to recovering topological structures in the data, Topological Data Analysis, can be exploited both for exploratory and inferential purposes with a special emphasis on kernels that preserve the topological information in the data. Finally, we show with an application how these two approaches can borrow strength from one another in the identification and description of brain activity through fMRI data from the ABIDE project

    Statistical supervised learning with engineering data: a case study of low frequency noise measured on semiconductor devices

    Get PDF
    The authors thank the Laboratory of Nanoelectronics in the Research Centre for Information and Communications Technologies (CITIC-UGR) at the University of Granada (Spain) for providing the data for the study. This work was supported in part by the Spanish Ministry of Science and Innovation through grants number RTI2018-099723-B-I00, and PID2020-120217RB-I00; the Spanish Junta de Andalucia through grants number B-FQM-284-UGR20 and B-CTS-184-UGR20; and the IMAG-Maria de Maeztu grant CEX2020-001105-/AEI/10.13039/501100011033. The comments from two anonymous reviewers and the Associate Editor that have helped to improve the quality of the paper are also acknowledged.Our practical motivation is the analysis of potential correlations between spectral noise current and threshold voltage from common on-wafer MOSFETs. The usual strategy leads to the use of standard techniques based on Normal linear regression easily accessible in all statistical software (both free or commercial). However, these statistical methods are not appropriate because the assumptions they lie on are not met. More sophisticated methods are required. A new strategy based on the most novel nonparametric techniques which are data-driven and thus free from questionable parametric assumptions is proposed. A backfitting algorithm accounting for random effects and nonparametric regression is designed and implemented. The nature of the correlation between threshold voltage and noise is examined by conducting a statistical test, which is based on a novel technique that summarizes in a color map all the relevant information of the data. The way the results are presented in the plot makes it easy for a non-expert in data analysis to understand what is underlying. The good performance of the method is proven through simulations and it is applied to a data case in a field where these modern statistical techniques are novel and result very efficient.Spanish Government RTI2018-099723-B-I00 PID2020-120217RB-I00Junta de Andalucia B-FQM-284-UGR20 B-CTS-184-UGR20IMAG-Maria de Maeztu grant CEX2020-001105-/AE

    Process Monitoring and Uncertainty Quantification for Laser Powder Bed Fusion Additive Manufacturing

    Get PDF
    Metal Additive manufacturing (AM) such as Laser Powder-Bed Fusion (LPBF) processes offer new opportunities for building parts with geometries and features that other traditional processes cannot match. At the same time, LPBF imposes new challenges on practitioners. These challenges include high complexity of simulating the AM process, anisotropic mechanical properties, need for new monitoring methods. Part of this Dissertation develops a new method for layerwise anomaly detection during for LPBF. The method uses high-speed thermal imaging to capture melt pool temperature and is composed of a procedure utilizing spatial statistics and machine learning. Another parts of this Dissertation solves problems for efficient use of computer simulation models. Simulation models are vital for accelerated development of LPBF because we can integrate multiple computer simulation models at different scales to optimize the process prior to the part fabrication. This integration of computer models often happens in a hierarchical fashion and final model predicts the behavior of the most important Quantity of Interest (QoI). Once all the models are coupled, a system of models is created for which a formal Uncertainty Quantification (UQ) is needed to calibrate the unknown model parameters and analyze the discrepancy between the models and the real-world in order to identify regions of missing physics. This dissertation presents a framework for UQ of LPBF models with the following features: (1) models have multiple outputs instead of a single output, (2) models are coupled using the input and output variables that they share, and (3) models can have partially unobservable outputs for which no experimental data are present. This work proposes using Gaussian process (GP) and Bayesian networks (BN) as the main tool for handling UQ for a system of computer models with the aforementioned properties. For each of our methodologies, we present a case study of a specific alloy system. Experimental data are captured by additively manufacturing parts and single tracks to evaluate the proposed method. Our results show that the combination of GP and BN is a powerful and flexible tool to answer UQ problems for LPBF

    Model Testing Based on Regression Spline

    Get PDF
    Tests based on regression spline are developed in this chapter for testing nonparametric functions in nonparametric, partial linear and varying-coefficient models, respectively. These models are more flexible than linear regression model. However, one important problem is if it is really necessary to use such complex models which contain nonparametric functions. For this purpose, p-values for testing the linearity and constancy of the nonparametric functions are established based on regression spline and fiducial method. In the application of spline-based method, the determination of knots is difficult but plays an important role in inferring regression curve. In order to infer the nonparametric regression at different smoothing levels (scales) and locations, multi-scale smoothing methods based on regression spline are developed to test the structures of the regression curve and compare multiple regression curves. It could sidestep the determination of knots; meanwhile, it could give a more reliable result in using the spline-based method

    Time Series Analysis of MODIS NDVI data with Cloudy Pixels: Frequency-domain and SiZer analyses of vegetation change in Western Rwanda

    Get PDF
    Remote sensing is a valuable source of data for the study of human ecology in rural areas. In this thesis, I attempt to analyze the presence of a long-term trend indicative of post-resettlement adaptation in the vegetation signals of Western Rwanda. There is a dearth of research utilizing medium resolution imagery to study difficult environments, such as tropical-montane regions, where complex topography and cloud cover diminish image accuracy. I attempt to add to the extant literature on frequency-domain smoothing methods as well as the literature on human-environment interaction in tropical-montane regions by applying a harmonic filtering and smoothing algorithm to the ‘MOD13Q1’, 16-day composite, 250m, NDVI, MODIS imagery. To create a more robust time-series, I combine Gaussian generalized additive models and discrete Fourier analysis of the residuals to impute values to a filtered time series, based on MODIS’s own pixel reliability data. These methods significantly improve the quality of the time-series being analyzed, compared with the raw data, or imputation of the mean signal. To control for conflating variables, I take a difference-in-differences (DD) approach (Abadie, 2005) comparing resettled regions to older regions, identified in Google Earth. Harmonic filtering and smoothing shows a definite long-term trend of post-resettlement changes in the vegetation signal, demonstrated by the DD approach, analyzed in SiZer maps (Chaudhuri & Marron, 1999). Further research will be needed to determine whether this is indicative of cropping changes, or other impacts of post-resettlement adaptation

    Multiscale inference about a density

    Get PDF
    We introduce a multiscale test statistic based on local order statistics and spacings that provides simultaneous confidence statements for the existence and location of local increases and decreases of a density or a failure rate. The procedure provides guaranteed finite-sample significance levels, is easy to implement and possesses certain asymptotic optimality and adaptivity properties.Comment: Version 2 is an extended version (Technical report 56, IMSV, Univ. Bern) which is referred to in version 3. Published in at http://dx.doi.org/10.1214/07-AOS521 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    PS-SiZer map to investigate significant features of body-weight profile changes in HIV infected patients in the IeDEA Collaboration

    Get PDF
    Objectives: We extend the method of Significant Zero Crossings of Derivatives (SiZer) to address within-subject correlations of repeatedly collected longitudinal biomarker data and the computational aspects of the methodology when analyzing massive biomarker databases. SiZer is a powerful visualization tool for exploring structures in curves by mapping areas where the first derivative is increasing, decreasing or does not change (plateau) thus exploring changes and normalization of biomarkers in the presence of therapy. Methods: We propose a penalized spline SiZer (PS-SiZer) which can be expressed as a linear mixed model of the longitudinal biomarker process to account for irregularly collected data and within-subject correlations. Through simulations we show how sensitive PS-SiZer is in detecting existing features in longitudinal data versus existing versions of SiZer. In a real-world data analysis PS-SiZer maps are used to map areas where the first derivative of weight change after antiretroviral therapy (ART) start is significantly increasing, decreasing or does not change, thus exploring the durability of weight increase after the start of therapy. We use weight data repeatedly collected from persons living with HIV initiating ART in five regions in the International Epidemiologic Databases to Evaluate AIDS (IeDEA) worldwide collaboration and compare the durability of weight gain between ART regimens containing and not containing the drug stavudine (d4T), which has been associated with shorter durability of weight gain. Results: Through simulations we show that the PS-SiZer is more accurate in detecting relevant features in longitudinal data than existing SiZer variants such as the local linear smoother (LL) SiZer and the SiZer with smoothing splines (SS-SiZer). In the illustration we include data from 185,010 persons living with HIV who started ART with a d4T (53.1%) versus non-d4T (46.9%) containing regimen. The largest difference in durability of weight gain identified by the SiZer maps was observed in Southern Africa where weight gain in patients treated with d4T-containing regimens lasted 59.9 weeks compared to 133.8 weeks for those with non-d4T-containing regimens. In the other regions, persons receiving d4T-containing regimens experienced weight gains lasting 38-62 weeks versus 55-93 weeks in those receiving non-d4T-based regimens. Discussion: PS-SiZer, a SiZer variant, can handle irregularly collected longitudinal data and within-subject correlations and is sensitive in detecting even subtle features in biomarker curves

    Nonparametric Comparison of Multiple Regression Curves in Scale-Space

    Get PDF
    This paper concerns testing the equality of multiple curves in a nonparametric regression context. The proposed test forms an ANOVA type test statistic based on kernel smoothing and examines the ratio of between and within group variations. The empirical distribution of the test statistic is derived using a permutation test. Unlike traditional kernel smoothing approaches, the test is conducted in scale-space so that it does not require the selection of an optimal smoothing level, but instead considers a wide range of scales. The proposed method also visualizes its testing results as a color map and graphically summarizes the statistical differences between curves across multiple locations and scales. A numerical study using simulated and real examples is conducted to demonstrate the finite sample performance of the proposed method
    • 

    corecore