18 research outputs found

    A computational framework for data-driven infrastructure engineering using advanced statistical learning, prediction, and curing

    Get PDF
    Over the past few decades, in most science and engineering fields, data-driven research has been becoming a promising next-generation research paradigm due to noticeable advances in computing power and accumulation of valuable databases. Despite this valuable accomplishment, the leveraging of these databases is still in its infancy. To address this issue, this dissertation investigates the following studies that use advanced statistical methods. The first study aims to develop a computational framework for collecting and transforming data obtained from heterogeneous databases in the Federal Aviation Administration and build a flexible predictive model using a generalized additive model (GAM) to predict runway incursions for 15 years in the top major US 36 airports. Results show that GAM is a powerful method for RI prediction with a high prediction accuracy. A direct search for finding the best predictor variables appears to be superior over the variable section approach based on a principal component analysis. The prediction power of GAM turns out to be comparable to that of an artificial neural network (ANN). The second study is to build an accurate predictive model based on earthquake engineering databases. As with the previous study, GAM is adopted as a predictive model. The result shows a promising predictive power of GAM with application to existing reinforced concrete shear wall databases. The primary objective of the third study is to suggest an efficient predictor variable selection method and provide relative importance among predictor variables using field survey pavement and simulated airport pavement data. Results show that the direct search method always finds the best predictor model, but the method takes a long time depending on the size of data and the variables\u27 dimensions. The results also depict that all variables are not necessary for the best prediction and identify the relative importance of variables selected for the GAM model. The fourth study deals with the impact of fractional hot-deck imputation (FHDI) on statistical and machine learning and prediction using practical engineering databases. Multiple response rates and internal parameters (i.e., category number and donor number) are investigated regarding the behavior and impacts of FHDI on prediction models. GAM, ANN, support vector machine, and extremely randomized trees are adopted as predictive models. Results show that the FHDI holds a positive impact on the prediction for engineering-based databases. The optimal internal parameters are also suggested to achieve a better prediction accuracy. The last study aims to offer a systematic computational framework including data collection, transformation, and squashing to develop a prediction model for the structural behavior of the target bridge. Missing values in the bridge data are cured by using the FHDI method to avoid an inaccurate data analysis due to biasness and sparseness of data. Results show that the application of FHDI improves prediction performances. This dissertation is expected to provide a notable computational framework for data processing, suggest a seamless data curing method, and offer an advanced statistical predictive model based on multiple projects. This novel research approach will help researchers to investigate their databases with a better understanding and build a statistical model with high accuracy according to their knowledge about the data

    A machine-learning approach to modeling picophytoplankton abundances in the South China Sea

    Get PDF
    Picophytoplankton, the smallest phytoplankton (<3 micron), contribute significantly to primary production in the oligotrophic South China Sea. To improve our ability to predict picophytoplankton abundances in the South China Sea and infer the underlying mechanisms, we compared four machine learning algorithms to estimate the horizontal and vertical distributions of picophytoplankton abundances. The inputs of the algorithms include spatiotemporal (longitude, latitude, sampling depth and date) and environmental variables (sea surface temperature, chlorophyll, and light). The algorithms were fit to a dataset of 2442 samples collected from 2006 to 2012. We find that the Boosted Regression Trees (BRT) gives the best prediction performance with R2 ranging from 77% to 85% for Chl a concentration and abundances of three picophytoplankton groups. The model outputs confirm that temperature and light play important roles in affecting picophytoplankton distribution. Prochlorococcus, Synechococcus, and picoeukaryotes show decreasing preference to oligotrophy. These insights are reflected in the vertical patterns of Chl a and picoeukaryotes that form subsurface maximal layers in summer and spring, contrasting with those of Prochlorococcus and Synechococcus that are most abundant at surface. Our forecasts suggest that, under the “business-as-usual” scenario, total Chl a will decrease but Prochlorococcus abundances will increase significantly to the end of this century. Synechococcus abundances will also increase, but the trend is only significant in coastal waters. Our study has advanced the ability of predicting picophytoplankton abundances in the South China Sea and suggests that BRT is a useful machine learning technique for modelling plankton distribution

    Emulation and calibration with smoothed system and simulator data

    Get PDF
    This thesis is concerned with structuring the statistical model with which we relate physical systems and computer simulators. The novelty of the work lies in the fact that we relate them via imagined smoothed versions of themselves, reflecting the belief that they are similar on large scales but discrepant when in comes to small scale details. Our central, paradigmatic example involves relating the planet’s climate to a climate simulator. Here the simulator is suspected to be incapable of faithfully reproducing changes in the system as time or certain physical parameters are changed by a small amount, but is still considered informative for the changes in the system over long time scales and large parameter change

    Weak Gravitational Lensing by Large-Scale Structures:A Tool for Constraining Cosmology

    Get PDF
    There is now very strong evidence that our Universe is undergoing an accelerated expansion period as if it were under the influence of a gravitationally repulsive “dark energy” component. Furthermore, most of the mass of the Universe seems to be in the form of non-luminous matter, the so-called “dark matter”. Together, these “dark” components, whose nature remains unknown today, represent around 96 % of the matter-energy budget of the Universe. Unraveling the true nature of the dark energy and dark matter has thus, obviously, become one of the primary goals of present-day cosmology. Weak gravitational lensing, or weak lensing for short, is the effect whereby light emitted by distant galaxies is slightly deflected by the tidal gravitational fields of intervening foreground structures. Because it only relies on the physics of gravity, weak lensing has the unique ability to probe the distribution of mass in a direct and unbiased way. This technique is at present routinely used to study the dark matter, typical applications being the mass reconstruction of galaxy clusters and the study of the properties of dark halos surrounding galaxies. Another and more recent application of weak lensing, on which we focus in this thesis, is the analysis of the cosmological lensing signal induced by large-scale structures, the so-called “cosmic shear”. This signal can be used to measure the growth of structures and the expansion history of the Universe, which makes it particularly relevant to the study of dark energy. Of all weak lensing effects, the cosmic shear is the most subtle and its detection requires the accurate analysis of the shapes of millions of distant, faint galaxies in the near infrared. So far, the main factor limiting cosmic shear measurement accuracy has been the relatively small sky areas covered. Next-generation of wide-field, multicolor surveys will, however, overcome this hurdle by covering a much larger portion of the sky with improved image quality. The resulting statistical errors will then become subdominant compared to systematic errors, the latter becoming instead the main source of uncertainty. In fact, uncovering key properties of dark energy will only be achievable if these systematics are well understood and reduced to the required level. The major sources of uncertainty resides in the shape measurement algorithm used, the convolution of the original image by the instrumental and possibly atmospheric point spread function (PSF), the pixelation effect caused by the integration of light falling on the detector pixels and the degradation caused by various sources of noise. Measuring the Cosmic shear thus entails solving the difficult inverse problem of recovering the shear signal from blurred, pixelated and noisy galaxy images while keeping errors within the limits demanded by future weak lensing surveys. Reaching this goal is not without challenges. In fact, the best available shear measurement methods would need a tenfold improvement in accuracy to match the requirements of a space mission like Euclid from ESA, scheduled at the end of this decade. Significant progress has nevertheless been made in the last few years, with substantial contributions from initiatives such as GREAT (GRavitational lEnsing Accuracy Testing) challenges. The main objective of these open competitions is to foster the development of new and more accurate shear measurement methods. We start this work with a quick overview of modern cosmology: its fundamental tenets, achievements and the challenges it faces today. We then review the theory of weak gravitational lensing and explains how it can make use of cosmic shear observations to place constraints on cosmology. The last part of this thesis focuses on the practical challenges associated with the accurate measurement of the cosmic shear. After a review of the subject we present the main contributions we have brought in this area: the development of the gfit shear measurement method, new algorithms for point spread function (PSF) interpolation and image denoising. The gfit method emerged as one of the top performers in the GREAT10 Galaxy Challenge. It essentially consists in fitting two-dimensional elliptical Sérsic light profiles to observed galaxy image in order to produce estimates for the shear power spectrum. PSF correction is automatic and an efficient shape-preserving denoising algorithm can be optionally applied prior to fitting the data. PSF interpolation is also an important issue in shear measurement because the PSF is only known at star positions while PSF correction has to be performed at any position on the sky. We have developed innovative PSF interpolation algorithms on the occasion of the GREAT10 Star Challenge, a competition dedicated to the PSF interpolation problem. Our participation was very successful since one of our interpolation method won the Star Challenge while the remaining four achieved the next highest scores of the competition. Finally we have participated in the development of a wavelet-based, shape-preserving denoising method particularly well suited to weak lensing analysis
    corecore