160 research outputs found

    Kernel Carpentry for Online Regression using Randomly Varying Coefficient Model

    Get PDF
    We present a Bayesian formulation of locally weighted learning (LWL) using the novel concept of a randomly varying coefficient model. Based on thi

    Bayesian locally weighted online learning

    Get PDF
    Locally weighted regression is a non-parametric technique of regression that is capable of coping with non-stationarity of the input distribution. Online algorithms like Receptive FieldWeighted Regression and Locally Weighted Projection Regression use a sparse representation of the locally weighted model to approximate a target function, resulting in an efficient learning algorithm. However, these algorithms are fairly sensitive to parameter initializations and have multiple open learning parameters that are usually set using some insights of the problem and local heuristics. In this thesis, we attempt to alleviate these problems by using a probabilistic formulation of locally weighted regression followed by a principled Bayesian inference of the parameters. In the Randomly Varying Coefficient (RVC) model developed in this thesis, locally weighted regression is set up as an ensemble of regression experts that provide a local linear approximation to the target function. We train the individual experts independently and then combine their predictions using a Product of Experts formalism. Independent training of experts allows us to adapt the complexity of the regression model dynamically while learning in an online fashion. The local experts themselves are modeled using a hierarchical Bayesian probability distribution with Variational Bayesian Expectation Maximization steps to learn the posterior distributions over the parameters. The Bayesian modeling of the local experts leads to an inference procedure that is fairly insensitive to parameter initializations and avoids problems like overfitting. We further exploit the Bayesian inference procedure to derive efficient online update rules for the parameters. Learning in the regression setting is also extended to handle a classification task by making use of a logistic regression to model discrete class labels. The main contribution of the thesis is a spatially localised online learning algorithm set up in a probabilistic framework with principled Bayesian inference rule for the parameters of the model that learns local models completely independent of each other, uses only local information and adapts the local model complexity in a data driven fashion. This thesis, for the first time, brings together the computational efficiency and the adaptability of ‘non-competitive’ locally weighted learning schemes and the modelling guarantees of the Bayesian formulation

    Lazy Lasso for local regression

    Get PDF
    Locally weighted regression is a technique that predicts the response for new data items from their neighbors in the training data set, where closer data items are assigned higher weights in the prediction. However, the original method may suffer from overfitting and fail to select the relevant variables. In this paper we propose combining a regularization approach with locally weighted regression to achieve sparse models. Specifically, the lasso is a shrinkage and selection method for linear regression. We present an algorithm that embeds lasso in an iterative procedure that alternatively computes weights and performs lasso-wise regression. The algorithm is tested on three synthetic scenarios and two real data sets. Results show that the proposed method outperforms linear and local models for several kinds of scenario

    Land valuation using an innovative model combining machine learning and spatial context

    Get PDF
    Valuation predictions are used by buyers, sellers, regulators, and authorities to assess the fairness of the value being asked. Urbanization demands a modern and efficient land valuation system since the conventional approach is costly, slow, and relatively subjective towards locational factors. This necessitates the development of alternative methods that are faster, user-friendly, and digitally based. These approaches should use geographic information systems and strong analytical tools to produce reliable and accurate valuations. Location information in the form of spatial data is crucial because the price can vary significantly based on the neighborhood and context of where the parcel is located. In this thesis, a model has been proposed that combines machine learning and spatial context. It integrates raster information derived from remote sensing as well as vector information from geospatial analytics to predict land values, in the City of Springfield. These are used to investigate whether a joint model can improve the value estimation. The study also identifies the factors that are most influential in driving these models. A geodatabase was created by calculating proximity and accessibility to key locations as well as integrating socio-economic variables, and by adding statistics related to green space density and vegetation index utilizing Sentinel-2 -satellite data. The model has been trained using Greene County government data as truth appraisal land values through supervised machine learning models and the impact of each data type on price prediction was explored. Two types of modeling were conducted. Initially, only spatial context data were used to assess their predictive capability. Subsequently, socio-economic variables were added to the dataset to compare the performance of the models. The results showed that there was a slight difference in performance between the random forest and gradient boosting algorithm as well as using distance measures data derived from GIS and adding socioeconomic variables to them. Furthermore, spatial autocorrelation analysis was conducted to investigate how the distribution of similar attributes related to the location of the land affects its value. This analysis also aimed to identify the disparities that exist in terms of socio-economic structure and to measure their magnitude.Includes bibliographical references

    An Automatic Method for Extracting Chemical Impurity Profiles of Illicit Drugs from Chromatoraphic-Mass Spectrometric Data and Their Comparison Using Bayesian Reasoning

    Get PDF
    In this work, an automated procedure for extracting chemical profiles of illicit drugs from chromatographic-mass spectrometric data is presented along with a method for comparison of the profiles using Bayesian inference. The described methods aim to ease the work of a forensic chemist who is tasked with comparing two samples of a drug, such as amphetamine, and delivering an answer to a question of the form 'Are these two samples from the same source?' Additionally, more statistical rigour is introduced to the process of comparison. The chemical profiles consist of the relative amounts of certain impurities present in seized drug samples. In order to obtain such profiles, the amounts of the target compounds must be recovered from chromatographic-mass spectrometric measurements, which amounts to searching the raw signals for peaks corresponding to the targets. The areas of these peaks must then be integrated and normalized by the sum of all target peak areas. The automated impurity profile extraction presented in this thesis works by first filtering the data corresponding to a sample, which includes discarding irrelevant parts of the raw data, estimating and removing signal baseline using the asymmetrical reweighed penalized least squares (arPLS) algorithm, and smoothing the relevant signals using a Savitzky-Golay (SG) filter. The SG filter is also used to estimate signal derivatives. These derivatives are used in the next step to detect signal peaks from which parameters are estimated for an exponential-Gaussian hybrid peak model. The signal is reconstructed using the estimated model peaks and optimal parameters are found by fitting the reconstructed signal to the measurements via non-linear least squares methods. In the last step, impurity profiles are extracted by integrating the areas of the optimized models for target compound peaks. These areas are then normalized by their sum to obtain relative amounts of the substances. In order to separate the peaks from noise, a model for noise dependency on signal level was fitted to replicate measurements of amphetamine quality control samples non-parametrically. This model was used to compute detection limits based on estimated baseline of the signals. Finally, the classical Pearson correlation based comparison method for these impurity profiles was compared to two Bayesian methods, the Bayes factor (BF) and the predictive agreement(PA). The Bayesian methods used a probabilistic model assuming normally distributed values with normal-gamma prior distribution for the mean and precision parameters. These methods were compared using simulation tests and application to 90 samples of seized amphetamine

    Nonparametric variable selection and dimension reduction methods and their applications in pharmacogenomics

    Get PDF
    Nowadays it is common to collect large volumes of data in many fields with an extensive amount of variables, but often a small or moderate number of samples. For example, in the analysis of genomic data, the number of genes can be very large, varying from tens of thousands to several millions, whereas the number of samples is several hundreds to thousands. Pharmacogenomics is an example of genomics data analysis that we are considering here. Pharmacogenomics research uses whole-genome genetic information to predict individuals\u27 drug response. Because whole-genome data are high dimensional and their relationships to drug response are complicated, we are developing a variety of nonparametric methods, including variable selection using local regression and extended dimension reduction techniques, to detect nonlinear patterns in the relationship between genetic variants and clinical response.^ High dimensional data analysis has become a popular research topic in the Statistics society in recent years. However, the nature of high dimensional data makes many traditional statistical methods fail, because most methods rely on the assumption that the sample size n is larger than the variable dimension p. Consequently, variable selection or dimension reduction is often the first step in high dimensional data analysis. Meanwhile, another important issue arises as the choice of an appropriate statistical modeling strategy for conducting variable selection or dimension reduction. It has been found from our studies that the traditional parametric linear model might not work well for detecting nonlinear patterns of relationships between predictors and response. The limitations of the linear model and other parametric statistical approaches motivate us to consider nonparametric/nonlinear models for conducting variable selection or dimension reduction.^ The thesis is composed of two major parts. In the first part, we develop a nonparametric predictive model of the response based on a small number of predictors, which are selected from a nonparametric forward variable selection procedure. We also propose strategies to identify subpopulations with enhanced treatment effects. In the second part, we develop an alternating least squares method to extend the classical Sliced Inverse Regression (SIR) [Li, 1991] to the context of high dimensional data. Both methods are demonstrated by simulation studies and a pharmacogenomics study of bortezomib in multiple myeloma [Mulligan et al., 2007]. The proposed methods have favorable performances compared to other existing methods in the literature
    corecore