5 research outputs found

    Robust-to-outliers square-root LASSO, simultaneous inference with a MOM approach

    Get PDF
    We consider the least-squares regression problem with unknown noise variance, where the observed data points are allowed to be corrupted by outliers. Building on the median-of-means (MOM) method introduced by Lecue and Lerasle Ann.Statist.48(2):906-931(April 2020) in the case of known noise variance, we propose a general MOM approach for simultaneous inference of both the regression function and the noise variance, requiring only an upper bound on the noise level. Interestingly, this generalization requires care due to regularity issues that are intrinsic to the underlying convex-concave optimization problem. In the general case where the regression function belongs to a convex class, we show that our simultaneous estimator achieves with high probability the same convergence rates and a similar risk bound as if the noise level was unknown, as well as convergence rates for the estimated noise standard deviation. In the high-dimensional sparse linear setting, our estimator yields a robust analog of the square-root LASSO. Under weak moment conditions, it jointly achieves with high probability the minimax rates of estimation s1/p(1/n)log(p/s)s^{1/p} \sqrt{(1/n) \log(p/s)} for the p\ell_p-norm of the coefficient vector, and the rate (s/n)log(p/s)\sqrt{(s/n) \log(p/s)} for the estimation of the noise standard deviation. Here nn denotes the sample size, pp the dimension and ss the sparsity level. We finally propose an extension to the case of unknown sparsity level ss, providing a jointly adaptive estimator (β~,σ~,s~)(\widetilde \beta, \widetilde \sigma, \widetilde s). It simultaneously estimates the coefficient vector, the noise level and the sparsity level, with proven bounds on each of these three components that hold with high probability.Comment: 70 page

    Robust-to-outliers square-root LASSO, simultaneous inference with a MOM approach

    No full text
    We consider the least-squares regression problem with unknown noise variance, where the observed data points are allowed to be corrupted by outliers. Building on the median-of-means (MOM) method introduced by Lecue and Lerasle Ann.Statist.48(2):906-931(April 2020) in the case of known noise variance, we propose a general MOM approach for simultaneous inference of both the regression function and the noise variance, requiring only an upper bound on the noise level. Interestingly, this generalization requires care due to regularity issues that are intrinsic to the underlying convex-concave optimization problem. In the general case where the regression function belongs to a convex class, we show that our simultaneous estimator achieves with high probability the same convergence rates and a similar risk bound as if the noise level was unknown, as well as convergence rates for the estimated noise standard deviation. In the high-dimensional sparse linear setting, our estimator yields a robust analog of the square-root LASSO. Under weak moment conditions, it jointly achieves with high probability the minimax rates of estimation s1/p(1/n)log(p/s)s^{1/p} \sqrt{(1/n) \log(p/s)} for the p\ell_p-norm of the coefficient vector, and the rate (s/n)log(p/s)\sqrt{(s/n) \log(p/s)} for the estimation of the noise standard deviation. Here nn denotes the sample size, pp the dimension and ss the sparsity level. We finally propose an extension to the case of unknown sparsity level ss, providing a jointly adaptive estimator (β~,σ~,s~)(\widetilde \beta, \widetilde \sigma, \widetilde s). It simultaneously estimates the coefficient vector, the noise level and the sparsity level, with proven bounds on each of these three components that hold with high probability

    Spatial clustering of waste reuse in a circular economy: A spatial autocorrelation analysis on locations of waste reuse in the Netherlands using global and local Moran’s I

    No full text
    In recent years, implementing a circular economy in cities has been considered by policy makers as a potential solution for achieving sustainability. Existing literature on circular cities is mainly focused on two perspectives: urban governance and urban metabolism. Both these perspectives, to some extent, miss an understanding of space. A spatial perspective is important because circular activities, such as the recycling, reuse, or storage of materials, require space and have a location. It is therefore useful to understand where circular activities are located, and how they are affected by their location and surrounding geography. This study therefore aims to understand the existing state of waste reuse activities in the Netherlands from a spatial perspective, by analyzing the degree, scale, and locations of spatial clusters of waste reuse. This was done by measuring the spatial autocorrelation of waste reuse locations using global and local Moran’s I, with waste reuse data from the national waste registry of the Netherlands. The analysis was done for 10 material types: minerals, plastic, wood and paper, fertilizer, food, machinery and electronics, metal, mixed construction materials, glass, and textile. It was found that all materials except for glass and textiles formed spatial clusters. By varying the grid cell sizes used for data aggregation, it was found that different materials had different “best fit” cell sizes where spatial clustering was the strongest. The best fit cell size is ∼7 km for materials associated with construction and agricultural industries, and ∼20–25 km for plastic and metals.The best fit cell sizes indicate the average distance of companies from each other within clusters, and suggest a suitable spatial resolution at which the material can be understood. Hotspot maps were also produced for each material to show where reuse activities are most spatially concentrated.Climate Design and SustainabilityStatisticsEnvironmental Technology and Desig

    A multifunctional matching algorithm for sample design in agricultural plots

    No full text
    Collection of accurate and representative data from agricultural fields is required for efficient crop management. Since growers have limited available resources, there is a need for advanced methods to select representative points within a field in order to best satisfy sampling or sensing objectives. The main purpose of this work was to develop a data-driven method for selecting locations across an agricultural field given observations of some covariates at every point in the field. These chosen locations should be representative of the distribution of the covariates in the entire population and represent the spatial variability in the field. They can then be used to sample an unknown target feature whose sampling is expensive and cannot be realistically done at the population scale. An algorithm for determining these optimal sampling locations, namely the multifunctional matching (MFM) criterion, was based on matching of moments (functionals) between sample and population. The selected functionals in this study were standard deviation, mean, and Kendall's tau. An additional algorithm defined the minimal number of observations that could represent the population according to a desired level of accuracy. The MFM was applied to datasets from two agricultural plots: a vineyard and a peach orchard. The data from the plots included measured values of slope, topographic wetness index, normalized difference vegetation index, and apparent soil electrical conductivity. The MFM algorithm selected the number of sampling points according to a representation accuracy of 90% and determined the optimal location of these points. The algorithm was validated against values of vine or tree water status measured as crop water stress index (CWSI). Algorithm performance was then compared to two other sampling methods: the conditioned Latin hypercube sampling (cLHS) model and a uniform random sample with spatial constraints. Comparison among sampling methods was based on measures of similarity between the target variable population distribution and the distribution of the selected sample. MFM represented CWSI distribution better than the cLHS and the uniform random sampling, and the selected locations showed smaller deviations from the mean and standard deviation of the entire population. The MFM functioned better in the vineyard, where spatial variability was larger than in the orchard. In both plots, the spatial pattern of the selected samples captured the spatial variability of CWSI. MFM can be adjusted and applied using other moments/functionals and may be adopted by other disciplines, particularly in cases where small sample sizes are desired.Statistic
    corecore