760 research outputs found

    JMASM43: TEEReg: Trimmed Elemental Estimation (R)

    Get PDF
    Trimmed elemental regression is robust to outliers and violations of model assumptions. Its properties and statistical inference were evaluated using bias-corrected and accelerated bootstrap confidence intervals. An R package named TEEReg is developed to compute the trimmed elemental estimates and the corresponding bootstrap confidence intervals. Two examples are provided to demonstrate its usage

    JMASM43: TEEReg: Trimmed Elemental Estimation (R))

    Full text link

    A procedure for robust estimation and diagnostics in regression

    Get PDF
    We propose a new procedure for computing an approximation to regression estimates based on the minimization of a robust scale. The procedure can be applied with a large number of independent variables where the usual methods based on resampling require an unfeasible or extremely costly computer time. An important advantage of the procedure is that it can be incorporated in any high breakdown procedure and improve it with just a few seconds of computer time. The procedure minimizes the robust scale over a set of tentative parameter vectors. Each of these parameter vector is obtained as follows. We represent each data point by the vector of changes of the least squares forecasts of that observation, when each of the observations is deleted. Then the sets of possible outliers are obtained as the extreme points of the principal components of these vectors, or as the set of points with large residuals. The good performance of the procedure allows the identification of multiple outliers avoiding masking effects. The efficiency of the procedure for robust estimation and its power as an outlier detection tool are investigated in a simulation study and some examples

    Inconsistency of Resampling Algorithms for High Breakdown Regression Estimators and a New Algorithm

    Get PDF
    Since high breakdown estimators are impractical to compute exactly in large samples, approximate algorithms are used. The algorithm generally produces an estimator with a lower consistency rate and breakdown value than the exact theoretical estimator. This discrepancy grows with the sample size, with the implication that huge computations are needed for good approximations in large high-dimensioned samples The workhorse for HBE has been the ‘elemental set’, or ‘basic resampling’ algorithm. This turns out to be completely ineffective in high dimensions with high levels of contamination. However, enriching it with a “concentration” step turns it into a method that is able to handle even high levels of contamination, provided the regression outliers are located on random cases. It remains ineffective if the regression outliers are concentrated on high leverage cases. We focus on the multiple regression problem, but several of the broad conclusions – notably those of the inadequacy of fixed numbers of elemental starts – are relevant to multivariate location and dispersion estimation as well. We introduce a new algorithm – the “X-cluster” method – for large high-dimensional multiple regression data sets that are beyond the reach of standard resampling methods. This algorithm departs sharply from current HBE algorithms in that, even at a constant percentage of contamination, it is more effective the larger the sample, making a compelling case for using it in the large-sample situations that current methods serve poorly. A multi-pronged analysis, using both traditional OLS and L1 methods along with newer resistant techniques, will often detect departures from the multiple regression model that can not be detected by any single estimator

    Sparse least trimmed squares regression for analyzing high-dimensional large data sets

    Get PDF
    Sparse model estimation is a topic of high importance in modern data analysis due to the increasing availability of data sets with a large number of variables. Another common problem in applied statistics is the presence of outliers in the data. This paper combines robust regression and sparse model estimation. A robust and sparse estimator is introduced by adding an L1penalty on the coefficient estimates to the well-known least trimmed squares (LTS) estimator. The breakdown point of this sparse LTS estimator is derived, and a fast algorithm for its computation is proposed. In addition, the sparse LTS is applied to protein and gene expression data of the NCI-60 cancer cell panel. Both a simulation study and the real data application show that the sparse LTS has better prediction performance than its competitors in the presence of leverage points

    Using hierarchical information-theoretic criteria to optimize subsampling of extensive datasets

    Get PDF
    This paper addresses the challenge of subsampling large datasets, aiming to generate a smaller dataset that retains a significant portion of the original information. To achieve this objective, we present a subsampling algorithm that integrates hierarchical data partitioning with a specialized tool tailored to identify the most informative observations within a dataset for a specified underlying linear model, not necessarily first-order, relating responses and inputs. The hierarchical data partitioning procedure systematically and incrementally aggregates information from smaller-sized samples into new samples. Simultaneously, our selection tool employs Semidefinite Programming for numerical optimization to maximize the information content of the chosen observations. We validate the effectiveness of our algorithm through extensive testing, using both benchmark and real-world datasets. The real-world dataset is related to the physicochemical characterization of white variants of Portuguese Vinho Verde. Our results are highly promising, demonstrating the algorithm's capability to efficiently identify and select the most informative observations while keeping computational requirements at a manageable level

    Sub-sampling a large physical soil archive for additional analyses to support spatial mapping; a pre-registered experiment in the Southern Nations, Nationalities, and Peoples Region (SNNPR) of Ethiopia

    Get PDF
    The value of physical archives of soil material from field sampling activities has been widely recognized. If we want to use archive material for new destructive analyses to support a task, such as spatial mapping, then an efficient sub-sampling strategy is needed, both to manage analytical costs and to conserve the archive material. In this paper we present an approach to this problem when the objective is spatial mapping by ordinary kriging. Our objective was to subsample the physical archive from the Ethiopia Soil Information System (EthioSIS) survey of the Southern Nations, Nationalities and Peoples Region (SNNPR) for spatial mapping of two variables, concentrations of particular fractions of selenium and iodine in the soil, which had not been measured there. We used data from cognate parts of surrounding regions of Ethiopia to estimate variograms of these properties, and then computed prediction error variances for maps in SNNPR based on proposed subsets of the archive of different size, selected to optimize a spatial coverage criterion (with some close sample pairs included). On this basis a subsample was selected. This is a preregistered experiment in that we have proposed criteria for evaluating the success of our approach, and are publishing that in advance of receiving analytical data on the subsampled material from the laboratories where they are being processed. A subsequent short report will publish the outcome. The use of preregistered trials is widely recommended and used in areas of science including public health, and we believe that it is a sound strategy to promote reproducible research in soil science

    Comparison of Data Mining and Mathematical Models for Estimating Fuel Consumption of Passenger Vehicles

    Get PDF
    A number of analytical models have been described in the literature to estimate the fuel consumption of vehicles, most of which require a wide range of vehicle and trip related parameters as input data, which might limit the practical applicability of these models if such data were not readily available. To overcome this drawback, this study describes the development of three data mining models to estimate fuel consumption of a vehicle, including linear regression, artificial neural network and support vector machines. The paper presents comparison results with five instantaneous fuel consumption models from the literature using real data collected from three passenger vehicles on three routes. The results indicate that while the prediction accuracy of the instantaneous fuel consumption models varies across the data sets, those obtained by the regression models are significantly better and more robust against changes in input data
    • 

    corecore