41 research outputs found

    Robustness concepts for sliced inverse regression

    Get PDF
    A typical difficulty with nonparametric regression with a large number of regressor variables is the so-called curse of dimensionality. That is, as the dimension of the regressor space increases, more data are needed to fill the space densely enough to accurately estimate an underlying regression function. As a remedy, various dimension reduction procedures, such as SIR, SIR II (Li, 1991), SAVE (Cook and Weisberg (1991), Cook (2000)), or MAVE (Xia et al. (2002)) have been proposed for identifying an appropriate, smaller subspace of the original regressor space before fitting an underlying regression function. Because ultimately the estimation of a regression curve or link function relies crucially on the correct identification of the linear combinations that span the dimension reduction subspace, robustness properties of a dimension reduction procedure become crucial to understand. That is, it is important to consider just how sensitive dimension reduction procedures and their subspace estimates are to data contamination. The focus of this thesis is placed on a detailed investigation of the robustness properties of the dimension reduction procedure SIR (Li, (1991)). In particular, we emphasize on the finite sample behavior of the SIR procedure under data contamination, considering various types of contamination (i.e., directions of contamination) which may produce a “worst case” subspace estimate. We demonstrate that the data contamination scenarios that produce bad subspace estimates in SIR depend also on the covariance structure of the regressor variables as well as the dimension K of the final dimension reduction subspace. We show that the type of data contamination that causes SIR to yield an erroneous subspace estimate can change depending on whether the covariance of the regressors is known or not. Initial efforts to define a breakdown point concept for dimension reduction procedures in the finite sample case goes back to the dissertation of Hilker (1997) and involved canonical correlations as a “distance measure” between estimated and true regression subspaces (cf. Hilker (1997); Becker (2001); Gather, Hilker and Becker (2002)). Hilker's work stipulated that breakdown occurs if one basis vector of an estimated subspace is orthogonal to the true subspace. However, this formulation of breakdown in dimension reduction has some drawbacks. For one, it is arguably worse to estimate and select the entire orthogonal subspace of the true regression subspace of interest so that the previous concept of breakdown may not be adequate. Another problematic point is that breakdown classically involves the use of an underlying metric in its definition, but canonical correlations as a measure of “closeness” between spaces do not constitute a metric. The dissertation develops an alternative definition of breakdown in dimension reduction in the finite sample case and investigates an upper bound for the breakdown point in this situation. This formulation of breakdown uses an appropriate metric based on the Frobenius norm to measure the distance between subspaces and defines breakdown under data contamination when the distance between an estimated regression subspace and the true subspace is maximal under the metric. Because a subspace is characterized by its projection matrix, a suitable metric between spaces is possible through a matrix norm applied on the difference of two projection matrices. This gives a geometrically meaningful definition for the finite sample breakdown point of methods such as SIR. This thesis also contains a simulation study used to numerically support our theoretical findings.Ein bekanntes PhĂ€nomen bei der SchĂ€tzung nichtparametrischer Regressionsmodelle ist der sogenannte Fluch der Dimensionen. Dieser besagt, dass bei steigender Anzahl an Einflussvariablen, d.h. Dimension des Regressorraumes die benötigte Datenmenge fĂŒr eine adequate SchĂ€tzung des zugrunde liegenden Modells exponentiell anwĂ€chst. Zur Umgehung dieser Problematik existieren dimensionsreduzierende Verfahren, die eine maßgebliche Reduktion der Dimension des Regressorraumes anstreben. Als Verfahren dieses Typs seien beispielsweise SIR, SIR II (Li, 1991), SAVE (Cook and Weisberg (1991), Cook (2000)), oder MAVE (Xia et al. (2002)) genannt, welche einen Unterraum, genannt e.d.r. Raum, des ursprĂŒnglichen Regressoraumes schĂ€tzen. Eine korrekte Identifizierung dieses Unterraumes ist fĂŒr die sich anschliessende Anpassung des Regressionsmodells konsequenterweise ausschlaggebend und Kentnisse ĂŒber die Empfindlichkeit solcher dimensionsreduzierenden Verfahren gegenĂŒber Kontamination der Daten sind daher von besonderem Interesse. Die zentrale Fragestellung dieser Dissertation beschĂ€ftigt sich mit einer ausfĂŒhrlichen Analyse der Robustheitseigenschaften des dimensionsreduzierenden Verfahrens SIR (Li, 1991). Besonderer Augenmerk wird dabei auf das Verhalten des Verfahrens im endlichen Stichprobenfall unter Kontamination der Daten gelegt. Ziel der Arbeit ist es aufzuzeigen, welche Art von Datenkontamination eine sogenannte “worst case” SchĂ€tzung des e.d.r. Raumes verursacht. Dabei stellt sich heraus, dass fĂŒr die SchĂ€tzung die Kentniss sowohl der Kovarianzstruktur des Regressorvektors, als auch der Dimension K des e.d.r. Raumes von Bedeutung ist. Im Rahmen der Arbeit kann gezeigt werden, dass die Richtung, in welche eine Kontamination der Daten fĂŒr das Erhalten einer „worst case“ SchĂ€tzung gelegt werden muss, entscheidend davon abhĂ€ngt, ob die Kovarianzmatrix des Regressorvektors bekannt oder unbekannt ist. Des Weiteren werden erste Ergebnisse zur geeigneten Definition des Bruchpunktverhaltens im endlichen Stichprobenfall aus der Dissertation von Hilker (1997) analysiert und auf den mehrdimensionalen Fall erweitert. Dabei hat sich herausgestellt, dass das von Hilker verwendete Distanzmaß der kanonischen Korrelation sowie die von ihm eingefĂŒhrte Bruchpunktdefinition fĂŒr die Erweiterung im mehrdimensionalen Fall nicht lĂ€nger geeignet sind. Eine alternative Bruchpunktdefinition fĂŒr den endlichen Stichprobenfall wird daher vorgeschlagen, welche auf einer fĂŒr UnterrĂ€ume geeigeten Metrik basiert. Die in der Dissertation erzielten Ergebnisse werden durch eine Simulationsstudie gestĂŒtzt

    A Robust Approach to Automatically Locating Grooves in 3D Bullet Land Scans

    Get PDF
    Land engraved areas (LEAs) provide evidence to address the same source–different source problem in forensic firearms examination. Collecting 3D images of bullet LEAs requires capturing portions of the neighboring groove engraved areas (GEAs). Analyzing LEA and GEA data separately is imperative to accuracy in automated comparison methods such as the one developed by Hare et al. (Ann Appl Stat 2017;11, 2332). Existing standard statistical modeling techniques often fail to adequately separate LEA and GEA data due to the atypical structure of 3D bullet data. We developed a method for automated removal of GEA data based on robust locally weighted regression (LOESS). This automated method was tested on high‐resolution 3D scans of LEAs from two bullet test sets with a total of 622 LEA scans. Our robust LOESS method outperforms a previously proposed “rollapply” method. We conclude that our method is a major improvement upon rollapply, but that further validation needs to be conducted before the method can be applied in a fully automated fashion

    rotations: An R Package for SO(3) Data

    Get PDF
    Abstract In this article we introduce the rotations package which provides users with the ability to simulate, analyze and visualize three-dimensional rotation data. More specifically it includes four commonly used distributions from which to simulate data, four estimators of the central orientation, six confidence region estimation procedures and two approaches to visualizing rotation data. All of these features are available for two different parameterizations of rotations: three-by-three matrices and quaternions. In addition, two datasets are included that illustrate the use of rotation data in practice

    The Power in Groups: Using Cluster Analysis to Critically Quantify Women’s STEM Enrollment

    Get PDF
    Despite efforts to close the gender gap in science, technology, engineering, and math (STEM), disparities still exist, especially in math intensive STEM (MISTEM) majors. Females and males receive similar academic preparation and overall, perform similarly, yet females continue to enroll in STEM majors less frequently than men. In examining academic preparation, most research considers performance measures individually, ignoring the possible interrelationships between these measures. We address this problem by using hierarchical agglomerative clustering – a statistical technique which allows for identifying groups (i.e., clusters) of students who are similar in multiple factors. We first apply this technique to readily available institutional data to determine if we could identify distinct groups. Results illustrated that it was possible to identify nine unique groups. We then examined differences in STEM enrollment by group and by gender. We found that the proportion of females differed by group, and the gap between males and females also varied by group. Overall, males enrolled in STEM at a higher proportion than females and did so regardless of the strength of their academic preparation. Our results provide a novel yet feasible approach to examining gender differences in STEM enrollment in postsecondary education
    corecore