9 research outputs found

    tourrGui: A gWidgets GUI for the Tour to Explore High-Dimensional Data Using Low-Dimensional Projections

    Get PDF
    This paper describes a graphical user interface (GUI) for the tourr package in R. The tour is a dynamic graphical method for viewing multivariate data. The GUI allows users to interact with a tour in order to explore the data for structures like clustering, outliers, nonlinear dependence. Users can pause the tour, choose a subset of variables, color points by other variables, and switch between several different types of tours

    Visual Diagnostics for Constrained Optimisation with Application to Guided Tours

    Full text link
    A guided tour helps to visualise high-dimensional data by showing low-dimensional projections along a projection pursuit optimisation path. Projection pursuit is a generalisation of principal component analysis, in the sense that different indexes are used to define the interestingness of the projected data. While much work has been done in developing new indexes in the literature, less has been done on understanding the optimisation. Index functions can be noisy, might have multiple local maxima as well as an optimal maximum, and are constrained to generate orthonormal projection frames, which complicates the optimization. In addition, projection pursuit is primarily used for exploratory data analysis, and finding the local maxima is also useful. The guided tour is especially useful for exploration, because it conducts geodesic interpolation connecting steps in the optimisation and shows how the projected data changes as a maxima is approached. This work provides new visual diagnostics for examining a choice of optimisation procedure, based on the provision of a new data object which collects information throughout the optimisation. It has helped to diagnose and fix several problems with projection pursuit guided tour. This work might be useful more broadly for diagnosing optimisers, and comparing their performance. The diagnostics are implemented in the R package, ferrn

    PPtreeViz: An R Package for Visualizing Projection Pursuit Classification Trees

    Get PDF
    PPtreeViz, an R package, was developed to explore projection pursuit methods for classification. It provides functions to calculate various projection pursuit indices for classification and to explore the results in the space of projection. It also provides functions for the projection pursuit classification tree. The visualization methods of the tree structure and the features of each node in PPtreeViz can be used to easily explore the projection pursuit classification tree structure and determine the characteristics of each class. To calculate the projection pursuit indices and optimize these indices, we use the Rcpp and RcppArmadillo packages in R to improve the speed

    Racial Differences in Vaginal Fluid Metabolites and Association with Systemic Inflammation Markers among Ovarian Cancer Patients: A Pilot Study

    Get PDF
    The vaginal microbiome differs by race and contributes to inflammation by directly producing or consuming metabolites or by indirectly inducing host immune response, but its potential contributions to ovarian cancer (OC) disparities remain unclear. In this exploratory cross-sectional study, we examine whether vaginal fluid metabolites differ by race among patients with OC, if they are associated with systemic inflammation, and if such associations differ by race. Study participants were recruited from the Ovarian Cancer Epidemiology, Healthcare Access, and Disparities Study between March 2021 and September 2022. Our study included 36 study participants with ovarian cancer who provided biospecimens; 20 randomly selected White patients and all 16 eligible Black patients, aged 50–70 years. Acylcarnitines (n = 45 species), sphingomyelins (n = 34), and ceramides (n = 21) were assayed on cervicovaginal fluid, while four cytokines (IL-1β, IL-10, TNF-α, and IL-6) were assayed on saliva. Seven metabolites showed >2-fold differences, two showed significant differences using the Wilcoxon rank-sum test (p 0.05), and 30 metabolites had coefficients > ±0.1 in a Penalized Discriminant Analysis that achieved two distinct clusters by race. Arachidonoylcarnitine, the carnitine adduct of arachidonic acid, appeared to be consistently different by race. Thirty-eight vaginal fluid metabolites were significantly correlated with systemic inflammation biomarkers, irrespective of race. These findings suggest that vaginal fluid metabolites may differ by race, are linked with systemic inflammation, and hint at a potential role for mitochondrial dysfunction and sphingolipid metabolism in OC disparities. Larger studies are needed to verify these findings and further establish specific biological mechanisms that may link the vaginal microbiome with OC racial disparities

    Bagged projection methods for supervised classification in big data

    Get PDF
    Classification methods are widely used for types problems where rules to sort observations into groups are needed. There are many different methods to fit classification models but nothing is universally best. This research develops new classification methods, and visual tools for exploring the algorithms and results introduced in this work. The new classification method is a random forest built on trees using linear combinations of variables, which improves the predictive performance when the separation between classes is in combinations of variables. It is called a projection pursuit random forest (PPF). The benefit of the method is demonstrated using a simulation study, and on a suite of benchmark data. It is implemented in the R package, PPforest, with core functions in Rcpp to improve the computational speed. The process of bagging and combining results from multiple trees produces numerous diagnostics which, with interactive graphics, can provide a lot of insight into the class structure in high dimensions. A web app is designed and developed for this purpose. In the process of developing the PPF some deficiencies were observed in the tree algorithm, PPtree, forming the basic building block. This led to modifications to the algorithm, implemented in the R package, PPtreeExt, and a small web app to help digest differences between various model parameter choices

    Explorations of the lineup protocol for visual inference: application to high dimension, low sample size problems and metrics to assess the quality

    Get PDF
    Statistical graphics play an important role in exploratory data analysis, model checking and diagnosis. Recent developments suggest that visual inference helps to quantify the significance of findings made from graphics. In visual inference, lineups embed the plot of the data among a set of null plots, and engage a human observer to select the plot that is most different from the rest. If the data plot is selected it corresponds to the rejection of a null hypothesis. With high dimensional data, statistical graphics are obtained by plotting low-dimensional projections, for example, in classification tasks projection pursuit is used to find low-dimensional projections that reveal differences between labelled groups. In many contemporary data sets the number of observations is relatively small compared to the number of variables, which is known as a high dimension low sample size (HDLSS) problem. The research conducted and described in this thesis explores the use of visual inference on understanding low dimensional pictures of HDLSS data. This approach may be helpful to broaden the understanding of issues related to HDLSS data in the data analysis community. Methods are illustrated using data from a published paper, which erroneously found real separation in microarray data. The thesis also describes metrics developed to assist the use of lineups for making inferential statements. Metrics measure the quality of the lineup, and help to understand what people see in the data plots. The null plots represent a finite sample from a null distribution, and the selected sample potentially affects the ease or difficulty of a lineup. Distance metrics are designed to describe how close the true data plot is to the null plots, and how close the null plots are to each other. The distribution of the distance metrics is studied to learn how well this matches to what people detect in the plots, the effect of null generating mechanism and plot choices for particular tasks. The analysis was conducted on data collected from Amazon Turk studies conducted with lineups for studying an array of exploratory data analysis tasks. Finally an R package is constructed to provide open source tools to use visual inference and distance metrics
    corecore