36 research outputs found

    Statistical Techniques for Exploratory Analysis of Structured Three-Way and Dynamic Network Data.

    Full text link
    In this thesis, I develop different techniques for the pattern extraction and visual exploration of a collection of data matrices. Specifically, I present methods to help home in on and visualize an underlying structure and its evolution over ordered (e.g., time) or unordered (e.g., experimental conditions) index sets. The first part of the thesis introduces a biclustering technique for such three dimensional data arrays. This technique is capable of discovering potentially overlapping groups of samples and variables that evolve similarly with respect to a subset of conditions. To facilitate and enhance visual exploration, I introduce a framework that utilizes kernel smoothing to guide the estimation of bicluster responses over the array. In the second part of the thesis, I introduce two matrix factorization models. The first is a data integration model that decomposes the data into two factors: a basis common to all data matrices, and a coefficient matrix that varies for each data matrix. The second model is meant for visual clustering of nodes in dynamic network data, which often contains complex evolving structure. Hence, this approach is more flexible and additionally lets the basis evolve for each matrix in the array. Both models utilize a regularization within the framework of non-negative matrix factorization to encourage local smoothness of the basis and coefficient matrices, which improves interpretability and highlights the structural patterns underlying the data, while mitigating noise effects. I also address computational aspects of applying regularized non-negative matrix factorization models to large data arrays by presenting multiple algorithms, including an approximation algorithm based on alternating least squares.PhDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99838/1/smankad_1.pd

    Non-Standard Errors

    Get PDF
    In statistics, samples are drawn from a population in a data-generating process (DGP). Standard errors measure the uncertainty in estimates of population parameters. In science, evidence is generated to test hypotheses in an evidence-generating process (EGP). We claim that EGP variation across researchers adds uncertainty: Non-standard errors (NSEs). We study NSEs by letting 164 teams test the same hypotheses on the same data. NSEs turn out to be sizable, but smaller for better reproducible or higher rated research. Adding peer-review stages reduces NSEs. We further find that this type of uncertainty is underestimated by participants

    Biclustering Three-Dimensional Data Arrays With Plaid Models

    No full text
    <div><p>Three-dimensional data arrays (collections of individual data matrices) are increasingly prevalent in modern data and pose unique challenges to pattern extraction and visualization. This article introduces a biclustering technique for exploration and pattern detection in such complex structured data. The proposed framework couples the popular plaid model together with tools from functional data analysis to guide the estimation of bicluster responses over the array. We present an efficient algorithm that first detects biclusters that exhibit strong deviations for some data matrices, and then estimates their responses over the entire data array. Altogether, the framework is useful to home in on and display underlying structure and its evolution over conditions/time. The methods are scalable to large datasets, and can accommodate a variety of dynamic patterns. The proposed techniques are illustrated on gene expression data and bilateral trade networks. Supplementary materials are available online.</p></div

    A for Effort? Using the Crowd to Identify Moral Hazard in NYC Restaurant Hygiene Inspections

    No full text
    From an upset stomach to a life-threatening foodborne illness, getting sick is all too common after eating in restaurants. While health inspection programs are designed to protect consumers, such inspections typically occur at wide intervals of time, allowing restaurant hygiene to remain unmonitored in the interim periods. Information provided in online reviews may be effectively used in these interim periods to gauge restaurant hygiene. In this paper, we provide evidence for how information from online reviews of restaurants can be effectively used to identify cases of hygiene violations in restaurants, even after the restaurant has been inspected and certified. We use data from restaurant hygiene inspections in New York City from the launch of an inspection program from 2010 to 2016, and combine this data with online reviews for the same set of restaurants. Using supervised machine learning techniques, we then create a hygiene dictionary specifically crafted to identify hygiene-related concerns, and use it to identify systematic instances of moral hazard, wherein restaurants with positive hygiene inspection scores are seen to regress in their hygiene maintenance within 90 days of receiving the inspection scores. To the extent that social media provides some visibility into the hygiene practices of restaurants, we argue that the effects of information asymmetry that lead to moral hazard may be partially mitigated in this context. Based on our work, we also provide strategies for how cities and policy-makers may design effective restaurant inspection programs, through a combination of traditional inspections and the appropriate use of social media
    corecore