204,053 research outputs found

    Partial identification in the statistical matching problem

    Get PDF
    The statistical matching problem involves the integration of multiple datasets where some variables are not observed jointly. This missing data pattern leaves most statistical models unidentifiable. Statistical inference is still possible when operating under the framework of partially identified models, where the goal is to bound the parameters rather than to estimate them precisely. In many matching problems, developing feasible bounds on the parameters is equivalent to finding the set of positive-definite completions of a partially specified covariance matrix. Existing methods for characterising the set of possible completions do not extend to high-dimensional problems. A Gibbs sampler to draw from the set of possible completions is proposed. The variation in the observed samples gives an estimate of the feasible region of the parameters. The Gibbs sampler extends easily to high-dimensional statistical matching problems.Daniel Ahfock, Saumyadipta Pyne, Sharon X. Lee, Geoffrey J. McLachla

    Indeterministic Handling of Uncertain Decisions in Duplicate Detection

    Get PDF
    In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way

    Vehicle-Rear: A New Dataset to Explore Feature Fusion for Vehicle Identification Using Convolutional Neural Networks

    Full text link
    This work addresses the problem of vehicle identification through non-overlapping cameras. As our main contribution, we introduce a novel dataset for vehicle identification, called Vehicle-Rear, that contains more than three hours of high-resolution videos, with accurate information about the make, model, color and year of nearly 3,000 vehicles, in addition to the position and identification of their license plates. To explore our dataset we design a two-stream CNN that simultaneously uses two of the most distinctive and persistent features available: the vehicle's appearance and its license plate. This is an attempt to tackle a major problem: false alarms caused by vehicles with similar designs or by very close license plate identifiers. In the first network stream, shape similarities are identified by a Siamese CNN that uses a pair of low-resolution vehicle patches recorded by two different cameras. In the second stream, we use a CNN for OCR to extract textual information, confidence scores, and string similarities from a pair of high-resolution license plate patches. Then, features from both streams are merged by a sequence of fully connected layers for decision. In our experiments, we compared the two-stream network against several well-known CNN architectures using single or multiple vehicle features. The architectures, trained models, and dataset are publicly available at https://github.com/icarofua/vehicle-rear

    Minimal inference from incomplete 2x2-tables

    Full text link
    Estimates based on 2x2 tables of frequencies are widely used in statistical applications. However, in many cases these tables are incomplete in the sense that the data required to compute the frequencies for a subset of the cells defining the table are unavailable. Minimal inference addresses those situations where this incompleteness leads to target parameters for these tables that are interval, rather than point, identifiable. In particular, we develop the concept of corroboration as a measure of the statistical evidence in the observed data that is not based on likelihoods. The corroboration function identifies the parameter values that are the hardest to refute, i.e., those values which, under repeated sampling, remain interval identified. This enables us to develop a general approach to inference from incomplete 2x2 tables when the additional assumptions required to support a likelihood-based approach cannot be sustained based on the data available. This minimal inference approach then provides a foundation for further analysis that aims at making sharper inference supported by plausible external beliefs

    Partial Identification in Matching Models for the Marriage Market

    Full text link
    We study partial identification of the preference parameters in models of one-to-one matching with perfectly transferable utilities, without imposing parametric distributional restrictions on the unobserved heterogeneity and with data on one large market. We provide a tractable characterisation of the identified set, under various classes of nonparametric distributional assumptions on the unobserved heterogeneity. Using our methodology, we re-examine some of the relevant questions in the empirical literature on the marriage market which have been previously studied under the Multinomial Logit assumption
    corecore