204,053 research outputs found
Partial identification in the statistical matching problem
The statistical matching problem involves the integration of multiple datasets where some variables are not observed jointly. This missing data pattern leaves most statistical models unidentifiable. Statistical inference is still possible when operating under the framework of partially identified models, where the goal is to bound the parameters rather than to estimate them precisely. In many matching problems, developing feasible bounds on the parameters is equivalent to finding the set of positive-definite completions of a partially specified covariance matrix. Existing methods for characterising the set of possible completions do not extend to high-dimensional problems. A Gibbs sampler to draw from the set of possible completions is proposed. The variation in the observed samples gives an estimate of the feasible region of the parameters. The Gibbs sampler extends easily to high-dimensional statistical matching problems.Daniel Ahfock, Saumyadipta Pyne, Sharon X. Lee, Geoffrey J. McLachla
Indeterministic Handling of Uncertain Decisions in Duplicate Detection
In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way
Vehicle-Rear: A New Dataset to Explore Feature Fusion for Vehicle Identification Using Convolutional Neural Networks
This work addresses the problem of vehicle identification through
non-overlapping cameras. As our main contribution, we introduce a novel dataset
for vehicle identification, called Vehicle-Rear, that contains more than three
hours of high-resolution videos, with accurate information about the make,
model, color and year of nearly 3,000 vehicles, in addition to the position and
identification of their license plates. To explore our dataset we design a
two-stream CNN that simultaneously uses two of the most distinctive and
persistent features available: the vehicle's appearance and its license plate.
This is an attempt to tackle a major problem: false alarms caused by vehicles
with similar designs or by very close license plate identifiers. In the first
network stream, shape similarities are identified by a Siamese CNN that uses a
pair of low-resolution vehicle patches recorded by two different cameras. In
the second stream, we use a CNN for OCR to extract textual information,
confidence scores, and string similarities from a pair of high-resolution
license plate patches. Then, features from both streams are merged by a
sequence of fully connected layers for decision. In our experiments, we
compared the two-stream network against several well-known CNN architectures
using single or multiple vehicle features. The architectures, trained models,
and dataset are publicly available at https://github.com/icarofua/vehicle-rear
Minimal inference from incomplete 2x2-tables
Estimates based on 2x2 tables of frequencies are widely used in statistical
applications. However, in many cases these tables are incomplete in the sense
that the data required to compute the frequencies for a subset of the cells
defining the table are unavailable. Minimal inference addresses those
situations where this incompleteness leads to target parameters for these
tables that are interval, rather than point, identifiable. In particular, we
develop the concept of corroboration as a measure of the statistical evidence
in the observed data that is not based on likelihoods. The corroboration
function identifies the parameter values that are the hardest to refute, i.e.,
those values which, under repeated sampling, remain interval identified. This
enables us to develop a general approach to inference from incomplete 2x2
tables when the additional assumptions required to support a likelihood-based
approach cannot be sustained based on the data available. This minimal
inference approach then provides a foundation for further analysis that aims at
making sharper inference supported by plausible external beliefs
Partial Identification in Matching Models for the Marriage Market
We study partial identification of the preference parameters in models of
one-to-one matching with perfectly transferable utilities, without imposing
parametric distributional restrictions on the unobserved heterogeneity and with
data on one large market. We provide a tractable characterisation of the
identified set, under various classes of nonparametric distributional
assumptions on the unobserved heterogeneity. Using our methodology, we
re-examine some of the relevant questions in the empirical literature on the
marriage market which have been previously studied under the Multinomial Logit
assumption
- …