1,553 research outputs found
Model Selection in an Information Economy : Choosing what to Learn
As online markets for the exchange of goods and services become more common, the study of markets composed at least in part of autonomous agents has taken on increasing importance. In contrast to traditional completeinformation economic scenarios, agents that are operating in an electronic marketplace often do so under considerable uncertainty. In order to reduce their uncertainty, these agents must learn about the world around them. When an agent producer is engaged in a learning task in which data collection is costly, such as learning the preferences of a consumer population, it is faced with a classic decision problem: when to explore and when to exploit. If the agent has a limited number of chances to experiment, it must explicitly consider the cost of learning (in terms of foregone profit) against the value of the information acquired. Information goods add an additional dimension to this problem; due to their flexibility, they can be bundled and priced according to a number of different price schedules. An optimizing producer should consider the profit each price schedule can extract, as well as the difficulty of learning of this schedule. In this paper, we demonstrate the tradeoff between complexity and profitability for a number of common price schedules. We begin with a one-shot decision as to which schedule to learn. Schedules with moderate complexity are preferred in the short and medium term, as they are learned quickly, yet extract a significant fraction of the available profit. We then turn to the repeated version of this one-shot decision and show that moderate complexity schedules, in particular two-part tariff, perform well when the producer must adapt to nonstationarity in the consumer population. When a producer can dynamically change schedules as it learns, it can use an explicit decision-theoretic formulation to greedily select the schedule which appears to yield the greatest profit in the next period. By explicitly considering the both the learnability and the profit extracted by different price schedules, a producer can extract more profit as it learns than if it naively chose models that are accurate once learned.Online learning; information economics; model selection; direct search
Space-efficient data sketching algorithms for network applications
Sketching techniques are widely adopted in network applications. Sketching algorithms “encode” data into succinct data structures that can later be accessed and “decoded” for various purposes, such as network measurement, accounting, anomaly detection and etc. Bloom filters and counter braids are two well-known representatives in this category. Those sketching algorithms usually need to strike a tradeoff between performance (how much information can be revealed and how fast) and cost (storage, transmission and computation). This dissertation is dedicated to the
research and development of several sketching techniques including improved forms of stateful Bloom Filters, Statistical Counter Arrays and Error Estimating Codes. Bloom filter is a space-efficient randomized data structure for approximately representing a set in order to support membership queries. Bloom filter and its variants have found widespread use in many networking applications, where it is important to minimize the cost of storing and communicating network data. In this thesis, we propose a family of Bloom Filter variants augmented by rank-indexing method. We will show such augmentation can bring a significant reduction of space and also the number of memory accesses, especially when deletions of set elements from the Bloom Filter need to be supported. Exact active counter array is another important building block in many sketching algorithms, where storage cost of the array is of paramount concern. Previous approaches reduce the storage costs while either losing accuracy or supporting only passive measurements. In this thesis, we propose an exact statistics counter array architecture that can support active measurements (real-time read and write). It also leverages the aforementioned rank-indexing method and exploits statistical multiplexing to minimize the storage
costs of the counter array. Error estimating coding (EEC) has recently been established as an important tool to estimate bit error rates in the transmission of packets over wireless links. In essence, the EEC problem is also a sketching problem, since the EEC codes can be viewed as a sketch of the packet sent, which is decoded by the receiver to estimate bit error rate. In this thesis, we will first investigate the asymptotic bound of error estimating coding by viewing the problem from two-party computation perspective and then investigate its coding/decoding efficiency using Fisher information analysis. Further, we develop several sketching techniques including Enhanced tug-of-war(EToW) sketch and the generalized EEC (gEEC)sketch family which can achieve around 70% reduction of sketch size with similar estimation accuracies. For all solutions proposed above, we will use theoretical tools such as information theory and communication complexity to investigate how far our proposed solutions are away from the theoretical optimal. We will show that the proposed techniques are asymptotically or empirically very close to the theoretical bounds.PhDCommittee Chair: Xu, Jun; Committee Member: Feamster, Nick; Committee Member: Li, Baochun; Committee Member: Romberg, Justin; Committee Member: Zegura, Ellen W
Detecting multivariate interactions in spatial point patterns with Gibbs models and variable selection
We propose a method for detecting significant interactions in very large
multivariate spatial point patterns. This methodology develops high dimensional
data understanding in the point process setting. The method is based on
modelling the patterns using a flexible Gibbs point process model to directly
characterise point-to-point interactions at different spatial scales. By using
the Gibbs framework significant interactions can also be captured at small
scales. Subsequently, the Gibbs point process is fitted using a
pseudo-likelihood approximation, and we select significant interactions
automatically using the group lasso penalty with this likelihood approximation.
Thus we estimate the multivariate interactions stably even in this setting. We
demonstrate the feasibility of the method with a simulation study and show its
power by applying it to a large and complex rainforest plant population data
set of 83 species
Uniform Inference for Kernel Density Estimators with Dyadic Data
Dyadic data is often encountered when quantities of interest are associated
with the edges of a network. As such it plays an important role in statistics,
econometrics and many other data science disciplines. We consider the problem
of uniformly estimating a dyadic Lebesgue density function, focusing on
nonparametric kernel-based estimators taking the form of dyadic empirical
processes. Our main contributions include the minimax-optimal uniform
convergence rate of the dyadic kernel density estimator, along with strong
approximation results for the associated standardized and Studentized
-processes. A consistent variance estimator enables the construction of
valid and feasible uniform confidence bands for the unknown density function.
We showcase the broad applicability of our results by developing novel
counterfactual density estimation and inference methodology for dyadic data,
which can be used for causal inference and program evaluation. A crucial
feature of dyadic distributions is that they may be "degenerate" at certain
points in the support of the data, a property making our analysis somewhat
delicate. Nonetheless our methods for uniform inference remain robust to the
potential presence of such points. For implementation purposes, we discuss
inference procedures based on positive semi-definite covariance estimators,
mean squared error optimal bandwidth selectors and robust bias correction
techniques. We illustrate the empirical finite-sample performance of our
methods both in simulations and with real-world trade data, for which we make
comparisons between observed and counterfactual trade distributions in
different years. Our technical results concerning strong approximations and
maximal inequalities are of potential independent interest.Comment: Article: 23 pages, 3 figures. Supplemental appendix: 72 pages, 3
figure
Recommended from our members
Higher-order properties of approximate estimators
Many modern estimation methods in econometrics approximate an objective function, for instance, through simulation or discretization. These approximations typically affect both bias and variance of the resulting estimator. We first provide a higher-order expansion of such “approximate” estimators that takes into account the errors due to the use of approximations. We show how a Newton–Raphson adjustment can reduce the impact of approximations. Then we use our expansions to develop inferential tools that take into account approximation errors: we propose adjustments of the approximate estimator that remove its first-order bias and adjust its standard errors. These corrections apply to a class of approximate estimators that includes all known simulation-based procedures. A Monte Carlo simulation on the mixed logit model shows that our proposed adjustments can yield significant improvements at a low computational cost
Bayesian score calibration for approximate models
Scientists continue to develop increasingly complex mechanistic models to
reflect their knowledge more realistically. Statistical inference using these
models can be challenging since the corresponding likelihood function is often
intractable and model simulation may be computationally burdensome.
Fortunately, in many of these situations, it is possible to adopt a surrogate
model or approximate likelihood function. It may be convenient to conduct
Bayesian inference directly with the surrogate, but this can result in bias and
poor uncertainty quantification. In this paper we propose a new method for
adjusting approximate posterior samples to reduce bias and produce more
accurate uncertainty quantification. We do this by optimizing a transform of
the approximate posterior that maximizes a scoring rule. Our approach requires
only a (fixed) small number of complex model simulations and is numerically
stable. We demonstrate good performance of the new method on several examples
of increasing complexity.Comment: 27 pages, 8 figures, 5 table
Training Normalizing Flows from Dependent Data
Normalizing flows are powerful non-parametric statistical models that
function as a hybrid between density estimators and generative models. Current
learning algorithms for normalizing flows assume that data points are sampled
independently, an assumption that is frequently violated in practice, which may
lead to erroneous density estimation and data generation. We propose a
likelihood objective of normalizing flows incorporating dependencies between
the data points, for which we derive a flexible and efficient learning
algorithm suitable for different dependency structures. We show that respecting
dependencies between observations can improve empirical results on both
synthetic and real-world data, and leads to higher statistical power in a
downstream application to genome-wide association studies
Design of resource to backbone transmission for a high wind penetration future
In a high wind penetration future, transmission must be designed to integrate groups of new wind farms with a high capacity inter-regional ``backbone transmission system. A design process is described which begins by identifying feasible sites for future wind farms, identifies an optimal set of those wind farms for a specified future, and designs a reliable low-cost ``resource to backbone collector transmission network to connect each individual wind farm to the backbone transmission network. A model of the transmission and generation system in the state of Iowa is used to test these methods, and to make observations about the nature of these resource to backbone networks
- …