4,211 research outputs found
Cheaper and Better: Selecting Good Workers for Crowdsourcing
Crowdsourcing provides a popular paradigm for data collection at scale. We
study the problem of selecting subsets of workers from a given worker pool to
maximize the accuracy under a budget constraint. One natural question is
whether we should hire as many workers as the budget allows, or restrict on a
small number of top-quality workers. By theoretically analyzing the error rate
of a typical setting in crowdsourcing, we frame the worker selection problem
into a combinatorial optimization problem and propose an algorithm to solve it
efficiently. Empirical results on both simulated and real-world datasets show
that our algorithm is able to select a small number of high-quality workers,
and performs as good as, sometimes even better than, the much larger crowds as
the budget allows
Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes
We define a family of probability distributions for random count matrices
with a potentially unbounded number of rows and columns. The three
distributions we consider are derived from the gamma-Poisson, gamma-negative
binomial, and beta-negative binomial processes. Because the models lead to
closed-form Gibbs sampling update equations, they are natural candidates for
nonparametric Bayesian priors over count matrices. A key aspect of our analysis
is the recognition that, although the random count matrices within the family
are defined by a row-wise construction, their columns can be shown to be i.i.d.
This fact is used to derive explicit formulas for drawing all the columns at
once. Moreover, by analyzing these matrices' combinatorial structure, we
describe how to sequentially construct a column-i.i.d. random count matrix one
row at a time, and derive the predictive distribution of a new row count vector
with previously unseen features. We describe the similarities and differences
between the three priors, and argue that the greater flexibility of the gamma-
and beta- negative binomial processes, especially their ability to model
over-dispersed, heavy-tailed count data, makes these well suited to a wide
variety of real-world applications. As an example of our framework, we
construct a naive-Bayes text classifier to categorize a count vector to one of
several existing random count matrices of different categories. The classifier
supports an unbounded number of features, and unlike most existing methods, it
does not require a predefined finite vocabulary to be shared by all the
categories, and needs neither feature selection nor parameter tuning. Both the
gamma- and beta- negative binomial processes are shown to significantly
outperform the gamma-Poisson process for document categorization, with
comparable performance to other state-of-the-art supervised text classification
algorithms.Comment: To appear in Journal of the American Statistical Association (Theory
and Methods). 31 pages + 11 page supplement, 5 figure
Non-adaptive pooling strategies for detection of rare faulty items
We study non-adaptive pooling strategies for detection of rare faulty items.
Given a binary sparse N-dimensional signal x, how to construct a sparse binary
MxN pooling matrix F such that the signal can be reconstructed from the
smallest possible number M of measurements y=Fx? We show that a very low number
of measurements is possible for random spatially coupled design of pools F. Our
design might find application in genetic screening or compressed genotyping. We
show that our results are robust with respect to the uncertainty in the matrix
F when some elements are mistaken.Comment: 5 page
Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources
An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at: http://www.probtf.org
Ligand-Based Virtual Screening Using Bayesian Inference Network and Reweighted Fragments
Many of the similarity-based virtual screening approaches assume that molecular fragments that are not related to the biological activity carry the same weight as the important ones. This was the reason that led to the use of Bayesian networks as an alternative to existing tools for similarity-based virtual screening. In our recent work, the retrieval performance of the Bayesian inference network (BIN) was observed to improve significantly when molecular fragments were reweighted using the relevance feedback information. In this paper, a set of active reference structures were used to reweight the fragments in the reference structure. In this approach, higher weights were assigned to those fragments that occur more frequently in the set of active reference structures while others were penalized. Simulated virtual screening experiments with MDL Drug Data Report datasets showed that the proposed approach significantly improved the retrieval effectiveness of ligand-based virtual screening, especially when the active molecules being sought had a high degree of structural heterogeneity
Partition MCMC for inference on acyclic digraphs
Acyclic digraphs are the underlying representation of Bayesian networks, a
widely used class of probabilistic graphical models. Learning the underlying
graph from data is a way of gaining insights about the structural properties of
a domain. Structure learning forms one of the inference challenges of
statistical graphical models.
MCMC methods, notably structure MCMC, to sample graphs from the posterior
distribution given the data are probably the only viable option for Bayesian
model averaging. Score modularity and restrictions on the number of parents of
each node allow the graphs to be grouped into larger collections, which can be
scored as a whole to improve the chain's convergence. Current examples of
algorithms taking advantage of grouping are the biased order MCMC, which acts
on the alternative space of permuted triangular matrices, and non ergodic edge
reversal moves.
Here we propose a novel algorithm, which employs the underlying combinatorial
structure of DAGs to define a new grouping. As a result convergence is improved
compared to structure MCMC, while still retaining the property of producing an
unbiased sample. Finally the method can be combined with edge reversal moves to
improve the sampler further.Comment: Revised version. 34 pages, 16 figures. R code available at
https://github.com/annlia/partitionMCM
- …