634 research outputs found
Crowdsourcing complex workflows under budget constraints
We consider the problem of task allocation in crowdsourcing systems with multiple complex workflows, each of which consists of a set of interdependent micro-tasks. We propose Budgeteer, an algorithm to solve this problem under a budget constraint. In particular, our algorithm first calculates an efficient way to allocate budget to each workflow. It then determines the number of inter-dependent micro-tasks and the price to pay for each task within each workflow, given the corresponding budget constraints. We empirically evaluate it on a well-known crowdsourcing-based text correction workflow using Amazon Mechanical Turk, and show that Budgeteer can achieve similar levels of accuracy to current benchmarks, but is on average 45% cheaper
SERIMI: Class-Based Matching for Instance Matching Across Heterogeneous Datasets
State-of-the-art instance matching approaches do not perform well when used for matching instances across heterogeneous datasets. This shortcoming derives from their core operation depending on direct matching, which involves a direct comparison of instances in the source with instances in the target dataset. Direct matching is not suitable when the overlap between the datasets is small. Aiming at resolving this problem, we propose a new paradigm called class-based matching. Given a class of instances from the source dataset, called the class of interest, and a set of candidate matches retrieved from the target, class-based matching refines the candidates by filtering out those that do not belong to the class of interest. For this refinement, only data in the target is used, i.e., no direct comparison between source and target is involved. Based on extensive experiments using public benchmarks, we show our approach greatly improves the quality of state-of-the-art systems; especially on difficult matching tasks
Structure Functions of Nuclei at Small x and Diffraction at HERA
Gribov theory is applied to investigate the shadowing effects in the
structure functions of nuclei. In this approach these effects are related to
the process of diffractive dissociation of a virtual photon. A model for this
diffractive process, which describes well the HERA data, is used to calculate
the shadowing in nuclear structure functions. A reasonable description of the
x, Q^2 and A-dependence of nuclear shadowing is achieved.Comment: TeX, 10 pages, 7 figures in 6 ps-file
Hard Diffraction at HERA and the Gluonic Content of the Pomeron
We show that the previously introduced CKMT model, based on conventional
Regge theory, gives a good description of the HERA data on the structure
function F_2^D for large rapidity gap (diffractive) events. These data allow,
not only to determine the valence and sea quark content of the Pomeron, but
also, through their Q^2 dependence, give information on its gluonic content.
Using DGLAP evolution, we find that the gluon distribution in the Pomeron is
very hard and the gluons carry more momentum than the quarks. This indicates
that the Pomeron, unlike ordinary hadrons, is a mostly gluonic object. With our
definition of the Pomeron flux factor the total momentum carried by quarks and
gluons turns out to be 0.3-0.4 - strongly violating the momentum sum rule.Comment: C-Shell archive of a PostScript file containing a 20 page paper with
text and 12 figures in i
Deep Metric Learning Meets Deep Clustering: An Novel Unsupervised Approach for Feature Embedding
Unsupervised Deep Distance Metric Learning (UDML) aims to learn sample
similarities in the embedding space from an unlabeled dataset. Traditional UDML
methods usually use the triplet loss or pairwise loss which requires the mining
of positive and negative samples w.r.t. anchor data points. This is, however,
challenging in an unsupervised setting as the label information is not
available. In this paper, we propose a new UDML method that overcomes that
challenge. In particular, we propose to use a deep clustering loss to learn
centroids, i.e., pseudo labels, that represent semantic classes. During
learning, these centroids are also used to reconstruct the input samples. It
hence ensures the representativeness of centroids - each centroid represents
visually similar samples. Therefore, the centroids give information about
positive (visually similar) and negative (visually dissimilar) samples. Based
on pseudo labels, we propose a novel unsupervised metric loss which enforces
the positive concentration and negative separation of samples in the embedding
space. Experimental results on benchmarking datasets show that the proposed
approach outperforms other UDML methods.Comment: Accepted in BMVC 202
- …