241,482 research outputs found
Get Another Label? Improving Data Quality and Data Mining
This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity and show several main results: (i) Repeated-labeling can improve label and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Repeated Labeling Using Multiple Noisy Labelers
This paper addresses the repeated acquisition of labels for data items
when the labeling is imperfect. We examine the improvement (or lack
thereof) in data quality via repeated labeling, and focus especially on
the improvement of training labels for supervised induction. With the
outsourcing of small tasks becoming easier, for example via Amazon's
Mechanical Turk, it often is possible to obtain less-than-expert
labeling at low cost. With low-cost labeling, preparing the unlabeled
part of the data can become considerably more expensive than labeling.
We present repeated-labeling strategies of increasing complexity, and
show several main results. (i) Repeated-labeling can improve label
quality and model quality, but not always. (ii) When labels are noisy,
repeated labeling can be preferable to single labeling even in the
traditional setting where labels are not particularly cheap. (iii) As
soon as the cost of processing the unlabeled data is not free, even the
simple strategy of labeling everything multiple times can give
considerable advantage. (iv) Repeatedly labeling a carefully chosen set
of points is generally preferable, and we present a set of robust
techniques that combine different notions of uncertainty to select data
points for which quality should be improved. The bottom line: the
results show clearly that when labeling is not perfect, selective
acquisition of multiple labels is a strategy that data miners should
have in their repertoire. For certain label-quality/cost regimes, the
benefit is substantial.This work was supported by the National Science Foundation under Grant
No. IIS-0643846, by an NSERC Postdoctoral Fellowship, and by an NEC
Faculty Fellowship
Recommended from our members
PetroPlot: A plotting and data management tool set for Microsoft Excel
PetroPlot is a 4000-line software code written in Visual Basic for the spreadsheet program Excel that automates plotting and data management tasks for large amount of data. The major plotting functions include: automation of large numbers of multiseries XY plots; normalized diagrams (e.g., spider diagrams); replotting of any complex formatted diagram with multiple series for any other axis parameters; addition of customized labels for individual data points; and labeling flexible log scale axes. Other functions include: assignment of groups for samples based on multiple customized criteria; removal of nonnumeric values; calculation of averages/standard deviations; calculation of correlation matrices; deletion of nonconsecutive rows; and compilation of multiple rows of data for a single sample to single rows appropriate for plotting. A cubic spline function permits curve fitting to complex time series, and comparison of data to the fits. For users of Excel, PetroPlot increases efficiency of data manipulation and visualization by orders of magnitude and allows exploration of large data sets that would not be possible making plots individually. The source codes are open to all users
On Compatible Matchings
A matching is compatible to two or more labeled point sets of size with
labels if its straight-line drawing on each of these point sets
is crossing-free. We study the maximum number of edges in a matching compatible
to two or more labeled point sets in general position in the plane. We show
that for any two labeled convex sets of points there exists a compatible
matching with edges. More generally, for any
labeled point sets we construct compatible matchings of size
. As a corresponding upper bound, we use probabilistic
arguments to show that for any given sets of points there exists a
labeling of each set such that the largest compatible matching has
edges. Finally, we show that
copies of any set of points are necessary and sufficient for the existence
of a labeling such that any compatible matching consists only of a single edge
Consistent labeling of rotating maps
Dynamic maps that allow continuous map rotations, for example, on mobile
devices, encounter new geometric labeling issues unseen in static maps before. We study the
following dynamic map labeling problem: The input is an abstract map consisting of a set P
of points in the plane with attached horizontally aligned rectangular labels. While the map
with the point set P is rotated, all labels remain horizontally aligned. We are interested in
a consistent labeling of P under rotation, i.e., an assignment of a single (possibly empty)
active interval of angles for each label that determines its visibility under rotations such
that visible labels neither intersect each other (soft conflicts) nor occlude points in P at any
rotation angle (hard conflicts). Our goal is to find a consistent labeling that maximizes the
number of visible labels integrated over all rotation angles.
We first introduce a general model for labeling rotating maps and derive basic geometric
properties of consistent solutions. We show NP-hardness of the above optimization
problem even for unit-square labels. We then present a constant-factor approximation for
this problem based on line stabbing, and refine it further into an efficient polynomial-time
approximation scheme (EPTAS)
Get Another Label? Improving Data Quality and Data Mining
This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity and show several main results: (i) Repeated-labeling can improve label and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
An Algorithmic Framework for Labeling Road Maps
Given an unlabeled road map, we consider, from an algorithmic perspective,
the cartographic problem to place non-overlapping road labels embedded in their
roads. We first decompose the road network into logically coherent road
sections, e.g., parts of roads between two junctions. Based on this
decomposition, we present and implement a new and versatile framework for
placing labels in road maps such that the number of labeled road sections is
maximized. In an experimental evaluation with road maps of 11 major cities we
show that our proposed labeling algorithm is both fast in practice and that it
reaches near-optimal solution quality, where optimal solutions are obtained by
mixed-integer linear programming. In comparison to the standard OpenStreetMap
renderer Mapnik, our algorithm labels 31% more road sections in average.Comment: extended version of a paper to appear at GIScience 201
Multiple Description Vector Quantization with Lattice Codebooks: Design and Analysis
The problem of designing a multiple description vector quantizer with lattice
codebook Lambda is considered. A general solution is given to a labeling
problem which plays a crucial role in the design of such quantizers. Numerical
performance results are obtained for quantizers based on the lattices A_2 and
Z^i, i=1,2,4,8, that make use of this labeling algorithm. The high-rate
squared-error distortions for this family of L-dimensional vector quantizers
are then analyzed for a memoryless source with probability density function p
and differential entropy h(p) < infty. For any a in (0,1) and rate pair (R,R),
it is shown that the two-channel distortion d_0 and the channel 1 (or channel
2) distortions d_s satisfy lim_{R -> infty} d_0 2^(2R(1+a)) = (1/4) G(Lambda)
2^{2h(p)} and lim_{R -> infty} d_s 2^(2R(1-a)) = G(S_L) 2^2h(p), where
G(Lambda) is the normalized second moment of a Voronoi cell of the lattice
Lambda and G(S_L) is the normalized second moment of a sphere in L dimensions.Comment: 46 pages, 14 figure
- …