Search CORE

241,482 research outputs found

Get Another Label? Improving Data Quality and Data Mining

Author: Ipeirotis Panagiotis
Provost Foster
Sheng Victor
Publication venue
Publication date: 01/01/2008
Field of study

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity and show several main results: (i) Repeated-labeling can improve label and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

New York University Faculty Digital Archive

Repeated Labeling Using Multiple Noisy Labelers

Author: Ipeirotis Panagiotis G.
Provost Foster
Sheng Victor
Wang Jing
Publication venue
Publication date: 10/09/2010
Field of study

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a set of robust techniques that combine different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire. For certain label-quality/cost regimes, the benefit is substantial.This work was supported by the National Science Foundation under Grant No. IIS-0643846, by an NSERC Postdoctoral Fellowship, and by an NEC Faculty Fellowship

New York University Faculty Digital Archive

Recommended from our members

PetroPlot: A plotting and data management tool set for Microsoft Excel

Author: Asimow Paul D.
Langmuir Charles H.
Su Yongjun
Publication venue: 'American Geophysical Union (AGU)'
Publication date: 01/01/2003
Field of study

PetroPlot is a 4000-line software code written in Visual Basic for the spreadsheet program Excel that automates plotting and data management tasks for large amount of data. The major plotting functions include: automation of large numbers of multiseries XY plots; normalized diagrams (e.g., spider diagrams); replotting of any complex formatted diagram with multiple series for any other axis parameters; addition of customized labels for individual data points; and labeling flexible log scale axes. Other functions include: assignment of groups for samples based on multiple customized criteria; removal of nonnumeric values; calculation of averages/standard deviations; calculation of correlation matrices; deletion of nonconsecutive rows; and compilation of multiple rows of data for a single sample to single rows appropriate for plotting. A cubic spline function permits curve fitting to complex time series, and comparison of data to the fits. For users of Excel, PetroPlot increases efficiency of data manipulation and visualization by orders of magnitude and allows exploration of large data sets that would not be possible making plots individually. The source codes are open to all users

Harvard University - DASH

Caltech Authors

On Compatible Matchings

Author: Aichholzer Oswin
Arroyo Alan
Masárová Zuzana
Parada Irene
Perz Daniel
Pilz Alexander
Tkadlec Josef
Vogtenhuber Birgit
Publication venue
Publication date: 11/01/2021
Field of study

A matching is compatible to two or more labeled point sets of size

n

with labels

\{1,\dots,n\}

if its straight-line drawing on each of these point sets is crossing-free. We study the maximum number of edges in a matching compatible to two or more labeled point sets in general position in the plane. We show that for any two labeled convex sets of

n

points there exists a compatible matching with

\lfloor \sqrt {2n}\rfloor

edges. More generally, for any

\ell

labeled point sets we construct compatible matchings of size

\Omega(n^{1/\ell})

. As a corresponding upper bound, we use probabilistic arguments to show that for any

\ell

given sets of

n

points there exists a labeling of each set such that the largest compatible matching has

{\mathcal{O}}(n^{2/({\ell}+1)})

edges. Finally, we show that

\Theta(\log n)

copies of any set of

n

points are necessary and sufficient for the existence of a labeling such that any compatible matching consists only of a single edge

arXiv.org e-Print Archive

IST Austria: PubRep (Institute of Science and Technology)

Consistent labeling of rotating maps

Author: Gemsa Andreas
Nöllenburg Martin
Rutter Ignaz
Publication venue: Computational Geometry Laboratory
Publication date: 28/05/2018
Field of study

Dynamic maps that allow continuous map rotations, for example, on mobile devices, encounter new geometric labeling issues unseen in static maps before. We study the following dynamic map labeling problem: The input is an abstract map consisting of a set P of points in the plane with attached horizontally aligned rectangular labels. While the map with the point set P is rotated, all labels remain horizontally aligned. We are interested in a consistent labeling of P under rotation, i.e., an assignment of a single (possibly empty) active interval of angles for each label that determines its visibility under rotations such that visible labels neither intersect each other (soft conflicts) nor occlude points in P at any rotation angle (hard conflicts). Our goal is to find a consistent labeling that maximizes the number of visible labels integrated over all rotation angles. We first introduce a general model for labeling rotating maps and derive basic geometric properties of consistent solutions. We show NP-hardness of the above optimization problem even for unit-square labels. We then present a constant-factor approximation for this problem based on line stabbing, and refine it further into an efficient polynomial-time approximation scheme (EPTAS)

KITopen

Get Another Label? Improving Data Quality and Data Mining

Author: Ipeirotis Panagiotis
Provost Foster
Sheng Victor
Publication venue
Publication date: 01/01/2008
Field of study

An Algorithmic Framework for Labeling Road Maps

Author: A Gemsa
E Imhof
F Chirié
G Neyer
S Seibert
Publication venue
Publication date: 13/05/2016
Field of study

Given an unlabeled road map, we consider, from an algorithmic perspective, the cartographic problem to place non-overlapping road labels embedded in their roads. We first decompose the road network into logically coherent road sections, e.g., parts of roads between two junctions. Based on this decomposition, we present and implement a new and versatile framework for placing labels in road maps such that the number of labeled road sections is maximized. In an experimental evaluation with road maps of 11 major cities we show that our proposed labeling algorithm is both fast in practice and that it reaches near-optimal solution quality, where optimal solutions are obtained by mixed-integer linear programming. In comparison to the standard OpenStreetMap renderer Mapnik, our algorithm labels 31% more road sections in average.Comment: extended version of a paper to appear at GIScience 201

arXiv.org e-Print Archive

Crossref

Multiple Description Vector Quantization with Lattice Codebooks: Design and Analysis

Author: Servetto Sergio D.
Sloane N. J. A.
Vaishampayan Vinay A.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2000
Field of study

The problem of designing a multiple description vector quantizer with lattice codebook Lambda is considered. A general solution is given to a labeling problem which plays a crucial role in the design of such quantizers. Numerical performance results are obtained for quantizers based on the lattices A_2 and Z^i, i=1,2,4,8, that make use of this labeling algorithm. The high-rate squared-error distortions for this family of L-dimensional vector quantizers are then analyzed for a memoryless source with probability density function p and differential entropy h(p) < infty. For any a in (0,1) and rate pair (R,R), it is shown that the two-channel distortion d_0 and the channel 1 (or channel 2) distortions d_s satisfy lim_{R -> infty} d_0 2^(2R(1+a)) = (1/4) G(Lambda) 2^{2h(p)} and lim_{R -> infty} d_s 2^(2R(1-a)) = G(S_L) 2^2h(p), where G(Lambda) is the normalized second moment of a Voronoi cell of the lattice Lambda and G(S_L) is the normalized second moment of a sphere in L dimensions.Comment: 46 pages, 14 figure

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX