241,482 research outputs found

    Get Another Label? Improving Data Quality and Data Mining

    Get PDF
    This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity and show several main results: (i) Repeated-labeling can improve label and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

    Repeated Labeling Using Multiple Noisy Labelers

    Get PDF
    This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a set of robust techniques that combine different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire. For certain label-quality/cost regimes, the benefit is substantial.This work was supported by the National Science Foundation under Grant No. IIS-0643846, by an NSERC Postdoctoral Fellowship, and by an NEC Faculty Fellowship

    On Compatible Matchings

    Full text link
    A matching is compatible to two or more labeled point sets of size nn with labels {1,,n}\{1,\dots,n\} if its straight-line drawing on each of these point sets is crossing-free. We study the maximum number of edges in a matching compatible to two or more labeled point sets in general position in the plane. We show that for any two labeled convex sets of nn points there exists a compatible matching with 2n\lfloor \sqrt {2n}\rfloor edges. More generally, for any \ell labeled point sets we construct compatible matchings of size Ω(n1/)\Omega(n^{1/\ell}). As a corresponding upper bound, we use probabilistic arguments to show that for any \ell given sets of nn points there exists a labeling of each set such that the largest compatible matching has O(n2/(+1)){\mathcal{O}}(n^{2/({\ell}+1)}) edges. Finally, we show that Θ(logn)\Theta(\log n) copies of any set of nn points are necessary and sufficient for the existence of a labeling such that any compatible matching consists only of a single edge

    Consistent labeling of rotating maps

    Get PDF
    Dynamic maps that allow continuous map rotations, for example, on mobile devices, encounter new geometric labeling issues unseen in static maps before. We study the following dynamic map labeling problem: The input is an abstract map consisting of a set P of points in the plane with attached horizontally aligned rectangular labels. While the map with the point set P is rotated, all labels remain horizontally aligned. We are interested in a consistent labeling of P under rotation, i.e., an assignment of a single (possibly empty) active interval of angles for each label that determines its visibility under rotations such that visible labels neither intersect each other (soft conflicts) nor occlude points in P at any rotation angle (hard conflicts). Our goal is to find a consistent labeling that maximizes the number of visible labels integrated over all rotation angles. We first introduce a general model for labeling rotating maps and derive basic geometric properties of consistent solutions. We show NP-hardness of the above optimization problem even for unit-square labels. We then present a constant-factor approximation for this problem based on line stabbing, and refine it further into an efficient polynomial-time approximation scheme (EPTAS)

    Get Another Label? Improving Data Quality and Data Mining

    Get PDF
    This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity and show several main results: (i) Repeated-labeling can improve label and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

    An Algorithmic Framework for Labeling Road Maps

    Full text link
    Given an unlabeled road map, we consider, from an algorithmic perspective, the cartographic problem to place non-overlapping road labels embedded in their roads. We first decompose the road network into logically coherent road sections, e.g., parts of roads between two junctions. Based on this decomposition, we present and implement a new and versatile framework for placing labels in road maps such that the number of labeled road sections is maximized. In an experimental evaluation with road maps of 11 major cities we show that our proposed labeling algorithm is both fast in practice and that it reaches near-optimal solution quality, where optimal solutions are obtained by mixed-integer linear programming. In comparison to the standard OpenStreetMap renderer Mapnik, our algorithm labels 31% more road sections in average.Comment: extended version of a paper to appear at GIScience 201

    Multiple Description Vector Quantization with Lattice Codebooks: Design and Analysis

    Get PDF
    The problem of designing a multiple description vector quantizer with lattice codebook Lambda is considered. A general solution is given to a labeling problem which plays a crucial role in the design of such quantizers. Numerical performance results are obtained for quantizers based on the lattices A_2 and Z^i, i=1,2,4,8, that make use of this labeling algorithm. The high-rate squared-error distortions for this family of L-dimensional vector quantizers are then analyzed for a memoryless source with probability density function p and differential entropy h(p) < infty. For any a in (0,1) and rate pair (R,R), it is shown that the two-channel distortion d_0 and the channel 1 (or channel 2) distortions d_s satisfy lim_{R -> infty} d_0 2^(2R(1+a)) = (1/4) G(Lambda) 2^{2h(p)} and lim_{R -> infty} d_s 2^(2R(1-a)) = G(S_L) 2^2h(p), where G(Lambda) is the normalized second moment of a Voronoi cell of the lattice Lambda and G(S_L) is the normalized second moment of a sphere in L dimensions.Comment: 46 pages, 14 figure
    corecore