31,395 research outputs found

    The Devil of Face Recognition is in the Noise

    Full text link
    The growing scale of face recognition datasets empowers us to train strong convolutional networks for face recognition. While a variety of architectures and loss functions have been devised, we still have a limited understanding of the source and consequence of label noise inherent in existing datasets. We make the following contributions: 1) We contribute cleaned subsets of popular face databases, i.e., MegaFace and MS-Celeb-1M datasets, and build a new large-scale noise-controlled IMDb-Face dataset. 2) With the original datasets and cleaned subsets, we profile and analyze label noise properties of MegaFace and MS-Celeb-1M. We show that a few orders more samples are needed to achieve the same accuracy yielded by a clean subset. 3) We study the association between different types of noise, i.e., label flips and outliers, with the accuracy of face recognition models. 4) We investigate ways to improve data cleanliness, including a comprehensive user study on the influence of data labeling strategies to annotation accuracy. The IMDb-Face dataset has been released on https://github.com/fwang91/IMDb-Face.Comment: accepted to ECCV'1

    Towards Cleaning-up Open Data Portals: A Metadata Reconciliation Approach

    Full text link
    This paper presents an approach for metadata reconciliation, curation and linking for Open Governamental Data Portals (ODPs). ODPs have been lately the standard solution for governments willing to put their public data available for the society. Portal managers use several types of metadata to organize the datasets, one of the most important ones being the tags. However, the tagging process is subject to many problems, such as synonyms, ambiguity or incoherence, among others. As our empiric analysis of ODPs shows, these issues are currently prevalent in most ODPs and effectively hinders the reuse of Open Data. In order to address these problems, we develop and implement an approach for tag reconciliation in Open Data Portals, encompassing local actions related to individual portals, and global actions for adding a semantic metadata layer above individual portals. The local part aims to enhance the quality of tags in a single portal, and the global part is meant to interlink ODPs by establishing relations between tags.Comment: 8 pages,10 Figures - Under Revision for ICSC201

    CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

    Full text link
    Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

    Level Playing Field for Million Scale Face Recognition

    Full text link
    Face recognition has the perception of a solved problem, however when tested at the million-scale exhibits dramatic variation in accuracies across the different algorithms. Are the algorithms very different? Is access to good/big training data their secret weapon? Where should face recognition improve? To address those questions, we created a benchmark, MF2, that requires all algorithms to be trained on same data, and tested at the million scale. MF2 is a public large-scale set with 672K identities and 4.7M photos created with the goal to level playing field for large scale face recognition. We contrast our results with findings from the other two large-scale benchmarks MegaFace Challenge and MS-Celebs-1M where groups were allowed to train on any private/public/big/small set. Some key discoveries: 1) algorithms, trained on MF2, were able to achieve state of the art and comparable results to algorithms trained on massive private sets, 2) some outperformed themselves once trained on MF2, 3) invariance to aging suffers from low accuracies as in MegaFace, identifying the need for larger age variations possibly within identities or adjustment of algorithms in future testings

    A review of GIS-based information sharing systems

    Get PDF
    GIS-based information sharing systems have been implemented in many of England and Wales' Crime and Disorder Reduction Partnerships (CDRPs). The information sharing role of these systems is seen as being vital to help in the review of crime, disorder and misuse of drugs; to sustain strategic objectives, to monitor interventions and initiatives; and support action plans for service delivery. This evaluation into these systems aimed to identify the lessons learned from existing systems, identify how these systems can be best used to support the business functions of CDRPs, identify common weaknesses across the systems, and produce guidelines on how these systems should be further developed. At present there are in excess of 20 major systems distributed across England and Wales. This evaluation considered a representative sample of ten systems. To date, little documented evidence has been collected by the systems that demonstrate the direct impact they are having in reducing crime and disorder, and the misuse of drugs. All point to how they are contributing to more effective partnership working, but all systems must be encouraged to record how they are contributing to improving community safety. Demonstrating this impact will help them to assure their future role in their CDRPs. By reviewing the systems wholly, several key ingredients were identified that were evident in contributing to the effectiveness of these systems. These included the need for an effective partnership business model within which the system operates, and the generation of good quality multi-agency intelligence products from the system. In helping to determine the future development of GIS-based information sharing systems, four key community safety partnership business service functions have been identified that these systems can most effectively support. These functions support the performance review requirements of CDRPs, operate a problem solving scanning and analysis role, and offer an interface with the public. By following these business service functions as a template will provide for a more effective application of these systems nationally
    • …
    corecore