31,395 research outputs found
The Devil of Face Recognition is in the Noise
The growing scale of face recognition datasets empowers us to train strong
convolutional networks for face recognition. While a variety of architectures
and loss functions have been devised, we still have a limited understanding of
the source and consequence of label noise inherent in existing datasets. We
make the following contributions: 1) We contribute cleaned subsets of popular
face databases, i.e., MegaFace and MS-Celeb-1M datasets, and build a new
large-scale noise-controlled IMDb-Face dataset. 2) With the original datasets
and cleaned subsets, we profile and analyze label noise properties of MegaFace
and MS-Celeb-1M. We show that a few orders more samples are needed to achieve
the same accuracy yielded by a clean subset. 3) We study the association
between different types of noise, i.e., label flips and outliers, with the
accuracy of face recognition models. 4) We investigate ways to improve data
cleanliness, including a comprehensive user study on the influence of data
labeling strategies to annotation accuracy. The IMDb-Face dataset has been
released on https://github.com/fwang91/IMDb-Face.Comment: accepted to ECCV'1
Towards Cleaning-up Open Data Portals: A Metadata Reconciliation Approach
This paper presents an approach for metadata reconciliation, curation and
linking for Open Governamental Data Portals (ODPs). ODPs have been lately the
standard solution for governments willing to put their public data available
for the society. Portal managers use several types of metadata to organize the
datasets, one of the most important ones being the tags. However, the tagging
process is subject to many problems, such as synonyms, ambiguity or
incoherence, among others. As our empiric analysis of ODPs shows, these issues
are currently prevalent in most ODPs and effectively hinders the reuse of Open
Data. In order to address these problems, we develop and implement an approach
for tag reconciliation in Open Data Portals, encompassing local actions related
to individual portals, and global actions for adding a semantic metadata layer
above individual portals. The local part aims to enhance the quality of tags in
a single portal, and the global part is meant to interlink ODPs by establishing
relations between tags.Comment: 8 pages,10 Figures - Under Revision for ICSC201
CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Data quality affects machine learning (ML) model performances, and data
scientists spend considerable amount of time on data cleaning before model
training. However, to date, there does not exist a rigorous study on how
exactly cleaning affects ML -- ML community usually focuses on developing ML
algorithms that are robust to some particular noise types of certain
distributions, while database (DB) community has been mostly studying the
problem of data cleaning alone without considering how data is consumed by
downstream ML analytics. We propose a CleanML study that systematically
investigates the impact of data cleaning on ML classification tasks. The
open-source and extensible CleanML study currently includes 14 real-world
datasets with real errors, five common error types, seven different ML models,
and multiple cleaning algorithms for each error type (including both commonly
used algorithms in practice as well as state-of-the-art solutions in academic
literature). We control the randomness in ML experiments using statistical
hypothesis testing, and we also control false discovery rate in our experiments
using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a
systematic way to derive many interesting and nontrivial observations. We also
put forward multiple research directions for researchers.Comment: published in ICDE 202
Level Playing Field for Million Scale Face Recognition
Face recognition has the perception of a solved problem, however when tested
at the million-scale exhibits dramatic variation in accuracies across the
different algorithms. Are the algorithms very different? Is access to good/big
training data their secret weapon? Where should face recognition improve? To
address those questions, we created a benchmark, MF2, that requires all
algorithms to be trained on same data, and tested at the million scale. MF2 is
a public large-scale set with 672K identities and 4.7M photos created with the
goal to level playing field for large scale face recognition. We contrast our
results with findings from the other two large-scale benchmarks MegaFace
Challenge and MS-Celebs-1M where groups were allowed to train on any
private/public/big/small set. Some key discoveries: 1) algorithms, trained on
MF2, were able to achieve state of the art and comparable results to algorithms
trained on massive private sets, 2) some outperformed themselves once trained
on MF2, 3) invariance to aging suffers from low accuracies as in MegaFace,
identifying the need for larger age variations possibly within identities or
adjustment of algorithms in future testings
A review of GIS-based information sharing systems
GIS-based information sharing systems have been implemented in many of England and Wales' Crime and Disorder Reduction Partnerships (CDRPs). The information sharing role of these systems is seen as being vital to help in the review of crime, disorder and misuse of drugs; to sustain strategic objectives, to monitor interventions and initiatives; and support action plans for service delivery. This evaluation into these systems aimed to identify the lessons learned from existing systems, identify how these systems can be best used to support the business functions of CDRPs, identify common weaknesses across the systems, and produce guidelines on how these systems should be further developed. At present there are in excess of 20 major systems distributed across England and Wales. This evaluation considered a representative sample of ten systems. To date, little documented evidence has been collected by the systems that demonstrate the direct impact they are having in reducing crime and disorder, and the misuse of drugs. All point to how they are contributing to more effective partnership working, but all systems must be encouraged to record how they are contributing to improving community safety. Demonstrating this impact will help them to assure their future role in their CDRPs. By reviewing the systems wholly, several key ingredients were identified that were evident in contributing to the effectiveness of these systems. These included the need for an effective partnership business model within which the system operates, and the generation of good quality multi-agency intelligence products from the system. In helping to determine the future development of GIS-based information sharing systems, four key community safety partnership business service functions have been identified that these systems can most effectively support. These functions support the performance review requirements of CDRPs, operate a problem solving scanning and analysis role, and offer an interface with the public. By following these business service functions as a template will provide for a more effective application of these systems nationally
- …