908 research outputs found
Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective
Data-centric AI is at the center of a fundamental shift in software
engineering where machine learning becomes the new software, powered by big
data and computing infrastructure. Here software engineering needs to be
re-thought where data becomes a first-class citizen on par with code. One
striking observation is that a significant portion of the machine learning
process is spent on data preparation. Without good data, even the best machine
learning algorithms cannot perform well. As a result, data-centric AI practices
are now becoming mainstream. Unfortunately, many datasets in the real world are
small, dirty, biased, and even poisoned. In this survey, we study the research
landscape for data collection and data quality primarily for deep learning
applications. Data collection is important because there is lesser need for
feature engineering for recent deep learning approaches, but instead more need
for large amounts of data. For data quality, we study data validation,
cleaning, and integration techniques. Even if the data cannot be fully cleaned,
we can still cope with imperfect data during model training using robust model
training techniques. In addition, while bias and fairness have been less
studied in traditional data management research, these issues become essential
topics in modern machine learning applications. We thus study fairness measures
and unfairness mitigation techniques that can be applied before, during, or
after model training. We believe that the data management community is well
poised to solve these problems
Identifying Semantic Divergences in Parallel Text without Annotations
Recognizing that even correct translations are not always semantically
equivalent, we automatically detect meaning divergences in parallel sentence
pairs with a deep neural model of bilingual semantic similarity which can be
trained for any parallel corpus without any manual annotation. We show that our
semantic model detects divergences more accurately than models based on surface
features derived from word alignments, and that these divergences matter for
neural machine translation.Comment: Accepted as a full paper to NAACL 201
Beyond data collection: Objectives and methods of research using VGI and geo-social media for disaster management
This paper investigates research using VGI and geo-social media in the disaster management
context. Relying on the method of systematic mapping, it develops a classification schema that
captures three levels of main category, focus, and intended use, and analyzes the relationships
with the employed data sources and analysis methods. It focuses the scope to the pioneering
field of disaster management, but the described approach and the developed classification
schema are easily adaptable to different application domains or future developments. The
results show that a hypothesized consolidation of research, characterized through the building
of canonical bodies of knowledge and advanced application cases with refined methodology,
has not yet happened. The majority of the studies investigate the challenges and potential
solutions of data handling, with fewer studies focusing on socio-technological issues or
advanced applications. This trend is currently showing no sign of change, highlighting that VGI
research is still very much technology-driven as opposed to theory- or application-driven. From
the results of the systematic mapping study, the authors formulate and discuss several
research objectives for future work, which could lead to a stronger, more theory-driven
treatment of the topic VGI in GIScience.Carlos Granell has been partly funded by the Ramón y Cajal Programme (grant number RYC-2014-16913
HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions
Commercial ML APIs offered by providers such as Google, Amazon and Microsoft
have dramatically simplified ML adoption in many applications. Numerous
companies and academics pay to use ML APIs for tasks such as object detection,
OCR and sentiment analysis. Different ML APIs tackling the same task can have
very heterogeneous performance. Moreover, the ML models underlying the APIs
also evolve over time. As ML APIs rapidly become a valuable marketplace and a
widespread way to consume machine learning, it is critical to systematically
study and compare different APIs with each other and to characterize how APIs
change over time. However, this topic is currently underexplored due to the
lack of data. In this paper, we present HAPI (History of APIs), a longitudinal
dataset of 1,761,417 instances of commercial ML API applications (involving
APIs from Amazon, Google, IBM, Microsoft and other providers) across diverse
tasks including image tagging, speech recognition and text mining from 2020 to
2022. Each instance consists of a query input for an API (e.g., an image or
text) along with the API's output prediction/annotation and confidence scores.
HAPI is the first large-scale dataset of ML API usages and is a unique resource
for studying ML-as-a-service (MLaaS). As examples of the types of analyses that
HAPI enables, we show that ML APIs' performance change substantially over
time--several APIs' accuracies dropped on specific benchmark datasets. Even
when the API's aggregate performance stays steady, its error modes can shift
across different subtypes of data between 2020 and 2022. Such changes can
substantially impact the entire analytics pipelines that use some ML API as a
component. We further use HAPI to study commercial APIs' performance
disparities across demographic subgroups over time. HAPI can stimulate more
research in the growing field of MLaaS.Comment: Preprint, to appear in NeurIPS 202
Recommended from our members
Robust Algorithms for Clustering with Applications to Data Integration
A growing number of data-based applications are used for decision-making that have far-reaching consequences and significant societal impact. Entity resolution, community detection and taxonomy construction are some of the building blocks of these applications and for these methods, clustering is the fundamental underlying concept. Therefore, the use of accurate, robust and scalable methods for clustering cannot be overstated. We tackle the various facets of clustering with a multi-pronged approach described below.
1. While identification of clusters that refer to different entities is challenging for automated strategies, it is relatively easy for humans. We study the robustness of clustering methods that leverage supervision through an oracle i.e an abstraction of crowdsourcing. Additionally, we focus on scalability to handle web-scale datasets.
2. In community detection applications, a common setback in evaluation of the quality of clustering techniques is the lack of ground truth data. We propose a generative model that considers dependent edge formation and devise techniques for efficient cluster recovery
- …