5 research outputs found
HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions
Commercial ML APIs offered by providers such as Google, Amazon and Microsoft
have dramatically simplified ML adoption in many applications. Numerous
companies and academics pay to use ML APIs for tasks such as object detection,
OCR and sentiment analysis. Different ML APIs tackling the same task can have
very heterogeneous performance. Moreover, the ML models underlying the APIs
also evolve over time. As ML APIs rapidly become a valuable marketplace and a
widespread way to consume machine learning, it is critical to systematically
study and compare different APIs with each other and to characterize how APIs
change over time. However, this topic is currently underexplored due to the
lack of data. In this paper, we present HAPI (History of APIs), a longitudinal
dataset of 1,761,417 instances of commercial ML API applications (involving
APIs from Amazon, Google, IBM, Microsoft and other providers) across diverse
tasks including image tagging, speech recognition and text mining from 2020 to
2022. Each instance consists of a query input for an API (e.g., an image or
text) along with the API's output prediction/annotation and confidence scores.
HAPI is the first large-scale dataset of ML API usages and is a unique resource
for studying ML-as-a-service (MLaaS). As examples of the types of analyses that
HAPI enables, we show that ML APIs' performance change substantially over
time--several APIs' accuracies dropped on specific benchmark datasets. Even
when the API's aggregate performance stays steady, its error modes can shift
across different subtypes of data between 2020 and 2022. Such changes can
substantially impact the entire analytics pipelines that use some ML API as a
component. We further use HAPI to study commercial APIs' performance
disparities across demographic subgroups over time. HAPI can stimulate more
research in the growing field of MLaaS.Comment: Preprint, to appear in NeurIPS 202
Domino: Discovering Systematic Errors with Cross-Modal Embeddings
Machine learning models that achieve high overall accuracy often make
systematic errors on important subsets (or slices) of data. Identifying
underperforming slices is particularly challenging when working with
high-dimensional inputs (e.g. images, audio), where important slices are often
unlabeled. In order to address this issue, recent studies have proposed
automated slice discovery methods (SDMs), which leverage learned model
representations to mine input data for slices on which a model performs poorly.
To be useful to a practitioner, these methods must identify slices that are
both underperforming and coherent (i.e. united by a human-understandable
concept). However, no quantitative evaluation framework currently exists for
rigorously assessing SDMs with respect to these criteria. Additionally, prior
qualitative evaluations have shown that SDMs often identify slices that are
incoherent. In this work, we address these challenges by first designing a
principled evaluation framework that enables a quantitative comparison of SDMs
across 1,235 slice discovery settings in three input domains (natural images,
medical images, and time-series data). Then, motivated by the recent
development of powerful cross-modal representation learning approaches, we
present Domino, an SDM that leverages cross-modal embeddings and a novel
error-aware mixture model to discover and describe coherent slices. We find
that Domino accurately identifies 36% of the 1,235 slices in our framework - a
12 percentage point improvement over prior methods. Further, Domino is the
first SDM that can provide natural language descriptions of identified slices,
correctly generating the exact name of the slice in 35% of settings.Comment: ICLR 2022 (Oral
Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
Machine learning models are increasingly being scaled in both sequence length
and model dimension to reach longer contexts and better performance. However,
existing architectures such as Transformers scale quadratically along both
these axes. We ask: are there performant architectures that can scale
sub-quadratically along sequence length and model dimension? We introduce
Monarch Mixer (M2), a new architecture that uses the same sub-quadratic
primitive along both sequence length and model dimension: Monarch matrices, a
simple class of expressive structured matrices that captures many linear
transforms, achieves high hardware efficiency on GPUs, and scales
sub-quadratically. As a proof of concept, we explore the performance of M2 in
three domains: non-causal BERT-style language modeling, ViT-style image
classification, and causal GPT-style language modeling. For non-causal
BERT-style modeling, M2 matches BERT-base and BERT-large in downstream GLUE
quality with up to 27% fewer parameters, and achieves up to 9.1 higher
throughput at sequence length 4K. On ImageNet, M2 outperforms ViT-b by 1% in
accuracy, with only half the parameters. Causal GPT-style models introduce a
technical challenge: enforcing causality via masking introduces a quadratic
bottleneck. To alleviate this bottleneck, we develop a novel theoretical view
of Monarch matrices based on multivariate polynomial evaluation and
interpolation, which lets us parameterize M2 to be causal while remaining
sub-quadratic. Using this parameterization, M2 matches GPT-style Transformers
at 360M parameters in pretraining perplexity on The PILE--showing for the first
time that it may be possible to match Transformer quality without attention or
MLPs.Comment: NeurIPS 2023 (Oral
DataPerf: Benchmarks for Data-Centric AI Development
Machine learning research has long focused on models rather than datasets,
and prominent datasets are used for common ML tasks without regard to the
breadth, difficulty, and faithfulness of the underlying problems. Neglecting
the fundamental importance of data has given rise to inaccuracy, bias, and
fragility in real-world applications, and research is hindered by saturation
across existing dataset benchmarks. In response, we present DataPerf, a
community-led benchmark suite for evaluating ML datasets and data-centric
algorithms. We aim to foster innovation in data-centric AI through competition,
comparability, and reproducibility. We enable the ML community to iterate on
datasets, instead of just architectures, and we provide an open, online
platform with multiple rounds of challenges to support this iterative
development. The first iteration of DataPerf contains five benchmarks covering
a wide spectrum of data-centric techniques, tasks, and modalities in vision,
speech, acquisition, debugging, and diffusion prompting, and we support hosting
new contributed benchmarks from the community. The benchmarks, online
evaluation platform, and baseline implementations are open source, and the
MLCommons Association will maintain DataPerf to ensure long-term benefits to
academia and industry.Comment: NeurIPS 2023 Datasets and Benchmarks Trac
HAPI Explorer: Comprehension, Discovery, and Explanation on History of ML APIs
Machine learning prediction APIs offered by Google, Microsoft, Amazon, and many other providers have been continuously adopted in a plethora of applications, such as visual object detection, natural language comprehension, and speech recognition. Despite the importance of a systematic study and comparison of different APIs over time, this topic is currently under-explored because of the lack of data and user-friendly exploration tools. To address this issue, we present HAPI Explorer (History of API Explorer), an interactive system that offers easy access to millions of instances of commercial API applications collected in three years, prioritize attention on user-defined instance regimes, and explain interesting patterns across different APIs, subpopulations, and time periods via visual and natural languages. HAPI Explorer can facilitate further comprehension and exploitation of ML prediction APIs