34 research outputs found
Producing Scheduling that Causes Concurrent Programs to Fail
A noise maker is a tool that seeds a concurrent program with conditional synchronization primitives (such as yield()) for the purpose of increasing the likelihood that a bug manifest itself. This work explores the theory and practice of choosing where in the program to induce such thread switches at runtime. We introduce a novel fault model that classifies locations as .good., .neutral., or .bad,. based on the effect of a thread switch at the location. Using the model we explore the terms in which efficient search for real-life concurrent bugs can be carried out. We accordingly justify the use of probabilistic algorithms for this search and gain a deeper insight of the work done so far on noise-making. We validate our approach by experimenting with a set of programs taken from publicly available multi-threaded benchmark. Our empirical evidence demonstrates that real-life behavior is similar to what our model predicts
Characterizing how 'distributional' NLP corpora distance metrics are
A corpus of vector-embedded text documents has some empirical distribution.
Given two corpora, we want to calculate a single metric of distance (e.g.,
Mauve, Frechet Inception) between them. We describe an abstract quality, called
`distributionality', of such metrics. A non-distributional metric tends to use
very local measurements, or uses global measurements in a way that does not
fully reflect the distributions' true distance. For example, if individual
pairwise nearest-neighbor distances are low, it may judge the two corpora to
have low distance, even if their two distributions are in fact far from each
other. A more distributional metric will, in contrast, better capture the
distributions' overall distance. We quantify this quality by constructing a
Known-Similarity Corpora set from two paraphrase corpora and calculating the
distance between paired corpora from it. The distances' trend shape as set
element separation increases should quantify the distributionality of the
metric. We propose that Average Hausdorff Distance and energy distance between
corpora are representative examples of non-distributional and distributional
distance metrics, to which other metrics can be compared, to evaluate how
distributional they are.Comment: Published in the August 2023 Joint Statistical Meetings proceeding
Detection of data drift and outliers affecting machine learning model performance over time
A trained ML model is deployed on another `test' dataset where target feature
values (labels) are unknown. Drift is distribution change between the training
and deployment data, which is concerning if model performance changes. For a
cat/dog image classifier, for instance, drift during deployment could be rabbit
images (new class) or cat/dog images with changed characteristics (change in
distribution). We wish to detect these changes but can't measure accuracy
without deployment data labels. We instead detect drift indirectly by
nonparametrically testing the distribution of model prediction confidence for
changes. This generalizes our method and sidesteps domain-specific feature
representation.
We address important statistical issues, particularly Type-1 error control in
sequential testing, using Change Point Models (CPMs; see Adams and Ross 2012).
We also use nonparametric outlier methods to show the user suspicious
observations for model diagnosis, since the before/after change confidence
distributions overlap significantly. In experiments to demonstrate robustness,
we train on a subset of MNIST digit classes, then insert drift (e.g., unseen
digit class) in deployment data in various settings (gradual/sudden changes in
the drift proportion). A novel loss function is introduced to compare the
performance (detection delay, Type-1 and 2 errors) of a drift detector under
different levels of drift class contamination.Comment: In: JSM Proceedings, Nonparametric Statistics Section, 20202.
Philadelphia, PA: American Statistical Association. 144--16
Using Fuzzy Matching of Queries to optimize Database workloads
Directed Acyclic Graphs (DAGs) are commonly used in Databases and Big Data
computational engines like Apache Spark for representing the execution plan of
queries. We refer to such graphs as Query Directed Acyclic Graphs (QDAGs). This
paper uses similarity hashing to arrive at a fingerprint such that the
fingerprint embodies the compute requirements of the query for QDAGs. The
fingerprint, thus obtained, can be used to predict the runtime behaviour of a
query based on queries executed in the past having similar QDAGs. We discuss
two approaches to arrive at a fingerprint, their pros and cons and how aspects
of both approaches can be combined to improve the predictions. Using a hybrid
approach, we demonstrate that we are able to predict runtime behaviour of a
QDAG with more than 80% accuracy.Comment: 9 pages, 5 figure
Predicting Question-Answering Performance of Large Language Models through Semantic Consistency
Semantic consistency of a language model is broadly defined as the model's
ability to produce semantically-equivalent outputs, given
semantically-equivalent inputs. We address the task of assessing
question-answering (QA) semantic consistency of contemporary large language
models (LLMs) by manually creating a benchmark dataset with high-quality
paraphrases for factual questions, and release the dataset to the community.
We further combine the semantic consistency metric with additional
measurements suggested in prior work as correlating with LLM QA accuracy, for
building and evaluating a framework for factual QA reference-less performance
prediction -- predicting the likelihood of a language model to accurately
answer a question. Evaluating the framework on five contemporary LLMs, we
demonstrate encouraging, significantly outperforming baselines, results.Comment: EMNLP2023 GEM workshop, 17 page