4,645 research outputs found
Quantifying the relationship between specialisation and reputation in an online platform
Online platforms implement digital reputation systems in order to steer individual user behaviour towards outcomes that are deemed desirable on a collective level. At the same time, most online platforms are highly decentralised environments, leaving their users plenty of room to pursue different strategies and diversify behaviour. We provide a statistical characterisation of the user behaviour emerging from the interplay of such competing forces in Stack Overflow, a long-standing knowledge sharing platform. Over the 11 years covered by our analysis, we represent the interactions between users and topics as bipartite networks. We find such networks to display nested structures akin to those observed in ecological systems, demonstrating that the platform's user base consistently self-organises into specialists and generalists, i.e., users who focus on narrow and broad sets of topics, respectively. We relate the emergence of these behaviours to the platform's reputation system with a series of data-driven models, and find specialisation to be statistically associated with a higher ability to post the best answers to a question. We contrast our findings with observations made in top-down environments-such as firms and corporations-where generalist skills are consistently found to be more successful
Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics
We present the first study to investigate Large Language Models (LLMs) in
answering radiation oncology physics questions. Because popular exams like AP
Physics, LSAT, and GRE have large test-taker populations and ample test
preparation resources in circulation, they may not allow for accurately
assessing the true potential of LLMs. This paper proposes evaluating LLMs on a
highly-specialized topic, radiation oncology physics, which may be more
pertinent to scientific and medical communities in addition to being a valuable
benchmark of LLMs. We developed an exam consisting of 100 radiation oncology
physics questions based on our expertise at Mayo Clinic. Four LLMs, ChatGPT
(GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against
medical physicists and non-experts. ChatGPT (GPT-4) outperformed all other LLMs
as well as medical physicists, on average. The performance of ChatGPT (GPT-4)
was further improved when prompted to explain first, then answer. ChatGPT
(GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices
across a number of trials, whether correct or incorrect, a characteristic that
was not observed in the human test groups. In evaluating ChatGPTs (GPT-4)
deductive reasoning ability using a novel approach (substituting the correct
answer with "None of the above choices is the correct answer."), ChatGPT
(GPT-4) demonstrated surprising accuracy, suggesting the potential presence of
an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall,
its intrinsic properties did not allow for further improvement when scoring
based on a majority vote across trials. In contrast, a team of medical
physicists were able to greatly outperform ChatGPT (GPT-4) using a majority
vote. This study suggests a great potential for LLMs to work alongside
radiation oncology experts as highly knowledgeable assistants
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
Generalist foundation models such as GPT-4 have displayed surprising
capabilities in a wide variety of domains and tasks. Yet, there is a prevalent
assumption that they cannot match specialist capabilities of fine-tuned models.
For example, most explorations to date on medical competency benchmarks have
leveraged domain-specific training, as exemplified by efforts on BioGPT and
Med-PaLM. We build on a prior study of GPT-4's capabilities on medical
challenge benchmarks in the absence of special training. Rather than using
simple prompting to highlight the model's out-of-the-box capabilities, we
perform a systematic exploration of prompt engineering. We find that prompting
innovation can unlock deeper specialist capabilities and show that GPT-4 easily
tops prior leading results for medical benchmarks. The prompting methods we
explore are general purpose, and make no specific use of domain expertise,
removing the need for expert-curated content. Our experimental design carefully
controls for overfitting during the prompt engineering process. We introduce
Medprompt, based on a composition of several prompting strategies. With
Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark
datasets in the MultiMedQA suite. The method outperforms leading specialist
models such as Med-PaLM 2 by a significant margin with an order of magnitude
fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27%
reduction in error rate on the MedQA dataset over the best methods to date
achieved with specialist models and surpasses a score of 90% for the first
time. Beyond medical problems, we show the power of Medprompt to generalize to
other domains and provide evidence for the broad applicability of the approach
via studies of the strategy on exams in electrical engineering, machine
learning, philosophy, accounting, law, nursing, and clinical psychology.Comment: 21 pages, 7 figure
Privacy in Public and the contextual conditions of agency
Current technology and surveillance practices make behaviors traceable to persons in unprecedented ways. This causes a loss of anonymity and of many privacy measures relied on in the past. These de facto privacy losses are by many seen as problematic for individual psychology, intimate relations and democratic practices such as free speech and free assembly. I share most of these concerns but propose that an even more fundamental problem might be that our very ability to act as autonomous and purposive agents relies on some degree of privacy, perhaps particularly as we act in public and semi-public spaces. I suggest that basic issues concerning action choices have been left largely unexplored, due to a series of problematic theoretical assumptions at the heart of privacy debates. One such assumption has to do with the influential conceptualization of privacy as pertaining to personal intimate facts belonging to a private sphere as opposed to a public sphere of public facts. As Helen Nissenbaum has pointed out, the notion of privacy in public sounds almost like an oxymoron given this traditional private-public dichotomy. I discuss her important attempt to defend privacy in public through her concept of ‘contextual integrity.’ Context is crucial, but Nissenbaum’s descriptive notion of existing norms seems to fall short of a solution. I here agree with Joel Reidenberg’s recent worries regarding any approach that relies on ‘reasonable expectations’ . The problem is that in many current contexts we have no such expectations. Our contexts have already lost their integrity, so to speak. By way of a functional and more biologically inspired account, I analyze the relational and contextual dynamics of both privacy needs and harms. Through an understanding of action choice as situated and options and capabilities as relational, a more consequence-oriented notion of privacy begins to appear. I suggest that privacy needs, harms and protections are relational. Privacy might have less to do with seclusion and absolute transactional control than hitherto thought. It might instead hinge on capacities to limit the social consequences of our actions through knowing and shaping our perceptible agency and social contexts of action. To act with intent we generally need the ability to conceal during exposure. If this analysis is correct then relational privacy is an important condition for autonomic purposive and responsible agency—particularly in public space. Overall, this chapter offers a first stab at a reconceptualization of our privacy needs as relational to contexts of action. In terms of ‘rights to privacy’ this means that we should expand our view from the regulation and protection of the information of individuals to questions of the kind of contexts we are creating. I am here particularly interested in what I call ‘unbounded contexts’, i.e. cases of context collapses, hidden audiences and even unknowable future agents
Ask Me Anything: A simple strategy for prompting language models
Large language models (LLMs) transfer well to new tasks out-of-the-box simply
given a natural language prompt that demonstrates how to perform the task and
no additional training. Prompting is a brittle process wherein small
modifications to the prompt can cause large variations in the model
predictions, and therefore significant effort is dedicated towards designing a
painstakingly "perfect prompt" for a task. To mitigate the high degree of
effort involved in prompt-design, we instead ask whether producing multiple
effective, yet imperfect, prompts and aggregating them can lead to a high
quality prompting strategy. Our observations motivate our proposed prompting
method, ASK ME ANYTHING (AMA). We first develop an understanding of the
effective prompt formats, finding that question-answering (QA) prompts, which
encourage open-ended generation ("Who went to the park?") tend to outperform
those that restrict the model outputs ("John went to the park. Output True or
False."). Our approach recursively uses the LLM itself to transform task inputs
to the effective QA format. We apply the collected prompts to obtain several
noisy votes for the input's true label. We find that the prompts can have very
different accuracies and complex dependencies and thus propose to use weak
supervision, a procedure for combining the noisy predictions, to produce the
final predictions for the inputs. We evaluate AMA across open-source model
families (e.g., EleutherAI, BLOOM, OPT, and T0) and model sizes (125M-175B
parameters), demonstrating an average performance lift of 10.2% over the
few-shot baseline. This simple strategy enables the open-source GPT-J-6B model
to match and exceed the performance of few-shot GPT3-175B on 15 of 20 popular
benchmarks. Averaged across these tasks, the GPT-J-6B model outperforms
few-shot GPT3-175B. We release our code here:
https://github.com/HazyResearch/ama_promptin
The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation
Human variation in labeling is often considered noise. Annotation projects
for machine learning (ML) aim at minimizing human label variation, with the
assumption to maximize data quality and in turn optimize and maximize machine
learning metrics. However, this conventional practice assumes that there exists
a ground truth, and neglects that there exists genuine human variation in
labeling due to disagreement, subjectivity in annotation or multiple plausible
answers. In this position paper, we argue that this big open problem of human
label variation persists and critically needs more attention to move our field
forward. This is because human label variation impacts all stages of the ML
pipeline: data, modeling and evaluation. However, few works consider all of
these dimensions jointly; and existing research is fragmented. We reconcile
different previously proposed notions of human label variation, provide a
repository of publicly-available datasets with un-aggregated labels, depict
approaches proposed so far, identify gaps and suggest ways forward. As datasets
are becoming increasingly available, we hope that this synthesized view on the
'problem' will lead to an open discussion on possible strategies to devise
fundamentally new directions.Comment: EMNLP 202
Welfare Polls: A Synthesis
"Welfare polls" are survey instruments that seek to quantify the determinants of human well-being. Currently, three "welfare polling" formats are dominant: contingent-valuation surveys, QALY surveys, and happiness surveys. Each format has generated a large, specialized, scholarly literature, but no comprehensive discussion of welfare polling as a general enterprise exists. This Article seeks to fill that gap. Part I describes the trio of existing formats. Part II discusses the actual and potential uses of welfare polls in government decision making. Part III analyzes in detail the obstacles that welfare polls must overcome to provide useful well-being information, and concludes that they can be genuinely informative. Part IV synthesizes the case for welfare polls, arguing against two types of challenges: the revealed-preference tradition in economics, which insists on using behavior rather than surveys to learn about well-being; and the civic-republican tradition in political theory, which accepts surveys but insists that respondents should be asked to take a "citizen", rather than "consumer" perspective. Part V suggests new directions for welfare polls.
Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery
Despite growing interest in using large language models (LLMs) in healthcare,
current explorations do not assess the real-world utility and safety of LLMs in
clinical settings. Our objective was to determine whether two LLMs can serve
information needs submitted by physicians as questions to an informatics
consultation service in a safe and concordant manner. Sixty six questions from
an informatics consult service were submitted to GPT-3.5 and GPT-4 via simple
prompts. 12 physicians assessed the LLM responses' possibility of patient harm
and concordance with existing reports from an informatics consultation service.
Physician assessments were summarized based on majority vote. For no questions
did a majority of physicians deem either LLM response as harmful. For GPT-3.5,
responses to 8 questions were concordant with the informatics consult report,
20 discordant, and 9 were unable to be assessed. There were 29 responses with
no majority on "Agree", "Disagree", and "Unable to assess". For GPT-4,
responses to 13 questions were concordant, 15 discordant, and 3 were unable to
be assessed. There were 35 responses with no majority. Responses from both LLMs
were largely devoid of overt harm, but less than 20% of the responses agreed
with an answer from an informatics consultation service, responses contained
hallucinated references, and physicians were divided on what constitutes harm.
These results suggest that while general purpose LLMs are able to provide safe
and credible responses, they often do not meet the specific information need of
a given question. A definitive evaluation of the usefulness of LLMs in
healthcare settings will likely require additional research on prompt
engineering, calibration, and custom-tailoring of general purpose models.Comment: 27 pages including supplemental informatio
- …