4,645 research outputs found

    Quantifying the relationship between specialisation and reputation in an online platform

    Get PDF
    Online platforms implement digital reputation systems in order to steer individual user behaviour towards outcomes that are deemed desirable on a collective level. At the same time, most online platforms are highly decentralised environments, leaving their users plenty of room to pursue different strategies and diversify behaviour. We provide a statistical characterisation of the user behaviour emerging from the interplay of such competing forces in Stack Overflow, a long-standing knowledge sharing platform. Over the 11 years covered by our analysis, we represent the interactions between users and topics as bipartite networks. We find such networks to display nested structures akin to those observed in ecological systems, demonstrating that the platform's user base consistently self-organises into specialists and generalists, i.e., users who focus on narrow and broad sets of topics, respectively. We relate the emergence of these behaviours to the platform's reputation system with a series of data-driven models, and find specialisation to be statistically associated with a higher ability to post the best answers to a question. We contrast our findings with observations made in top-down environments-such as firms and corporations-where generalist skills are consistently found to be more successful

    Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics

    Full text link
    We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs. We developed an exam consisting of 100 radiation oncology physics questions based on our expertise at Mayo Clinic. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. ChatGPT (GPT-4) outperformed all other LLMs as well as medical physicists, on average. The performance of ChatGPT (GPT-4) was further improved when prompted to explain first, then answer. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups. In evaluating ChatGPTs (GPT-4) deductive reasoning ability using a novel approach (substituting the correct answer with "None of the above choices is the correct answer."), ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote. This study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants

    Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

    Full text link
    Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expert-curated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as Med-PaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.Comment: 21 pages, 7 figure

    Privacy in Public and the contextual conditions of agency

    Get PDF
    Current technology and surveillance practices make behaviors traceable to persons in unprecedented ways. This causes a loss of anonymity and of many privacy measures relied on in the past. These de facto privacy losses are by many seen as problematic for individual psychology, intimate relations and democratic practices such as free speech and free assembly. I share most of these concerns but propose that an even more fundamental problem might be that our very ability to act as autonomous and purposive agents relies on some degree of privacy, perhaps particularly as we act in public and semi-public spaces. I suggest that basic issues concerning action choices have been left largely unexplored, due to a series of problematic theoretical assumptions at the heart of privacy debates. One such assumption has to do with the influential conceptualization of privacy as pertaining to personal intimate facts belonging to a private sphere as opposed to a public sphere of public facts. As Helen Nissenbaum has pointed out, the notion of privacy in public sounds almost like an oxymoron given this traditional private-public dichotomy. I discuss her important attempt to defend privacy in public through her concept of ‘contextual integrity.’ Context is crucial, but Nissenbaum’s descriptive notion of existing norms seems to fall short of a solution. I here agree with Joel Reidenberg’s recent worries regarding any approach that relies on ‘reasonable expectations’ . The problem is that in many current contexts we have no such expectations. Our contexts have already lost their integrity, so to speak. By way of a functional and more biologically inspired account, I analyze the relational and contextual dynamics of both privacy needs and harms. Through an understanding of action choice as situated and options and capabilities as relational, a more consequence-oriented notion of privacy begins to appear. I suggest that privacy needs, harms and protections are relational. Privacy might have less to do with seclusion and absolute transactional control than hitherto thought. It might instead hinge on capacities to limit the social consequences of our actions through knowing and shaping our perceptible agency and social contexts of action. To act with intent we generally need the ability to conceal during exposure. If this analysis is correct then relational privacy is an important condition for autonomic purposive and responsible agency—particularly in public space. Overall, this chapter offers a first stab at a reconceptualization of our privacy needs as relational to contexts of action. In terms of ‘rights to privacy’ this means that we should expand our view from the regulation and protection of the information of individuals to questions of the kind of contexts we are creating. I am here particularly interested in what I call ‘unbounded contexts’, i.e. cases of context collapses, hidden audiences and even unknowable future agents

    Ask Me Anything: A simple strategy for prompting language models

    Full text link
    Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt that demonstrates how to perform the task and no additional training. Prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions, and therefore significant effort is dedicated towards designing a painstakingly "perfect prompt" for a task. To mitigate the high degree of effort involved in prompt-design, we instead ask whether producing multiple effective, yet imperfect, prompts and aggregating them can lead to a high quality prompting strategy. Our observations motivate our proposed prompting method, ASK ME ANYTHING (AMA). We first develop an understanding of the effective prompt formats, finding that question-answering (QA) prompts, which encourage open-ended generation ("Who went to the park?") tend to outperform those that restrict the model outputs ("John went to the park. Output True or False."). Our approach recursively uses the LLM itself to transform task inputs to the effective QA format. We apply the collected prompts to obtain several noisy votes for the input's true label. We find that the prompts can have very different accuracies and complex dependencies and thus propose to use weak supervision, a procedure for combining the noisy predictions, to produce the final predictions for the inputs. We evaluate AMA across open-source model families (e.g., EleutherAI, BLOOM, OPT, and T0) and model sizes (125M-175B parameters), demonstrating an average performance lift of 10.2% over the few-shot baseline. This simple strategy enables the open-source GPT-J-6B model to match and exceed the performance of few-shot GPT3-175B on 15 of 20 popular benchmarks. Averaged across these tasks, the GPT-J-6B model outperforms few-shot GPT3-175B. We release our code here: https://github.com/HazyResearch/ama_promptin

    The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

    Full text link
    Human variation in labeling is often considered noise. Annotation projects for machine learning (ML) aim at minimizing human label variation, with the assumption to maximize data quality and in turn optimize and maximize machine learning metrics. However, this conventional practice assumes that there exists a ground truth, and neglects that there exists genuine human variation in labeling due to disagreement, subjectivity in annotation or multiple plausible answers. In this position paper, we argue that this big open problem of human label variation persists and critically needs more attention to move our field forward. This is because human label variation impacts all stages of the ML pipeline: data, modeling and evaluation. However, few works consider all of these dimensions jointly; and existing research is fragmented. We reconcile different previously proposed notions of human label variation, provide a repository of publicly-available datasets with un-aggregated labels, depict approaches proposed so far, identify gaps and suggest ways forward. As datasets are becoming increasingly available, we hope that this synthesized view on the 'problem' will lead to an open discussion on possible strategies to devise fundamentally new directions.Comment: EMNLP 202

    Welfare Polls: A Synthesis

    Get PDF
    "Welfare polls" are survey instruments that seek to quantify the determinants of human well-being. Currently, three "welfare polling" formats are dominant: contingent-valuation surveys, QALY surveys, and happiness surveys. Each format has generated a large, specialized, scholarly literature, but no comprehensive discussion of welfare polling as a general enterprise exists. This Article seeks to fill that gap. Part I describes the trio of existing formats. Part II discusses the actual and potential uses of welfare polls in government decision making. Part III analyzes in detail the obstacles that welfare polls must overcome to provide useful well-being information, and concludes that they can be genuinely informative. Part IV synthesizes the case for welfare polls, arguing against two types of challenges: the revealed-preference tradition in economics, which insists on using behavior rather than surveys to learn about well-being; and the civic-republican tradition in political theory, which accepts surveys but insists that respondents should be asked to take a "citizen", rather than "consumer" perspective. Part V suggests new directions for welfare polls.

    Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery

    Full text link
    Despite growing interest in using large language models (LLMs) in healthcare, current explorations do not assess the real-world utility and safety of LLMs in clinical settings. Our objective was to determine whether two LLMs can serve information needs submitted by physicians as questions to an informatics consultation service in a safe and concordant manner. Sixty six questions from an informatics consult service were submitted to GPT-3.5 and GPT-4 via simple prompts. 12 physicians assessed the LLM responses' possibility of patient harm and concordance with existing reports from an informatics consultation service. Physician assessments were summarized based on majority vote. For no questions did a majority of physicians deem either LLM response as harmful. For GPT-3.5, responses to 8 questions were concordant with the informatics consult report, 20 discordant, and 9 were unable to be assessed. There were 29 responses with no majority on "Agree", "Disagree", and "Unable to assess". For GPT-4, responses to 13 questions were concordant, 15 discordant, and 3 were unable to be assessed. There were 35 responses with no majority. Responses from both LLMs were largely devoid of overt harm, but less than 20% of the responses agreed with an answer from an informatics consultation service, responses contained hallucinated references, and physicians were divided on what constitutes harm. These results suggest that while general purpose LLMs are able to provide safe and credible responses, they often do not meet the specific information need of a given question. A definitive evaluation of the usefulness of LLMs in healthcare settings will likely require additional research on prompt engineering, calibration, and custom-tailoring of general purpose models.Comment: 27 pages including supplemental informatio
    • …
    corecore