    Stochastic Parrots Looking for Stochastic Parrots: LLMs are Easy to Fine-Tune and Hard to Detect with other LLMs

    The self-attention revolution allowed generative language models to scale and achieve increasingly impressive abilities. Such models - commonly referred to as Large Language Models (LLMs) - have recently gained prominence with the general public, thanks to conversational fine-tuning, putting their behavior in line with public expectations regarding AI. This prominence amplified prior concerns regarding the misuse of LLMs and led to the emergence of numerous tools to detect LLMs in the wild. Unfortunately, most such tools are critically flawed. While major publications in the LLM detectability field suggested that LLMs were easy to detect with fine-tuned autoencoders, the limitations of their results are easy to overlook. Specifically, they assumed publicly available generative models without fine-tunes or non-trivial prompts. While the importance of these assumptions has been demonstrated, until now, it remained unclear how well such detection could be countered. Here, we show that an attacker with access to such detectors' reference human texts and output not only evades detection but can fully frustrate the detector training - with a reasonable budget and all its outputs labeled as such. Achieving it required combining common "reinforcement from critic" loss function modification and AdamW optimizer, which led to surprisingly good fine-tuning generalization. Finally, we warn against the temptation to transpose the conclusions obtained in RNN-driven text GANs to LLMs due to their better representative ability. These results have critical implications for the detection and prevention of malicious use of generative language models, and we hope they will aid the designers of generative models and detectors.Comment: 15 pages, 6 figures; 10 pages, 7 figures Supplementary Materials; under review at ECML 202

    Practical AI Value Alignment Using Stories

    As more machine learning agents interact with humans, it is increasingly a prospect that an agent trained to perform a task optimally - using only a measure of task performance as feedback--can violate societal norms for acceptable behavior or cause harm. Consequently, it becomes necessary to prioritize task performance and ensure that AI actions do not have detrimental effects. Value alignment is a property of intelligent agents, wherein they solely pursue goals and activities that are non-harmful and beneficial to humans. Current approaches to value alignment largely depend on imitation learning or learning from demonstration methods. However, the dynamic nature of values makes it difficult to learn values through imitation learning-based approaches. To overcome the limitations of imitation learning-based approaches, in this work, we introduced a complementary technique in which a value-aligned prior is learned from naturally occurring stories that embody societal norms. This value-aligned prior can detect the normative and non-normative behavior of human society as well as describe the underlying social norms associated with these behaviors. To train our models, we sourced data from the children’s educational comic strip, Goofus \& Gallant. Additionally, we have built another dataset by utilizing a crowdsourcing platform. This dataset was created specifically to identify the norms or principles exhibited in the actions depicted within the comic strips. To build a normative prior model, we trained multiple machine learning models to classify natural language descriptions and visual demonstrations of situations found in the comic strip as either normative or non-normative and into different social norms. Finally, to train a value-aligned agent, we introduced a reinforcement learning-based method, in which we train an agent with two reward signals: a standard task performance reward plus a normative behavior reward. The test environment provides the standard task performance reward, while the normative behavior reward is derived from the value-aligned prior model. We show how variations on a policy shaping technique can balance these two sources of reward and produce policies that are both effective and perceived as being more normative. We test our value-alignment technique on different interactive text-based worlds; each world is designed specifically to challenge agents with a task as well as provide opportunities to deviate from the task to engage in normative and/or altruistic behavior

    Language Models have a Moral Dimension

    Artificial writing is permeating our lives due to recent advances in large-scale, transformer-based language models (LMs) such as BERT, its variants, GPT-2/3, and others. Using them as pretrained models and fine-tuning them for specific tasks, researchers have extended the state of the art for many NLP tasks and shown that they not only capture linguistic knowledge but also retain general knowledge implicitly present in the data. These and other successes are exciting. Unfortunately, LMs trained on unfiltered text corpora suffer from degenerate and biased behaviour. While this is well established, we show that recent improvements of LMs also store ethical and moral values of the society and actually bring a ``moral dimension'' to surface: the values are capture geometrically by a direction in the embedding space, reflecting well the agreement of phrases to social norms implicitly expressed in the training texts. This provides a path for attenuating or even preventing toxic degeneration in LMs. Since one can now rate the (non-)normativity of arbitrary phrases without explicitly training the LM for this task, the moral dimension can be used as ``moral compass'' guiding (even other) LMs towards producing normative text, as we will show

    Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans

    We are currently unable to specify human goals and societal values in a way that reliably directs AI behavior. Law-making and legal interpretation form a computational engine that converts opaque human values into legible directives. "Law Informs Code" is the research agenda embedding legal knowledge and reasoning in AI. Similar to how parties to a legal contract cannot foresee every potential contingency of their future relationship, and legislators cannot predict all the circumstances under which their proposed bills will be applied, we cannot ex ante specify rules that provably direct good AI behavior. Legal theory and practice have developed arrays of tools to address these specification problems. For instance, legal standards allow humans to develop shared understandings and adapt them to novel situations. In contrast to more prosaic uses of the law (e.g., as a deterrent of bad behavior through the threat of sanction), leveraged as an expression of how humans communicate their goals, and what society values, Law Informs Code. We describe how data generated by legal processes (methods of law-making, statutory interpretation, contract drafting, applications of legal standards, legal reasoning, etc.) can facilitate the robust specification of inherently vague human goals. This increases human-AI alignment and the local usefulness of AI. Toward society-AI alignment, we present a framework for understanding law as the applied philosophy of multi-agent alignment. Although law is partly a reflection of historically contingent political power - and thus not a perfect aggregation of citizen preferences - if properly parsed, its distillation offers the most legitimate computational comprehension of societal values available. If law eventually informs powerful AI, engaging in the deliberative political process to improve law takes on even more meaning.Comment: Forthcoming in Northwestern Journal of Technology and Intellectual Property, Volume 2

    Self-Supervised Learning of Machine Ethics

    In recent years Artificial Intelligence (AI), especially deep learning, has proven to be a technology driver in industry. However, while advancing existing and creating novel technologies, automatizing processes, and assisting humans in essential areas such as drug discovery, they raise many concerns, like other groundbreaking novel technologies before. In this case, these concerns include, for instance, models producing stereotypical and derogatory content as well as gender and racial biases. Since AI technologies will permeate more of our lives in the coming years, these concerns need to be addressed. This thesis examines recent data-driven approaches, which often suffer from degenerated and biased behavior through their self-supervised training on large-scale noisy web data, containing potential inappropriate data. While this is well-established, we will investigate and demonstrate the promises of deep models’ acquired knowledge and capabilities through the provision of this very particular potentially inappropriate data. Importantly, we present the first approaches for learning ethics from data. Our findings suggest that if we build an AI system that learns an improved representation of data and that is able to better understand and produce it, in the process, it will also acquire more accurate societal knowledge, in this case, historical cultural associations to make human-like "right" and "wrong" choices. Furthermore, based on these findings, we consequently ask the arguably "circular" question of whether a machine can help us mitigate their associated concerns. Importantly, we demonstrate the importance of their ability to distinguish between "right" and "wrong" and how utilizing them can mitigate associated risks surrounding large-scale models themselves. However, we also highlight the role of human-machine interaction to explore and reinforce AI systems’ properties, including their flaws and merits, and present how human feedback on explanations can align deep learning based models with our precepts. We present these algorithms and corresponding findings, providing important insights for the goal of putting human values into AI systems, which, summarized, may not be insurmountable in the long run

    Talkin' 'Bout AI Generation: Copyright and the Generative-AI Supply Chain

    "Does generative AI infringe copyright?" is an urgent question. It is also a difficult question, for two reasons. First, "generative AI" is not just one product from one company. It is a catch-all name for a massive ecosystem of loosely related technologies, including conversational text chatbots like ChatGPT, image generators like Midjourney and DALL-E, coding assistants like GitHub Copilot, and systems that compose music and create videos. These systems behave differently and raise different legal issues. The second problem is that copyright law is notoriously complicated, and generative-AI systems manage to touch on a great many corners of it: authorship, similarity, direct and indirect liability, fair use, and licensing, among much else. These issues cannot be analyzed in isolation, because there are connections everywhere. In this Article, we aim to bring order to the chaos. To do so, we introduce the generative-AI supply chain: an interconnected set of stages that transform training data (millions of pictures of cats) into generations (a new, potentially never-seen-before picture of a cat that has never existed). Breaking down generative AI into these constituent stages reveals all of the places at which companies and users make choices that have copyright consequences. It enables us to trace the effects of upstream technical designs on downstream uses, and to assess who in these complicated sociotechnical systems bears responsibility for infringement when it happens. Because we engage so closely with the technology of generative AI, we are able to shed more light on the copyright questions. We do not give definitive answers as to who should and should not be held liable. Instead, we identify the key decisions that courts will need to make as they grapple with these issues, and point out the consequences that would likely flow from different liability regimes.Comment: Forthcoming, Journal of the Copyright Society of the USA '2

    “I Can See the Forest for the Trees”: Examining Personality Traits with Trasformers

    Our understanding of Personality and its structure is rooted in linguistic studies operating under the assumptions made by the Lexical Hypothesis: personality characteristics that are important to a group of people will at some point be codified in their language, with the number of encoded representations of a personality characteristic indicating their importance. Qualitative and quantitative efforts in the dimension reduction of our lexicon throughout the mid-20th century have played a vital role in the field’s eventual arrival at the widely accepted Five Factor Model (FFM). However, there are a number of presently unresolved conflicts regarding the breadth and structure of this model (c.f., Hough, Oswald, & Ock, 2015). The present study sought to address such issues through previously unavailable language modeling techniques. The Distributional Semantic Hypothesis (DSH) argues that the meaning of words may be formed through some function of their co-occurrence with other words. There is evidence that DSH-based techniques are cognitively valid, serving as a proxy for learned associations between stimuli (Günther et al., 2019). Given that Personality is often measured through self-report surveys, the present study proposed that a Personality measure be created directly from this source data, using large pre-trained Transformers (a type of neural network that is adept at encoding and decoding semantic representations from natural language). An inventory was constructed, administered, and response data was analyzed using partial correlation networks. This exploratory study identifies differences in the internal structure of trait-domains, while simultaneously demonstrating a quantitative approach to item creation and survey development
