Search CORE

116 research outputs found

LexGPT 0.1: pre-trained GPT-J models with Pile of Law

Author: Lee Jieh-Sheng
Publication venue
Publication date: 05/06/2023
Field of study

This research aims to build generative language models specialized for the legal domain. The manuscript presents the development of LexGPT models based on GPT-J models and pre-trained with Pile of Law. The foundation model built in this manuscript is the initial step for the development of future applications in the legal domain, such as further training with reinforcement learning from human feedback. Another objective of this manuscript is to assist legal professionals in utilizing language models through the ``No Code'' approach. By fine-tuning models with specialized data and without modifying any source code, legal professionals can create custom language models for downstream tasks with minimum effort and technical knowledge. The downstream task in this manuscript is to turn a LexGPT model into a classifier, although the performance is notably lower than the state-of-the-art result. How to enhance downstream task performance without modifying the model or its source code is a research topic for future exploration.Comment: 10 pages and 2 figures. To be published in the Proceedings of the Seventeenth International Workshop on Juris-informatics (JURISIN 2023), hosted by JSAI International Symposia on AI 202

arXiv.org e-Print Archive

Evaluating Generative Patent Language Models

Author: Lee Jieh-Sheng
Publication venue: 'Elsevier BV'
Publication date: 05/06/2023
Field of study

Generative language models are promising for assisting human writing in various domains. This manuscript aims to build generative language models in the patent domain and evaluate model performance from a human-centric perspective. The perspective is to measure the ratio of keystrokes that can be saved by autocompletion based on generative patent language models. A higher ratio means a more effective model which can save more keystrokes. This metric can be used to benchmark model performance. The metric is different from conventional machine-centric metrics that are token-based instead of keystroke-based. In terms of model size, the largest model built in this manuscript is 6B, which is state-of-the-art in the patent domain. Based on the metric, it is found that the largest model is not necessarily the best for the human-centric metric. The finding means that keeping increasing model sizes in the patent domain might be unnecessary if the purpose is to assist human writing with autocompletion. Several patent language models are pre-trained from scratch in this research. The pre-trained models are released for future researchers. Several visualization tools are also provided. The importance of building a generative language model in the patent domain is the potential to facilitate creativity and innovations in the future.Comment: 12 pages, 7 figures, and 5 table

arXiv.org e-Print Archive

Do Language Models Plagiarize?

Author: Chen Jinghui
Le Thai
Lee Dongwon
Lee Jooyoung
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/02/2023
Field of study

Past literature has illustrated that language models (LMs) often memorize parts of training instances and reproduce them in natural language generation (NLG) processes. However, it is unclear to what extent LMs "reuse" a training corpus. For instance, models can generate paraphrased sentences that are contextually similar to training samples. In this work, therefore, we study three types of plagiarism (i.e., verbatim, paraphrase, and idea) among GPT-2 generated texts, in comparison to its training data, and further analyze the plagiarism patterns of fine-tuned LMs with domain-specific corpora which are extensively used in practice. Our results suggest that (1) three types of plagiarism widely exist in LMs beyond memorization, (2) both size and decoding methods of LMs are strongly associated with the degrees of plagiarism they exhibit, and (3) fine-tuned LMs' plagiarism patterns vary based on their corpus similarity and homogeneity. Given that a majority of LMs' training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications. Their patterns are likely to exacerbate as both the size of LMs and their training data increase, raising concerns about indiscriminately pursuing larger models with larger training corpora. Plagiarized content can also contain individuals' personal and sensitive information. These findings overall cast doubt on the practicality of current LMs in mission-critical writing tasks and urge more discussions around the observed phenomena. Data and source code are available at https://github.com/Brit7777/LM-plagiarism.Comment: Accepted to WWW'2

arXiv.org e-Print Archive

The artefacts of intelligence: governing scientists' contribution to AI proliferation

Author: Shevlane Toby
Publication venue
Publication date: 16/02/2024
Field of study

This DPhil dissertation is about attempts to govern how artificial intelligence (AI) researchers share their work. There is growing concern that the software artefacts built by AI researchers will have adverse impacts on society if made freely available online. AI research is a scientific field, and openly sharing these artefacts is routine and expected, as part of the functioning of the scientific field. Recently, there have been a number of occasions where members of the AI research community have trialled new ways of sharing their work, in response to their concerns that it poses risks to society. The case study follows: the ‘staged release’ of the GPT-2 language model, where more capable models were gradually released; the platform through which researchers and developers could access GPT-3, the successor to GPT-2; and a wave of new ethics regimes for AI conference publications. The study relies on 42 qualitative interviews with members of the AI research community, conducted between 2019 and 2021, as well as many other publicly available sources such as blog posts and Twitter. The aim is to understand how concerns about risk can become a feature of the way AI research is shared. Major themes are: the relationship between science and society; the relationship between industry AI labs and academia; the interplay between AI risks and AI governance regimes; and how the existing scientific field provides an insecure footing for new governance regimes

Oxford University Research Archive

Generative Transformers for Design Concept Generation

Author: Luo Jianxi
Zhu Qihao
Publication venue
Publication date: 07/11/2022
Field of study

Generating novel and useful concepts is essential during the early design stage to explore a large variety of design opportunities, which usually requires advanced design thinking ability and a wide range of knowledge from designers. Growing works on computer-aided tools have explored the retrieval of knowledge and heuristics from design data. However, they only provide stimuli to inspire designers from limited aspects. This study explores the recent advance of the natural language generation (NLG) technique in the artificial intelligence (AI) field to automate the early-stage design concept generation. Specifically, a novel approach utilizing the generative pre-trained transformer (GPT) is proposed to leverage the knowledge and reasoning from textual data and transform them into new concepts in understandable language. Three concept generation tasks are defined to leverage different knowledge and reasoning: domain knowledge synthesis, problem-driven synthesis, and analogy-driven synthesis. The experiments with both human and data-driven evaluation show good performance in generating novel and useful concepts.Comment: Accepted by J. Comput. Inf. Sci. En

arXiv.org e-Print Archive