398,466 research outputs found

    PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

    Full text link
    Large protein language models are adept at capturing the underlying evolutionary information in primary structures, offering significant practical value for protein engineering. Compared to natural language models, protein amino acid sequences have a smaller data volume and a limited combinatorial space. Choosing an appropriate vocabulary size to optimize the pre-trained model is a pivotal issue. Moreover, despite the wealth of benchmarks and studies in the natural language community, there remains a lack of a comprehensive benchmark for systematically evaluating protein language model quality. Given these challenges, PETA trained language models with 14 different vocabulary sizes under three tokenization methods. It conducted thousands of tests on 33 diverse downstream datasets to assess the models' transfer learning capabilities, incorporating two classification heads and three random seeds to mitigate potential biases. Extensive experiments indicate that vocabulary sizes between 50 and 200 optimize the model, whereas sizes exceeding 800 detrimentally affect the model's representational performance. Our code, model weights and datasets are available at https://github.com/ginnm/ProteinPretraining.Comment: 46 pages, 4figures, 9 table

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

    The REVERE project:Experiments with the application of probabilistic NLP to systems engineering

    Get PDF
    Despite natural language’s well-documented shortcomings as a medium for precise technical description, its use in software-intensive systems engineering remains inescapable. This poses many problems for engineers who must derive problem understanding and synthesise precise solution descriptions from free text. This is true both for the largely unstructured textual descriptions from which system requirements are derived, and for more formal documents, such as standards, which impose requirements on system development processes. This paper describes experiments that we have carried out in the REVERE1 project to investigate the use of probabilistic natural language processing techniques to provide systems engineering support

    Overcoming Language Dichotomies: Toward Effective Program Comprehension for Mobile App Development

    Full text link
    Mobile devices and platforms have become an established target for modern software developers due to performant hardware and a large and growing user base numbering in the billions. Despite their popularity, the software development process for mobile apps comes with a set of unique, domain-specific challenges rooted in program comprehension. Many of these challenges stem from developer difficulties in reasoning about different representations of a program, a phenomenon we define as a "language dichotomy". In this paper, we reflect upon the various language dichotomies that contribute to open problems in program comprehension and development for mobile apps. Furthermore, to help guide the research community towards effective solutions for these problems, we provide a roadmap of directions for future work.Comment: Invited Keynote Paper for the 26th IEEE/ACM International Conference on Program Comprehension (ICPC'18

    A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes

    Full text link
    We propose a model to automatically describe changes introduced in the source code of a program using natural language. Our method receives as input a set of code commits, which contains both the modifications and message introduced by an user. These two modalities are used to train an encoder-decoder architecture. We evaluated our approach on twelve real world open source projects from four different programming languages. Quantitative and qualitative results showed that the proposed approach can generate feasible and semantically sound descriptions not only in standard in-project settings, but also in a cross-project setting.Comment: Accepted at ACL 201

    A Guided Tour Of Conceptual Engineering and Conceptual Ethics

    Get PDF
    In this Introduction, we aim to introduce the reader to the basic topic of this book. As part of this, we explain why we are using two different expressions (‘conceptual engineering’ and ‘conceptual ethics’) to describe the topics in the book. We then turn to some of the central foundational issues that arise for conceptual engineering and conceptual ethics, and finally we outline various views one might have about their role in philosophy and inquiry more generally

    Abstract State Machines 1988-1998: Commented ASM Bibliography

    Get PDF
    An annotated bibliography of papers which deal with or use Abstract State Machines (ASMs), as of January 1998.Comment: Also maintained as a BibTeX file at http://www.eecs.umich.edu/gasm
    • …
    corecore