27 research outputs found

    Incorporating External Knowledge through Pre-training for Natural Language to Code Generation

    Full text link
    Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at https://github.com/neulab/external-knowledge-codegen.Comment: Accepted by ACL 202

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Full text link
    With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.Comment: Our code, data, environment reproduction resources, and video demonstrations are publicly available at https://webarena.dev

    Neutrino Oscillations and the Supernova 1987A Signal

    Get PDF
    We study the impact of neutrino oscillations on the interpretation of the supernova (SN) 1987A neutrino signal by means of a maximum-likelihood analysis. We focus on oscillations between νe\overline\nu_e with νμ\overline\nu_\mu or ντ\overline\nu_\tau with those mixing parameters that would solve the solar neutrino problem. For the small-angle MSW solution (Δm2105eV2\Delta m^2\approx10^{-5}\,\rm eV^2, sin22Θ00.007\sin^22\Theta_0\approx0.007), there are no significant oscillation effects on the Kelvin-Helmholtz cooling signal; we confirm previous best-fit values for the neutron-star binding energy and average spectral νe\overline\nu_e temperature. There is only marginal overlap between the upper end of the 95.4\% CL inferred range of Eνe\langle E_{\overline\nu_e}\rangle and the lower end of the range of theoretical predictions. Any admixture of the stiffer νμ\overline\nu_\mu spectrum by oscillations aggravates the conflict between experimentally inferred and theoretically predicted spectral properties. For mixing parameters in the neighborhood of the large-angle MSW solution (Δm2105eV2\Delta m^2\approx10^{-5}\,\rm eV^2, sin22Θ00.7\sin^22\Theta_0\approx0.7) the oscillations in the SN are adiabatic, but one needs to include the regeneration effect in the Earth which causes the Kamiokande and IMB detectors to observe different νe\overline\nu_e spectra. For the solar vacuum solution (Δm21010eV2\Delta m^2\approx10^{-10}\,\rm eV^2, sin22Θ01\sin^22\Theta_0\approx1) the oscillations in the SN are nonadiabatic; vacuum oscillations take place between the SN and the detector. If either of the large-angle solutions were borne out by the upcoming round of solar neutrino experiments, one would have to conclude that the SN~1987A νμ\overline\nu_\mu and/or νe\overline\nu_e spectra had been much softer than predicted by currentComment: Final version with very minor wording changes, to be published in Phys. Rev.

    A Systematic Evaluation of Large Language Models of Code

    Full text link
    Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodeParrot, across various programming languages. Although Codex itself is not open-source, we find that existing open-source models do achieve close results in some programming languages, although targeted mainly for natural language modeling. We further identify an important missing piece in the form of a large open-source model trained exclusively on a multi-lingual corpus of code. We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine. In the C programming language, PolyCoder outperforms all models including Codex. Our trained models are open-source and publicly available at https://github.com/VHellendoorn/Code-LMs, which enables future research and application in this area.Comment: DL4C@ICLR 2022, and MAPS@PLDI 202

    Active Retrieval Augmented Generation

    Full text link
    Despite the remarkable ability of large language models (LMs) to comprehend and generate language, they have a tendency to hallucinate and create factually inaccurate output. Augmenting LMs by retrieving information from external knowledge resources is one promising solution. Most existing retrieval-augmented LMs employ a retrieve-and-generate setup that only retrieves information once based on the input. This is limiting, however, in more general scenarios involving generation of long texts, where continually gathering information throughout the generation process is essential. There have been some past efforts to retrieve information multiple times while generating outputs, which mostly retrieve documents at fixed intervals using the previous context as queries. In this work, we provide a generalized view of active retrieval augmented generation, methods that actively decide when and what to retrieve across the course of the generation. We propose Forward-Looking Active REtrieval augmented generation (FLARE), a generic retrieval-augmented generation method which iteratively uses a prediction of the upcoming sentence to anticipate future content, which is then utilized as a query to retrieve relevant documents to regenerate the sentence if it contains low-confidence tokens. We test FLARE along with baselines comprehensively over 4 long-form knowledge-intensive generation tasks/datasets. FLARE achieves superior or competitive performance on all tasks, demonstrating the effectiveness of our method. Code and datasets are available at https://github.com/jzbjyb/FLARE
    corecore