7 research outputs found

    MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

    Full text link
    Recently, there has been a rapid advancement in research on Large Language Models (LLMs), resulting in significant progress in several Natural Language Processing (NLP) tasks. Consequently, there has been a surge in LLM evaluation research to comprehend the models' capabilities and limitations. However, much of this research has been confined to the English language, leaving LLM building and evaluation for non-English languages relatively unexplored. There has been an introduction of several new LLMs, necessitating their evaluation on non-English languages. This study aims to expand our MEGA benchmarking suite by including six new datasets to form the MEGAVERSE benchmark. The benchmark comprises 22 datasets covering 81 languages, including low-resource African languages. We evaluate several state-of-the-art LLMs like GPT-3.5-Turbo, GPT4, PaLM2, and Llama2 on the MEGAVERSE datasets. Additionally, we include two multimodal datasets in the benchmark and assess the performance of the LLaVa-v1.5 model. Our experiments suggest that GPT4 and PaLM2 outperform the Llama models on various tasks, notably on low-resource languages, with GPT4 outperforming PaLM2 on more datasets than vice versa. However, issues such as data contamination must be addressed to obtain an accurate assessment of LLM performance on non-English languages.Comment: 23 pages, 30 figures and 1 tabl

    Effectiveness of a national quality improvement programme to improve survival after emergency abdominal surgery (EPOCH): a stepped-wedge cluster-randomised trial

    Get PDF
    Background: Emergency abdominal surgery is associated with poor patient outcomes. We studied the effectiveness of a national quality improvement (QI) programme to implement a care pathway to improve survival for these patients. Methods: We did a stepped-wedge cluster-randomised trial of patients aged 40 years or older undergoing emergency open major abdominal surgery. Eligible UK National Health Service (NHS) hospitals (those that had an emergency general surgical service, a substantial volume of emergency abdominal surgery cases, and contributed data to the National Emergency Laparotomy Audit) were organised into 15 geographical clusters and commenced the QI programme in a random order, based on a computer-generated random sequence, over an 85-week period with one geographical cluster commencing the intervention every 5 weeks from the second to the 16th time period. Patients were masked to the study group, but it was not possible to mask hospital staff or investigators. The primary outcome measure was mortality within 90 days of surgery. Analyses were done on an intention-to-treat basis. This study is registered with the ISRCTN registry, number ISRCTN80682973. Findings: Treatment took place between March 3, 2014, and Oct 19, 2015. 22 754 patients were assessed for elegibility. Of 15 873 eligible patients from 93 NHS hospitals, primary outcome data were analysed for 8482 patients in the usual care group and 7374 in the QI group. Eight patients in the usual care group and nine patients in the QI group were not included in the analysis because of missing primary outcome data. The primary outcome of 90-day mortality occurred in 1210 (16%) patients in the QI group compared with 1393 (16%) patients in the usual care group (HR 1·11, 0·96–1·28). Interpretation: No survival benefit was observed from this QI programme to implement a care pathway for patients undergoing emergency abdominal surgery. Future QI programmes should ensure that teams have both the time and resources needed to improve patient care. Funding: National Institute for Health Research Health Services and Delivery Research Programme

    Effectiveness of a national quality improvement programme to improve survival after emergency abdominal surgery (EPOCH): a stepped-wedge cluster-randomised trial

    Get PDF
    BACKGROUND: Emergency abdominal surgery is associated with poor patient outcomes. We studied the effectiveness of a national quality improvement (QI) programme to implement a care pathway to improve survival for these patients. METHODS: We did a stepped-wedge cluster-randomised trial of patients aged 40 years or older undergoing emergency open major abdominal surgery. Eligible UK National Health Service (NHS) hospitals (those that had an emergency general surgical service, a substantial volume of emergency abdominal surgery cases, and contributed data to the National Emergency Laparotomy Audit) were organised into 15 geographical clusters and commenced the QI programme in a random order, based on a computer-generated random sequence, over an 85-week period with one geographical cluster commencing the intervention every 5 weeks from the second to the 16th time period. Patients were masked to the study group, but it was not possible to mask hospital staff or investigators. The primary outcome measure was mortality within 90 days of surgery. Analyses were done on an intention-to-treat basis. This study is registered with the ISRCTN registry, number ISRCTN80682973. FINDINGS: Treatment took place between March 3, 2014, and Oct 19, 2015. 22 754 patients were assessed for elegibility. Of 15 873 eligible patients from 93 NHS hospitals, primary outcome data were analysed for 8482 patients in the usual care group and 7374 in the QI group. Eight patients in the usual care group and nine patients in the QI group were not included in the analysis because of missing primary outcome data. The primary outcome of 90-day mortality occurred in 1210 (16%) patients in the QI group compared with 1393 (16%) patients in the usual care group (HR 1·11, 0·96-1·28). INTERPRETATION: No survival benefit was observed from this QI programme to implement a care pathway for patients undergoing emergency abdominal surgery. Future QI programmes should ensure that teams have both the time and resources needed to improve patient care. FUNDING: National Institute for Health Research Health Services and Delivery Research Programme

    Evolving Mario levels in the latent space of a deep convolutional generative adversarial network

    Get PDF
    © 2018 Copyright held by the owner/author(s). Generative Adversarial Networks (GANs) are a machine learning approach capable of generating novel example outputs across a space of provided training examples. Procedural Content Generation (PCG) of levels for video games could benefit from such models, especially for games where there is a pre-existing corpus of levels to emulate. This paper trains a GAN to generate levels for Super Mario Bros using a level from the Video Game Level Corpus. The approach successfully generates a variety of levels similar to one in the original corpus, but is further improved by application of the Covariance Matrix Adaptation Evolution Strategy (CMA-ES). Specifically, various fitness functions are used to discover levels within the latent space of the GAN that maximize desired properties. Simple static properties are optimized, such as a given distribution of tile types. Additionally, the champion A* agent from the 2009 Mario AI competition is used to assess whether a level is playable, and how many jumping actions are required to beat it. These fitness functions allow for the discovery of levels that exist within the space of examples designed by experts, and also guide the search towards levels that fulfill one or more specified objectives
    corecore