1,313 research outputs found

    GAIA: a benchmark for General AI Assistants

    Full text link
    We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark

    July 2021 Complete

    Full text link

    Improving Neural Question Answering with Retrieval and Generation

    Get PDF
    Text-based Question Answering (QA) is a subject of interest both for its practical applications, and as a test-bed to measure the key Artificial Intelligence competencies of Natural Language Processing (NLP) and the representation and application of knowledge. QA has progressed a great deal in recent years by adopting neural networks, the construction of large training datasets, and unsupervised pretraining. Despite these successes, QA models require large amounts of hand-annotated data, struggle to apply supplied knowledge effectively, and can be computationally ex- pensive to operate. In this thesis, we employ natural language generation and information retrieval techniques in order to explore and address these three issues. We first approach the task of Reading Comprehension (RC), with the aim of lifting the requirement for in-domain hand-annotated training data. We describe a method for inducing RC capabilities without requiring hand-annotated RC instances, and demonstrate performance on par with early supervised approaches. We then explore multi-lingual RC, and develop a dataset to evaluate methods which enable training RC models in one language, and testing them in another. Second, we explore open-domain QA (ODQA), and consider how to build mod- els which best leverage the knowledge contained in a Wikipedia text corpus. We demonstrate that retrieval-augmentation greatly improves the factual predictions of large pretrained language models in unsupervised settings. We then introduce a class of retrieval-augmented generator model, and demonstrate its strength and flexibility across a range of knowledge-intensive NLP tasks, including ODQA. Lastly, we study the relationship between memorisation and generalisation in ODQA, developing a behavioural framework based on memorisation to contextualise the performance of ODQA models. Based on these insights, we introduce a class of ODQA model based on the concept of representing knowledge as question- answer pairs, and demonstrate how, by using question generation, such models can achieve high accuracy, fast inference, and well-calibrated predictions

    Condition matters: pupil voices on the design and condition of secondary schools

    Get PDF
    This research was produced by Sheffield Hallam University. The project aimed to inform the creation of a national schools Facilities Management network and an ongoing programme to research and benchmark the impact of school condition and design on pupils

    Understanding Teacher Morale

    Get PDF
    This study emerged from discussions within the Policy and Planning Council of the Metropolitan Educational Research Consortium (MERC), a research alliance between Virginia Commonwealth University’s School of Education and seven surrounding school divisions. The project has two goals. The first goal is to develop an understanding of the factors that impact teachers’ experience of their work in the current PK12 public school context. Although this topic could be, and has been, investigated through a number of lenses (e.g., burnout, trust, motivation), this project focuses on the idea of teacher morale, a choice that will be discussed in detail in the next section of the report. The study addresses the following three questions: 1. How do teachers experience job satisfaction and morale? 2. What are the dynamics between a teacher’s job related ideal and the professional culture of the school that support or hinder the experience of job satisfaction and morale? 3. How do differences between schools related to policy context and social context affect the dynamics of job satisfaction and morale? To answer these questions MERC assembled a research team comprised of a university researcher, graduate students, and a team of school personnel from the MERC school divisions. Over the course of two years, the team developed a conceptual framework for understanding teacher morale, designed a research study that involved observing and interviewing teachers (n=44) across three purposefully selected middle schools in the Richmond region, and then collected and analyzed the data. This report shares both the process and the findings of this collaborative research effort. The second goal of this research project is to support action by local policy makers, school division leaders, central office personnel, principals, and teachers. The study was commissioned by local school leaders not just to document and reflect on teacher morale, but more importantly to do something about it. As argued above, teachers and the conditions of teachers’ work matters for our students, our schools, and the well being of our communities and society. In this regard, this report is only one piece of this project’s action and impact plan. While the report does contain a series of recommendations based on findings and how they can be used, the release of the report is tied to additional dissemination and professional development efforts designed to effect change

    Characterizing the Adolescent Male Reader: A Narrative Inquiry into the Reading Lives of 12th Grade Boys Enrolled in IB Language A: Literature

    Get PDF
    The problem this research seeks to understand is the intersection of the gender achievement gap in reading and the gendered practices of “the school” as a powerful social institution. This project draws upon the studies of the 1990s-2000s in order to discover the potential impact of the social, cultural, and political movements and shifts that began in 2017. Most earlier research into adolescent male readers was conducted from a deficit perspective. This study explores the reading lives of boys who are positioned at the end of their K-12 public schooling career and are finishing their second year in IB (International Baccalaureate) Language A: Literature HL (Higher Level). Narrative inquiry was employed to develop an understanding of the experiences that shaped the reading lives of 12th grade boys enrolled in the most rigorous English/Language Arts course offered at their high school. The boys composed literacy narratives and reflected upon their reading lives through journaling, discussion, writing, revision, interviews, and a focus group. Analysis of this data set resulted in a continuation of the characterization of the male reader, as well as advice for how educators can create classrooms and curriculum that support gender as a social construct and an expansion of hegemonic masculinity

    Hybrid human-machine information systems for data classification

    Get PDF
    Over the last decade, we have seen an intense development of machine learning approaches for solving various tasks in diverse domains. Despite the remarkable advancements in this field, there are still task categories that machine learning models fall short of the required accuracy. This is the case with tasks that require human cognitive skills, such as sentiment analysis, emotional or contextual understanding. On the other hand, human-based computation approaches, such as crowdsourcing, are popular for solving such tasks. Crowdsourcing enables access to a vast number of groups with different expertise, and if managed properly, generates high-quality results. However, crowdsourcing as a standalone approach is not scalable due to the latency and cost it brings in. Addressing the challenges and limitations that the human and machine-based approaches have distinctly requires bridging the two fields into a hybrid intelligence, seen as a promising approach to solve critical and complex real-world tasks. This thesis focuses on hybrid human-machine information systems, combining machine and human intelligence and leveraging their complementary strengths: the data processing efficiency of machine learning and the data quality generated by crowdsourcing. In this thesis, we present hybrid human-machine models to address the challenges falling into three dimensions: accuracy, latency, and cost. Solving data classification tasks in different domains has different requirements concerning accuracy, latency, and cost criteria. Motivated by this fact, we introduce a master component that evaluates these criteria to find the suitable model as a trade-off solution. In hybrid human-machine information systems, incorporating human judgments is expected to improve the accuracy of the system. Therefore, to ensure this, we focus on the human intelligence component, integrating profile-aware crowdsourcing for task assignment and data quality control mechanisms in the hybrid pipelines. The proposed conceptual hybrid human-machine models materialize in conducted experiments. Motivated by challenging scenarios and using real-world datasets, we implement the hybrid models in three experiments. Evaluations show that the implemented hybrid human-machine architectures for data classification tasks lead to better results as compared to each of the two approaches individually, improving the overall accuracy at an acceptable cost and latency
    • 

    corecore