1,313 research outputs found
GAIA: a benchmark for General AI Assistants
We introduce GAIA, a benchmark for General AI Assistants that, if solved,
would represent a milestone in AI research. GAIA proposes real-world questions
that require a set of fundamental abilities such as reasoning, multi-modality
handling, web browsing, and generally tool-use proficiency. GAIA questions are
conceptually simple for humans yet challenging for most advanced AIs: we show
that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins.
This notable performance disparity contrasts with the recent trend of LLMs
outperforming humans on tasks requiring professional skills in e.g. law or
chemistry. GAIA's philosophy departs from the current trend in AI benchmarks
suggesting to target tasks that are ever more difficult for humans. We posit
that the advent of Artificial General Intelligence (AGI) hinges on a system's
capability to exhibit similar robustness as the average human does on such
questions. Using GAIA's methodology, we devise 466 questions and their answer.
We release our questions while retaining answers to 300 of them to power a
leader-board available at https://huggingface.co/gaia-benchmark
Improving Neural Question Answering with Retrieval and Generation
Text-based Question Answering (QA) is a subject of interest both for its practical applications, and as a test-bed to measure the key Artificial Intelligence competencies of Natural Language Processing (NLP) and the representation and application of knowledge. QA has progressed a great deal in recent years by adopting neural networks, the construction of large training datasets, and unsupervised pretraining. Despite these successes, QA models require large amounts of hand-annotated data, struggle to apply supplied knowledge effectively, and can be computationally ex- pensive to operate. In this thesis, we employ natural language generation and information retrieval techniques in order to explore and address these three issues.
We first approach the task of Reading Comprehension (RC), with the aim of lifting the requirement for in-domain hand-annotated training data. We describe a method for inducing RC capabilities without requiring hand-annotated RC instances, and demonstrate performance on par with early supervised approaches. We then explore multi-lingual RC, and develop a dataset to evaluate methods which enable training RC models in one language, and testing them in another.
Second, we explore open-domain QA (ODQA), and consider how to build mod- els which best leverage the knowledge contained in a Wikipedia text corpus. We demonstrate that retrieval-augmentation greatly improves the factual predictions of large pretrained language models in unsupervised settings. We then introduce a class of retrieval-augmented generator model, and demonstrate its strength and flexibility across a range of knowledge-intensive NLP tasks, including ODQA.
Lastly, we study the relationship between memorisation and generalisation in ODQA, developing a behavioural framework based on memorisation to contextualise the performance of ODQA models. Based on these insights, we introduce a class of ODQA model based on the concept of representing knowledge as question- answer pairs, and demonstrate how, by using question generation, such models can achieve high accuracy, fast inference, and well-calibrated predictions
Condition matters: pupil voices on the design and condition of secondary schools
This research was produced by Sheffield Hallam University. The project aimed to inform the creation of a national schools Facilities Management network and an ongoing programme to research and benchmark the impact of school condition and design on pupils
Recommended from our members
Approaches to researching digital-pedagogical competence development in VE-based teacher education
For the past two decades, Virtual Exchange (VE) has enjoyed increasing popularity in university education, including initial (language) teacher education programmes (OâDowd, 2018). Collaborating online with colleagues and students from different cultural backgrounds and educational systems has allowed trainees to experience and reflect on issues related to technology and pedagogy in authentic linguistic and intercultural contexts. In 2017/2018, the Evaluating and Upscaling Telecollaborative Teacher Education (EVALUATE) project â an Erasmus+ funded European Policy Experimentation (EPE) â collected and analysed data from VEs across the curriculum involving over 1,000 participants at Initial Teacher Education (ITE) institutions in Europe and beyond.
Here, we specifically focus on the impact of VE on their digital-pedagogical competence development. Following a mixed method design we used the Technological PedagogicalContent Knowledge (TPACK) work of Mishra and Koehler (2006) and Schmidt et al. (2009) in a pre-post-test manner. These were complemented by qualitative content analysis of prompted diary entries at key stages during the exchanges to collect further evidence of existing and emerging digital-pedagogical skills among the trainees. Based on one case study of a German-Polish EVALUATE exchange we will exemplify the aforementioned research methods and associated challenges. We will illustrate the urgent need for initial and in-service teacher education that combines technology and pedagogy and argue for VE as an ideal context to this effect. Finally, we will demonstrate how the chosen research approach has contributed to providing the kind of evidence required by education administrators and policy makers for a systematic integration of VE into teacher education programmes
Understanding Teacher Morale
This study emerged from discussions within the Policy and Planning Council of the Metropolitan Educational Research Consortium (MERC), a research alliance between Virginia Commonwealth Universityâs School of Education and seven surrounding school divisions.
The project has two goals. The first goal is to develop an understanding of the factors that impact teachersâ experience of their work in the current PK12 public school context. Although this topic could be, and has been, investigated through a number of lenses (e.g., burnout, trust, motivation), this project focuses on the idea of teacher morale, a choice that will be discussed in detail in the next section of the report. The study addresses the following three questions:
1. How do teachers experience job satisfaction and morale?
2. What are the dynamics between a teacherâs job related ideal and the professional culture of the school that support or hinder the experience of job satisfaction and morale?
3. How do differences between schools related to policy context and social context affect the dynamics of job satisfaction and morale?
To answer these questions MERC assembled a research team comprised of a university researcher, graduate students, and a team of school personnel from the MERC school divisions. Over the course of two years, the team developed a conceptual framework for understanding teacher morale, designed a research study that involved observing and interviewing teachers (n=44) across three purposefully selected middle schools in the Richmond region, and then collected and analyzed the data. This report shares both the process and the findings of this collaborative research effort.
The second goal of this research project is to support action by local policy makers, school division leaders, central office personnel, principals, and teachers. The study was commissioned by local school leaders not just to document and reflect on teacher morale, but more importantly to do something about it. As argued above, teachers and the conditions of teachersâ work matters for our students, our schools, and the well being of our communities and society. In this regard, this report is only one piece of this projectâs action and impact plan. While the report does contain a series of recommendations based on findings and how they can be used, the release of the report is tied to additional dissemination and professional development efforts designed to effect change
Characterizing the Adolescent Male Reader: A Narrative Inquiry into the Reading Lives of 12th Grade Boys Enrolled in IB Language A: Literature
The problem this research seeks to understand is the intersection of the gender achievement gap in reading and the gendered practices of âthe schoolâ as a powerful social institution. This project draws upon the studies of the 1990s-2000s in order to discover the potential impact of the social, cultural, and political movements and shifts that began in 2017. Most earlier research into adolescent male readers was conducted from a deficit perspective. This study explores the reading lives of boys who are positioned at the end of their K-12 public schooling career and are finishing their second year in IB (International Baccalaureate) Language A: Literature HL (Higher Level). Narrative inquiry was employed to develop an understanding of the experiences that shaped the reading lives of 12th grade boys enrolled in the most rigorous English/Language Arts course offered at their high school. The boys composed literacy narratives and reflected upon their reading lives through journaling, discussion, writing, revision, interviews, and a focus group. Analysis of this data set resulted in a continuation of the characterization of the male reader, as well as advice for how educators can create classrooms and curriculum that support gender as a social construct and an expansion of hegemonic masculinity
Hybrid human-machine information systems for data classification
Over the last decade, we have seen an intense development of machine learning approaches for solving various tasks in diverse domains. Despite the remarkable advancements in this field, there are still task categories that machine learning models fall short of the required accuracy. This is the case with tasks that require human cognitive skills, such as sentiment analysis, emotional or contextual understanding. On the other hand, human-based computation approaches, such as crowdsourcing, are popular for solving such tasks. Crowdsourcing enables access to a vast number of groups with different expertise, and if managed properly, generates high-quality results. However, crowdsourcing as a standalone approach is not scalable due to the latency and cost it brings in.
Addressing the challenges and limitations that the human and machine-based approaches have distinctly requires bridging the two fields into a hybrid intelligence, seen as a promising approach to solve critical and complex real-world tasks. This thesis focuses on hybrid human-machine information systems, combining machine and human intelligence and leveraging their complementary strengths: the data processing efficiency of machine learning and the data quality generated by crowdsourcing.
In this thesis, we present hybrid human-machine models to address the challenges falling into three dimensions: accuracy, latency, and cost. Solving data classification tasks in different domains has different requirements concerning accuracy, latency, and cost criteria. Motivated by this fact, we introduce a master component that evaluates these criteria to find the suitable model as a trade-off solution. In hybrid human-machine information systems, incorporating human judgments is expected to improve the accuracy of the system. Therefore, to ensure this, we focus on the human intelligence component, integrating profile-aware crowdsourcing for task assignment and data quality control mechanisms in the hybrid pipelines.
The proposed conceptual hybrid human-machine models materialize in conducted experiments. Motivated by challenging scenarios and using real-world datasets, we implement the hybrid models in three experiments. Evaluations show that the implemented hybrid human-machine architectures for data classification tasks lead to better results as compared to each of the two approaches individually, improving the overall accuracy at an acceptable cost and latency
- âŠ