Search CORE

5 research outputs found

A framework for categorising AI evaluation instruments

Author: Cohn Anthony G
Hernández-Orallo José
Mboli Julius Sechang
Moros-Daval Yael
Xiang Zhiliang
Zhou Lexin
Publication venue: CEUR Workshop Proceedings
Publication date: 24/07/2022
Field of study

The current and future capabilities of Artificial Intelligence (AI) are typically assessed with an ever increasing number of benchmarks, competitions, tests and evaluation standards, which are meant to work as AI evaluation instruments (EI). These EIs are not only increasing in number, but also in complexity and diversity, making it hard to understand this evaluation landscape in a meaningful way. In this paper we present an approach for categorising EIs using a set of 18 facets, accompanied by a rubric to allow anyone to apply the framework to any existing or new EI. We apply the rubric to 23 EIs in different domains through a team of raters, and analyse how consistent the rubric is and how well it works to distinguish between EIs and map the evaluation landscape in AI

Online Research @ Cardiff

Predictable Artificial Intelligence

Author: Burden John
Burnell Ryan
Cheke Lucy
Ferri Cèsar
Hernández-Orallo José
hÉigeartaigh Seán Ó
Marcoci Alexandru
Martínez-Plumed Fernando
Mehrbakhsh Behzad
Moreno-Casares Pablo A.
Moros-Daval Yael
Rutar Danaja
Schellaert Wout
Voudouris Konstantinos
Zhou Lexin
Publication venue
Publication date: 09/10/2023
Field of study

We introduce the fundamental ideas and challenges of Predictable AI, a nascent research area that explores the ways in which we can anticipate key indicators of present and future AI ecosystems. We argue that achieving predictability is crucial for fostering trust, liability, control, alignment and safety of AI ecosystems, and thus should be prioritised over performance. While distinctive from other areas of technical and non-technical AI research, the questions, hypotheses and challenges relevant to Predictable AI were yet to be clearly described. This paper aims to elucidate them, calls for identifying paths towards AI predictability and outlines the potential impact of this emergent field.Comment: 11 pages excluding references, 4 figures, and 2 tables. Paper Under Revie

arXiv.org e-Print Archive

Anotació multimodal d'escales de demanda de visió per a estimar la capacitat de detecció d'objectes

Author: Moros Daval Yael
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 15/10/2024
Field of study

[ES] El objetivo de este proyecto es identificar las demandas de visión requeridas por diversos problemas de detección de objetos, adaptando una escala de exigencias de visión creada originalmente por un grupo de expertos de la OCDE. La escala se organiza en cinco niveles de dificultad creciente e incluye una descripción de aquellas características visuales que pueden influir en el rendimiento de los sistemas de visión en benchmarks de detección de objetos, como el desenfoque, la oclusión, las condiciones de iluminación, la orientación del objeto, la superposición o la presencia de múltiples objetos. Las escalas se convierten en rúbricas que pueden ser utilizadas por grandes modelos de lenguaje con capacidades de visión, como GPT4-Vision, para anotar las demandas de visión de grandes muestras de imágenes en múltiples conjuntos de datos de detección de objetos (por ejemplo, COCO o VOC). También utilizamos few-shot learning para garantizar que las respuestas del anotador se ajustan a los niveles de dificultad previstos. Una vez anotados los puntos de referencia, procesamos todas las imágenes encuadradas para dos tareas diferentes, detección y localización de objetos, utilizando diversos algoritmos de visión por computador, con especial atención a la familia de modelos YOLO. Vemos cómo el rendimiento disminuye en general para niveles crecientes, lo que nos permite representar curvas características de agentes para distintos métodos y familias. El resultado de este trabajo es una metodología para estimar el nivel de capacidad visual de los algoritmos actuales de detección de objetos ---en lugar de su rendimiento---, así como proporcionar una cierta visión de su evolución en el tiempo.[EN] This project aims to identify the vision demands required by various Object Detection problems, adapting a vision demands scale originally created by a group of experts from the OECD. The scale is organised into five levels of increasing difficulty and includes a description of those visual features that may influence machine vision performance in object detection benchmarks such as blur, occlusion, lighting conditions, object orientation, overlapping or the presence of multiple objects. The scales are converted into rubrics that can be used by Large Language Models with vision capabilities, such as GPT4-Vision, to annotate the vision demands of large samples of images in multiple object detection datasets (e.g., COCO or VOC). We also use few-shot learning to ensure the annotator s responses align with the expected difficulty levels. Once the benchmarks are annotated, we process all the images framed for two different tasks, object detection and localisation, using a variety of computer vision algorithms, with a particular focus on the YOLO family. We see how performance generally decreases for increasing levels, allowing us to represent agent characteristic curves for different methods and families. The outcome of this work is a methodology to estimate the level of visual capability of current object detection algorithms ---rather than performance---, as well as providing some insight into their evolution over time.Moros Daval, Y. (2024). Multimodal annotation of vision demand scales to estimate object detection capabilities. Universitat Politècnica de València. http://hdl.handle.net/10251/21015

RiuNet

Anotación automatizada de metacaracterísticas para predecir el rendimiento del modelo de lenguaje en tareas de procesamiento del lenguaje natural

Author: Moros Daval Yael
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 19/09/2023
Field of study

[EN] Large language models can be used for a wide range of tasks. The performance on each task instance depends on the specific characteristics of the question (e.g., knowledge or reasoning required) but also on its linguistic components (such as syntactic or semantic elaboration). It is important to determine whether failures depend on the specific elements of the task or on a more general linguistic factor. To this aim, this project introduces new methods to evaluate the base linguistic complexity of any task that is expressed in natural language, by identifying and annotating a set of linguistic meta-features that may affect performance. This work proposes a comprehensive list of meta-features, such as the presence of uncertainty, negation or reasoning. For each meta-feature, we identify a set of difficulty levels, and write a rubric to map each example to one of these levels. Using this rubric, we automate the process also using large language models, such as GPT. To validate the meta-features and their annotations, both univariate and multivariate analyses are performed to demonstrate the predictability of performance based on meta-feature levels. Large repositories such as BIG-bench and HELM are used for this validation, providing instance-level results for many models and tasks. The project explores the advantages and disadvantages of this automated annotation method, highlighting its flexibility and scalability. However, it also acknowledges the need for post-processing, the cost of tokens, the need for an initial pool of annotated examples, and the prompting engineering effort. By analysing performance on an illustrative set of tasks and models from the previous repositories, the main take-away of this work is to demonstrate the general applicability of the meta-feature approach, its effectiveness and its value in assessing the complexity of NLP tasks.[CA] Els grans models de llenguatge poden utilitzar-se per a una àmplia gamma de tasques. El rendiment en cada instància de la tasca depén de les característiques específiques de la pregunta (per exemple, el coneixement o el raonament necessari), però també dels seus components lingüístics (com l’elaboració sintàctica o semàntica). És important determinar si les fallades depenen dels elements específics de la tasca o d’un factor lingüístic més general. Amb aquest objectiu, aquest projecte introdueix nous mètodes per a avaluar la complexitat lingüística de qualsevol tasca que s’expresse en llenguatge natural, mitjançant la identificació i anotació d’un conjunt de meta-característiques lingüístiques que poden afectar el rendiment. Aquest treball proposa una llista exhaustiva de meta-característiques, com la presència d’incertesa, negació o raonament. Per a cada meta-característica, identifiquem un conjunt de nivells de dificultat i escrivim una rúbrica per a assignar cada exemple a un d’aquests nivells. A partir d’aquesta rúbrica, automatitzem el procés utilitzant també grans models de llenguatge, com GPT. Per a validar les meta-característiques i les seues anotacions, es realitzen anàlisis univariants i multivariants per a demostrar la predictibilitat del rendiment en funció dels nivells de meta-característiques. Per a aquesta validació s’utilitzen grans repositoris com BIG-bench i HELM, que proporcionen resultats a nivell d’instància per a molts models i tasques. El projecte explora els avantatges i inconvenients d’aquest mètode d’anotació automatitzada, destacant la seua flexibilitat i escalabilitat. No obstant això, també es reconeix la necessitat de postprocessament, el cost dels tokens, la necessitat d’un conjunt inicial d’exemples anotats i l’esforç de prompt engineering. En analitzar el rendiment en un conjunt il·lustratiu de tasques i models dels repositoris anteriors, el principal resultat d’aquest treball és demostrar l’aplicabilitat general de l’enfocament de les meta-característiques, la seua eficàcia i el seu valor per a avaluar la complexitat de les tasques de NLP.[ES] Los grandes modelos de lenguaje pueden utilizarse para una amplia gama de tareas. El rendimiento en cada instancia de la tarea depende de las características específicas de la pregunta (por ejemplo, el conocimiento o el razonamiento necesario), pero también de sus componentes lingüísticos (como la elaboración sintáctica o semántica). Es importante determinar si los fallos dependen de los elementos específicos de la tarea o de un factor lingüístico más general. Con este objetivo, este proyecto introduce nuevos métodos para evaluar la complejidad lingüística de cualquier tarea que se exprese en lenguaje natural, mediante la identificación y anotación de un conjunto de meta-características lingüísticas que pueden afectar al rendimiento. Este trabajo propone una lista exhaustiva de meta-características, como la presencia de incertidumbre, negación o razonamiento. Para cada meta-característica, identificamos un conjunto de niveles de dificultad y escribimos una rúbrica para asignar cada ejemplo a uno de estos niveles. A partir de esta rúbrica, automatizamos el proceso utilizando también grandes modelos de lenguaje, como GPT. Para validar las meta-características y sus anotaciones, se realizan análisis univariantes y multivariantes para demostrar la predictibilidad del rendimiento en función de los niveles de meta-características. Para esta validación se utilizan grandes repositorios como BIG-bench y HELM, que proporcionan resultados a nivel de instancia para muchos modelos y tareas. El proyecto explora las ventajas e inconvenientes de este método de anotación automatizada, destacando su flexibilidad y escalabilidad. Sin embargo, también se reconoce la necesidad de postprocesamiento, el coste de los tokens, la necesidad de un conjunto inicial de ejemplos anotados y el esfuerzo de prompt engineering. Al analizar el rendimiento en un conjunto ilustrativo de tareas y modelos de los repositorios anteriores, el principal resultado de este trabajo es demostrar la aplicabilidad general del enfoque de las meta-características, su eficacia y su valor para evaluar la complejidad de las tareas de NLP.Moros Daval, Y. (2023). Automated Annotation of Meta-Features for Predicting Language Model Performance in Natural Language Processing Tasks. Universitat Politècnica de València. http://hdl.handle.net/10251/19672

RiuNet

Recommended from our members

Larger and more instructable language models become less reliable.

Author: Ferri Cèsar
Hernández-Orallo José
Martínez-Plumed Fernando
Moros-Daval Yael
Schellaert Wout
Zhou Lexin
Publication venue: 'The Nature Conservancy'
Publication date: 02/10/2024
Field of study

The prevailing methods to make large language models more powerful and amenable have been based on continuous scaling up (that is, increasing their size, data volume and computational resources1) and bespoke shaping up (including post-filtering2,3, fine tuning or use of human feedback4,5). However, larger and more instructable large language models may have become less reliable. By studying the relationship between difficulty concordance, task avoidance and prompting stability of several language model families, here we show that easy instances for human participants are also easy for the models, but scaled-up, shaped-up models do not secure areas of low difficulty in which either the model does not err or human supervision can spot the errors. We also find that early models often avoid user questions but scaled-up, shaped-up models tend to give an apparently sensible yet wrong answer much more often, including errors on difficult questions that human supervisors frequently overlook. Moreover, we observe that stability to different natural phrasings of the same question is improved by scaling-up and shaping-up interventions, but pockets of variability persist across difficulty levels. These findings highlight the need for a fundamental shift in the design and development of general-purpose artificial intelligence, particularly in high-stakes areas for which a predictable distribution of errors is paramount

Apollo (Cambridge)