164 research outputs found
On evaluating agent performance in a fixed period of time
The evaluation of several agents over a given task in a finite period of time is a very common problem in experimental design, statistics, computer science, economics and, in general, any experimental science. It is also crucial for intelligence evaluation. In reinforcement learning, the task is formalised as an interactive environment with observations, actions and rewards. Typically, the decision that has to be made by the agent is a choice among a set of actions, cycle after cycle. However, in real evaluation scenarios, the time can be intentionally modulated by the agent. Consequently, agents not only choose an action but they also choose the time when they want to perform an action. This is natural in biological systems but it is also an issue in control. In this paper we revisit the classical reward aggregating functions which are commonly used in reinforcement learning and related areas, we analyse their problems, and we propose a modification of the average reward to get a consistent measurement for continuous time
AI Generality and Spearman's Law of Diminishing Returns
[EN] Many areas of AI today use benchmarks and competitions with larger and wider sets of tasks. This tries to deter AI systems (and research effort) from specialising to a single task, and encourage them to be prepared to solve previously unseen tasks. It is unclear, however, whether the methods with best performance are actually those that are most general and, in perspective, whether the trend moves towards more general AI systems. This question has a striking similarity with the analysis of the so-called positive manifold and general factors in the area of human intelligence. In this paper, we first show how the existence of a manifold (positive average pairwise task correlation) can also be analysed in AI, and how this relates to the notion of agent generality, from the individual and the populational points of view. From the populational perspective, we analyse the following question: is this manifold correlation higher for the most or for the least able group of agents? We contrast this analysis with one of the most controversial issues in human intelligence research, the so-called Spearman's Law of Diminishing Returns (SLODR), which basically states that the relevance of a general factor diminishes for most able human groups. We perform two empirical studies on these issues in AI. We analyse the results of the 2015 general video game AI (GVGAI) competition, with games as tasks and "controllers" as agents, and the results of a synthetic setting, with modified elementary cellular automata (ECA) rules as tasks and simple interactive programs as agents. In both cases, we see that SLODR does not appear. The data, and the use of just two scenarios, does not clearly support the reverse either, a Universal Law of Augmenting Returns (ULOAR), but calls for more experiments on this question.I thank the anonymous reviewers of ECAI'2016 for their comments on an early version of the experiments shown in Section 4. I'm really grateful to Philip J. Bontrager, Ahmed Khalifa, Diego Perez-Liebana and Julian Togelius for providing me with the GVGAI competition data that made Section 3 possible. David Stillwell and Aiden Loe suggested the use of person-fit as a measure of generality. The JAIR reviewers have provided very insightful and constructive comments, which have greatly helped to improve the final version of this paper. This work has been partially supported by the EU (FEDER) and Spanish MINECO grant TIN2015-69175-C4-1-R, and by Generalitat Valenciana PROMETEOII/2015/013 and PROMETEO/2019/098. I also thank the support from the Future of Life Institute through FLI grant RFP2-152. Part of this work has been done while visiting the Leverhulme Centre for the Future of Intelligence, generously funded by the Leverhulme Trust. I also thank the UPV for granting me a sabbatical leave and the funding from the Spanish MECD programme "Salvador de Madariaga" (PRX17/00467) and a BEST grant (BEST/2017/045) from the Generalitat Valenciana for another research stay also at the CFI.Hernández-Orallo, J. (2019). AI Generality and Spearman's Law of Diminishing Returns. Journal of Artificial Intelligence Research. 64:529-562. https://doi.org/10.1613/jair.1.11388S5295626
Unbridled Mental Power
Hernández-Orallo, J. (2019). Unbridled Mental Power. Nature Physics. 15(1). https://doi.org/10.1038/s41567-018-0388-1S10615
Heasuring (machine) intelligence universally: An interdisciplinary challenge
Artificial intelligence (Al) is having a deep impact on the way humans work, communicate and enjoy their leisure time. Al systems have been traditionally devised to solve specific tasks, such as playing chess, diagnosing a disease or driving a car. However, more and more Al systems are now being devised to be generally adaptable, and learn to solve a variety of tasks or to assist humans and organisations in their everyday tasks. As a result, an increasing number of robots, bots, avatars and 'smart' devices are enhancing our capabilities as individuals, collectives and humanity as a whole. What are these systems capable of doing? What is their global intelligence? How to tell whether they are meeting their specifications?Are the organisations including Al systems being less predictable and difficult to govern? The truth is that we lack proper measurement tools to evaluate the cognitive abilities and expected behaviour of this variety of systems. includino hybrid [e.g. machine-enhanced humans] and collectives. Once realised the relevance of Al evaluation and its difficulty, we will survey what has been done in the past twenty years in this area, focussing on approaches based on algorithmic information theory and Kolmogorov complexity, and its relation to other disciplines that are concerned with intelligence evaluation in humans and animals, such as psychometrics and comparative cognition. This will lead us to the notion of universal intelligence test and the new endeavour of universal psychometrics
I.G
This report summarises the key ideas for a simple concept of I.G. and its
explanatory value for several areas.Hernández-Orallo, J. (2018). I.G. http://hdl.handle.net/10251/10026
Threshold Choice Methods: the Missing Link
Many performance metrics have been introduced for the evaluation of
classification performance, with different origins and niches of application:
accuracy, macro-accuracy, area under the ROC curve, the ROC convex hull, the
absolute error, and the Brier score (with its decomposition into refinement and
calibration). One way of understanding the relation among some of these metrics
is the use of variable operating conditions (either in the form of
misclassification costs or class proportions). Thus, a metric may correspond to
some expected loss over a range of operating conditions. One dimension for the
analysis has been precisely the distribution we take for this range of
operating conditions, leading to some important connections in the area of
proper scoring rules. However, we show that there is another dimension which
has not received attention in the analysis of performance metrics. This new
dimension is given by the decision rule, which is typically implemented as a
threshold choice method when using scoring models. In this paper, we explore
many old and new threshold choice methods: fixed, score-uniform, score-driven,
rate-driven and optimal, among others. By calculating the loss of these methods
for a uniform range of operating conditions we get the 0-1 loss, the absolute
error, the Brier score (mean squared error), the AUC and the refinement loss
respectively. This provides a comprehensive view of performance metrics as well
as a systematic approach to loss minimisation, namely: take a model, apply
several threshold choice methods consistent with the information which is (and
will be) available about the operating condition, and compare their expected
losses. In order to assist in this procedure we also derive several connections
between the aforementioned performance metrics, and we highlight the role of
calibration in choosing the threshold choice method
Exploring AI Safety in Degrees: Generality, Capability and Control
[EN] The landscape of AI safety is frequently explored differently
by contrasting specialised AI versus general AI (or AGI), by
analysing the short-term hazards of systems with limited capabilities against those more long-term risks posed by `superintelligence¿, and by conceptualising sophisticated ways
of bounding control an AI system has over its environment
and itself (impact, harm to humans, self-harm, containment,
etc.). In this position paper we reconsider these three aspects
of AI safety as quantitative factors ¿generality, capability and
control¿, suggesting that by defining metrics for these dimensions, AI risks can be characterised and analysed more precisely. As an example, we illustrate how to define these metrics and their values for some simple agents in a toy scenario
within a reinforcement learning setting.We thank the anonymous reviewers for their comments. This
work was funded by the Future of Life Institute, FLI, under
grant RFP2-152, and also supported by the EU (FEDER)
and Spanish MINECO under RTI2018-094403-B-C32, and
Generalitat Valenciana under PROMETEO/2019/098.Burden, J.; Hernández-Orallo, J. (2020). Exploring AI Safety in Degrees: Generality, Capability and Control. ceur-ws.org. 36-40. http://hdl.handle.net/10251/177484S364
Dual Indicators to Analyse AI Benchmarks: Difficulty, Discrimination, Ability and Generality
[EN] With the purpose of better analyzing the result of artificial intelligence (AI) benchmarks, we present two indicators on the side of the AI problems, difficulty and discrimination, and two indicators on the side of the AI systems, ability and generality. The first three are adapted from psychometric models in item response theory (IRT), whereas generality is defined as a new metric that evaluates whether an agent is consistently good at easy problems and bad at difficult ones. We illustrate how these key indicators give us more insight on the results of two popular benchmarks in AI, the Arcade Learning Environment (Atari 2600 games) and the General Video Game AI competition, and we include some guidelines to estimate and interpret these indicators for other AI benchmarks and competitions.This work was supported by the U.S. Air Force Office of Scientific Research under Award FA9550-17-1-0287; in part by the EU (FEDER) and the Spanish MINECO under Grant TIN 2015-69175-C4-1-R; and in part by the Generalitat Valenciana PROMETEOII/2015/013. The work of F. Mart ' inez-Plumed was supported by INCIBE (Ayudas para la excelencia de los equipos de investigaci ' on avanzada en ciberseguridad), the European Commission, JRC's Centre for Advanced Studies, HUMAINT project (Expert Contract CT-EX2018D335821-101), and UPV PAID-06-18 Ref. SP20180210. The work of J. Hern ' andez-Orallo was supported in part by Salvador de Madariaga grant (PRX17/00467) from the Spanish MECD, in part by the BEST Grant (BEST/2017/045) from the GVA for research stays at the CFI, and in part by the FLI grant RFP2-152.Martínez-Plumed, F.; Hernández-Orallo, J. (2020). Dual Indicators to Analyse AI Benchmarks: Difficulty, Discrimination, Ability and Generality. IEEE Transactions on Games. 12(2):121-131. https://doi.org/10.1109/TG.2018.2883773S12113112
- …