151 research outputs found
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
Reinforcement Learning and Bandits for Speech and Language Processing: Tutorial, Review and Outlook
In recent years, reinforcement learning and bandits have transformed a wide
range of real-world applications including healthcare, finance, recommendation
systems, robotics, and last but not least, the speech and natural language
processing. While most speech and language applications of reinforcement
learning algorithms are centered around improving the training of deep neural
networks with its flexible optimization properties, there are still many
grounds to explore to utilize the benefits of reinforcement learning, such as
its reward-driven adaptability, state representations, temporal structures and
generalizability. In this survey, we present an overview of recent advancements
of reinforcement learning and bandits, and discuss how they can be effectively
employed to solve speech and natural language processing problems with models
that are adaptive, interactive and scalable.Comment: To appear in Expert Systems with Applications. Accompanying
INTERSPEECH 2022 Tutorial on the same topic. Including latest advancements in
large language models (LLMs
Multi-objective bandit algorithms with Chebyshev scalarization.
In this paper we analyze several alternatives for Chebyshev scalarization in multi-objective bandit problems. The alternatives are evaluated on a reference bi-objective benchmark problem of Pareto frontier approximation. Performance is analyzed according to three measures: probability of selecting an optimal action, regret, and unfairness. The paper presents a new algorithm that improves the speed of convergence over previous proposals at least by one order of magnitude.Financiado por Plan Propio de Investigación de la Universidad de Málaga - Campus de Excelencia Internacional Andalucía Tech. L. Mandow supported by project IRIS PID2021-122812OB-I00 (co-financed by FEDER funds). This research is partially supported by the Spanish Ministry of Science and Innovation, the European Regional Development Fund (FEDER), Junta de Andalucía (JA), and Universidad de Málaga (UMA) through the research projects with reference PID2021-122381OB-I00 and UMA20-FEDERJA-065. S. Martín-Albo supported by Beca de Iniciación a la Investigación para estudiantes de grado y máster, I Plan Propio de Investigación y Transferencia de la Universidad de Málaga, España
Infinite Action Contextual Bandits with Reusable Data Exhaust
For infinite action contextual bandits, smoothed regret and reduction to
regression results in state-of-the-art online performance with computational
cost independent of the action set: unfortunately, the resulting data exhaust
does not have well-defined importance-weights. This frustrates the execution of
downstream data science processes such as offline model selection. In this
paper we describe an online algorithm with an equivalent smoothed regret
guarantee, but which generates well-defined importance weights: in exchange,
the online computational cost increases, but only to order smoothness (i.e.,
still independent of the action set). This removes a key obstacle to adoption
of smoothed regret in production scenarios.Comment: Final version after responding to reviewer
On Reward Structures of Markov Decision Processes
A Markov decision process can be parameterized by a transition kernel and a
reward function. Both play essential roles in the study of reinforcement
learning as evidenced by their presence in the Bellman equations. In our
inquiry of various kinds of "costs" associated with reinforcement learning
inspired by the demands in robotic applications, rewards are central to
understanding the structure of a Markov decision process and reward-centric
notions can elucidate important concepts in reinforcement learning.
Specifically, we study the sample complexity of policy evaluation and develop
a novel estimator with an instance-specific error bound of
for estimating a single state value. Under
the online regret minimization setting, we refine the transition-based MDP
constant, diameter, into a reward-based constant, maximum expected hitting
cost, and with it, provide a theoretical explanation for how a well-known
technique, potential-based reward shaping, could accelerate learning with
expert knowledge. In an attempt to study safe reinforcement learning, we model
hazardous environments with irrecoverability and proposed a quantitative notion
of safe learning via reset efficiency. In this setting, we modify a classic
algorithm to account for resets achieving promising preliminary numerical
results. Lastly, for MDPs with multiple reward functions, we develop a planning
algorithm that computationally efficiently finds Pareto-optimal stochastic
policies.Comment: This PhD thesis draws heavily from arXiv:1907.02114 and
arXiv:2002.06299; minor edit
Deep Reinforcement Learning Approaches for Technology Enhanced Learning
Artificial Intelligence (AI) has advanced significantly in recent years, transforming various industries and domains. Its ability to extract patterns and insights from large volumes of data has revolutionised areas such as image recognition, natural language processing, and autonomous systems. As AI systems become increasingly integrated into daily human life, there is a growing need for meaningful collaboration and mutual engagement between humans and AI, known as Human-AI Collaboration. This collaboration involves combining AI with human workflows to achieve shared objectives.
In the current educational landscape, the integration of AI methods in Technology Enhanced Learning (TEL) has become crucial for providing high-quality education and facilitating lifelong learning. Human-AI Collaboration also plays a vital role in the field of Technology Enhanced Learning (TEL), particularly in Intelligent Tutoring Systems (ITS). The COVID-19 pandemic has further emphasised the need for effective educational technologies to support remote learning and bridge the gap between traditional classrooms and online platforms. To maximise the performance of ITS while minimising the input and interaction required from students, it is essential to design collaborative systems that effectively leverage the capabilities of AI and foster effective collaboration between students and ITS.
However, there are several challenges that need to be addressed in this context. One challenge is the lack of clear guidance on designing and building user-friendly systems that facilitate collaboration between humans and AI. This challenge is relevant not only to education researchers but also to Human-Computer Interaction (HCI) researchers and developers. Another challenge is the scarcity of interaction data in the early stages of ITS development, which hampers the accurate modelling of students' knowledge states and learning trajectories, known as the cold start problem. Moreover, the effectiveness of Intelligent Tutoring Systems (ITS) in delivering personalised instruction is hindered by the limitations of existing Knowledge Tracing (KT) models, which often struggle to provide accurate predictions. Therefore, addressing these challenges is crucial for enhancing the collaborative process between humans and AI in the development of ITS.
This thesis aims to address these challenges and improve the collaborative process between students and ITS in TEL. It proposes innovative approaches to generate simulated student behavioural data and enhance the performance of KT models. The thesis starts with a comprehensive survey of human-AI collaborative systems, identifying key challenges and opportunities. It then presents a structured framework for the student-ITS collaborative process, providing insights into designing user-friendly and efficient systems.
To overcome the challenge of data scarcity in ITS development, the thesis proposes two student modelling approaches: Sim-GAIL and SimStu. SimStu leverages a deep learning method, the Decision Transformer, to simulate student interactions and enhance ITS training. Sim-GAIL utilises a reinforcement learning method, Generative Adversarial Imitation Learning (GAIL), to generate high-fidelity and diverse simulated student behavioural data, addressing the cold start problem in ITS training.
Furthermore, the thesis focuses on improving the performance of KT models. It introduces the MLFBKT model, which integrates multiple features and mines latent relations in student interaction data, aiming to improve the accuracy and efficiency of KT models. Additionally, the thesis proposes the LBKT model, which combines the strengths of the BERT model and LSTM to process long sequence data in KT models effectively.
Overall, this thesis contributes to the field of Human-AI collaboration in TEL by addressing key challenges and proposing innovative approaches to enhance ITS training and KT model performance. The findings have the potential to improve the learning experiences and outcomes of students in educational settings
Exact Pareto Optimal Search for Multi-Task Learning and Multi-Criteria Decision-Making
Given multiple non-convex objective functions and objective-specific weights,
Chebyshev scalarization (CS) is a well-known approach to obtain an Exact Pareto
Optimal (EPO), i.e., a solution on the Pareto front (PF) that intersects the
ray defined by the inverse of the weights. First-order optimizers that use the
CS formulation to find EPO solutions encounter practical problems of
oscillations and stagnation that affect convergence. Moreover, when initialized
with a PO solution, they do not guarantee a controlled trajectory that lies
completely on the PF. These shortcomings lead to modeling limitations and
computational inefficiency in multi-task learning (MTL) and multi-criteria
decision-making (MCDM) methods that utilize CS for their underlying non-convex
multi-objective optimization (MOO). To address these shortcomings, we design a
new MOO method, EPO Search. We prove that EPO Search converges to an EPO
solution and empirically illustrate its computational efficiency and robustness
to initialization. When initialized on the PF, EPO Search can trace the PF and
converge to the required EPO solution at a linear rate of convergence. Using
EPO Search we develop new algorithms: PESA-EPO for approximating the PF in a
posteriori MCDM, and GP-EPO for preference elicitation in interactive MCDM;
experiments on benchmark datasets confirm their advantages over competing
alternatives. EPO Search scales linearly with the number of decision variables
which enables its use for training deep networks. Empirical results on real
data from personalized medicine, e-commerce and hydrometeorology demonstrate
the efficacy of EPO Search for deep MTL
Operational Research: methods and applications
This is the final version. Available on open access from Taylor & Francis via the DOI in this recordThroughout its history, Operational Research has evolved to include methods, models and algorithms that have been applied to a wide range of contexts. This encyclopedic article consists of two main sections: methods and applications. The first summarises the up-to-date knowledge and provides an overview of the state-of-the-art methods and key developments in the various subdomains of the field. The second offers a wide-ranging list of areas where Operational Research has been applied. The article is meant to be read in a nonlinear fashion and used as a point of reference by a diverse pool of readers: academics, researchers, students, and practitioners. The entries within the methods and applications sections are presented in alphabetical order. The authors dedicate this paper to the 2023 Turkey/Syria earthquake victims. We sincerely hope that advances in OR will play a role towards minimising the pain and suffering caused by this and future catastrophes
- …