1,010 research outputs found

    Master-slave Deep Architecture for Top-K Multi-armed Bandits with Non-linear Bandit Feedback and Diversity Constraints

    Full text link
    We propose a novel master-slave architecture to solve the top-KK combinatorial multi-armed bandits problem with non-linear bandit feedback and diversity constraints, which, to the best of our knowledge, is the first combinatorial bandits setting considering diversity constraints under bandit feedback. Specifically, to efficiently explore the combinatorial and constrained action space, we introduce six slave models with distinguished merits to generate diversified samples well balancing rewards and constraints as well as efficiency. Moreover, we propose teacher learning based optimization and the policy co-training technique to boost the performance of the multiple slave models. The master model then collects the elite samples provided by the slave models and selects the best sample estimated by a neural contextual UCB-based network to make a decision with a trade-off between exploration and exploitation. Thanks to the elaborate design of slave models, the co-training mechanism among slave models, and the novel interactions between the master and slave models, our approach significantly surpasses existing state-of-the-art algorithms in both synthetic and real datasets for recommendation tasks. The code is available at: \url{https://github.com/huanghanchi/Master-slave-Algorithm-for-Top-K-Bandits}.Comment: IEEE Transactions on Neural Networks and Learning System

    META-LEARNING NEURAL MACHINE TRANSLATION CURRICULA

    Get PDF
    Curriculum learning hypothesizes that presenting training samples in a meaningful order to machine learners during training helps improve model quality and conver- gence rate. In this dissertation, we explore this framework for learning in the context of Neural Machine Translation (NMT). NMT systems are typically trained on a large amount of heterogeneous data and have the potential to benefit greatly from curricu- lum learning in terms of both speed and quality. We concern ourselves with three primary questions in our investigation : (i) how do we design a task and/or dataset specific curriculum for NMT training? (ii) can we leverage human intuition about learning in this design or can we learn the curriculum itself? (iii) how do we featurize training samples (e.g., easy versus hard) so that they can be effectively slotted into a curriculum? We begin by empirically exploring various hand-designed curricula and their effect on translation performance and speed of training NMT systems. We show that these curricula, most of which are based on human intuition, can improve NMT training speed but are highly sensitive to hyperparameter settings. Next, instead of using a hand-designed curriculum, we meta-learn a curriculum for the task of learning from noisy translation samples using reinforcement learning. We demonstrate that this learned curriculum significantly outperforms a random-curriculum baseline and matches the strongest hand-designed curriculum. We then extend this approach to the task of multi-lingual NMT with an emphasis on accumulating knowledge and learning from multiple training runs. Again, we show that this technique can match the strongest baseline obtained via expensive fine-grained grid search for the (learned) hyperparameters. We conclude with an extension which requires no prior knowledge of sample relevance to the task and uses sample features instead, hence learning both the relevance of each training sample to the task and the appropriate curriculum jointly. We show that this technique outperforms the state-of-the-art results on a noisy filtering task

    Note, The Death Penalty in Late Imperial, Modern, and Post-Tiananmen China

    Get PDF
    This paper seeks to explore the crucial determinants that shape the Chinese legal system\u27s use of the death penalty. Why have the Chinese relied so heavily on execution as a form of sentencing? What factors and conditions account for the major changes in the frequency of China\u27s use of the death penalty? What indigenous traditions are reflected in China\u27s implementation of the death penalty? In order to inquire into the role and function of the legal system in affecting the severity of criminal punishment in China, this study will focus on only those death sentences carried out by the state in a judicial context

    Quantum-enhanced reinforcement learning

    Get PDF
    Dissertação de mestrado em Engenharia FísicaThe field of Artificial Intelligence has lately witnessed extraordinary results. The ability to design a system capable of beating the world champion of Go, an ancient Chinese game known as the holy grail of AI, caused a spark worldwide, making people believe that some thing revolutionary is about to happen. A different flavor of learning called Reinforcement Learning is at the core of this revolution. In parallel, we are witnessing the emergence of a new field, that of Quantum Machine Learning which has already shown promising results in supervised/unsupervised learning. In this dissertation, we reach for the interplay between Quantum Computing and Reinforcement Learning. This learning by interaction was made possible in the quantum setting using the con cept of oraculization of task environments suggested by Dunjko in 2015. In this dissertation, we extended the oracular instances previously suggested to work in more general stochastic environments. On top of this quantum agent-environment paradigm we developed a novel quantum algorithm for near-optimal decision-making based on the Reinforcement Learn ing paradigm known as Sparse Sampling, obtaining a quantum speedup compared to the classical counterpart. The achievement was a quantum algorithm that exhibits a complexity independent on the number of states of the environment. This independence guarantees its suitability for dealing with large state spaces where planning may be inapplicable. The most important open questions remain whether it is possible to improve the orac ular instances of task environments to deal with even more general environments, especially the ability to represent negative rewards as a natural mechanism for negative feedback instead of some normalization of the reward and the extension of the algorithm to perform an informed tree-based search instead of the uninformed search proposed. Improvements on this result would allow the comparison between the algorithm and more recent classical Reinforcement Learning algorithms.O campo da Inteligência Artificial tem tido resultados extraordinários ultimamente, a capacidade de projetar um sistema capaz de vencer o campeão mundial de Go, um antigo jogo de origem Chinesa, conhecido como o santo graal da IA, causou uma faísca em todo o mundo, fazendo as pessoas acreditarem em que algo revolucionário estar a para acontecer. Um tipo diferente de aprendizagem, chamada Aprendizagem por Reforço está no cerne dessa revolução. Em paralelo surge também um novo campo, o da Aprendizagem Máquina Quântica, que já vem apresentando resultados promissores na aprendizagem supervisionada/não, supervisionada. Nesta dissertação, procuramos invés a interação entre Computação Quântica e a Aprendizagem por Reforço. Esta interação entre agente e Ambiente foi possível no cenário quântico usando o conceito de oraculização de ambientes sugerido por Dunjko em 2015. Neste trabalho, estendemos as instâncias oraculares sugeridas anteriormente para trabalhar em ambientes estocásticos generalizados. Tendo em conta este paradigma quântico agente-ambiente, desenvolvemos um novo algoritmo quântico para tomada de decisão aproximadamente ótima com base no paradigma da Aprendizagem por Reforço conhecido como Amostragem Esparsa, obtendo uma aceleração quântica em comparação com o caso clássico que possibilitou a obtenção de um algoritmo quântico que exibe uma complexidade independente do número de estados do ambiente. Esta independência garante a sua adaptação para ambientes com um grande espaço de estados em que o planeamento pode ser intratável. As questões mais pertinentes que se colocam é se é possível melhorar as instâncias oraculares de ambientes para lidar com ambientes ainda mais gerais, especialmente a capacidade de exprimir recompensas negativas como um mecanismo natural para feedback negativo em vez de alguma normalização da recompensa. Além disso, a extensão do algoritmo para realizar uma procura em árvore informada ao invés da procura não informada proposta. Melhorias neste resultado permitiriam a comparação entre o algoritmo quântico e os algoritmos clássicos mais recentes da Aprendizagem por Reforço
    corecore