12 research outputs found

    Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback

    Full text link
    Machine translation is a natural candidate problem for reinforcement learning from human feedback: users provide quick, dirty ratings on candidate translations to guide a system to improve. Yet, current neural machine translation training focuses on expensive human-generated reference translations. We describe a reinforcement learning algorithm that improves neural machine translation systems from simulated human feedback. Our algorithm combines the advantage actor-critic algorithm (Mnih et al., 2016) with the attention-based neural encoder-decoder architecture (Luong et al., 2015). This algorithm (a) is well-designed for problems with a large action space and delayed rewards, (b) effectively optimizes traditional corpus-level machine translation metrics, and (c) is robust to skewed, high-variance, granular feedback modeled after actual human behaviors.Comment: 11 pages, 5 figures, In Proceedings of Empirical Methods in Natural Language Processing (EMNLP) 201

    Using feedback in adaptive and user-dependent one-step decision making

    Get PDF
    International audienceSeveral machine learning approaches are used to train systems and agents while exploiting users' feedback over the given service. For example, different semi-supervised approaches employ this kind of information in the learning process to guide the agent to a more adaptive and possibly person-alized behavior. Whether for recommendation systems , companion robots or smart home assistance, the trained agent must face the challenges of adapting to different users (with different profiles, preferences , etc.), coping with dynamic environments (dynamic preferences, etc.) and scaling up with a minimal number of training examples. We are interested in this paper in one-step decision making for adaptive and user-dependent services using users' feedback. We focus on the quality of such services while dealing with ambiguities (noise) in the received feedback. We describe our problem and we concentrate on presenting a state of the art of possible methods that can be applied. We detail two algorithms that are based on existing approaches. We present comparative results by showing scaling and convergence analysis with clean and noisy simulated data
    corecore