502 research outputs found

    Evolutionary Reinforcement Learning: A Survey

    Full text link
    Reinforcement learning (RL) is a machine learning approach that trains agents to maximize cumulative rewards through interactions with environments. The integration of RL with deep learning has recently resulted in impressive achievements in a wide range of challenging tasks, including board games, arcade games, and robot control. Despite these successes, there remain several crucial challenges, including brittle convergence properties caused by sensitive hyperparameters, difficulties in temporal credit assignment with long time horizons and sparse rewards, a lack of diverse exploration, especially in continuous search space scenarios, difficulties in credit assignment in multi-agent reinforcement learning, and conflicting objectives for rewards. Evolutionary computation (EC), which maintains a population of learning agents, has demonstrated promising performance in addressing these limitations. This article presents a comprehensive survey of state-of-the-art methods for integrating EC into RL, referred to as evolutionary reinforcement learning (EvoRL). We categorize EvoRL methods according to key research fields in RL, including hyperparameter optimization, policy search, exploration, reward shaping, meta-RL, and multi-objective RL. We then discuss future research directions in terms of efficient methods, benchmarks, and scalable platforms. This survey serves as a resource for researchers and practitioners interested in the field of EvoRL, highlighting the important challenges and opportunities for future research. With the help of this survey, researchers and practitioners can develop more efficient methods and tailored benchmarks for EvoRL, further advancing this promising cross-disciplinary research field

    계층 강화 학습에서의 탐험적 혼합 탐색

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2020. 8. 문병로.Balancing exploitation and exploration is a great challenge in many optimization problems. Evolutionary algorithms, such as evolutionary strategies and genetic algorithms, are algorithms inspired by biological evolution. They have been used for various optimization problems, such as combinatorial optimization and continuous optimization. However, evolutionary algorithms lack fine-tuning near local optima; in other words, they lack exploitation power. This drawback can be overcome by hybridization. Hybrid genetic algorithms, or memetic algorithms, are successful examples of hybridization. Although the solution space is exponentially vast in some optimization problems, these algorithms successfully find satisfactory solutions. In the deep learning era, the problem of exploitation and exploration has been relatively neglected. In deep reinforcement learning problems, however, balancing exploitation and exploration is more crucial than that in problems with supervision. Many environments in the real world have an exponentially wide state space that must be explored by agents. Without sufficient exploration power, agents only reveal a small portion of the state space and end up with seeking only instant rewards. In this thesis, a hybridization method is proposed which contains both gradientbased policy optimization with strong exploitation power and evolutionary policy optimization with strong exploration power. First, the gradient-based policy optimization and evolutionary policy optimization are analyzed in various environments. The results demonstrate that evolutionary policy optimization is robust for sparse rewards but weak for instant rewards, whereas gradient-based policy optimization is effective for instant rewards but weak for sparse rewards. This difference between the two optimizations reveals the potential of hybridization in policy optimization. Then, a hybrid search is suggested in the framework of hierarchical reinforcement learning. The results demonstrate that the hybrid search finds an effective agent for complex environments with sparse rewards thanks to its balanced exploitation and exploration.많은 최적화 문제에서 탐사와 탐험의 균형을 맞추는 것은 매우 중요한 문제이다. 진화 전략과 유전 알고리즘과 같은 진화 알고리즘은 자연에서의 진화에서 영감을 얻은 메타휴리스틱 알고리즘이다. 이들은 조합 최적화, 연속 최적화와 같은 다양한 최적화 문제를 풀기 위해 사용되었다. 하지만 진화 알고리즘은 지역 최적해 근처에서의 미세 조정, 즉 탐사에 약한 특성이 있다. 이러한 결점함은 혼합화를 통해 극복할 수 있다. 혼합 유전 알고리즘, 혹은 미미틱 알고리즘이 성공적인 혼합화의 사례이다. 이러한 알고리즘은 최적화 문제의 해 공간이 기하급수적으로 넓더라도 성공적으로 만족스러운 해를 찾아낸다. 한편 심층 학습의 시대에서, 탐사와 탐험의 균형을 맞추는 문제는 종종 무시되었다. 하지만 심층 강화학습에서는 탐사와 탐험의 균형을 맞추는 일은 지도학습에서보다 훨씬 더 중요하다. 많은 실제 세계의 환경은 기하급수적으로 큰 상태 공간을 가지고 있고 에이전트는 이를 탐험해야만 한다. 충분한 탐험 능력이 없으면 에이전트는 상태 공간의 극히 일부만을 밝혀내어 결국 즉각적인 보상만 탐하게 될 것이다. 본 학위논문에서는 강한 탐사 능력을 가진 그레디언트 기반 정책 최적화와 강한 탐험 능력을 가진 진화적 정책 최적화를 혼합하는 기법을 제시할 것이다. 우선 그레디언트 기반 정책 최적화와 진화적 정책 최적화를 다양한 환경에서 분석한다. 결과적으로 그레디언트 기반 정책 최적화는 즉각적 보상에 효과적이지만 보상의 밀도가 낮을때 취약한 반면 진화적 정책 최적화가 밀도가 낮은 보상에 대해 강하지만 즉각적인 보상에 대해 취약하다는 것을 알 수 있다. 두 가지 최적화의 특징 상 차이점이 혼합적 정책 최적화의 가능성을 보여준다. 그리고 계층적 강화 학습 프레임워크에서의 혼합 탐색 기법을 제시한다. 그 결과 혼합 탐색 기법이 균형잡힌 탐사와 탐험 덕분에 밀도가 낮은 보상을 주는 복잡한 환경에서 효과적인 에이전트를 찾아낸 다는 것을 보여준다.I. Introduction 1 II. Background 6 2.1 Evolutionary Computations 6 2.1.1 Hybrid Genetic Algorithm 7 2.1.2 Evolutionary Strategy 9 2.2 Hybrid Genetic Algorithm Example: Brick Layout Problem 10 2.2.1 Problem Statement 11 2.2.2 Hybrid Genetic Algorithm 11 2.2.3 Experimental Results 14 2.2.4 Discussion 15 2.3 Reinforcement Learning 16 2.3.1 Policy Optimization 19 2.3.2 Proximal Policy Optimization 21 2.4 Neuroevolution for Reinforcement Learning 23 2.5 Hierarchical Reinforcement Learning 25 2.5.1 Option-based HRL 26 2.5.2 Goal-based HRL 27 2.5.3 Exploitation versus Exploration 27 III. Understanding Features of Evolutionary Policy Optimizations 29 3.1 Experimental Setup 31 3.2 Feature Analysis 32 3.2.1 Convolution Filter Inspection 32 3.2.2 Saliency Map 36 3.3 Discussion 40 3.3.1 Behavioral Characteristics 40 3.3.2 ES Agent without Inputs 42 IV. Hybrid Search for Hierarchical Reinforcement Learning 44 4.1 Method 45 4.2 Experimental Setup 47 4.2.1 Environment 47 4.2.2 Network Architectures 50 4.2.3 Training 50 4.3 Results 51 4.3.1 Comparison 51 4.3.2 Experimental Results 53 4.3.3 Behavior of Low-Level Policy 54 4.4 Conclusion 55 V. Conclusion 56 5.1 Summary 56 5.2 Future Work 57 Bibliography 58Docto

    Machine Learning for Ad Publishers in Real Time Bidding

    Get PDF

    A review on deep reinforcement learning for fluid mechanics: an update

    Full text link
    In the past couple of years, the interest of the fluid mechanics community for deep reinforcement learning (DRL) techniques has increased at fast pace, leading to a growing bibliography on the topic. While the capabilities of DRL to solve complex decision-making problems make it a valuable tool for active flow control, recent publications also demonstrated applications to other fields, such as shape optimization or microfluidics. The present work aims at proposing an exhaustive review of the existing literature, and is a follow-up to our previous review on the topic. The contributions are regrouped by field of application, and are compared together regarding algorithmic and technical choices, such as state selection, reward design, time granularity, and more. Based on these comparisons, general conclusions are drawn regarding the current state-of-the-art in the domain, and perspectives for future improvements are sketched

    my Human Brain Project (mHBP)

    Get PDF
    How can we make an agent that thinks like us humans? An agent that can have proprioception, intrinsic motivation, identify deception, use small amounts of energy, transfer knowledge between tasks and evolve? This is the problem that this thesis is focusing on. Being able to create a piece of software that can perform tasks like a human being, is a goal that, if achieved, will allow us to extend our own capabilities to a very high level, and have more tasks performed in a predictable fashion. This is one of the motivations for this thesis. To address this problem, we have proposed a modular architecture for Reinforcement Learning computation and developed an implementation to have this architecture exercised. This software, that we call mHBP, is created in Python using Webots as an environment for the agent, and Neo4J, a graph database, as memory. mHBP takes the sensory data or other inputs, and produces, based on the body parts / tools that the agent has available, an output consisting of actions to perform. This thesis involves experimental design with several iterations, exploring a theoretical approach to RL based on graph databases. We conclude, with our work in this thesis, that it is possible to represent episodic data in a graph, and is also possible to interconnect Webots, Python and Neo4J to support a stable architecture for Reinforcement Learning. In this work we also find a way to search for policies using the Neo4J querying language: Cypher. Another key conclusion of this work is that state representation needs to have further research to find a state definition that enables policy search to produce more useful policies. The article “REINFORCEMENT LEARNING: A LITERATURE REVIEW (2020)” at Research Gate with doi 10.13140/RG.2.2.30323.76327 is an outcome of this thesis.Como podemos criar um agente que pense como nós humanos? Um agente que tenha propriocepção, motivação intrínseca, seja capaz de identificar ilusão, usar pequenas quantidades de energia, transferir conhecimento entre tarefas e evoluir? Este é o problema em que se foca esta tese. Ser capaz de criar uma peça de software que desempenhe tarefas como um ser humano é um objectivo que, se conseguido, nos permitirá estender as nossas capacidades a um nível muito alto, e conseguir realizar mais tarefas de uma forma previsível. Esta é uma das motivações desta tese. Para endereçar este problema, propomos uma arquitectura modular para computação de aprendizagem por reforço e desenvolvemos uma implementação para exercitar esta arquitetura. Este software, ao qual chamamos mHBP, foi criado em Python usando o Webots como um ambiente para o agente, e o Neo4J, uma base de dados de grafos, como memória. O mHBP recebe dados sensoriais ou outros inputs, e produz, baseado nas partes do corpo / ferramentas que o agente tem disponíveis, um output que consiste em ações a desempenhar. Uma boa parte desta tese envolve desenho experimental com diversas iterações, explorando uma abordagem teórica assente em bases de dados de grafos. Concluímos, com o trabalho nesta tese, que é possível representar episódios em um grafo, e que é, também, possível interligar o Webots, com o Python e o Neo4J para suportar uma arquitetura estável para a aprendizagem por reforço. Neste trabalho, também, encontramos uma forma de procurar políticas usando a linguagem de pesquisa do Neo4J: Cypher. Outra conclusão chave deste trabalho é que a representação de estados necessita de mais investigação para encontrar uma definição de estado que permita à pesquisa de políticas produzir políticas que sejam mais úteis. O artigo “REINFORCEMENT LEARNING: A LITERATURE REVIEW (2020)” no Research Gate com o doi 10.13140/RG.2.2.30323.76327 é um sub-produto desta tese

    HUMAN ACTIVITY RECOGNITION FROM EGOCENTRIC VIDEOS AND ROBUSTNESS ANALYSIS OF DEEP NEURAL NETWORKS

    Get PDF
    In recent years, there has been significant amount of research work on human activity classification relying either on Inertial Measurement Unit (IMU) data or data from static cameras providing a third-person view. There has been relatively less work using wearable cameras, providing egocentric view, which is a first-person view providing the view of the environment as seen by the wearer. Using only IMU data limits the variety and complexity of the activities that can be detected. Deep machine learning has achieved great success in image and video processing in recent years. Neural network based models provide improved accuracy in multiple fields in computer vision. However, there has been relatively less work focusing on designing specific models to improve the performance of egocentric image/video tasks. As deep neural networks keep improving the accuracy in computer vision tasks, the robustness and resilience of the networks should be improved as well to make it possible to be applied in safety-crucial areas such as autonomous driving. Motivated by these considerations, in the first part of the thesis, the problem of human activity detection and classification from egocentric cameras is addressed. First, anew method is presented to count the number of footsteps and compute the total traveled distance by using the data from the IMU sensors and camera of a smart phone. By incorporating data from multiple sensor modalities, and calculating the length of each step, instead of using preset stride lengths and assuming equal-length steps, the proposed method provides much higher accuracy compared to commercially available step counting apps. After the application of footstep counting, more complicated human activities, such as steps of preparing a recipe and sitting on a sofa, are taken into consideration. Multiple classification methods, non-deep learning and deep-learning-based, are presented, which employ both ego-centric camera and IMU data. Then, a Genetic Algorithm-based approach is employed to set the parameters of an activity classification network autonomously and performance is compared with empirically-set parameters. Then, a new framework is introduced to reduce the computational cost of human temporal activity recognition from egocentric videos while maintaining the accuracy at a comparable level. The actor-critic model of reinforcement learning is applied to optical flow data to locate a bounding box around region of interest, which is then used for clipping a sub-image from a video frame. A shallow and deeper 3D convolutional neural network is designed to process the original image and the clipped image region, respectively.Next, a systematic method is introduced that autonomously and simultaneously optimizes multiple parameters of any deep neural network by using a bi-generative adversarial network (Bi-GAN) guiding a genetic algorithm(GA). The proposed Bi-GAN allows the autonomous exploitation and choice of the number of neurons for the fully-connected layers, and number of filters for the convolutional layers, from a large range of values. The Bi-GAN involves two generators, and two different models compete and improve each other progressively with a GAN-based strategy to optimize the networks during a GA evolution.In this analysis, three different neural network layers and datasets are taken into consideration: First, 3D convolutional layers for ModelNet40 dataset. We applied the proposed approach on a 3D convolutional network by using the ModelNet40 dataset. ModelNet is a dataset of 3D point clouds. The goal is to perform shape classification over 40shape classes. LSTM layers for UCI HAR dataset. UCI HAR dataset is composed of InertialMeasurement Unit (IMU) data captured during activities of standing, sitting, laying, walking, walking upstairs and walking downstairs. These activities were performed by 30 subjects, and the 3-axial linear acceleration and 3-axial angular velocity were collected at a constant rate of 50Hz. 2D convolutional layers for Chars74k Dataset. Chars74k dataset contains 64 classes(0-9, A-Z, a-z), 7705 characters obtained from natural images, 3410 hand-drawn characters using a tablet PC and 62992 synthesised characters from computer fonts giving a total of over 74K images. In the final part of the thesis, network robustness and resilience for neural network models is investigated from adversarial examples (AEs) and automatic driving conditions. The transferability of adversarial examples across a wide range of real-world computer vision tasks, including image classification, explicit content detection, optical character recognition(OCR), and object detection are investigated. It represents the cybercriminal’s situation where an ensemble of different detection mechanisms need to be evaded all at once.Novel dispersion Reduction(DR) attack is designed, which is a practical attack that overcomes existing attacks’ limitation of requiring task-specific loss functions by targeting on the “dispersion” of internal feature map. In the autonomous driving scenario, the adversarial machine learning attacks against the complete visual perception pipeline in autonomous driving is studied. A novel attack technique, tracker hijacking, that can effectively fool Multi-Object Tracking (MOT) using AEs on object detection is presented. Using this technique, successful AEs on as few as one single frame can move an existing object in to or out of the headway of an autonomous vehicle to cause potential safety hazards
    corecore