Abstract
Direct policy search often results in high-quality policies in complex reinforcement learning problems, which employs some optimization algorithms to search the parameters of the policy for maximizing the its total reward. Classificationbased optimization is a recently developed framework for derivative-free optimization, which has shown to be effective and efficient for non-convex optimization problems with many local optima, and may provide a power optimization tool for direct policy search. However, this framework requires to sample a batch of solutions for every update of the search model, while in reinforcement learning, the environment often offers only sequential policy evaluation. Thus the classification-based optimization may not efficient for direct policy search, where solutions have to be sampled sequentially. In this paper, we adapt the classification-based optimization for sequential sampled solutions by forming the sample batch via reusing historical solutions. Experiments on a helicopter hovering task and controlling tasks in OpenAI Gym show that the new algorithm significantly improve the performance from several state-of-the-art derivative-free optimization approaches.
| Original language | English |
|---|---|
| Pages | 2029-2035 |
| Number of pages | 7 |
| State | Published - 2017 |
| Externally published | Yes |
| Event | 31st AAAI Conference on Artificial Intelligence, AAAI 2017 - San Francisco, United States Duration: 4 Feb 2017 → 10 Feb 2017 |
Conference
| Conference | 31st AAAI Conference on Artificial Intelligence, AAAI 2017 |
|---|---|
| Country/Territory | United States |
| City | San Francisco |
| Period | 4/02/17 → 10/02/17 |