强化学习笔记(1)—— 强化学习简介

作者 Noven Kan/杆 日期 2018-03-13
强化学习笔记(1)—— 强化学习简介

前言

将学习笔记写到博客中来是为了更好的理解和巩固知识,另外,也方便各位纠正在下理解上的误区。如果想系统的学习,推荐一本优秀书籍Reinforcement Learning: An Introduction


David Silver学习视频及其课程中文笔记


下面进入正题:


INTRODUCTION


强化学习应用的学科和与机器学习的关系如下所示:


强化学习的特点:



  • 无监督(supervisor),只有reward信号

  • 不能即时反馈,只有延迟反馈

  • 与时序有关

  • agent的action会影响到后续接收的数据


下面讲一下强化学习的基本架构,如下图所示。强化学习的“中央处理器”叫做agent,功能简要的说是与环境交互,从环境中获得反馈(state, reward),基于此做出决策(action),又反作用于环境。


在Full Observability(即agent直接获取环境的所有状态,以后所讲的内容大多数是这类情况)情况下,以上行为是一个马尔科夫决策过程,因此可以作为典型的案例来入门。


Beyond the agent and environment, a reinforcement learning system include four sub-elements: a policy, a reward signal, a value function, and a model of the environment.


Policy


A policy is the agent’s behaviors function, define the agent’s way of behaving at a given time. It is a mapping from perceived states to actions. In some cases the policy may be a simple function or lookup table. In general, policy may be stochastic: π(as)=P[At=aSt=s]\pi(a|s)=\mathbb{P}[A_t=a|S_t=s], where StS_t represent the state and AtA_t is the action at the time tt.


Reward Signal


A reward signal defines the goal in a reinforcement learning problem. Reward signal is a scale number which defines what are good and bad events for the agent, and it is the primary basis for altering the policy.


Value Function



Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state.


For example, a state might always yield a low immediate reward but still have a high value because it is regularly followed by other states that yield high rewards. Or the reverse could be true. To make a human analogy, rewards are somewhat like pleasure (if high) and pain (if low), whereas values correspond to a more refined and farsighted judgment of how pleased or displeased we are that our environment is in a particular state.



Reward is primary and value is secondary, nevertheless, it is value with which we are most concerned when making and evaluating decisions. Action choices are made based on value judgments.


The value function can be defined as


vπ(s)=E[Rt+1+γRt+2+γ2Rt+3+St=s]v_\pi(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots |S_t =s]

where γ[0,1]\gamma \in [0,1] is the discount which is the present value of future rewards.


Model


Model is something that mimics the behavior of the environment, and predicts what the environment will do next. For example, given a state and action, the model might predict the resultant next state


Pssa=P[St+1=sSt=s,At=a]\mathcal{P}^a_{ss'}=\mathbb{P}[S_{t+1}=s'|S_t = s, A_t =a]

and next reward


Rsa=E[Rt+1St=s,At=a]\mathcal{R}^a_s=\mathbb{E}[R_{t+1}|S_t = s, A_t =a]

Model is used for Planning, and a model is not necessary for reinforcement learning.


Now, a example will be given blow to express the way in which a reinforcement learning system work.


Here is a maze, we could defined the rewards as 1-1 per time-step, the actions of a agent are N, E, S, W, and the states is agent’s location.



At the beginning, the policy of every state could be a random policy, and after several policy evaluation and improvement, a optimal policy shown in the blow picture will be converged.



Similar with the policy of each state, the value function of each state also all be zero at the start time. Then, it could be update by value evaluation and improvement, and converge to a constant value like the value in blow picture.


*The method of policy/value evaluation and improvement will be introduced in following lecture.



Agent may have an internal model of the environment. From the below picture, grid layout represents transition Pssa\mathcal{P}^a_{ss'}, numbers represent immediate reward Rsa\mathcal{R}^a_s from each state. Note that the model here is not operating mechanism of environment, in other words, how actions change the state, which is called Dynamic.



Categorizing RL agents


From the maze game, we can find a feasible solution only using the policy or the value function. So we could separate the RL method to three major categories named value based, policy based and actor-critic which using both the policy and value function.


Also, we could divided the RL method into model free method and model based method. A model can be efficiently learn by supervised learning methods. With the model, simulated experience will be generated and apply into model free learning. It may be useful in some case like Alpha Go.



Exploration and Exploitation


Reinforcement learning is like trial-error learning, the agent’s goal is to find a good policy from its experiences of the environment. In order to achieve this goal, the agent should find more information about the environment like computing the long term reward of every state or action. And exploitation means the agent will exploits known information to maximize reward. It is usually important to explore as well as exploit.


For example, just like restaurant selection when you are to eat. If you are hungry, you want to find a restaurant. Some restaurants you have eaten in and some not. The exploitation here is choosing your favorite restaurant, but the drawback of exploitation is you never know the taste of the restaurant that you have not eaten in. So, the exploration means to try a new restaurant and update your favorite restaurant.


Prediction and Control


Another a pair of nouns that should compare in RL is prediction and control. Prediction means evaluate the future. In prediction, the policy is known, the agent will evaluate the value of each state, in other words, predict the long term reward of each state following the policy. But in control, the agent will compare the every value of states, find the optimal value and update the policy to optimise the future.


The end.