Q-learning演算法實現1(matlab)
阿新 • • 發佈:2019-02-07
演算法虛擬碼:
得到Q表後,根據如下演算法選擇最優策略:
以機器人走房間為例,程式碼實現如下:
注:原文中的房間狀態0-5分別對應程式碼中1-6
%機器人走房間Q-learning的實現 %% 基本引數 episode=100; %探索的迭代次數 alpha=1;%更新步長 gamma=0.8;%折扣因子 state_num=6; action_num=6; final_state=6;%目標房間 Reward_table = [ -1 -1 -1 -1 0 -1; %1 -1 -1 -1 0 -1 100; %2 -1 -1 -1 0 -1 -1; %3 -1 0 0 -1 0 -1; %4 0 -1 -1 0 -1 100; %5 -1 0 -1 -1 0 100 %6 ]; %% 更新Q表 %initialize Q(s,a) Q_table=zeros(state_num,action_num); for i=1:episode %randomly choose a state current_state=randperm(state_num,1); while current_state~=final_state %randomly choose an action from current state optional_action=find(Reward_table(current_state,:)>-1); chosen_action=optional_action(randperm(length(optional_action),1)); %take action, observe reward and next state r=Reward_table(current_state,chosen_action); next_state=chosen_action; %update Q-table next_possible_action=find(Reward_table(next_state,:)>-1); maxQ=max(Q_table(next_state,next_possible_action)); Q_table(current_state,chosen_action)=Q_table(current_state,chosen_action)+alpha*(r+gamma*maxQ-Q_table(current_state,chosen_action)); current_state=next_state; end end %% 選擇最優路徑 %randomly choose a state currentstate=randperm(state_num,1); fprintf('Initialized state %d\n',currentstate); %choose action which satisfies Q(s,a)=max{Q(s,a')} while currentstate~=final_state [maxQtable,index]=max(Q_table(currentstate,:)); chosenaction=index; nextstate=chosenaction; fprintf('the robot goes to %d\n',nextstate); currentstate=nextstate; end
程式碼輸出:
Q表:
最優策略: