1. 程式人生 > >Q-learning演算法實現1(matlab)

Q-learning演算法實現1(matlab)

演算法虛擬碼:

得到Q表後,根據如下演算法選擇最優策略:

以機器人走房間為例,程式碼實現如下:

注:原文中的房間狀態0-5分別對應程式碼中1-6

%機器人走房間Q-learning的實現
%% 基本引數
episode=100; %探索的迭代次數
alpha=1;%更新步長
gamma=0.8;%折扣因子
state_num=6;
action_num=6;
final_state=6;%目標房間
Reward_table = [
-1 -1 -1 -1 0 -1; %1
-1 -1 -1 0 -1 100; %2
-1 -1 -1 0 -1 -1; %3
-1 0 0 -1 0 -1; %4
0 -1 -1 0 -1 100; %5
-1 0 -1 -1 0 100 %6
];
%% 更新Q表
%initialize Q(s,a)
Q_table=zeros(state_num,action_num);
for i=1:episode
    %randomly choose a state
    current_state=randperm(state_num,1);
    while current_state~=final_state
        %randomly choose an action from current state
        optional_action=find(Reward_table(current_state,:)>-1);
        chosen_action=optional_action(randperm(length(optional_action),1));
        %take action, observe reward and next state
        r=Reward_table(current_state,chosen_action);
        next_state=chosen_action;
        %update Q-table
        next_possible_action=find(Reward_table(next_state,:)>-1);
        maxQ=max(Q_table(next_state,next_possible_action));
        Q_table(current_state,chosen_action)=Q_table(current_state,chosen_action)+alpha*(r+gamma*maxQ-Q_table(current_state,chosen_action));
        current_state=next_state;
    end
end
 %% 選擇最優路徑
 %randomly choose a state
currentstate=randperm(state_num,1);
fprintf('Initialized state %d\n',currentstate);
%choose action which satisfies Q(s,a)=max{Q(s,a')}
while currentstate~=final_state
     [maxQtable,index]=max(Q_table(currentstate,:));
     chosenaction=index;
     nextstate=chosenaction;
     fprintf('the robot goes to %d\n',nextstate);
     currentstate=nextstate;
end
        

程式碼輸出:

Q表:

最優策略: