Reinforcement Learning with Python - Part 1
- Introduction
- Reinforcement Learning
- Finite Markov Decision Processes
- The Bellman equation
- The Q-algorithm
- Interface for the lake problem
Introduction problem
Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend. The surface is described using a grid like the following:
- Blue tile : the moving agent at the starting point, safe
- White tile : frozen surface, safe
- Red tiles : holes, fall to your doom
- Green tile : goal, where the frisbee is located
The episode ends when you reach the goal or fall in a hole. Consequently, you have to find a safe path
to find the frisbee in the lake. If possible, you have to find the shortest one.
Of course, we assume that the agent doesn't know where are the holes and the frisbee.
The agent receives a reward of

Each timestep, the agent chooses an action, and the environment returns an observation and a reward. The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water (in red). Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile (in green).
Reinforcement Learning
Reinforcement learning (RL) is the subfield of machine learning concerned with decision making.
It studies how an agent can learn how to achieve goals in a complex, uncertain environment.
The environment can be deterministic or stochastic, hostile (chess game) or not (tetris game),
partially observable (mobile robotics), known or unknown to the decision-making agent.
D’après Sutton et Barto, l’apprentissage par renforcement désigne toute méthode adaptative permettant de résoudre un problème de décision
séquentielle. Le terme “adaptatif” signifie qu’on part d’une solution inefficace, et qu’elle est améliorée progressivement en fonction de l’expérience
de l’agent (ou des agents).
Reinforcement consists in finding:
- a sequence of actions which provide an optimal reward
- a strategy where sometimes one sacrifices his rewards to small rewards in the short term in order to obtain much better rewards in the long term
-
RL is very general, encompassing all problems that involve making a sequence of decisions: for example, controlling a robot’s motors so that it’s able to run and jump, making business decisions like pricing and inventory management, or playing video games and board games. RL can even be applied to supervised learning problems with sequential or structured outputs.
-
RL algorithms have started to achieve good results in many difficult environments. RL has a long history, but until recent advances in deep learning, it required lots of problem-specific engineering. DeepMind’s StarCraft II results and AlphaGo all used deep RL algorithms which did not make too many assumptions about their environment, and thus can be applied in other settings.
Finite Markov Decision Processes
At the beginning of the 20th century, the mathematician Andrey Markov studied
the stochastic processes without memory, called Markov chains. Such a process has a fixed number of states
and evolves randomly from one state to another at each stage.
The probability that it goes from state

This Markov chain describes the probability of various weather patterns knowing the weather state. In the "sun" state the probability is of 50% for that state to stay like this. It is of 20% to pass to the "rain" state.
We introduce the formal problem of finite Markov decision processes, or finite MDPs.
They were firstly introduced by Richard Bellman in the 1950s.
They are very similar to Markov chains but at each
step, an agent can choose an action among several possible actions. The transition probabilities only depend from the chosen action.
In addition, certain transitions between states return a positive or negative reward.
The agent's goal is then to find a policy that maximizes the rewards over time.
We will use this strategy to solve the previous problem.
MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards,
but also subsequent situations, or states, and through those future rewards. Thus MDPs involve delayed reward
and the need to tradeoff immediate and delayed reward.
MDPs are a mathematically idealized form of the reinforcement learning problem for which precise
theoretical statements can be made. We introduce key elements of the problem's mathematical structure,
such as returns, value functions, and Bellman equations.
MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve
a goal. The learner and decision maker is called the agent. The thing it interacts with, comprising
everything outside the agent, is called the environment. These interact continually, the agent selecting
actions and the environment responding to these actions and presenting new situations to the agent.
The environment also gives rise to rewards, special numerical values that the agent seeks to maximize
over time through its choice of actions.
More specifially, the agent and environment interact at each of a sequence of discrete time steps,
The MDP and agent together thereby give rise to a sequence or trajectory.
In a finite MDP, the sets of states, actions, and rewards all have a finite number of
elements. In this case, the random variables
In the following MDP Figure, there are
For example, from
We can imagine here, that he can have a probability of 90% to reach state
At state

Let us give a state
The Bellman equation
In the following, we assume that the agent knows perfectly the MDP of the system. We will see in the Part 2,
what kind of algorithms we can use when this is not the case. These algorithms are based on the Richard Bellman theory.
We assume that we have
where
-
is the transition probability from state to state knowing that the agent chose action . For example, and in the previous MDP Figure. -
Function
is the reward obtained by going from state to state by choosing action . For example, , and . -
Coefficient
is the discount rate. It belongs to . We will see the key role of this coefficient in the next exercices.
This equation is a necessary condition for optimality. It writes the value of a decision problem at a certain point in time in terms of the payoff from some initial choices and the value of the remaining decision problem that results from those initial choices. This breaks a dynamic optimization problem into a sequence of simpler subproblems. This recursive equation says that if the agent acts in an optimal way then the optimal value of the current state is equal to the reward which he will obtain on average after having carried out an optimal action, plus the optimal value expected for all the following possible states to which this action can lead.
This equation leads then us directly to the following algorithm which allows us to estimate
the optimal values for each of these
In this exercice, we want to implement with Python this Bellamn equation estimation and apply it to the previous MDP Figure.
-
For this implementation, we need to define
and . To do this, we create numpy arrays for these two functions. We assume that the value of will be given by using the numpy array P[i,a,j] where the indexes are ordered like this. You can use the numpy.full function to create an initial array with zeros. After that you can create an automatic algorithm that enables you to initialize all the probabilities. We assume here that the probability to do the given action is 100%. -
Define the MDP class where:
- the attributes of the object with the numpy arrays P (transition probability array), r (the reward array), and a list of all possible actions of each states. For instance, has two possible actions and has three possible actions.
Consequently actions=[[0,3],[0,2,3],...].- the Bellman method which returns a list that contains the estimated optimal values for each states after niter iterations for a given
value ( and niter are parameters) We assume that and the maximum iterations number niter = 200.class MDP(object): def __init__(self,P,r,actions): # To Do def Bellman(self, gamma =0.95, niter = 200): # To Do return V
-
Create an object with the MDP class for the previous MDP problem, compare the results of the Bellman method with
and for niter=200. Can you explain the differences? In fact, a reward (or payment) in the future is not worth quite as much as a reward now. For example, being promised 10000€ next year is worth only 90% as much as receiving 10000€ right now. Assuming payment steps in the future is worth only of payment now. -
We assume now that we have the following MDP Figure:
it seems clear that action is the best possible. In the state the agent must choose action . But in state , it is not easy to determine whether the agent should stay there or go through the flames. Apply your Bellman algorithm to this case. Calculate the expected discounted average future value for each states.
m = 12 n = 4 P = np.full([m,n,m],0.0) r = np.full([m,n,m],0.0) for i in range(m): for a in range(n): for j in range(m): if (a==0 and j == i + n) or (a==1 and j == i - n) or (a==2 and j == i - 1) or (a==3 and j == i + 1): # Test to avoid to go to state S5 and S11 has no possible actions if (i!=5 and j!=5 and i!=(m-1)): P[i,a,j] = 1.0 r[3,0,7] = -1.0 r[6,3,7] = -1.0 r[11,1,7] = -1.0 r[7,0,11] = 1.0 r[10,3,11] = 1.0
class MDP(object): def __init__(self,P,r,actions): self.P = P self.r = r self.actions = actions self.m = P.shape[0] def Bellman(self,gamma = 0.95, niter = 200): V0 = [0.0 for i in range(self.m)] V1 = V0 for i in range(niter): for s in range(self.m): V1[s] = np.max([np.sum([self.P[s,a,sp]*(self.r[s,a,sp] + gamma * V0[sp]) for sp in range(self.m)]) for a in self.actions[s]]) V0 = V1 return V0 actions = [] for i in range(m): temp = [] for a in range(n): if (a==0 and i+4 < m) or (a==1 and i-4>=0) or (a==2 and (i%4-1)>=0) or (a==3 and (i%4+1)<=3): temp.append(a) actions.append(temp) model = MDP(P,r,actions) print(model.Bellman())
P = np.array([[[0.7,0.3,0.0],[1,0,0],[0.8,0.2,0]],[[0.0,1.0,0.0],[0,0,0],[0.,0.,1.0]],[[0.,0.,0.0],[0.8,0.1,0.1],[0.,0.,0]]]) r = np.array([[[10,0,0],[0,0,0],[0,0,0]],[[0,0,0],[0,0,0],[0,0,-50]],[[0,0,0],[40,0,0],[0,0,0]]]) actions = [[0,1,2],[0,2],[1]]
The Q-algorithm
As we can see in the previous exercice the Bellman equation only gives us the average optimal values for each states.
This not say to the agent what he has to do.
Which optimal action he as to choose at each states. For this, Richard Bellman designed a new equation called the Q-equation.
This leads to the Q-algorithm methodology. Q-algorithm is one of the technique to find the optimal policy in an MDP.
The objective of the Q-algorithm is to find a policy that is optimal in the sense that the expected value of the total reward over
all successive steps is the maximum achievable. So, in other words, the goal of Q-algorithm is to find the optimal policy by learning
the optimal Q-values for each state-action pair.
In the following, I give the Q-algorithm:
for all
Exercise [Q algorithm]
-
Implement this Q-algorithm. For this you will define a new function in the MDP class called Qalgo with
, and niter as parameters. Parameter is the number of the possible actions.
def Qalgo(self, n, gamma =0.95, niter = 200): Q = np.full([self.m,self.m],0.0) # To Do return Q
-
In order to obtain each optimal action for each state, you will apply the numpy.argmax(Q, axis =1) to the previous returned result. Show that the optimal actions for each state of the first MDP are:
-Table0.81450625 0 0 0.81450625 0 0 0.77378094 0.857375 0.9025 0 0.81450625 0.81450625 -0.05 0 0.857375 0 0.857375 0.77378094 0 0 0 0 0 0 0.95 0.857375 0 -0.05 1 0.81450625 0.9025 0 0 0.81450625 0 0.9025 0 0 0.857375 0.95 0 0.9025 0.9025 1 0 0 0 0
optimal actions for each states: [0 3 0 2 0 0 0 0 3 3 3 0] and the Bellman algorithm gives us the following V-values (to be compared to the optimal Q-values): [0.8145, 0.8573, 0.9025, 0.8573, 0.8573, 0.0, 0.95, 1.0, 0.9025, 0.95, 1.0, 0.0]
class MDP(object): def __init__(self,P,r,actions): self.P = P self.r = r self.actions = actions self.m = P.shape[0] def Bellman(self,gamma = 0.95,niter = 200): V0 = [0.0 for i in range(self.m)] V1 = V0 for i in range(niter): for s in range(self.m): V1[s] = np.max([np.sum([self.P[s,a,sp]*(self.r[s,a,sp] + gamma * V0[s]) for sp in range(self.m)]) for a in self.actions[s]]) V0 = V1 return V0 def Qalgo(self,n,gamma = 0.95,niter = 200): Q = np.full([self.m,n],0.0) for i in range(niter): Q1 = Q for s in range(self.m): for a in actions[s]: Q[s,a] = np.sum([self.P[s,a,sp]*(self.r[s,a,sp] + gamma * np.max(Q1[sp])) for sp in range(self.m)]) return Q model = MDP(P,r,actions) print(model.Bellman()) res = model.Qalgo(4) print(res) print("Action optimale pour chaque état:") print(np.argmax(res, axis =1))
Interface for the lake problem
To create the interface for the lake problem, solved by using the reinforcement strategy, we will use the tkinter module.
Exercise [tkinter interface class]
-
Create a class Game which enables:
For this you can use the following code which enables you to draw a square in a grid according to the mouse event coordinates.
- to initialize the size (pxp) of the graphical grid of the lake
- to allow to the user to define with the mouse all the holes (red tiles)
- to set the frisbee location (green tile)
- to create the MDP object for the RL resolutionfrom tkinter import * root = Tk() def key(event): print ("pressed", repr(event.char)) def callback(event): global dim canvas.focus_set() print ("clicked at", event.x, event.y) i=event.x//dim j=event.y//dim canvas.itemconfig(rectangles_array[i][j], fill='blue') canvas= Canvas(root, width=300, height=300) canvas.bind("< Key >", key) canvas.bind("< Button-1 >", callback) canvas.pack() rectangles_array=[] dim = 60 for i in range(5): raw=[] for j in range(5): if (i+j)%2==0: raw.append(canvas.create_rectangle(i*dim,j*dim, (i+1)*dim, (j+1)*dim, fill='green')) else : raw.append(canvas.create_rectangle(i*dim,j*dim, (i+1)*dim, (j+1)*dim, fill='red')) rectangles_array.append(raw) root.mainloop()
-
Add to the class, a function path which "plays" the reinforcement strategy up to the agent finds the frisbee.
class TextBox(Entry): def __init__(self,parent=None,width=30): self.__text=StringVar() super().__init__(parent,width=width, textvariable=self.__text) def get(self): return self.__text.get() def set(self,text): self.__text.set(text) #classe d'interface pour la création et l'affichage du problème class Game(Tk): def __init__(self): super().__init__() self.sizeG=0 #taille de la grille carré self.sizeC=30 #taille d'un carré en pixels self.configureG=0 #numéro d'étape dans la configuration de la grille 0=arret, 1=point de départ, 2=point d'arrivée, 3=trous self.depart=(0,0) #position du point de départ self.arrivee=(0,0) #position du ooint d'arrivée self.trous=[] #liste des positions des trous self.rectangles_array=[] #liste les objets carrés de la grille self.create_widgets() self.MDPO=None #objet MDP pour la résolution def create_widgets(self): #création des cadres #-----------------cadre11 de configuration de la grille -------------- # TO DO #------------------ cadre12 de la résolution ------------------- # TO DO #-------------------- cadre13 pour l'affichage des messages ------------------- # TO DO #---------------- cadre2 pour l'affichage de la grille ------- self.cadre2=Frame(self, width=10, height=10, borderwidth=5,relief=GROOVE) self.cadre2.pack(fill=X) self.gride= Canvas(self.cadre2, width=self.sizeG*self.sizeC, height=self.sizeG*self.sizeC) self.rectangles_array=[] self.gride.bind("", self.callback) self.gride.pack() #gestion du click droit de la souris dans le canvas def callback(self,event): if self.configureG!=0: #coordonnées de la souris dans la grille j=event.x//self.sizeC i=event.y//self.sizeC if self.configureG==1: #definition du point de départ self.gride.itemconfig(self.rectangles_array[j][i], fill='blue') self.configureG=2 self.depart=(i,j) self.label1['text']="Choisir le point d'arrivée" elif self.configureG==2: #definition du point d arrivee self.gride.itemconfig(self.rectangles_array[j][i], fill='green') self.configureG=3 self.arrivee=(i,j) self.label1['text']="Choisir les emplacements des trous\nCliquez sur le bouton pour terminer" self.bouton_config['state'] = 'normal' #arret possible de la configuration else: #definition des trous self.gride.itemconfig(self.rectangles_array[j][i], fill='red') self.trous.append((i,j)) #mise en couleur d'une case de la grille def colorize(self,j,i,color='black'): #TO DO #création de la grille def creerG(self,size): #TO DO #configuration de la grille/ arret de la configuration def configure(self): if self.configureG!=0: #arret self.bouton_config.config(relief=RAISED) self.bouton_config['text'] = 'Configurer la grille' self.bouton_calcul['state'] = 'normal' self.configureG=0 self.label1['text']='' self.creerMDP() else: #démarrage de la configuration v=self.texte1.get() if v.isnumeric: self.creerG(int(v)) self.bouton_config.config(relief=SUNKEN) self.bouton_config['state'] = 'disable' self.bouton_config['text'] = 'Fin de configuration' self.bouton_calcul['state'] = 'disable' self.configureG=1 self.trous=[] self.label1['text']="Choisir le point de départ" else: self.label1['text']="la taille de la grille doit être un entier!" #création de l'objet MDP en fonction de la configuration de la grille def creerMDP(self): #TO DO #résolution du problème avec affichage du chemin def path(self): #TO DO #Démarrage de l'interface fenetre=Game() fenetre.mainloop()
from tkinter import * from tkinter.filedialog import askopenfilename import numpy as np #classe MDP pour la résolution du problème class MDP(object): #P : nparray de taille [m,n,m], avec m=nombre d'états et n=nombre d'actions #P[i,a,j] contient la probabilité de passer de l'état i à l'état j par l'action a #r: nparray de taille [m,n,m], avec m=nombre d'états et n=nombre d'actions #r[i,a,j] contient la récompense de passer de l'état i à l'état j par l'action a #actions : list de liste de taille[m], #action[i] : contient la liste des actions possibles à partir de l'état i def __init__(self,P,r,actions): self.P = P self.r = r self.actions = actions self.m = P.shape[0] self.nb_action=P.shape[1] #calcul de la valeur de l'état optimal pour chaque état par l'équation de Richard Bellman def Bellman(self,gamma = 0.95,niter = 200): V0 = [0.0 for i in range(self.m)] V1 = V0 for i in range(niter): for s in range(self.m): V1[s] = np.max([np.sum([self.P[s,a,sp]*(self.r[s,a,sp] + gamma * V0[sp]) for sp in range(self.m)]) for a in self.actions[s]]) V0 = V1 return V0 #calcul de la valeur optimale pour chaque action à partir de chaque état def Qalgo(self,gamma = 0.95,niter = 200): Q = np.full([self.m,self.nb_action],0.0) for i in range(niter): Q1 = Q for s in range(self.m): for a in self.actions[s]: Q[s,a] = np.sum([self.P[s,a,sp]*(self.r[s,a,sp] + gamma * np.max(Q1[sp])) for sp in range(self.m)]) return Q class TextBox(Entry): def __init__(self,parent=None,width=30): self.__text=StringVar() super().__init__(parent,width=width, textvariable=self.__text) def get(self): return self.__text.get() def set(self,text): self.__text.set(text) #classe d'interface pour la création et l'affichage du problème class Game(Tk): def __init__(self): super().__init__() self.sizeG=0 #taille de la grille carré self.sizeC=30 #taille d'un carré en pixels self.configureG=0 #numéro d'étape dans la configuration de la grille 0=arret, 1=point de départ, 2=point d'arrivée, 3=trous self.depart=(0,0) #position du point de départ self.arrivee=(0,0) #position du ooint d'arrivée self.trous=[] #liste des positions des trous self.rectangles_array=[] #liste les objets carrés de la grille self.create_widgets() self.MDPO=None #objet MDP pour la résolution def create_widgets(self): #création des cadres #-----------------cadre11 de configuration de la grille -------------- self.cadre11 = Frame(self, width=400, height=50, borderwidth=5) self.cadre11.pack(fill=X) self.label11 = Label(self.cadre11, text="taille de la grille carré :") self.label11.pack(side=LEFT,padx=10,pady=10) self.texte1 = TextBox(self.cadre11, width=30) self.texte1.pack(side=LEFT,padx=10,pady=10) self.texte1.set('10') self.bouton_config =Button(self.cadre11, text="Configurer la grille", command=self.configure) self.bouton_config.pack(side=LEFT,padx=10,pady=10) #------------------ cadre12 de la résolution ------------------- self.cadre12 = Frame(self, width=400, height=50, borderwidth=5) self.cadre12.pack(fill=X) self.label2 = Label(self.cadre12, text="nonbre d'itérations :") self.label2.pack(side=LEFT,padx=10,pady=10) self.texte2 = TextBox(self.cadre12, width=30) self.texte2.pack(side=LEFT,padx=10,pady=10) self.texte2.set('100') self.bouton_calcul =Button(self.cadre12, text="Calculer le chemin", command=self.path) self.bouton_calcul.pack(side=LEFT,padx=10,pady=10) self.bouton_calcul['state'] = 'disable' #-------------------- cadre13 pour l'affichage des messages ------------------- self.cadre13 = Frame(self, width=400, height=50, borderwidth=5) self.cadre13.pack(fill=X) self.label1 = Label(self.cadre13, text="") self.label1.pack(side=TOP,padx=10,pady=10) #---------------- cadre2 pour l'affichage de la grille ------- self.cadre2=Frame(self, width=10, height=10, borderwidth=5,relief=GROOVE) self.cadre2.pack(fill=X) self.gride= Canvas(self.cadre2, width=self.sizeG*self.sizeC, height=self.sizeG*self.sizeC) self.rectangles_array=[] self.gride.bind("", self.callback) self.gride.pack() #gestion du click droit de la souris dans le canvas def callback(self,event): if self.configureG!=0: #coordonnées de la souris dans la grille j=event.x//self.sizeC i=event.y//self.sizeC if self.configureG==1: #definition du point de départ self.gride.itemconfig(self.rectangles_array[j][i], fill='blue') self.configureG=2 self.depart=(i,j) self.label1['text']="Choisir le point d'arrivée" elif self.configureG==2: #definition du point d arrivee self.gride.itemconfig(self.rectangles_array[j][i], fill='green') self.configureG=3 self.arrivee=(i,j) self.label1['text']="Choisir les emplacements des trous\nCliquez sur le bouton pour terminer" self.bouton_config['state'] = 'normal' #arret possible de la configuration else: #definition des trous self.gride.itemconfig(self.rectangles_array[j][i], fill='red') self.trous.append((i,j)) #mise en couleur d'une case de la grille def colorize(self,j,i,color='black'): self.gride.itemconfig(self.rectangles_array[j][i], fill=color) #création de la grille def creerG(self,size): self.configureG=0 self.bouton_config.config(relief=RAISED) self.label1['text']='' self.sizeG=int(size) self.gride.delete("all") #mise à jour de la taille du canvas self.gride.config( width=self.sizeG*self.sizeC, height=self.sizeG*self.sizeC) self.rectangles_array=[] #création des carrés de la grille for i in range(self.sizeG): raw=[] for j in range(self.sizeG): raw.append(self.gride.create_rectangle(i*self.sizeC,j*self.sizeC, (i+1)*self.sizeC, (j+1)*self.sizeC, fill='white')) self.rectangles_array.append(raw) #configuration de la grille/ arret de la configuration def configure(self): if self.configureG!=0: #arret self.bouton_config.config(relief=RAISED) self.bouton_config['text'] = 'Configurer la grille' self.bouton_calcul['state'] = 'normal' self.configureG=0 self.label1['text']='' self.creerMDP() else: #démarrage de la configuration v=self.texte1.get() if v.isnumeric: self.creerG(int(v)) self.bouton_config.config(relief=SUNKEN) self.bouton_config['state'] = 'disable' self.bouton_config['text'] = 'Fin de configuration' self.bouton_calcul['state'] = 'disable' self.configureG=1 self.trous=[] self.label1['text']="Choisir le point de départ" else: self.label1['text']="la taille de la grille doit être un entier!" #création de l'objet MDP en fonction de la configuration de la grille def creerMDP(self): m = self.sizeG*self.sizeG n = 4 P = np.full([m,n,m],0.0) #initialisation de P for i in range(m): for a in range(n): for j in range(m): if (a==0 and j == i + self.sizeG) or (a==1 and j == i - self.sizeG) or (a==2 and j == i - 1) or (a==3 and j == i + 1): #arrivée=etat final if i!=self.arrivee[0]*self.sizeG+self.arrivee[1]: P[i,a,j] = 1.0 #initialisation des actions actions=[] for i in range(m): l=i//self.sizeG c=i%self.sizeG cactions=[] if l< self.sizeG-1 : cactions.append(0) if l>0 : cactions.append(1) if c>0 : cactions.append(2) if c< self.sizeG-1 : cactions.append(3) actions.append(cactions) # initalisation de r r = np.full([m,n,m],0.0) #récompensse point final ind=self.arrivee[0]*self.sizeG+self.arrivee[1] ind2=(self.arrivee[0]-1)*self.sizeG+self.arrivee[1] if ind2 >=0 and ind2< m: r[ind2,0,ind]=1.0 ind2=(self.arrivee[0]+1)*self.sizeG+self.arrivee[1] if ind2>=0 and ind2< m: r[ind2,1,ind]=1.0 ind2=self.arrivee[0]*self.sizeG+self.arrivee[1]+1 if ind2>=0 and ind2< m: r[ind2,2,ind]=1.0 ind2=self.arrivee[0]*self.sizeG+self.arrivee[1]-1 if ind2>=0 and ind2< m: r[ind2,3,ind]=1.0 #récompenses trous for i in range(len(self.trous)): ind=self.trous[i][0]*self.sizeG+self.trous[i][1] ind2=(self.trous[i][0]-1)*self.sizeG+self.trous[i][1] if ind2>=0 and ind2< m: r[ind2,0,ind]=-10.0 ind2=(self.trous[i][0]+1)*self.sizeG+self.trous[i][1] if ind2>=0 and ind2< m: r[ind2,1,ind]=-10.0 ind2=self.trous[i][0]*self.sizeG+self.trous[i][1]+1 if ind2>=0 and ind2< m: r[ind2,2,ind]=-10.0 ind2=self.trous[i][0]*self.sizeG+self.trous[i][1]-1 if ind2>=0 and ind2< m: r[ind2,3,ind]=-10.0 self.MDPO=MDP(P,r,actions) #résolution du problème avec affichage du chemin def path(self): niter=self.texte2.get() if self.MDPO is not None and niter.isnumeric: Q=self.MDPO.Qalgo(gamma = 0.95,niter = int(niter)) ind=self.depart[0]*self.sizeG+self.depart[1] ind_final=self.arrivee[0]*self.sizeG+self.arrivee[1] while ind!=ind_final: action=np.argmax(Q[ind]) if Q[ind][action]>0: if action==0: ind=ind+self.sizeG if action==1: ind=ind-self.sizeG if action==2: ind=ind-1 if action==3: ind=ind+1 if ind!=ind_final : self.colorize(ind%self.sizeG,ind//self.sizeG) else: raise Exception("chenmin bloqué!") #Démarrage de l'interface fenetre=Game() fenetre.mainloop()
References
- R.S. Sutton, A.G. Barto (2017), Reinforcement Learning: An Introduction.
- R. Bellman (1957), A Markovian Decision Process