Policy Learning in Dynamic Bayesian Networks
The Policy Learning function is only available in Validation Mode for Dynamic Networks that include both Decision Nodes and Utility Nodes.
In this context, the Temporal Toolbar features two additional buttons:

-
activates Exploration, i.e., testing random actions during the temporal simulation.
-
deactivates Exploration.
-
activates the Learning of the state/action qualities during the temporal simulation.
-
deactivates Learning.
The parameters of the reinforcement learning algorithm that will update the quality of each state/action pair based on the discounted sum of expected utilities can be changed by using the settings.
The complexity of this type of problem means that arriving at an optimal policy cannot be guaranteed. Hence, you should carry out several iterations and keep track of the policies yielding the best results.