Chapter 6 : learning policies
Static and dynamic Bayesian networks constitute decision aiding system since they allow computing the probability of the states with respect to the available evidences. However, this aid can be greatly improved by adding to kind of nodes: the Utility and the Decision nodes.
6.1 Utilities
A Utility node is a node that allows valuating the states defined by the modality combinations of its parents. Those numerical values represent the quality or the cost[1] of these states.
In order to illustrate the use of these Utility nodes, we take the classical example of drilling an oil well. The variable Oil is made of three modalities representing the soil state with respect to the oil: dry, wet and soak. The variable Drill indicates if we drill or not. The Utility node allows describing the quality of the different states. For example, the worst case correspond to drilling when the soil does not contain oil (-70), the best case being drilling when the soil is soaked in oil (200).
Validation mode allows computing the expected utility with respect to the evidences. For example, the a priori expected utility is equal to 10, whereas it is equal to 20 when we set that we are going to drill. The monitor associated to a Utility node has two bars: the first one is used to indicate its expected utility, the second one being used to indicate the expected sum of all the utilities defined in the Bayesian networks.
6.2 Decisions
In fact, the Drill node in the example above represents an action. BayesiaLab proposes Decision nodes to model this kind of nodes. As the classical nodes, the effect of a decision is modeled by using the classical conditional probability tables associated to its children. On the other hand, the conditional probability table of a decision node is replaced by a quality table that indicates, for each state, i.e. each parent modality combinations, the quality of the actions.
In this example, when we only know the a priori probability distribution of the soil quality, the policy to apply consists in drilling (action that appears in light blue in the table and the monitor).
This quality table can be filled out by expertise, and/or learned automatically by BayesiaLab (Learning Menu, Validation mode). In that case, a reinforcement learning algorithm updates the quality of each state/action pair with respect to the global expected utility. The parameters of this algorithm can be changed by using the Preferences (Quality initialization, learning rate and exploration rate, number of learning steps).
For the easiest problems as this Drill problem, the quality initialization step of BayesiaLab is sufficient to find the optimal policy, even when we take the entire model, as illustrated below. The Bayesian network contains another Decision node to take into account the choice of doing a seismic test to gather some uncertain information on the soil quality. Even if this test as a cost and has uncertain results, the optimal policy learned by BayesiaLab consists in testing and then drilling, except when the result of the test indicates that there is no oil.
6.3 Policy of dynamic systems
BayesiaLab proposes the same kind of decision aiding system for dynamic Bayesian networks. To illustrate these functionalities, we take again our fluid distribution process (cf. Chapter 5). The Bayesian network below defines a maintenance system in which it is not possible to repair more than one valve at the same time, and the time of the reparation depends of the valves. This Bayesian network also valuates different states of the process by using 4 utility nodes: Fixed costs, a cost depending on the valve being repaired, and the income and raw material cost that depend on the system availability.
The maintenance policy is periodic and depends then directly on the variable Time. This variable has a modality for each month, and is then fdefined as the modulo of the temporal counter value of BayesiaLab (green node).
As for the static Bayesian networks, the policy of a dynamic Bayesian network can be learned in Validation mode. The following toolbar appears when switching to that mode: .
The
button allows testing randomly actions during the temporal simulation (Exploration). The
button allows activating the learning of the state/action qualities during this simulation. The length of the simulation is directly set by using the counter.
The parameters of the reinforcement learning algorithm that will update the quality of each state/action pair based on the discounted sum of expected utilities can be changed by using the settings (discount factor, learning rate and exploration rate). The maintenance policy illustrated below corresponds to the policy learned by BayesiaLab over 1 000 time steps, with a discount factor equal to 0.99, a learning rate equal to 0.5 and an initial exploration rate equal to 1.
The graph below represents the probability evolution of the system availability over 200 time steps when we apply the learned maintenance policy (without exploration
, nor learning
). The mean of the total expected utility by time step obtained with this policy is equal to 182,998.
[1] The costs associated to the variables in the adaptive questionnaire framework correspond to the cost implied by the knowing of the value of the variable. This cost is identical for all the values of the variable.







