Chapter 3 - Extracting a Bayesian network from a database
3.1 Introducing the problem
Over the specialist's career, he or she created a data base of information on his or her patients suffering from lung
disease. This data base leaves an important record if ever another specialist needs to take over. One row in the
data base corresponds to the diagnosis of one patient. Here are the first two lines of the data base:
| N° | Smoker | Cancer | Age | Tuberculosis | TbOrCa | VisiteAsia | XRay | Bronchititis | Dyspnea |
| 1 | True | False | 25 | False | False | False | Normal | False | True |
The first line corresponds to the names of the variables and the second corresponds to the patient's information. In total, this data base contains the records for 10,000 patients.
Here is a list of the variables in more detail:
N° : Patient identification number. A positive numerical value
Smoker : Does he or she smoke? True/False
Cancer : Does he or she have cancer? True/False
Age : Age of the patient? A positive numerical value
Tuberculosis : Does the patient have tuberculosis? True/False
TbOrCa : This is a variable created to simplify the probability tables (logical 'or') True/False
VisitAsia : Did the patient visit Asia? True/False
XRay : Result of the X-ray? Normal/Abnormal
Bronchitis : Does the patient have bronchitis? True/False
Dyspnea : Does he or she have some respiratory difficulties? True/False
Assuming another specialist has taken over the case, he or she will want to identify all the associations among these variables in order to assess the chances that a given patient has cancer or tuberculosis and especially to be able to determine whether or not an X-ray is necessary.
3.2 Loading the data base
After having chosen which data base to open (a text file), a first assistant for the specification of the file is launched in order to choose the characters used as separators, then whether or not a line containing the name of variables appears at the beginning of the file, then, if the file contains missing values, what are the characters used to indicate these missing values.
A second assistant appears then to determine the operations to carry out on the fields.
This assistant allows specifying the type of the variables (discrete or continuous) and also allows not distributing some fields. For example, the patient number being unprofitable for lung disease modeling, it is not necessary to distribute this field.
This assistant also gives a set of information about the data.
The data base containing some missing values, one can specify a treatment for each variable having missing values
- Either the missing values are not treated, i.e. they are consider as a new modality (the non observation is sometime an information)
- Either they are replaced by some other value (the modal value if the variable is discrete, the average if it is continuous, or any other value)
- Either the power of inference is used to estimate the missing values with the current Bayesian network
- Dynamic completion: after each structural modification (arc addition, reversal, and removal), the missing values are dynamically estimated according to the resulting Bayesian network and the available data, and replaced by the corresponding modal values.
- Structural EM: after each structural modification and for each variable with missing values, the current Bayesian network and the available data are used to compute the probabilities of each modality. These probabilities are directly used for structural learning and parameter estimation. The main difference with the dynamic completion is then the lack of completion with a “winning” modality. This method is the more precise but also more time consuming than the other ones.
Each variable has its own missing value processing. If one wants to apply the same processing for all these variables, it is necessary to select all the columns (double click on a column or “Ctrl + A”). As we have very few missing values (2.77%), we choose to replace all the missing values by their modal values.
There exists another type of processing that is applicable to any values: the filtering. It would be possible
for example to filter the smokers aged of less than 15 years. To filter these patients, one would click on the
smoker’s icon
and would click once on the True modality. This modality would become blue, which would indicate an And
Filter (a second click switch to an Or Filter, red). We would then select the Age variable and would enter 15 as a min
value.
As the data base contains a continuous variable, we have to make it discrete. There are several ways to do this:
- Equal distance: the range is cut up into intervals of the same length.
- Equal frequency: more interesting because it involves intervals having the same number of associated cases.
- Decision tree: intervals are chosen according to the information they contribute to the target variable (a discrete target variable is always necessary).
Our specialist, who has decided to make the Age variable discrete using a decision tree, selects the target node Cancer. He or she now wants to cut up the range into a maximum of four intervals (we need to consider the maximum value since the software can decide to use less).
Once the assistant is closed, a worksheet appears. It contains all the nodes that correspond to the distributed variables of the data base. Now we only need to run the learning algorithms to use the data base. Now, our specialist discovers the 9 nodes corresponding to the distributed variables:
3.3 Unsupervised learning for discovering all the probabilistic relations in the data (association discovery)
BayesiaLab uses three algorithms for association discovery:
- SopLEQ: a search based on a global characterization of the data and on the exploitation of the properties of equivalent Bayesian networks (fast)
- Taboo: structural learning using a Taboo search.
- Taboo order: learning using a Taboo search for finding the best order for the nodes (a more complete search and therefore longer)
Our specialist can use the SopLEQ method to discover the associations between nodes:
To make it easier to see the Bayesian network, BayesiaLab has two tools that automatically reorganize the nodes: a dynamic positioning and a positioning with a genetic algorithm.
The automatic positioning can be stop by using the red light of the state bar,
. The potentiometer located at the
upper right corner can be used to vary the length of arcs.
Note that the “P” key allows launching the automatic positioning for a predefined number of steps.
Now, the physician has a usable representation of its data. The obtained Bayesian networks looks a lot to the one constructed thanks to the expert knowledge. However, the arc between Smoker and Cancer is inverted. Actually, without expert knowledge indicating that smoking is one cause of cancer, it is impossible to find the causal orientation of this arc. In Validation mode, BayesiaLab allows visualizing the set of arcs for which the orientation cannot be decided without expert knowledge (Inference Menu, Show the Edges).
By using the arc contextual menu, it is then possible to specify the orientation according to our knowledge. The changes of orientations are then reflected on the whole Bayesian network and the probability tables are automatically modified.
3.4 Using the knowledge hidden in the data base
We will now monitor the variables Smoker, Age, Cancer and Tuberculosis. In order to do this, we must first select the variables, and then choosing "monitor" from the contextual menu of one of the selected node (right click on the node):
We can notice that thanks to the discretization by decision tree with Cancer as target node, BayesiaLab cut out the Age variable in three intervals, instead of the four initially specified. These intervals correspond to people with an age less than or equal to 37 years, those with an age less than or equal to 60 years old, and those older than 60 years old.
The conditional probability table of the Age variable (as seen above) shows that 34.79% of the patients consulted were more than 60 years old. What would happen if we indicate that the patient that we are considering is more than 60 years old? Would the probability of having cancer increase? All we have to do is double click on the value ">60.152" on the Age variable and then analyze the new probabilities:
Because of the propagation of the new information, the probability distributions for every node have been updated and we can see that the probability of having Cancer goes from 3.89% to 7.70%.
Let's look at the problem from the opposite direction: what characterizes people having cancer (on the variables that we monitored)? While clicking on the state "True" for cancer, we find:
What we find, generally, are patients who are more than 60 years old (to 68.89%) and smokers (to 64.78%).
The approach taken here is completely different from the one described in the previous chapter. Here, the
specialist isn't only trying to verify his or her model. Instead, the goal is to discover knowledge from the data base.
BayesiaLab provides a tool that, in a couple of clicks, leads to findings that would never have been as simple
to discover by reading the 10000 lines of the data base.
BayesiaLab can generate an analysis report that generalizes this kind of analyses for a target node. The report
is generated by setting Cancer as target variable and by choosing Target Analysis Report in the Inference
Menu.
This report contains: the analysis context (the set of observations), the marginal distribution of the target node, a list of nodes sorted according to the information they bring to the knowledge of the target variable, and the profile for each modality of Cancer. We can see from this report that the patients that have cancer generally have an abnormal X-Ray, are older than 60 years old, suffer from dyspnea and are smoker.
3.5 Unsupervised learning for discovering new concepts (clustering, segmentation)
In addition to association discovery, BayesiaLab has clustering algorithms. The goal of these algorithms is the
discovery of natural partitions from examples described in the data base; these partitions share a set of common
properties. The learning menu allows launching the Clustering assistant.
Two methods are proposed:
- Fixed number of classes : the number of classes (partitions) is determined by the user.
- Automatic selection of the number of classes : the system searches for the best number of classes for optimal partitioning. The search method is based on a directed random walk. It needs then an initial number of classes and a maximum number of classes.
, while preserving the best solution
At the end of learning, the following report is generated. It contains the learning parameters and the learning results, gives the marginal distributions of the clusters found, sorted according to their weight, and the distribution of these clusters on the data base used for learning. As one can note it, these two distributions are not identical. The first one is theoretical; the second one is computed by assigning to each example the cluster with the higher probability. It could be possible, for example, that a modality representing 2% theoretically but do not have any assigned example in the data base, i.e. this modality never has the higher probability. Actually, this kind of cluster is automatically removed in the BayesiaLab’s clustering algorithm.
The second part of the report first gives a list of nodes sorted according to the information they bring to the Cluster node, then, for each cluster, it describes its probabilistic profile. In the report shown below, 5 clusters have been found:
- Patients in good health.
- Patients suffering from bronchitis and being mainly smoker.
- Aged smoker patients without dyspnea.
- Aged patient that do not smoke suffering from bronchitis.
- Patients having cancer or tuberculosis
Actually, this report used the various analysis functionalities proposed in the inference menu (cf Chapter 4).
3.6 Supervised learning dedicated entirely to characterizing a particular variable
BayesiaLab also offers a third type of Bayesian network learning from a data base. While the first two (association discovery and new concept discovery) are "unsupervised" algorithms, this last category contains “supervised” algorithms. This is due to the fact that the induction of the Bayesian network is completely guided throughout the characterization of a particular node.
BayesiaLab uses several characterization algorithms. To illustrate this functionality, we leave our initial
problem of lung diseases to be interested in a simplified modeling of car operation. The optimal Bayesian network
having allowed the generation of the data ("Menu Data base - Generate data base" in Validation mode) is as
follows:
Our target node is BatteryPower.
Naïve Bayes
This is a Bayesian network in which the target node is the father of all other nodes. In this structure, knowing the value of the target makes each node independent of the others. The low number of probabilities to be estimated makes this a very robust structure and its learning is very fast since the structure is known and then, we just have to learn the parameters. The screen shot below corresponds to the Naïve Bayes for BatteryPower after an arc analysis and automatic positioning.
Augmented Naive Bayes
This architecture is composed of a naive structure, enriched by relations among son nodes knowing the value of the target node. With this algorithm, we can get more precise results than with the naive architecture, but searching for the conditional relations costs more in time.
Sons & Spouses
Here, the target node is the father of a subset of nodes possibly having other relationships (spouses). This algorithm has the advantage of showing the set of nodes being indirectly linked to the target. The time cost is of the same order as for the augmented naive Bayes.
Markov Blanket
This algorithm searches for the nodes that belong to the Markov Blanket of the target: fathers, sons and spouses nodes. With this structure's search, which is entirely focused on the target node, we can get the sub-set of the nodes that are actually relevant in a time frame that is much lower than with the other two architectures Augmented Naïve Bayes and Sons & Spouses. This method is a very good tool for analyzing one variable
The observation of the nodes belonging to the Markov Blanket makes the target node independent of all the other nodes.
Augmented Markov Blanket
This involves the Markov Blanket learning with the addition of an unsupervised search for the probabilistic relations between each of the variables belonging to the Markov Blanket. This additional search entails a supplementary cost but in exchange there is a gain in precision with respect to the simple version. In our example, such relations do not exist and the resulting structure is then the Bayesian network obtained by the Markov Blanket algorithm.
Evaluation of the learned Bayesian networks
Once these supervised models learned, it is possible to evaluate their performances on a test set (ideally different from the learning set). In Validation mode, the inference menu proposes the item Targeted Evaluation. This evaluation provides:
- The total precision of the Bayesian network
- Three confusion matrices allowing to precisely measure the performances of the model for each modality (occurrence number, reliability: proportion of the cases with a correct prediction, and precision: proportion of the cases where the true value is correctly predicted).
- A lift curve relative to a target modality. It represents the detection rate of the target value (Y-axis) with respect to the number of processed cases (X-axis). Cases are first sorted with respect to the probability inferred by the model (cases with the highest target modality probability appear at the beginning of the list).Thus, if the target modality represents 10 % of the cases, the optimal lift curve has an Y-value at 100 % for an X-value at 10 % (all the target cases are detected with no false detection). Conversely, a random tagging policy leads to a diagonal lift curve (the expected detection rate is directly given by the selection rate: if you randomly select x% of the cases, the expected detection rate of the cases with the target value is x %).
- A ROC curve representing the False Positive rate of the target modality along X axis and the True Positive rate along the Y axis.















