Français Search

Chapter 3 - Extracting a Bayesian network from a database


3.1    Introducing the problem
Over the specialist's career, he or she created a data base of information on his or her patients suffering from lung disease.  This data base leaves an important record if ever another specialist needs to take over. One row in the data base corresponds to the diagnosis of one patient. Here are the first two lines of the data base:

 

Smoker Cancer Age Tuberculosis TbOrCa VisiteAsia XRay Bronchititis Dyspnea
1 True False 25 False False False Normal False True

The first line corresponds to the names of the variables and the second corresponds to the patient's information. In total, this data base contains the records for 10,000 patients.  

Here is a list of the variables in more detail: 

N° : Patient identification number. A positive numerical value

Smoker : Does he or she smoke? True/False

Cancer : Does he or she have cancer? True/False

Age : Age of the patient? A positive numerical value

Tuberculosis : Does the patient have tuberculosis? True/False

TbOrCa : This is a variable created to simplify the probability tables (logical 'or')  True/False

VisitAsia : Did the patient visit Asia? True/False

XRay : Result of the X-ray? Normal/Abnormal

Bronchitis : Does the patient have bronchitis? True/False

Dyspnea : Does he or she have some respiratory difficulties? True/False

Assuming another specialist has taken over the case, he or she will want to identify all the associations among these variables in order to assess the chances that a given patient has cancer or tuberculosis and especially to be able to determine whether or not an X-ray is necessary.  

3.2 Loading the data base

images 1

After having chosen which data base to open (a text file), a first assistant for the specification of the file is launched in order to choose the characters used as separators, then whether or not a line containing the name of variables appears at the beginning of the file, then, if the file contains missing values, what are the characters used to indicate these missing values.

images 2

A second assistant appears then to determine the operations to carry out on the fields.

images 3

This assistant allows specifying the type of the variables (discrete or continuous) and also allows not distributing some fields. For example, the patient number being unprofitable for lung disease modeling, it is not necessary to distribute this field.

This assistant also gives a set of information about the data.

images 4

The data base containing some missing values, one can specify a treatment for each variable having missing values

  1. Either the missing values are not treated, i.e. they are consider as a new modality (the non observation is sometime an information)
  2. Either they are replaced by some other value (the modal value if the variable is discrete, the average if it is continuous, or any other value)
  3. Either the power of inference is used to estimate the missing values with the current Bayesian network
    1. Dynamic completion: after each structural modification (arc addition, reversal, and removal), the missing values are dynamically estimated according to the resulting Bayesian network and the available data, and replaced by the corresponding modal values.
    2. Structural EM: after each structural modification and for each variable with missing values, the current Bayesian network and the available data are used to compute the probabilities of each modality. These probabilities are directly used for structural learning and parameter estimation. The main difference with the dynamic completion is then the lack of completion with a “winning” modality. This method is the more precise but also more time consuming than the other ones.

Each variable has its own missing value processing. If one wants to apply the same processing for all these variables, it is necessary to select all the columns (double click on a column or “Ctrl + A”). As we have very few missing values (2.77%), we choose to replace all the missing values by their modal values.

images 5

There exists another type of processing that is applicable to any values: the filtering. It would be possible for example to filter the smokers aged of less than 15 years. To filter these patients, one would click on the smoker’s icon images 6 and would click once on the True modality. This modality would become blue, which would indicate an And Filter (a second click switch to an Or Filter, red). We would then select the Age variable and would enter 15 as a min value.

images 7

As the data base contains a continuous variable, we have to make it discrete. There are several ways to do this:

  • Equal distance: the range is cut up into intervals of the same length.
  • Equal frequency: more interesting because it involves intervals having the same number of associated cases.
  • Decision tree: intervals are chosen according to the information they contribute to the target variable (a discrete target variable is always necessary).

Our specialist, who has decided to make the Age variable discrete using a decision tree, selects the target node Cancer.  He or she now wants to cut up the range into a maximum of four intervals (we need to consider the maximum value since the software can decide to use less). 

images 8

Once the assistant is closed, a worksheet appears.  It contains all the nodes that correspond to the distributed variables of the data base. Now we only need to run the learning algorithms to use the data base. Now, our specialist discovers the 9 nodes corresponding to the distributed variables:

images 9

3.3 Unsupervised learning for discovering all the probabilistic relations in the data (association discovery)

BayesiaLab uses three algorithms for association discovery: 

  • SopLEQ: a search based on a global characterization of the data and on the exploitation of the properties of equivalent Bayesian networks (fast)
  • Taboo: structural learning using a Taboo search.
  • Taboo order: learning using a Taboo search for finding the best order for the nodes (a more complete search and therefore longer)
images 10


Our specialist can use the SopLEQ method to discover the associations between nodes:  

images 11

To make it easier to see the Bayesian network, BayesiaLab has two tools that automatically reorganize the nodes: a dynamic positioning and a positioning with a genetic algorithm.   

images 12

The automatic positioning can be stop by using the red light of the state bar, images 13. The potentiometer located at the upper right corner can be used to vary the length of arcs.

Note that the “P” key allows launching the automatic positioning for a predefined number of steps.

images 14

Now, the physician has a usable representation of its data. The obtained Bayesian networks looks a lot to the one constructed thanks to the expert knowledge. However, the arc between Smoker and Cancer is inverted. Actually, without expert knowledge indicating that smoking is one cause of cancer, it is impossible to find the causal orientation of this arc. In Validation mode, BayesiaLab allows visualizing the set of arcs for which the orientation cannot be decided without expert knowledge (Inference Menu, Show the Edges).

By using the arc contextual menu, it is then possible to specify the orientation according to our knowledge. The changes of orientations are then reflected on the whole Bayesian network and the probability tables are automatically modified.

images 15

3.4 Using the knowledge hidden in the data base

We will now monitor the variables Smoker, Age, Cancer and Tuberculosis. In order to do this, we must first select the variables, and then choosing "monitor" from the contextual menu of one of the selected node (right click on the node):

images 16

We can notice that thanks to the discretization by decision tree with Cancer as target node, BayesiaLab cut out the Age variable in three intervals, instead of the four initially specified. These intervals correspond to people with an age less than or equal to 37 years, those with an age less than or equal to 60 years old, and those older than 60 years old.

images 17

The conditional probability table of the Age variable (as seen above) shows that 34.79% of the patients consulted were more than 60 years old. What would happen if we indicate that the patient that we are considering is more than 60 years old? Would the probability of having cancer increase? All we have to do is double click on the value ">60.152" on the Age variable and then analyze the new probabilities: 

images 18

Because of the propagation of the new information, the probability distributions for every node have been updated and we can see that the probability of having Cancer goes from 3.89% to 7.70%.

Let's look at the problem from the opposite direction: what characterizes people having cancer (on the variables that we monitored)? While clicking on the state "True" for cancer, we find: 

images 19

What we find, generally, are patients who are more than 60 years old (to 68.89%) and smokers (to 64.78%).  

The approach taken here is completely different from the one described in the previous chapter. Here, the specialist isn't only trying to verify his or her model. Instead, the goal is to discover knowledge from the data base. BayesiaLab provides a tool that, in a couple of clicks, leads to findings that would never have been as simple to discover by reading the 10000 lines of the data base.
BayesiaLab can generate an analysis report that generalizes this kind of analyses for a target node. The report is generated by setting Cancer as target variable and by choosing Target Analysis Report in the Inference Menu.

images 20

This report contains: the analysis context (the set of observations), the marginal distribution of the target node, a list of nodes sorted according to the information they bring to the knowledge of the target variable, and the profile for each modality of Cancer. We can see from this report that the patients that have cancer generally have an abnormal X-Ray, are older than 60 years old, suffer from dyspnea and are smoker.

3.5 Unsupervised learning for discovering new concepts (clustering, segmentation)
In addition to association discovery, BayesiaLab has clustering algorithms. The goal of these algorithms is the discovery of natural partitions from examples described in the data base; these partitions share a set of common properties. The learning menu allows launching the Clustering assistant.

images 21

Two methods are proposed:

  1. Fixed number of classes : the number of classes (partitions) is determined by the user.
  2. Automatic selection of the number of classes : the system searches for the best number of classes for optimal partitioning. The search method is based on a directed random walk. It needs then an initial number of classes and a maximum number of classes.
Some parameters can be set to tune these methods: the sample size used for learning, and the number of steps for the random walk.  This last number can be set extremely high since the walk can be stopped by using the red light of the state bar images 22 , while preserving the best solution

 

At the end of learning, the following report is generated. It contains the learning parameters and the learning results, gives the marginal distributions of the clusters found, sorted according to their weight, and the distribution of these clusters on the data base used for learning. As one can note it, these two distributions are not identical. The first one is theoretical; the second one is computed by assigning to each example the cluster with the higher probability. It could be possible, for example, that a modality representing 2% theoretically but do not have any assigned example in the data base, i.e. this modality never has the higher probability. Actually, this kind of cluster is automatically removed in the BayesiaLab’s clustering algorithm.

The second part of the report first gives a list of nodes sorted according to the information they bring to the Cluster node, then, for each cluster, it describes its probabilistic profile. In the report shown below, 5 clusters have been found:

  1. Patients in good health.
  2. Patients suffering from bronchitis and being mainly smoker.
  3. Aged smoker patients without dyspnea.
  4. Aged patient that do not smoke suffering from bronchitis.
  5. Patients having cancer or tuberculosis
images 23

images 24

Actually, this report used the various analysis functionalities proposed in the inference menu (cf Chapter 4).

3.6 Supervised learning dedicated entirely to characterizing a particular variable


BayesiaLab also offers a third type of Bayesian network learning from a data base. While the first two (association discovery and new concept discovery) are "unsupervised" algorithms, this last category contains “supervised” algorithms.  This is due to the fact that the induction of the Bayesian network is completely guided throughout the characterization of a particular node.

images 25


BayesiaLab uses several characterization algorithms. To illustrate this functionality, we leave our initial problem of lung diseases to be interested in a simplified modeling of car operation. The optimal Bayesian network having allowed the generation of the data ("Menu Data base - Generate data base" in Validation mode) is as follows:

images 26

Our target node is BatteryPower.

Naïve Bayes

This is a Bayesian network in which the target node is the father of all other nodes. In this structure, knowing the value of the target makes each node independent of the others. The low number of probabilities to be estimated makes this a very robust structure and its learning is very fast since the structure is known and then, we just have to learn the parameters. The screen shot below corresponds to the Naïve Bayes for BatteryPower after an arc analysis and automatic positioning.

images 27

Augmented Naive Bayes

This architecture is composed of a naive structure, enriched by relations among son nodes knowing the value of the target node. With this algorithm, we can get more precise results than with the naive architecture, but searching for the conditional relations costs more in time.

images 28

Sons & Spouses

Here, the target node is the father of a subset of nodes possibly having other relationships (spouses). This algorithm has the advantage of showing the set of nodes being indirectly linked to the target. The time cost is of the same order as for the augmented naive Bayes.

images 29

Markov Blanket

This algorithm searches for the nodes that belong to the Markov Blanket of the target: fathers, sons and spouses nodes. With this structure's search, which is entirely focused on the target node, we can get the sub-set of the nodes that are actually relevant in a time frame that is much lower than with the other two architectures Augmented Naïve Bayes and Sons & Spouses. This method is a very good tool for analyzing one variable

images 30

The observation of the nodes belonging to the Markov Blanket makes the target node independent of all the other nodes.

Augmented Markov Blanket

This involves the Markov Blanket learning with the addition of an unsupervised search for the probabilistic relations between each of the variables belonging to the Markov Blanket. This additional search entails a supplementary cost but in exchange there is a gain in precision with respect to the simple version. In our example, such relations do not exist and the resulting structure is then the Bayesian network obtained by the Markov Blanket algorithm.

Evaluation of the learned Bayesian networks

Once these supervised models learned, it is possible to evaluate their performances on a test set (ideally different from the learning set). In Validation mode, the inference menu proposes the item Targeted Evaluation. This evaluation provides:

  1. The total precision of the Bayesian network
  2. Three confusion matrices allowing to precisely measure the performances of the model for each modality (occurrence number, reliability: proportion of the cases with a correct prediction, and precision: proportion of the cases where the true value is correctly predicted).
  3. A lift curve relative to a target modality. It represents the detection rate of the target value (Y-axis) with respect to the number of processed cases (X-axis). Cases are first sorted with respect to the probability inferred by the model (cases with the highest target modality probability appear at the beginning of the list).Thus, if the target modality represents 10 % of the cases, the optimal lift curve has an Y-value at 100 % for an X-value at 10 % (all the target cases are detected with no false detection). Conversely, a random tagging policy leads to a diagonal lift curve (the expected detection rate is directly given by the selection rate: if you randomly select x% of the cases, the expected detection rate of the cases with the target value is x %).
  4. A ROC curve representing the False Positive rate of the target modality along X axis and the True Positive rate along the Y axis.
The screen shot below corresponds to the target evaluation of the naïve Bayes structure.
images 31