In the previous chapter, we described the application of Bayesian networks for evidential reasoning. All available knowledge was manually encoded in the Bayesian network in that example. In this chapter, we additionally use data for defining Bayesian networks. This provides the basis for the following chapters, which will present applications that utilize machine-learning for generating Bayesian networks entirely from data.
For machine learning with BayesiaLab, concepts derived from information theory, such as entropy and mutual information, are particularly important and should be understood by the researcher. However, these measures are not nearly as familiar to most scientists as common statistical measures, e.g., covariance and correlation.
We present a straightforward research task to introduce these presumably unfamiliar information-theoretic concepts. The objective is to establish the predictive importance of a range of variables concerning a target variable. The domain of this example is residential real estate, and we wish to examine the relationships between home characteristics and sales prices. In this context, it is natural to ask questions related to variable importance, such as, which is the most important predictive variable pertaining to home value? By attempting to answer this question, we can explain what entropy and mutual information mean in practice and how BayesiaLab computes these measures. In this process, we also demonstrate a number of BayesiaLab’s data-handling functions.
The dataset for this chapter’s exercise describes the sale of individual residential properties in Ames, Iowa, from 2006 to 2010. It contains a total of 2,930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous). This dataset was first used by De Cock (2011) as an educational tool for statistics students. The objective of their study was the same as ours, i.e., modeling sale prices as a function of the property attributes.
To make this dataset more convenient for demonstration purposes, we reduced the total number of variables to 49. This pre-selection was fairly straightforward as numerous variables essentially do not apply to homes in Ames, e.g., variables relating to pool quality and pool size (there are practically no pools) or roof material (it is the same for virtually all homes).
Dean De Cock. Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project. Journal of Statistical Education, 19(3), 2011.
As the first step, we start BayesiaLab’s Data Import Wizard by selecting Main Menu > Data > Open Data Source > Text File
.
Then, we select the file “AmesHousePriceData.csv”, a comma-delimited, flat text file, which you can download here:
This brings up the first screen of the Data Import Wizard, which previews the to-be-imported dataset.
For this example, the coding options for Missing Values and Filtered Values are particularly important. By default, BayesiaLab lists commonly used codes that indicate an absence of data, e.g., #NUL! or NR (non-response). In the Ames dataset, a blank field (“ ”) indicates a Missing Value, and “FV” stands for Filtered Value. These are recognized automatically. If other codes were used, we could add them to the respective lists on this screen.
Clicking Next, we proceed to the screen that allows us to define variable types.
BayesiaLab scans all variables in the database and provides a best guess regarding the variable type. Variables identified as Continuous are shown in turquoise, and those identified as Discrete are highlighted in pastel red.
In BayesiaLab, a Continuous variable contains a wide range of numerical values (discrete or continuous), which need to be transformed into a more limited number of discrete states. Some other variables in the database only have very few distinct numerical values to begin with, e.g., [1,2,3,4,5], and BayesiaLab automatically recognizes such variables as Discrete. For them, the number of numerical states is small enough that creating bins of values is unnecessary. Also, variables containing text values are automatically considered Discrete.
For this dataset, however, we need to make a number of adjustments to the suggested data types. For instance, we set all numerical variables to Continuous, including those highlighted in red that were originally identified as Discrete. As a result, all columns in the data preview of the Data Import Wizard are now shown in turquoise.
Given that our database contains some missing values, we need to select the type of Missing Values Processing in the next step. Instead of using ad hoc methods, such as pairwise or listwise deletion, BayesiaLab can leverage more sophisticated techniques and provide estimates (or temporary placeholders) for such missing values—without discarding any original data.
We will discuss Missing Values Processing in detail in Chapter 9. For this example, however, we leave the default setting of Structural EM.
At this point, however, we must introduce a very special type of missing value for which we must not generate any estimates. We are referring to so-called Filtered Values. These are “impossible” values that do not or cannot exist—given a specific set of evidence, as opposed to values that do exist but are not observed. For example, for a home that does not have a garage, there cannot be any value for the variable Garage Type, such as Attached to Home, Detached from Home, or Basement Garage. If there is no garage, there cannot be a garage type. As a result, it makes no sense to calculate an estimate of a Filtered Value. In a database, unfortunately, a Filtered Value typically looks identical to “true” missing value that does exist but is not observed. The database typically contains the same code, such as a blank, NULL, N/A, etc., for both cases.
Therefore, instead of “normal” missing values, which can be left as-is in the database, we must mark Filtered Values with a specific code, e.g., “FV.” The Filtered Value declaration should be done during data preparation before importing any data into BayesiaLab. BayesiaLab will then add a Filtered State (marked with “*”) to the discrete states of the variables with Filtered Values and utilize a special approach for actively disregarding such Filtered States so that they are not taken into account during machine learning or for estimating effects.
As the next step in the Data Import Wizard, all Continuous values must be discretized (or binned). We show a sequence of screenshots to highlight the necessary steps. The initial view of the Discretization and Aggregation step appears.
By default, the first column is highlighted, which happens to be SalePrice, the variable of principal interest in this example. Instead of selecting any available automatic discretization algorithms, we pick Manual from the Type drop-down menu, which brings up the Cumulative Distribution Function (CDF) of the SalePrice variable.
By clicking Density Function, we can bring up the Probability Density Function (PDF) of SalePrice.
Either view allows us to examine the distribution and identify any salient points. We stay on the current screen to set the thresholds for each discretization bin. In many instances, we would use an algorithm to define bins automatically unless the variable will serve as the target variable. In that case, we usually rely on available expert knowledge to define the binning. In this example, we wish to have evenly-spaced, round numbers for the interval boundaries. We add boundaries by right-clicking on the plot (right-clicking on an existing boundary removes it again). Furthermore, we can fine-tune a threshold’s position by entering a precise value in the Threshold Value field. We use {75000, 150000, 225000, 300000} as the interval boundaries.
Now that we have manually discretized the target variable SalePrice (column highlighted), we still need to discretize the remaining continuous variables. However, we will take advantage of an automatic discretization algorithm for those variables.
We click Select All Continuous. BayesiaLab automatically excludes SalePrice from this selection because we have already discretized it.
Numerous automatic discretization algorithms are available, but for the purpose of this example, we only consider the bivariate Tree discretization algorithm.
Please see the main entry for Discretization in this library for a detailed description of all available algorithms.
As its name suggests, the Tree discretization algorithm machine-learns a decision tree that uses the to-be-discretized variable for representing the conditional probability distributions of the target variable given the to-be-discretized variable. Once the Tree is learned, it is analyzed to extract the most useful thresholds. This is the method of choice in the context of Supervised Learning, i.e., when planning to machine-learn a model to predict the target variable.
At the same time, we do not recommend using Tree in the context of Unsupervised Learning. The Tree algorithm creates bins that are biased toward the designated target variable. Naturally, emphasizing one particular variable would run counter to the intent of Unsupervised Learning.
Note that if the to-be-discretized variable is independent of the target variable, it will be impossible to build a tree, and BayesiaLab will prompt the selection of a univariate discretization algorithm.
In this example, we focus our analysis on SalePrice, which can be considered a type of Supervised Learning. Therefore, we discretize all continuous variables with the Tree algorithm, using SalePrice as the Target variable. Note the Target must either be a Discrete variable or a Continuous variable that has already been manually discretized, which is the case for SalePrice.
Clicking Finish completes the import process.
The import process concludes with a pop-up window that offers to display the Import Report.
Clicking Yes brings up the Import Report, which can be saved in HTML format. It lists the discretization intervals of the Continuous variables, the States of the Discrete variables, and the discretization method used for each variable.
Once we close out this report, we can see the result of the import process. All the imported variables are now represented as nodes on the Graph Panel. The dashed borders of some nodes indicate that the corresponding variables were discretized during data import.
The lack of warning icons on any nodes indicates that all their parameters, i.e., their marginal probability distributions, were automatically estimated upon data import.
To verify, we open the Node Editor of SalePrice (Node Context Menu > Edit > Probability Distribution > Probabilistic
) and check the node’s marginal distribution.
Clicking on the Occurrences tab shows the observations per cell, which were used for the Maximum Likelihood Estimation of the marginal distribution.
The following animation shows all the above steps in a continuous workflow.
The Node Names displayed by default are taken directly from the column header of the imported dataset. To keep the Graph Panel uncluttered, we will keep these "short" names as the formal Node Names. On the other hand, we may want to have longer, more descriptive names available when interpreting the network or presenting it to an audience.
BayesiaLab offers three levels of "node names" for each node:
The Node Name uniquely identifies a node and is displayed by default.
A Long Name can be displayed instead of the Node Name on the Graph Panel, on the Monitors in the Monitor Panel, on reports, and in the context of many analysis functions.
A Node Comment provides additional space for supplemental information about a node. For instance, if nodes represent survey responses, the Node Comment could accommodate the verbatim survey question.
Long Names can be added to a network in two ways:
One by one for each node via the Properties tab of the Node Editor (Node Context Menu > Edit > Properties
).
Using a Dictionary to provide Long Names for multiple nodes at once.
Given that we want to apply Long Names to 49 nodes, using a Dictionary will be much more convenient. The format of a Dictionary is rather straightforward:
We define a plain text file that includes one Node Name per line. Spaces and special characters in the Node Name require backslash "\" as an escape character.
Each Node Name is followed by a delimiter (“=”, tab, or space) and then by the Long Name.
Here is a preview of the Dictionary:
You can download the complete Dictionary file here:
To attach this Dictionary, select Main Menu > Data > Associate Dictionary > Node > Long Names
.
Next, we select the Dictionary file, “AmesLongNames.txt”.
Upon loading the Dictionary file, the appearance of the network does not change. Only if an error occurred would a warning triangle appear in the lower right corner of the Graph Window. Also, any error details would be available in the Console.
We now have the option of turning on the Long Names for individual nodes or all nodes. For our purposes, we want to see the Long Names on all nodes:
Select all nodes, e.g., using Ctrl+A.
Node Context Monitor > Properties > Rendering Properties > Show Long Name
.
Check the Show Long Name box in the pop-up window:
Click OK.
Instead of the "short" Node Names, BayesiaLab now displays the Long Names for all nodes.
Now that we have the Ames dataset represented internally in BayesiaLab, we need to become familiar with how BayesiaLab can quantify the probabilistic properties of these nodes and their relationships.
In traditional statistical analysis, we would presumably examine correlation and covariance between the variables to establish their relative importance, especially regarding the target variable Sale Price. In this chapter, we take an alternative approach based on information theory. Instead of computing the correlation coefficient, we consider how the uncertainty of the states of a to-be-predicted variable is affected by observing a predictor variable.
It is fair to say that we would need detailed information about a property to predict its value reasonably. However, in the absence of any specific information, would we be entirely uncertain about its value? Probably not. Even if we did not know anything about a particular house, we would have some contextual knowledge, i.e., that the house is in Ames, Iowa, rather than in midtown-Manhattan, and that the property is a private home rather than a shopping mall. That knowledge significantly reduces the range of possible values. True uncertainty would mean that a value of $0.01 is as probable as a value of $1 million or $1 billion. That is clearly not the case here. So, how uncertain are we about the value of a random home in Ames prior to learning anything about that particular home? The answer is that we can compute the entropy from the marginal probability distribution of home values in Ames. Since we have the Ames dataset already imported into BayesiaLab, we can display a histogram of SalePrice by bringing up its Monitor.
Entering the values displayed in the Monitor, we obtain:
No uncertainty means that the probability of one bin (or state) of SalePrice is 100%. This could be, for instance, P(SalePrice<=75000)=1.
We now compute the entropy of this distribution once again:
Here,
is taken as 0, given the limit
This means that “no uncertainty” has zero Entropy.
What about the opposite end of the spectrum, i.e., complete uncertainty? Maximum uncertainty exists when all possible states of a distribution are equally probable when we have a uniform distribution:
Once again, we calculate the entropy:
How do such entropy values help us to establish the importance of predictive variables? If there is no uncertainty regarding a variable, one state of this variable has to have a 100% probability, and predicting that particular state must be correct. This would be like predicting the presence of clouds during rain. On the other hand, if the probability distribution of the target variable is uniform, e.g., the outcome of a fair coin toss, a random prediction has to be correct with a probability of 50%.
In the context of house prices, knowing the marginal distribution of SalePrice and assuming this distribution is still true when we make the prediction, predicting SalePrice>=150000 would have a 41.28% probability of being correct, even if we knew nothing else. However, we would expect that observing an attribute of a specific home would reduce our uncertainty concerning its SalePrice and increase our probability of making a correct prediction for this particular home. In other words, conditional upon learning an attribute of a home, i.e., by observing a predictive variable, we expect a lower uncertainty for the target variable, SalePrice.
For instance, the moment we learn of a particular home that LotArea=200,000 (measured in square feet. 200,000 ft2 ≈ 4.6 acres ≈ 18,851 m2 ≈ 1.86 ha), and assuming, again, that the estimated marginal distribution is still true when we are making the prediction, we can be certain that SalePrice>300000. This means that upon learning the value of this home’s LotArea, the entropy of SalePrice goes from 1.85 to 0. Learning the size reduces our entropy by 1.85 bits. Alternatively, we can say that we gain information amounting to 1.85 bits.
The information gain or entropy reduction from learning about LotArea of this house is obvious. Observing a different home with a more common lot size, e.g., LotArea=10,000, would presumably provide less information and, thus, have less predictive value for that home.
However, we wish to know how much information we would gain on average—considering all values of LotArea along with their probabilities—by generally observing it as a predictive variable for SalePrice. Knowing this “average information gain” would reflect the predictive importance of observing the variable LotArea.
To compute this, we need two quantities. First, the marginal entropy of the target variable H(SalePrice), and second, the conditional entropy of the target variable given the predictive variable:
which is equivalent to:
and furthermore also equivalent to:
Furthermore, we can see icons that indicate the presence of Missing Values and Filtered Values in the respective nodes.
Beyond our common-sense understanding of uncertainty, there is a more formal quantification of uncertainty in information theory: . More specifically, we use to quantify the uncertainty manifested in the probability distribution of a variable or of a set of variables. In the context of our example, the uncertainty relates to the to-be-predicted home price.
This Monitor reflects the discretization intervals that we defined during the data import. It is now easy to see the frequency of prices in each price interval, i.e., the marginal distribution of SalePrice. For instance, only about 2% of homes sold had a price of $75,000 or less. On the basis of this probability distribution, we can now compute the . The definition of for a discrete distribution is:
In information theory, the unit of information is "bit", which is why we use base 2 of the logarithm. On its own, the calculated value of 1.85 bits may not be a meaningful measure. To understand how much or how little uncertainty this value represents, we compare it to two easily-interpretable measures, i.e., “no uncertainty” and “complete uncertainty.”
The value 5 in the logarithm of the simplified equation reflects the number of states. This means that the is a function of the variable discretization. In addition to the previously computed Marginal Entropy of 1.85, we now have the values 0 and 2.3219 for “no uncertainty” and “complete uncertainty,” respectively.
The difference between the marginal entropy of the target variable and the conditional entropy of the target given the predictive variable is formally known as , denoted by I. In our example, the I between SalePrice and LotArea is the marginal entropy of SalePrice minus the conditional entropy of SalePrice given LotArea:
More generally, the I between variables X and Y is defined by:
This allows us to compute the between a target variable and any possible predictors. As a result, we can find out which predictor provides the maximum information gain and, thus, has the greatest predictive importance.
Now we can see the real benefit of bringing all variables as nodes into BayesiaLab. To calculate Mutual Information, all the terms of the equation can be easily computed with BayesiaLab once we have a fully specified network.
We start with a pair of nodes, namely Neighborhood and SalePrice. As opposed to LotArea, which is a discretized Continuous variable, Neighborhood is categorical, and, as such, it has been automatically treated as Discrete in BayesiaLab. This is the reason the node corresponding to Neighborhood has a solid border. We now add an arc between these two nodes to explicitly represent the dependency between them:
Counting all records, we obtain the marginal count of each state of Neighborhood.
Given that our Bayesian network structure says that Neighborhood is the parent node of SalePrice, we now count the states of SalePrice conditional on Neighborhood. This is simply a cross-tabulation.
Once we translate these counts into probabilities (by normalizing by the total number of occurrences for each row in the table), this table becomes a CPT. Together, the network structure (qualitative) and the CPTs (quantitative) comprise the Bayesian network.
In practice, however, we do not need to bother with these individual steps. Rather, BayesiaLab can automatically learn all marginal and conditional probabilities from the associated database. We select Main Menu > Learning > Parameter Estimation
to perform this task.
This model now provides the basis for computing the Mutual Information between Neighborhood and SalePrice. BayesiaLab computes Mutual Information on demand and can display its value in numerous ways. For instance, in Validation Mode (F5), we can select Main Menu > Analysis > Visual > Arc > Overall > Mutual Information
.
The top number in the Arc Comment box shows the actual Mutual Information value, i.e., 0.6462 bits. We should also point out that Mutual Information is a symmetric measure. As such, the amount of Mutual Information that Neighborhood provides on SalePrice is the same as the amount of MI that SalePrice provides with regard to Neighborhood. This means that knowing the SalePrice reduces the uncertainty with regard to Neighborhood, even though that may not be of interest.
Without context, however, the value of Mutual Information is not meaningful. Hence, BayesiaLab provides an additional measure, i.e., the Symmetric Normalized Mutual Information. which gives us a sense of how much the entropy of SalePrice was reduced. Previously, we computed the marginal entropy of SalePrice to be 1.85. Dividing the Mutual Information by the Marginal Entropy of SalePrice gives us a sense of how much our uncertainty is reduced:
Conversely, the red number shows the Relative Mutual Information with regard to the parent node, Neighborhood. Here, we divide the Mutual Information, which is the same in both directions, by the Marginal Entropy of Neighborhood:
This means that by knowing Neighborhood, we reduce our uncertainty regarding SalePrice by 32% on average. By knowing SalePrice, we reduce our uncertainty regarding Neighborhood by 14% on average. These values are readily interpretable. However, we need to know this for all nodes to determine which node is most important.
For the node he SalePrice, we select Node Context Menu > Set as Target Node
. Alternatively, we can double-click the node while pressing T
.
Strictly speaking, we are not learning a network in the true sense of machine learning. Rather, we are specifying a naive structure, i.e., arcs from the Target Node to all other nodes, and then estimating the parameters.
Due to its simplicity, the Naive Bayes network is presumably the most commonly used Bayesian network. As a result, we find it implemented in many software packages. For instance, the so-called Bayesian anti-spam systems are based on this model.
However, it is important to note that the Naive Bayes network is merely the first step towards embracing the Bayesian network paradigm.
The following standalone graphic highlights the order of the arcs and nodes in this Naive Bayes network:
As an alternative to this visualization, we can run a report Main Menu Analysis > Report > Relationship
:
Wouldn’t this report look the same if computed based on correlation? In fact, the rightmost column in this Relationship Analysis Report shows Pearson’s Correlation for reference. As we can see, the order would be different if we chose Pearson’s Correlation as the main metric.
So, what have we gained anything over correlation? One of the key advantages of Mutual Information is that it can be computed—and interpreted—between numerical and categorical variables without any variable transformation. For instance, we can easily compute the Mutual Information, such as between the Neighborhood and SalePrice. The question regarding the most important predictive variable can now be answered. It is Neighborhood.
Now that we have established the central role of Entropy and Mutual Information, we can apply these concepts in the next chapters for machine learning and network analysis.
The yellow warning triangle reminds us that the Conditional Probability Table (CPT) of SalePrice given Neighborhood has not been defined yet. In Chapter 4, we defined the CPT based on existing knowledge. On the other hand, as we have an associated database, BayesiaLab can use it to estimate the CPT by using Maximum Likelihood, i.e., BayesiaLab “counts” the (co-)occurrences of the states of the variables in our data. The table below shows the first 10 records of the variables SalePrice and Neighborhood from the Ames dataset.
Upon completing the Parameter Estimation, the warning triangle has disappeared, and we can verify the results by double-clicking SalePrice to open the Node Editor. Under the tab Probability Distribution > Probabilistic
we can see the probabilities of the states of SalePrice given Neighborhood. The CPT presented in the Node Editor is indeed identical to the table shown above.
The value of Mutual Information is now represented graphically in the thickness of the arc. This does not give us much insight because we only have a single arc in this network. So, we click the Show Arc Comments icon in the Toolbar to show the numerical values.
Rather than computing the relationships individually for each pair of nodes, we ask BayesiaLab to estimate a Naive Bayes network. A Naive Bayes structure is a network with only one parent, the Target Node, i.e., the only arcs in the graph are those directly connecting the to a set of nodes. By designating SalePrice as the Target Node, we can automatically compute its Mutual Information with all other available nodes.
The special status of the Target Node is highlighted by the bullseye symbol . We can now proceed to learn the Naive Bayes network: Select Main Menu > Learning Supervised Learning > Naive Bayes
.
Now we have the network that allows computing the between all nodes.
We switch to and select Main Menu > Analysis > Visual > Overall > Arc > Mutual Information
.
The different levels of are now reflected in the thickness of the arcs.
However, given the grid layout of the nodes and the overlapping arcs, it is difficult to establish a rank order of the nodes in terms of . To address this, we adjust the layout and select Main Menu > View > Layout > Radial Layout
.
This generates a circular arrangement of all nodes with the Target Node, SalePrice, in the center. Clicking the Stretch icon repeatedly, we expand the network to make it fit into the available screen space widthwise.
Also, having run the Radial Layout while the Arc Mutual Information function was still active, the arcs and nodes are ordered clockwise from strongest to weakest .
To improve the interpretability further, we select Main Menu > View > Hide Information
. Alternatively, we click the Hide Information icon in the Toolbar. This removes the information icons from the arcs. Their presence indicates that further information would be available to display, e.g., the numerical values of the of each arc.
This illustration shows that Neighborhood provides the highest amount of and, at the opposite end of the range, RoofMtl (Roof Material) the least.