Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
In BayesiaLab, nearly all learning and analysis functions are based on principles and metrics from the field of Information Theory.
In this section, we summarize some of these concepts and attempt to relate them to the corresponding BayesiaLab functions.
Furthermore, we include several relevant statistical concepts for understanding BayesiaLab's estimates and visualizations.
A Joint Probability Distribution is the distribution of Joint Probabilities.
A Joint Probability is the probability of specific values of variables jointly occurring in a domain.
We observe the variables Hair Color and Eye Color in a population of college students.
Joint Probability refers to the probability of specific values for Hair Color and Eye Color jointly occurring in this population.
For instance,
P(Eye Color=Blue, Hair Color=Blond)=15.86% means that the probability of a student having blue eyes and blond hair in the given population is 15.86%.
P(Eye Color=Green, Hair Color=Black)=0.85% means that the probability of having green eyes and black hair in that population is only 0.85%.
We can now look across all possible combinations of Hair Color and Eye Color, compute all Joint Probabilities and list them in a Joint Probability Table, with one row for each combination of the states of the variables.
In this example, the size of the Joint Probability Table is manageable: Number of States (Hair Color) × Number of States (Eye Color) = 4 × 4 = 16
This Joint Probability Table is a direct and complete representation of the Joint Probability Distribution for the variables Hair Color and Eye Color:
As the Joint Probability Distribution covers all possible combinations, it represents all regularities and patterns (or the lack thereof) within a domain.
Knowing the Joint Probability Distribution is required for performing two key operations for data analysis and inference:
Marginalization, which is calculating the marginal probability of a variable, e.g., P(Hair Color=Black)=18.25%.
Conditioning, which refers to inferring the values of a variable, given a specific value of another variable, e.g., P(Hair Color=Blond | Eye Color=Blue)=43.7%.
In high-dimensional domains, however, calculating and listing the Joint Probabilities in a Joint Probability Table can become intractable.
The size of a Joint Probability Table grows exponentially with the number of variables. For example, if we had 20 variables with 4 states each, the size of the corresponding Joint Probability Table would exceed 1 trillion rows.
While the arithmetic is straightforward, the sheer number of calculations can easily exceed the available computational power, both for generating the Joint Probability Table as well as for performing Marginalization and Conditioning.
"The only way to deal with such large distributions is to constrain the nature of the variable interactions in some manner, both to render specification and ultimately inference in such systems tractable. The key idea is to specify which variables are independent of others, leading to a structured factorisation of the joint probability distribution. [Bayesian] Belief Networks are a convenient framework for representing such factorisations into local conditional distributions." (Barber, 2012)
This means that Bayesian networks are extremely practical for approximating Joint Probability Distributions in complex, high-dimensional problem domains.
Barber, D. (2012). Bayesian Reasoning and Machine Learning. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511804779
Hair Color | Eye Color | Joint Probability |
---|---|---|
Black
Brown
11.49%
Brown
Brown
20.10%
Red
Brown
4.39%
Blond
Brown
1.18%
Black
Blue
3.38%
Brown
Blue
14.19%
Red
Blue
2.87%
Blond
Blue
15.88%
Black
Hazel
2.53%
Brown
Hazel
9.12%
Red
Hazel
2.36%
Blond
Hazel
1.69%
Black
Green
0.84%
Brown
Green
4.90%
Red
Green
2.36%
Blond
Green
2.70%
Sum
100.00%
DL(G) = \sum\limits_i^n {\left( {{{\log }_2}(n) + {{\log }_2}\left( {\begin{array}{*{20}{c}} n\\ {\left\| {P{a_i}} \right\|} \end{array}} \right)} \right)}
where
where
As the probability p cannot be known prior to learning the network, we use the following classical heuristic in BayesiaLab:
The Minimum Description Length Score (MDL Score) is derived from Information Theory and has been used extensively in the Artificial Intelligence community.
It consists of the sum of two components that estimate:
the minimum number of bits required to represent a model, and
the minimum number of bits required to represent the data given the model.
However, in the specific context of Bayesian networks, we need to explain the exact meaning and the notation of these two components:
The goal of this structural part is to apply Occam's Razor, or the law of parsimony, i.e., to choose the simplest hypothesis, all other things being equal.
The data likelihood is inversely proportional to the probability of the observed dataset, as inferred by the Bayesian network model.
BayesiaLab attempts to minimize the MDL Score by evaluating candidate networks during structural learning.
“Bayesian inference is important because it provides a normative and general-purpose procedure for reasoning under uncertainty.”
Inductive Reasoning: Experimental, Developmental, and Computational Approaches, edited by Aidan Feeney and Evan Heit
Bayesian inference refers to an approach first proposed by Rev. Thomas Bayes (1702-1761), whose rule allows calculating the probability of an event A upon observing an event B.
Bayes' rule or Bayes’ theorem relates the conditional and marginal probabilities of events A and B (provided that the probability of B is not equal to zero). More specifically, Bayes' rule allows calculating the conditional probability of event A given event B with the inverse conditional probability of event B given event A.
is the conditional probability of event A given event B. It is also called the "posterior" probability because it depends on knowledge of event B. This is the probability of interest.
Note that referring to "posterior" should not be interpreted in a temporal sense, i.e., it does not imply a temporal order between events A and B.
is the prior probability (or “unconditional” or “marginal” probability) of event A. The unconditional probability P(A) was first called “a priori” by Sir Ronald A. Fisher. It is a “prior” probability because it does not consider any information about event B.
is the prior or marginal probability of event B.
Note that "prior," just like "posterior," does not imply a temporal order.
is the Bayes factor or likelihood ratio.
Bayesian networks are models that consist of two parts:
A qualitative part to represent the dependencies using a Directed Acyclic Graph (DAG).
A quantitative part, using local probability distributions, for specifying the probabilistic relationships.
A Directed Acyclic Graph (DAG) consists of nodes and directed links:
Nodes represent variables of interest (e.g., the temperature of a device, the gender of a patient, a feature of an object, or the occurrence of an event).
Nodes can correspond to symbolic/categorical variables, numerical variables with discrete values, or discretized continuous variables.
Directed arcs represent statistical (informational) or causal dependencies among the variables. The directions are used to define kinship relations, i.e., parent-child relationships.
For example, in a Bayesian network with an arc from X to Y, X is the parent node of Y, and Y is the child node.
The local probability distributions can be either marginal for nodes without parents (Root Nodes) or conditional for nodes with parents.
In the latter case, the dependencies are quantified by Conditional Probability Tables (CPT) for each node given its parents in the Directed Acyclic Graph (DAG).
Once fully specified, a Bayesian network compactly represents the Joint Probability Distribution (JPD).
Thus, the Bayesian network can be used for computing the posterior probabilities of any subset of nodes given evidence set on any other subset.
The following illustration shows a simple Bayesian network, which consists of only two nodes and one directed arc.
This Bayesian network represents the Joint Probability Distribution (JPD) of the variables Eye Color and Hair Color in a population of students (Snee, 1974).
Eye Color is a Root Node and, therefore, does not have any Parents. In other words, Eye Color does not depend on any other node.
As a result, the table associated with Eye Color is a Probability Table, i.e., it represents the marginal distribution of Eye Color unconditionally.
On the other hand, the probabilities of Hair Color are only defined conditionally upon the values of its parent node, Eye Color.
Hence, the probabilities of Hair Color are provided in a Conditional Probability Table (CPT).
It is important to point out that this Bayesian network does not imply any causal relationships, even though the arc direction may suggest that to a casual observer.
The arc direction merely defines the parent-child relationship of the nodes for purposes of representing the Joint Probability Distribution (JPD).
To calculate the description length of the data given the Bayesian network, we utilize the fact that the description length is inversely proportional to the probability of the observed data inferred by the model.
where
The chain rule allows rewriting this equation with:
Normalized Entropy is a metric that takes into account the maximum possible value of Entropy and returns a normalized measure of the uncertainty associated with the variable:
In this new example, we now compare the variables X1 and X2, which each represent ball colors:
X1 ∈ {blue, red}
X2 ∈ {blue, red, green, yellow, purple, orange, brown, black}
Normalized Entropy allows us to compare the degree of uncertainty even though these two variables have different numbers of states, i.e., two versus eight states:
In BayesiaLab, the values of Entropy and Normalized Entropy can be accessed in a number of ways:
You can also sort the Monitors in the Monitor Panel according to their Normalized Entropy via Monitor Context Menu > Sort > Normalized Entropy
.
The Normalized Entropy is also available as a Node Analysis metric for Size and Color in the 2D and 3D Mapping Tools.
In Function Nodes, Entropy and Normalized Entropy are available as Inference Functions in the Equation tab.
Entropy: Entropy(?X1?, False)
Normalized Entropy: Entropy(?X1?, True)
where
To illustrate these concepts, we use the familiar Visit Asia network:
In BayesiaLab, the Expected Log-Loss values can be shown in the context of the Monitors.
On the left, the Monitors of the two nodes show their marginal distributions.
Instead of replacing the states' probabilities with the Expected Log-Loss values in a Monitor, you can bring up the Expected Log-Loss values ad hoc as a Tooltip.
Then, when you hover over any Monitor with your cursor, a Tooltip shows the Expected Log-Loss values.
Entropy is expressed in bits and defined as follows:
Let's assume we have four containers, A through D, which are filled with balls that can be either blue or red.
Container A is filled exclusively with blue balls.
Container B has an equal amount of red and blue balls.
In Container C, 10.89% of all balls are blue, and the remainder is red.
Container D only holds red balls.
Within each container, the order of balls is entirely random.
A volunteer who already knows the proportions of red and blue balls in each container now randomly draws one ball from each container. What is his degree of uncertainty regarding the ball color at the moment of each draw?
Needless to say, with Containers A and D, there is no uncertainty at all. From Containers A and D, he will draw a blue and red ball, respectively, with perfect certainty. What about the degree of certainty or, rather, uncertainty for Containers B and C?
The concept of Entropy can formally represent the degree of uncertainty.
Using the definition of Entropy from above, we can compute the Entropy value applicable to each draw.
We can also plot Entropy as a function of the probability of drawing a red ball.
This was an example of a variable with two states only. As we introduce more possible states, e.g., another ball color, the maximum possible Entropy increases.
As a result, one cannot compare the Entropy values of variables with different numbers of states.
is the number of bits required to represent a Bayesian network. We can break down this value into the sum of two components:
, which stands for the number of bits required to represent the graph G of the Bayesian network,
represents the number of bits required to represent the set of probability tables P.
To calculate , we need to determine the number of nodes and the number of their parent nodes.
n is the number of random variables (nodes):
is the set of the random variables that are parents of in graph G
and is the number of parents of the random variable .
Computing is straightforward as it is proportional to the number of cells in all probability tables.
is the number of states of the random variable
is the probability associated with the cell.
Calculating Complexity: DL(B)Calculating Fit: DL(D|B)"the minimum number of bits required to represent a model" is denoted (="Description Length of the Bayesian network ") and refers to the structural complexity of the Bayesian network model , which includes the network graph and all probability tables.
For brevity, we often use the shorthand "complexity" or "structure" to refer to .
Small values of suggest a simple model structure, and large values a complex model.
"the minimum number of bits required to represent the data given the model" is denoted (="Description Length of the data given the Bayesian network ") and refers to the likelihood of the data with respect to the Bayesia network model .
Put simply, refers to the "fit" of the model to the data.
Small values of suggest a well-fitting model; large values, conversely, imply a poor fit.
is the n-dimensional observation described in row , and
is the joint probability of this observation returned by the Bayesian network .
X1 | X2 |
---|
In Validation Mode , with the Information Mode activated, hovering over a Monitor with your cursor will bring up a Tooltip that includes Entropy and Normalized Entropy.
The Log-Loss reflects the number of bits required to encode an n-dimensional piece of evidence (or observation) given the current Bayesian network . As a shorthand for "the number of bits required to encode," we use the term "cost" in the sense that "more bits required" means computationally "more expensive."
where is the joint probability of the evidence computed by the network :
Furthermore, one of the key metrics in Information Theory is :
As a result, Entropy can be considered the sum of the Expected Log-Loss values of each state of variable given network .
We consider the nodes and in the VisitAsia.xbl network.
On the right, we set , which updates the probability of , i.e., .
On the Monitor of , we now select Monitor Context Menu > Show Expected Log-Loss
that the Expected Log-Loss values for the states of are shown instead of their probabilities.
This is an interesting example because setting does reduce the Entropy of , but does not seem to change the Expected Log-Loss of .
The following plot illustrates , , and . For a compact representation in the plot, we substituted for .
In this binary case, the curves show how the can be decomposed into and .
The blue curve also confirms that the Expected Log-Loss values are identical for the two probabilities of , i.e., 80.54% and 42.52%.
Click on the Information Mode icon in the Toolbar.
Entropy, denoted , is a key metric in BayesiaLab for measuring the uncertainty associated with the probability distribution of a variable .
The Entropy of a variable can also be understood as the sum of the Expected Log-Losses of its states.
We use the binary variable to represent the color of the ball.
Container A | Container B | Container C | Container D |
---|
We see that Entropy reaches its maximum value for , i.e., when drawing a red or a blue ball is equally probable. A 50/50 mix of red and blue balls is indeed the situation with the highest possible degree of uncertainty.
More specifically, the maximum value of Entropy increases logarithmically with the number of states of node .
where is the number of states of the variable .
To make Entropy comparable, the metric is available, which takes into account the Maximum Entropy.
The Hellinger distance measures the similarity or dissimilarity between two probability distributions. It is often used in statistics and information theory to compare how two probability distributions differ or overlap.
The Hellinger distance is often used in various applications, such as statistical hypothesis testing, image processing, machine learning, and ecology, where comparing and quantifying the similarity or difference between probability distributions is important.
Where:
The Hellinger distance has several useful properties:
It is a metric: It satisfies the properties of a metric, which means it is non-negative, symmetric, and obeys the triangle inequality. In other words, it measures the distance between two distributions in a mathematically consistent way.
Interpretability: The Hellinger distance has a meaningful interpretation in terms of probability distributions. It quantifies how much the square root of the probability density functions of the two distributions differ.
In BayesiaLab, the Kullback-Leibler Divergence (or KL Divergence) is used to measure the strength of the relationship between two nodes that are directly connected by an arc.
We commonly refer to the KL Divergence also as Arc Force.
We interpret this difference DKL as the "force of the arc" or Arc Force.
Note that Filtered Values are taken into account for computing the Arc Force.
Throughout this website, we use Kullback-Leibler Divergence, KL Divergence, and Arc Force interchangeably.
The Log-Loss reflects the number of bits required to encode an n-dimensional piece of evidence (or observation) given the current Bayesian network . As a shorthand for "the number of bits required to encode," we use the term "cost" in the sense that "more bits required" means computationally "more expensive."
where is the joint probability of the evidence computed by the network :
In other words, the lower the probability of given the network , the higher the Log-Loss .
Note that refers to a single piece of n-dimensional evidence, not an entire dataset.
For discrete probability distributions and defined over the same probability space, the Hellinger distance is defined as:
is the Hellinger distance between distributions and .
and are the probabilities associated with the event or outcome in the two distributions.
It is bounded: The Hellinger distance is bounded between 0 and , where 0 indicates that the two distributions are identical, and indicates that they are entirely dissimilar.
Square-root transformation: The square root transformation in the formula gives more weight to the differences in the tails of the distributions compared to other distance metrics like the or the Bhattacharyya distance.
Formally, the Kullback-Leibler Divergence DKL measures the difference between two distributions and .
For our purposes, we consider the Bayesian network that does include the arc for which we wish to compute the Arc Force, and the Bayesian network that does not contain that arc but is otherwise identical.
In earlier versions of BayesiaLab, Information Gain was named Consistency.
So, for a network model to be useful, there should generally be more sets of evidence with a positive Information Gain, i.e., consistent observations, than sets of evidence with a negative Information Gain, i.e., conflicting observations.
Information Gain and Evidence Analysis
Network Performance
The Deviance measure is based on the difference between the of the to-be-evaluated network and the of the complete (i.e., fully connected) network .
The closer the Deviance value is to 0, the better the network represents the dataset.
is the of the dataset given the to-be-evaluated network .
is the of the dataset given the complete (i.e., fully connected) network . In the complete network, all nodes are directly connected to all other nodes. Therefore, the complete network is an exact representation of the chain rule. As such, it does not utilize any conditional independence assumptions for representing the .
is the size of the dataset.
Contingency Table Fit (CTF) measures the quality of the representation of the by a Bayesian network compared to a complete (i.e., fully-connected) network .
is the entropy of the data with the unconnected network .
is the entropy of the data with the evaluated network .
is the entropy of the data with the complete (i.e., fully connected) network . In the complete network, all nodes are directly connected to all other nodes. Therefore, the complete network is an exact representation of the chain rule. As such, it does not utilize any conditional independence assumptions for representing the .
is equal to 100 if the is represented without any approximation, i.e., the entropy of the evaluated network is the same as that obtained with the complete network .
is equal to 0 if the is represented by considering that all the variables are independent, i.e., the entropy of the evaluated network B is the same as the one obtained with the unconnected network .
can also be negative if the parameters of network do not correspond to the dataset.
The Information Gain regarding evidence is the difference between the:
Log-Loss , given an unconnected network , i.e., a so-called straw model, in which all nodes are marginally independent;
Log-Loss given a reference network .
The Log-Loss reflects the "cost" in bits of applying the network to evidence , i.e., the number of bits that are needed to encode evidence . The lower the probability of evidence , the higher the Log-Loss.
As a result, a positive value of Information Gain would reflect a "cost-saving" for encoding evidence by virtue of having network . In other words, encoding with network is less "costly" than encoding it with the straw model . Therefore, evidence would be consistent with network .
Conversely, a negative Information Gain indicates a so-called conflict, Log-Loss of evidence is higher with the straw model compared to the reference network . Note that conflicting evidence does not necessarily mean that the reference network is wrong. Rather, it probably indicates that such a set of evidence belongs to the tail of the distribution that is represented by the reference network .
However, if evidence is drawn from the original data on which the reference network was originally learned, the probability of observing conflicting evidence should be smaller than the probability of observing consistent evidence.
Therefore, the mean value of the Information Gain of a reference network compared to a straw model is a useful performance indicator of the reference network .
BayesiaLab estimates the parameters of a Bayesian network using Maximum Likelihood Estimation.
The probability of a state of a node corresponds to the frequency the state is observed in the dataset.
Let's consider this simple network:
The marginal probability distribution of is estimated as:
where represents the number of occurrences of the specified configuration in the dataset.
The conditional probability distribution of X|Pa is estimated as
Priors reflect any a priori knowledge of an analyst regarding the domain, in other words, expert knowledge. See also Prior Knowledge for Structural Learning.
These priors are expressed with an analyst-specified, initial Bayesian network (structure and parameters) plus analyst-specified Prior Samples.
Prior Samples represent the analyst's subjective degree of confidence in the Priors.
where
BayesiaLab uses these two terms to generate virtual samples that are subsequently combined with the observed samples from the dataset.
With your current Bayesian network, you can generate Priors
Select Main Menu > Data > Prior Samples > Generate
.
Edit Number of Uniform Prior Samples allows you to define prior knowledge in such a way that all the variables are marginally independent (fully unconnected network), and the marginal probability distributions of all nodes are uniform.
For instance, if the number of Prior Samples is set to 1, one observation ("occurrence") would be "spread across" all states of each node, essentially assigning a "fraction of an observation" to each node's states.
To apply Smoothed Probability Estimation, select Main Menu > Edit > Edit Smoothed Probability Estimation
Specify the number of Prior Samples.
In BayesiaLab's approach to learning and analyzing Bayesian networks, statistical concepts play a secondary role compared to concepts from the field of Information Theory.
Nevertheless, statistical measures, such as correlation, can provide certain insights that are unavailable from non-statistical measures.
Where the covariance is defined by:
And the standard deviation:
In BayesiaLab, there are Discrete Nodes and Continuous Nodes with discretized numerical states. As a result, the value of a node's state may not always be apparent:
For Discrete Nodes that have states with integer or real values, BayesiaLab uses these numerical values directly.
For Continuous Nodes, BayesiaLab uses these mean values of each interval.
Please see Mean, Value, and Standard Deviations for a detailed discussion.
BayesiaLab utilizes proprietary score-based learning algorithms.
However, choosing too low a value might result in "overfitting", i.e., learning "insignificant" relationships, in other words, discovering patterns in what turns out to be mere noise.
BayesiaLab can help reduce the risk of overfitting with the Structural Coefficient Analysis feature.
The Markov Blanket of node A is the set of nodes composed of A’s parents, its children, and its children’s other parents (i.e., spouses).
The Markov Blanket of node A contains all the nodes that, if we know their states, i.e., we have hard evidence for these nodes, will make A independent of all other nodes.
This means that the Markov Blanket of node A is the only knowledge needed to predict the posterior probability distribution of that node.
Learning a Markov Blanket selects the most relevant predictor nodes, which is particularly helpful when there are many variables in a data set. As a result, this can serve as a highly efficient variable selection method.
BayesiaLab can also take into account Priors when estimating parameters using .
is the degree of confidence in the Prior.
is the joint probability returned by the prior Bayesian network.
You can specify by setting the number of Prior Samples.
BayesiaLab uses the current Bayesian network to compute .
The existence of a new Virtual Database is indicated by an icon in the lower right corner of the graph window, next to the "real dataset" icon .
Right-clicking on the Virtual Database icon displays the structure of the prior knowledge that was used for generating the Virtual Samples.
These Virtual Samples will be combined with the observed "real" samples during the learning process.
The Pearson Correlation Coefficient between two nodes and is defined as the covariance of the two corresponding variables divided by the product of their standard deviations:
is the value that is associated with the state .
is the of the node
is the marginal probability of state returned by the Bayesian network
is the joint probability of states and returned by the Bayesian network
For calculating the Pearson Correlation , BayesiaLab must use the values of node states.
For Discrete Nodes that have states without values, e.g., {red, green, blue}, BayesiaLab uses the indices of the states as values, i.e., {red, green, blue} would have the values {0, 1, 2} for the purpose of calculating . Note that the index of states starts at 0.
As opposed to the constraint-based algorithms that use independence tests for adding or removing arcs between nodes, BayesiaLab employs the to measure the quality of candidate networks with respect to the available data.
In BayesiaLab, the computation of the also includes the so-called Structural Coefficient as a weighting factor for the structural component .
With that, the is calculated using the following formula:
As a result, the choice of value for the Structural Coefficient affects the relative weighting of the two components and .
You can arbitrarily modify the Structural Coefficient within the range of 0 to 150.
, the default value means the components and are weighted equally.
reduces the contribution of in the formula and, thus, allows for more "structural complexity."
increases the contribution of in the formula, i.e., it penalizes "structural complexity", forcing a simpler model.
There is another way to interpret the Structural Coefficient , which can help understand its role in learning a Bayesian network.
Weighting with a factor is equivalent to changing the original number of observations N in a dataset to a new number of observations N′:
An value of 0 would be the same as having an infinite number of observations . As a result, the would only be based on the fit component of the score, i.e., , and BayesiaLab's structural learning algorithms would produce a fully connected network.
At the other extreme, an value of 150 would massively favor the simplest possible network structures as the new equivalent number of observations would only 1/150th of .
It is perhaps more intuitive to consider the new number of observations N′ as weighted counts of the actual observations . For instance, is equivalent to counting all observations twice.
From a practical perspective, the Structural Coefficient can be considered a kind of "significance" threshold for structural learning.
The higher you set the value, the higher the threshold for discovering probabilistic relationships. Conversely, the lower you set the value, the lower the discovery threshold and the weaker probabilistic relationship would still be found and represented by an arc.
Reducing α can be helpful if you have a small dataset from which you want to learn a model. Perhaps at the default value, , the learning algorithm would not find any arcs.
Based on Mutual Information, Normalized Mutual Information includes a normalization factor:
where denotes the number of states of .
This means that the Mutual Information is divided by the maximum possible entropy of , i.e., .
With that, the formal definition of Normalized Mutual Information is:
BayesiaLab reports the Normalized Mutual Information in the Target Analysis Report: Main Menu > Analysis > Report > Target > Relationship with Target Node
.
Note that this table shows the Normalized Mutual Information of each node, e.g., XRay, Dyspnea, etc., with regard to the Target Node, Cancer.
In Preferences, Child refers to the Normalized Mutual Information from the Parent onto the Child node, i.e., in the direction of the arc.
Conversely, Parent refers to the Normalized Mutual Information from the Child onto the Parent node, i.e., in the opposite direction of the arc.
Latent Variables (or Factors) and Manifest Nodes are central to building many types of models in BayesiaLab, including Probabilistic Structural Equation Models.
A Manifest Variable is a variable for which there are recorded observations from a given domain.
Manifest Variables include variables such as the temperature, speed, or mass of an object, i.e., properties that are generally measurable.
Furthermore, survey responses are also typical Manifest Variables, as they refer to ratings or assessments directly stated by respondents. In this case, a consumer's opinion is made manifest in the survey response.
manifest (adj.), late 14c., "clearly revealed to the eye or the understanding, open to view or comprehension," from Old French manifest "evident, palpable," (12c.), or directly from Latin manifestus "plainly apprehensible, clear, apparent, evident;" of offenses, "proved by direct evidence;" of offenders, "caught in the act."
A Latent variable (or Factor), as opposed to a Manifest Variable, is a variable that cannot be directly observed. As a result, such a variable would not be recorded in a dataset collected from the original problem domain.
Latent typically refers to a theoretical or "hidden" concept or construct that cannot be observed directly, such as safety, health, freedom, etc.
Connecting Latent Variables to several Manifest Variables often allows inferring values of the Latent Variables based on the measurements of the Manifest Variables.
For instance, values for the Latent VariableHealth could be inferred from a patient's Manifest Variables, such as body mass index, blood pressure, heart rate, lung function, etc.
The term Factor is entirely equivalent to Latent Variable. In the Bayesia Knowledge Base & Library, we use both terms interchangeably. Occasionally, we also refer to a Latent Factor, which is also the same.
Using Factor expresses intuitively that a Latent Variable can be a hidden cause of Manifest Variables. Consistent with the Latin origin of the word, a Factor can be the "doer" or "maker" behind Manifest Variables.
latent (adj.), mid-15c., "concealed, secret," from Latin latentem (nominative latens) "lying hid, concealed, secret, unknown," present participle of latere "lie hidden, lurk, be concealed."
factor (n.), early 15c., "commercial agent, deputy, one who buys or sells for another," from French facteur "agent, representative" (Old French factor, faitor "doer, author, creator"), from Latin factor "doer, maker, performer," in Medieval Latin, "agent," agent noun from past participle stem of facere "to do."
Probabilistic Structural Equation Models
Webinar: Factor Analysis Reinvented—Probabilistic Latent Factor Induction
Difference between SEM and PSEM Factors
The Mutual Information measures the amount of information gained on variable (the reduction in the Expected Log-Loss) by observing variable :
The Venn Diagram below illustrates this concept:
The Conditional Entropy measures, in bits, the Expected Log-Loss associated with variable once we have information on variable :
Hence, the Conditional Entropy is a key element in defining the Mutual Information between and .
Note that
is equivalent to:
and furthermore equivalent to:
This allows computing the Mutual Information between any two variables.
For a given network, BayesiaLab can report the Mutual Information in several contexts:
Main Menu > Analysis > Report > Target > Relationship with Target Node
.
Note that this table shows the Mutual Information of each node, e.g., XRay, Dyspnea, etc., only with regard to the Target Node, Cancer.
Main Menu > Analysis > Report > Relationship Analysis
:
Note that the corresponding options under Preferences > Analysis > Visual Analysis > Arc's Mutual Information Analysis
have to be selected first:
In Preferences, Child refers to the Relative Mutual Information from the Parent onto the Child node, i.e., in the direction of the arc.
Conversely, Parent refers to the Relative Mutual Information from the Child onto the Parent node, i.e., in the opposite direction of the arc.
The Symmetric Normalized Mutual Information measure takes the difference of the respective entropies of X and Y into account:
For a given network, BayesiaLab can report the Symmetric Normalized Mutual Information in several contexts:
Main Menu > Analysis > Report > Relationship Analysis
:
Note that the corresponding options under Preferences > Analysis > Visual Analysis > Arc's Mutual Information Analysis
have to be selected first:
In Preferences, Child refers to the Relative Mutual Information from the Parent onto the Child node, i.e., in the direction of the arc.
Conversely, Parent refers to the Relative Mutual Information from the Child onto the Parent node, i.e., in the opposite direction of the arc.
In older versions of BayesiaLab, Relative Mutual Information was also called Normalized Mutual Information.
Please see the up-to-date definition of Normalized Mutual Information.
BayesiaLab reports the Relative Mutual Information in the Target Analysis Report: Main Menu > Analysis > Report > Target > Relationship with Target Node
.
Note that this table shows the Relative Mutual Information of each node, e.g., XRay, Dyspnea, etc., only with regard to the Target Node, Cancer.
Note that the corresponding options under Preferences > Analysis > Visual Analysis > Arc's Mutual Information Analysis
have to be selected first:
In Preferences, Child refers to the Relative Mutual Information from the Parent onto the Child node, i.e., in the direction of the arc.
Conversely, Parent refers to the Relative Mutual Information from the Child onto the Parent node, i.e., in the opposite direction of the arc.
At the top of each Monitor, the items Mean, Dev, and Value are displayed.
Mean refers to the Mean Value m and is only shown in the Monitors of numerical nodes.
Dev stands for Standard Deviation and is shown alongside Mean.
The calculations for Expected Value and Mean Value are shown in the context of the following examples:
Let's take the discrete node Age with three categorical Node States:
Child
Adult
Senior
In the Node Editor, you can assign State Values to the Node States of Age.
A Monitor of a categorical node does not show a Mean value.
Let's suppose that the node Age has three numerical Node States instead of categorical Node States.
In this context, we need to consider two conditions, with and without State Values specified in the Node Editor:
No State Values Specified
Here, State Values are not specified in the Values tab of the Node Editor. Note the empty Value column below.
As a result, BayesiaLab uses the numerical values of the Node States, as they appear in the States tab, as the State Values.
Furthermore, as Age is a numerical node, its Monitor will now display the Mean (Mean) and the Standard Deviation (Dev) in addition to the Expected Value (Value)
The Mean m is computed using the numerical values of the Node States and the marginal probability distribution of the Node States:
Note that Mean and Value are identical in this case.
State Values Specified
However, if State Values are separately specified in the Values tab of the Node Editor, they will be used for the calculation of Value in the Monitor.
To highlight the distinction between the Node States {10, 40, 70} and the State Values, we assign unrelated arbitrary State Values of 0, 1, and 2.
Note that Mean and Value are not identical in this case.
Let's now consider a continuous variable Age defined in the domain [0; 99], discretized into three states:
Child: [0 ; 18]
Adult: ]18 ; 65]
Senior: ]65 ; 99]
Given Age is a numerical node, its Monitor shows the Mean (Mean), the Standard Deviation (Dev), plus the Expected Value (Value).
No Associated Data
So, the Mean Value m is computed as follows:
Associated Data
If you set a new piece of evidence on a node that modifies the distribution of the node, the Monitor displays a delta value in parentheses adjacent to Value.
This delta is the difference between the current Expected Valuev and:
The Normalized Mutual Information can also be shown by selecting Main Menu > Analysis > Visual > Overall > Arc > Mutual Information
and then clicking the Show Arc Comments icon or selecting Main Menu > View > Show Arc Comments
.
Note that the corresponding options under Main Menu > Preferences > Analysis > Visual Analysis > Arc's Mutual Information Analysis
have to be selected first:
Prior to inferring the values of a newly-created Latent Variable, it would appear as a Hidden Node on BayesiaLab's Graph Panel.
The Mutual Information can also be shown by selecting Main Menu > Analysis > Visual > Overall > Arc > Mutual Information
and then clicking the Show Arc Comments icon or selecting Main Menu > View > Show Arc Comments
.
The following Venn Diagram illustrates that the Mutual Information is symmetrical for the two variables and , i.e., .
However, the variables and can each have a different number of states. Therefore, their respective entropies can be very different.
This means that the absolute value of Mutual Information cannot be interpreted without context. In the Venn Diagram, for instance, reduces by a bigger percentage than does . As such, would be more "important" with regard to than it would be with regard to .
As a result, we have an easy-to-interpret measure that relates to both and together.
The Symmetric Normalized Mutual Information can also be shown by selecting Main Menu > Analysis > Visual > Overall > Arc > Mutual Information
and then clicking the Show Arc Comments icon or selecting Main Menu > View > Show Arc Comments
.
Based on , Relative Mutual Information is defined as:
Relative Mutual Information expresses in percent how much the entropy (or uncertainty) of is reduced by observing .
The Relative Mutual Information can also be shown by selecting Main Menu > Analysis > Visual > Overall > Arc > Mutual Information
and then clicking the Show Arc Comments icon or selecting Main Menu > View > Show Arc Comments
.
Value refers to the Expected Value and is shown in all Monitors, regardless of the node type, i.e., categorical or numerical.
For each node, the Expected Value is computed using the assigned State Values and the marginal probability distribution of the Node States:
where is the marginal probability of state and is its associated value.
The Monitor shows as the Value of Age.
where is the numerical value of the Node State.
The Expected Value is computed now using the assigned State Values {0, 1, 2} and the marginal probability distribution of the Node States:
where is the marginal probability of state and is its associated value.
The Monitor shows as the Value of Age.
If no associated data is with the node, both and are defined as the mid-points of the minimum and maximum values of each Node State. For example, for the Node State Adult ]18; 65], the midpoint is 17.2225.
The Expected Value is calculated analogously:
If data is associated with the node, is defined as the arithmetic mean of the data points that are associated with each Node State.
Furthermore, clicking on the Generate Values button in the Node Editor sets the values to the current arithmetic means of each Node State.
the Expected Value before setting the modifying evidence, or
the Expected Value that corresponds to the Reference Probability Distribution, which you can set with the icon in the toolbar.
If only some Node States have an associated value, the Expected Value is computed from the subset of Node States that do have an associated value.
If a node only has a single Node State with an associated value, the corresponding Monitor does not report the Expected Value .
The Mutual Information between two variables and is defined as follows:
The Kullback-Leibler Divergence (or KL Divergence) is used to measure the strength of the relationship between two nodes that are directly connected by an arc.
We commonly refer to the KL Divergence as Arc Force.
Formally, the Kullback-Leibler Divergence measures the difference between two distributions and .
For our purposes, we consider the Bayesian network that does include the arc for which we wish to compute the Arc Force, and the Bayesian network that does not contain that arc but is otherwise identical.
We interpret this difference as the "force of the arc" or Arc Force.
Mutual Information can be rewritten as:
Therefore, Mutual Information and Arc Force are identical if there are no spouses (co-parents) involved in the relationship of interest.
Let's consider the following network consisting of two nodes, X and Z.
The Conditional Probability Table associated with the node Z is defined as follows:
The top number in the box shows the Mutual Information .
The bottom number in the box is Symmetric Normalized Mutual Information .
The top number in the box shows the Arc Force .
The bottom number, in blue, represents the relative weight of this arc compared to the sum of all Arc Forces in the network. Given that this network consists only of one arc, this arc's weight accounts for 100%.
So, for now, both analyses return the same value, i.e., 0.3436. As we stated above, Mutual InformationI and Arc Force are identical with regard to an arc if no spouses (co-parents) are involved in the relationship of interest.
However, as soon as we have spouses (co-parents) involved, the Arc Force provides a more comprehensive characterization of the relationship.
Let's consider the following deterministic example, in which node Z represents an Exclusive-OR (XOR) gate with regard to its inputs X and Y.
The Truth Table associated with the node Z is defined as follows:
We can easily validate this assessment by simulating evidence for X and Y individually.
Indeed, there is no impact of either X and Y on Z.
The Arc Force, which takes into account the network as a whole, reveals the perfectly-deterministic relationship between X, Y, and Z.
Symmetric Relative Mutual Information computes the percentage of information gained by observing and :
This normalization is calculated similarly to Pearson's Correlation Coefficient .
where denotes variance.
So, Mutual Information is comparable to covariance, and Entropy is analogous to variance.
For a given network, BayesiaLab can report the Symmetric Relative Mutual Information in several contexts:
Main Menu > Analysis > Report > Relationship Analysis
:
Note that the corresponding options under Preferences > Analysis > Visual Analysis > Arc's Mutual Information Analysis
have to be selected first:
In Preferences, Child refers to the Relative Mutual Information from the Parent onto the Child node, i.e., in the direction of the arc.
Conversely, Parent refers to the Relative Mutual Information from the Child onto the Parent node, i.e., in the opposite direction of the arc.
The Total Effect (TE) is estimated as the derivative of the Target Node with respect to the driver node under study.
The Total Effect represents the change in the mean of the Target Node associated with — and not necessarily caused by — a small modification of the mean of a driver node.
The Total Effect is the ratio of these two values.
The Standardized Total Effect (STE) is also displayed. It represents the Total Effect multiplied by the ratio of the standard deviation of the driver node and the standard deviation of the Target Node.
This means that Standardized Total Effect takes into account the “potential” of the driver under study.
To provide some intuition for the Arc Force and Node Force measures computed by BayesiaLab, we use the water hose and balloon metaphor:
Imagine that we have a Bayesian network in which the variables are balloons and the arcs are elastic, perforated water hoses. The size of the holes in the hose represents the uncertainty contained in the conditional probability table associated with the child node.
For a deterministic relationship (i.e., we know the state of one variable given the state of the other one with certainty), there are no holes at all in the hose, and therefore, no water is lost between these two nodes.
Conversely, for an entirely uncertain relationship, in which information on one variable does not yield any information regarding the other one (such a “relationship” cannot be machine-learned as there is no correlation in the dataset), the size of the holes would be so large that no water could be transmitted from one node the other.
Now, we are sending a constant flow of water into this system. The thickness of a hose represents the actual water flow and is inversely proportional to the size of its holes. Big holes mean that most water leaks, and the effective water flow is minimal.
The pressure in a balloon, and therefore its size, depends on the number of connected hoses and the sizes of their respective holes.
BayesiaLab's Mapping function visualizes Node Force and Arc Force so you can easily identify the most important variables in a network, even in high-dimensional spaces.
In the networks below, for instance, the most important nodes are Country, Age, and Gender:
We now analyze this relationship in terms of Mutual Information in Validation Mode using Main Menu > Analysis > Visual > Overall > Arc > Arcs' Mutual Information
and click on the Arc Comments icon in the Toolbar.
Next, we now analyze this relationship in terms of Arc Force using Main Menu > Analysis > Visual > Overall > Arc > Kullback-Leibler
and, again, click on the Arc Comments icon in the Toolbar.
We now analyze this relationship in terms of Mutual Information in Validation Mode using Main Menu > Analysis > Visual > Overall > Arc > Arcs' Mutual Information
and click on the Arc Comments icon in the Toolbar.
Next, we now analyze this relationship in terms of Arc Force using Main Menu > Analysis > Visual > Overall > Arc > Kullback-Leibler
and, again, click on the Arc Comments icon in the Toolbar.
The Symmetric Normalized Mutual Information can also be shown by selecting Main Menu > Analysis > Visual > Overall > Arc > Mutual Information
and then clicking the Show Arc Comments icon or selecting Main Menu > View > Show Arc Comments
.
where is the analyzed variable and is the Target Node.