1 of 31

Key Concepts

Context

In BayesiaLab, nearly all learning and analysis functions are based on principles and metrics from the field of Information Theory.
In this section, we summarize some of these concepts and attempt to relate them to the corresponding BayesiaLab functions.
Furthermore, we include several relevant statistical concepts for understanding BayesiaLab's estimates and visualizations.

Bayes' Rule (Bayes' Theorem)

Context

“Bayesian inference is important because it provides a normative and general-purpose procedure for reasoning under uncertainty.”
Inductive Reasoning: Experimental, Developmental, and Computational Approaches, edited by Aidan Feeney and Evan Heit

Bayesian inference refers to an approach first proposed by Rev. Thomas Bayes (1702-1761), whose rule allows calculating the probability of an event A upon observing an event B.

Bayes' Rule

Bayes' rule or Bayes’ theorem relates the conditional and marginal probabilities of events A and B (provided that the probability of B is not equal to zero). More specifically, Bayes' rule allows calculating the conditional probability of event A given event B with the inverse conditional probability of event B given event A.

Posterior

is the conditional probability of event A given event B. It is also called the "posterior" probability because it depends on knowledge of event B. This is the probability of interest.

Note that referring to "posterior" should not be interpreted in a temporal sense, i.e., it does not imply a temporal order between events A and B.

Prior

is the prior probability (or “unconditional” or “marginal” probability) of event A. The unconditional probability P(A) was first called “a priori” by Sir Ronald A. Fisher. It is a “prior” probability because it does not consider any information about event B.

is the prior or marginal probability of event B.

Note that "prior," just like "posterior," does not imply a temporal order.

Likelihood Ratio

is the Bayes factor or likelihood ratio.

Learn More

Chapter 4: Knowledge Modeling & Probabilistic Reasoning

Joint Probability & Joint Probability Distribution (JPD)

Definition

A Joint Probability Distribution is the distribution of Joint Probabilities.
A Joint Probability is the probability of specific values of variables jointly occurring in a domain.

Example

We observe the variables Hair Color and Eye Color in a population of college students.
Joint Probability refers to the probability of specific values for Hair Color and Eye Color jointly occurring in this population.
For instance,
- P(Eye Color=Blue, Hair Color=Blond)=15.86% means that the probability of a student having blue eyes and blond hair in the given population is 15.86%.
- P(Eye Color=Green, Hair Color=Black)=0.85% means that the probability of having green eyes and black hair in that population is only 0.85%.
We can now look across all possible combinations of Hair Color and Eye Color, compute all Joint Probabilities and list them in a Joint Probability Table, with one row for each combination of the states of the variables.
In this example, the size of the Joint Probability Table is manageable: Number of States (Hair Color) × Number of States (Eye Color) = 4 × 4 = 16
This Joint Probability Table is a direct and complete representation of the Joint Probability Distribution for the variables Hair Color and Eye Color:

Relevance

As the Joint Probability Distribution covers all possible combinations, it represents all regularities and patterns (or the lack thereof) within a domain.
Knowing the Joint Probability Distribution is required for performing two key operations for data analysis and inference:
- Marginalization, which is calculating the marginal probability of a variable, e.g., P(Hair Color=Black)=18.25%.
- Conditioning, which refers to inferring the values of a variable, given a specific value of another variable, e.g., P(Hair Color=Blond | Eye Color=Blue)=43.7%.

Challenge

In high-dimensional domains, however, calculating and listing the Joint Probabilities in a Joint Probability Table can become intractable.
The size of a Joint Probability Table grows exponentially with the number of variables. For example, if we had 20 variables with 4 states each, the size of the corresponding Joint Probability Table would exceed 1 trillion rows.
While the arithmetic is straightforward, the sheer number of calculations can easily exceed the available computational power, both for generating the Joint Probability Table as well as for performing Marginalization and Conditioning.
"The only way to deal with such large distributions is to constrain the nature of the variable interactions in some manner, both to render specification and ultimately inference in such systems tractable. The key idea is to specify which variables are independent of others, leading to a structured factorisation of the joint probability distribution. [Bayesian] Belief Networks are a convenient framework for representing such factorisations into local conditional distributions." (Barber, 2012)
This means that Bayesian networks are extremely practical for approximating Joint Probability Distributions in complex, high-dimensional problem domains.

References

Barber, D. (2012). Bayesian Reasoning and Machine Learning. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511804779

Minimum Description Length Score

Definition

The Minimum Description Length Score (MDL Score) is derived from Information Theory and has been used extensively in the Artificial Intelligence community.

It consists of the sum of two components that estimate:

the minimum number of bits required to represent a model, and
the minimum number of bits required to represent the data given the model.

However, in the specific context of Bayesian networks, we need to explain the exact meaning and the notation of these two components:

- The goal of this structural part is to apply Occam's Razor, or the law of parsimony, i.e., to choose the simplest hypothesis, all other things being equal.
- The data likelihood is inversely proportional to the probability of the observed dataset, as inferred by the Bayesian network model.

BayesiaLab attempts to minimize the MDL Score by evaluating candidate networks during structural learning.

Learn More About Calculating Complexity & Fit

Calculating Complexity: DL(B)

Description Length of the Bayesian Network — DL(B)

Calculating DL(G)

DL(G) = \sum\limits_i^n {\left( {{{\log }_2}(n) + {{\log }_2}\left( {\begin{array}{*{20}{c}} n\\ {\left\| {P{a_i}} \right\|} \end{array}} \right)} \right)}

where

Calculating DL(P|G)

where

As the probability p cannot be known prior to learning the network, we use the following classical heuristic in BayesiaLab:

Calculating Fit: DL(D|B)

To calculate the description length of the data given the Bayesian network, we utilize the fact that the description length is inversely proportional to the probability of the observed data inferred by the model.

where

The chain rule allows rewriting this equation with:

Conditional Probability Table (CPT)

Context

Bayesian networks are models that consist of two parts:
1. A qualitative part to represent the dependencies using a Directed Acyclic Graph (DAG).
2. A quantitative part, using local probability distributions, for specifying the probabilistic relationships.
A Directed Acyclic Graph (DAG) consists of nodes and directed links:
- Nodes represent variables of interest (e.g., the temperature of a device, the gender of a patient, a feature of an object, or the occurrence of an event).
- Nodes can correspond to symbolic/categorical variables, numerical variables with discrete values, or discretized continuous variables.
- Directed arcs represent statistical (informational) or causal dependencies among the variables. The directions are used to define kinship relations, i.e., parent-child relationships.
- For example, in a Bayesian network with an arc from X to Y, X is the parent node of Y, and Y is the child node.
- The local probability distributions can be either marginal for nodes without parents (Root Nodes) or conditional for nodes with parents.
- In the latter case, the dependencies are quantified by Conditional Probability Tables (CPT) for each node given its parents in the Directed Acyclic Graph (DAG).
- Once fully specified, a Bayesian network compactly represents the Joint Probability Distribution (JPD).
- Thus, the Bayesian network can be used for computing the posterior probabilities of any subset of nodes given evidence set on any other subset.

Example

The following illustration shows a simple Bayesian network, which consists of only two nodes and one directed arc.

This Bayesian network represents the Joint Probability Distribution (JPD) of the variables Eye Color and Hair Color in a population of students (Snee, 1974).
Eye Color is a Root Node and, therefore, does not have any Parents. In other words, Eye Color does not depend on any other node.
As a result, the table associated with Eye Color is a Probability Table, i.e., it represents the marginal distribution of Eye Color unconditionally.
On the other hand, the probabilities of Hair Color are only defined conditionally upon the values of its parent node, Eye Color.
Hence, the probabilities of Hair Color are provided in a Conditional Probability Table (CPT).
It is important to point out that this Bayesian network does not imply any causal relationships, even though the arc direction may suggest that to a casual observer.
The arc direction merely defines the parent-child relationship of the nodes for purposes of representing the Joint Probability Distribution (JPD).

Entropy

Definition

Entropy is expressed in bits and defined as follows:

Example

Let's assume we have four containers, A through D, which are filled with balls that can be either blue or red.

Container A is filled exclusively with blue balls.
Container B has an equal amount of red and blue balls.
In Container C, 10.89% of all balls are blue, and the remainder is red.
Container D only holds red balls.
Within each container, the order of balls is entirely random.
A volunteer who already knows the proportions of red and blue balls in each container now randomly draws one ball from each container. What is his degree of uncertainty regarding the ball color at the moment of each draw?
Needless to say, with Containers A and D, there is no uncertainty at all. From Containers A and D, he will draw a blue and red ball, respectively, with perfect certainty. What about the degree of certainty or, rather, uncertainty for Containers B and C?
The concept of Entropy can formally represent the degree of uncertainty.
Using the definition of Entropy from above, we can compute the Entropy value applicable to each draw.

We can also plot Entropy as a function of the probability of drawing a red ball.

Maximum Entropy as a Function of the Number of States

This was an example of a variable with two states only. As we introduce more possible states, e.g., another ball color, the maximum possible Entropy increases.

As a result, one cannot compare the Entropy values of variables with different numbers of states.

Normalized Entropy

Normalized Entropy is a metric that takes into account the maximum possible value of Entropy and returns a normalized measure of the uncertainty associated with the variable:

Example

In this new example, we now compare the variables X1 and X2, which each represent ball colors:

X1 ∈ {blue, red}
X2 ∈ {blue, red, green, yellow, purple, orange, brown, black}

Normalized Entropy allows us to compare the degree of uncertainty even though these two variables have different numbers of states, i.e., two versus eight states:

Usage

In BayesiaLab, the values of Entropy and Normalized Entropy can be accessed in a number of ways:

You can also sort the Monitors in the Monitor Panel according to their Normalized Entropy via Monitor Context Menu > Sort > Normalized Entropy.

The Normalized Entropy is also available as a Node Analysis metric for Size and Color in the 2D and 3D Mapping Tools.
In Function Nodes, Entropy and Normalized Entropy are available as Inference Functions in the Equation tab.
- Entropy: Entropy(?X1?, False)
- Normalized Entropy: Entropy(?X1?, True)

Demo Network

Log-Loss

Definition

Expected Log-Loss

Context

where

Usage

To illustrate these concepts, we use the familiar Visit Asia network:

In BayesiaLab, the Expected Log-Loss values can be shown in the context of the Monitors.

Monitors

On the left, the Monitors of the two nodes show their marginal distributions.

Visual Illustration

Instead of replacing the states' probabilities with the Expected Log-Loss values in a Monitor, you can bring up the Expected Log-Loss values ad hoc as a Tooltip.

Then, when you hover over any Monitor with your cursor, a Tooltip shows the Expected Log-Loss values.

Workflow Animation

Kullback-Leibler Divergence (Arc Force)

Definition

In BayesiaLab, the Kullback-Leibler Divergence (or KL Divergence) is used to measure the strength of the relationship between two nodes that are directly connected by an arc.
We commonly refer to the KL Divergence also as Arc Force.

We interpret this difference DKL as the "force of the arc" or Arc Force.
Note that Filtered Values are taken into account for computing the Arc Force.

Throughout this website, we use Kullback-Leibler Divergence, KL Divergence, and Arc Force interchangeably.

Hellinger Distance

The Hellinger distance measures the similarity or dissimilarity between two probability distributions. It is often used in statistics and information theory to compare how two probability distributions differ or overlap.

The Hellinger distance is often used in various applications, such as statistical hypothesis testing, image processing, machine learning, and ecology, where comparing and quantifying the similarity or difference between probability distributions is important.

Where:

The Hellinger distance has several useful properties:

It is a metric: It satisfies the properties of a metric, which means it is non-negative, symmetric, and obeys the triangle inequality. In other words, it measures the distance between two distributions in a mathematically consistent way.
Interpretability: The Hellinger distance has a meaningful interpretation in terms of probability distributions. It quantifies how much the square root of the probability density functions of the two distributions differ.

Calculating Complexity: DL(B)

Description Length of the Bayesian Network — DL(B)

$DL(B)$ is the number of bits required to represent a Bayesian network. We can break down this value into the sum of two components:

$DL(G)$ , which stands for the number of bits required to represent the graph G of the Bayesian network,
$DL(P|G)$ represents the number of bits required to represent the set of probability tables P.

$DL(B) = DL(G) + DL(P|G)$

Calculating DL(G)

To calculate $DL(G)$ , we need to determine the number of nodes and the number of their parent nodes.

DL(G) = \sum\limits_i^n {\left( {{{\log }_2}(n) + {{\log }_2}\left( {\begin{array}{*{20}{c}} n\\ {\left\| {P{a_i}} \right\|} \end{array}} \right)} \right)}

where

n is the number of random variables (nodes): ${X_1},...,{X_n}$
$P{a_i}$ is the set of the random variables that are parents of ${X_i}$ in graph G
and $P{a_i}$ is the number of parents of the random variable ${X_i}$ .

Calculating DL(P|G)

Computing $DL(P|G)$ is straightforward as it is proportional to the number of cells in all probability tables.

$DL(P|G) = \sum\limits_i^n {\left( {\prod\limits_j^{\left\| {P{a_i}} \right\|} {{S_j} \times ({S_i} - 1) \times DL(p)} } \right)}$

where

${{S}_{i}}$ is the number of states of the random variable ${X_i}$
$p$ is the probability associated with the cell.

As the probability p cannot be known prior to learning the network, we use the following classical heuristic in BayesiaLab:

$DL(p) = \frac{{{{\log }_2}(N)}}{2}$

Minimum Description Length Score

Definition

The Minimum Description Length Score (MDL Score) is derived from Information Theory and has been used extensively in the Artificial Intelligence community.

It consists of the sum of two components that estimate:

the minimum number of bits required to represent a model, and
the minimum number of bits required to represent the data given the model.

However, in the specific context of Bayesian networks, we need to explain the exact meaning and the notation of these two components:

Calculating Complexity: DL(B)Calculating Fit: DL(D|B)"the minimum number of bits required to represent a model" is denoted $DL\left( {B} \right)$ (="Description Length of the Bayesian network $B$ ") and refers to the structural complexity of the Bayesian network model $B$ , which includes the network graph and all probability tables.
- For brevity, we often use the shorthand "complexity" or "structure" to refer to $DL\left( {B} \right)$ .
- Small values of $DL\left( {B} \right)$ suggest a simple model structure, and large values a complex model.
- The goal of this structural part is to apply Occam's Razor, or the law of parsimony, i.e., to choose the simplest hypothesis, all other things being equal.
"the minimum number of bits required to represent the data given the model" is denoted $DL\left( {D|B} \right)$ (="Description Length of the data $D$ given the Bayesian network $B$ ") and refers to the likelihood of the data $D$ with respect to the Bayesia network model $B$ .
- The data likelihood is inversely proportional to the probability of the observed dataset, as inferred by the Bayesian network model.
- Put simply, $DL\left( {D|B} \right)$ refers to the "fit" of the model to the data.
- Small values of $DL\left( {D|B} \right)$ suggest a well-fitting model; large values, conversely, imply a poor fit.

BayesiaLab attempts to minimize the MDL Score by evaluating candidate networks during structural learning.

Learn More About Calculating Complexity & Fit

pageCalculating Complexity: DL(B)

pageCalculating Fit: DL(D|B)

Bayes' Rule (Bayes' Theorem)

Context

“Bayesian inference is important because it provides a normative and general-purpose procedure for reasoning under uncertainty.”
Inductive Reasoning: Experimental, Developmental, and Computational Approaches, edited by Aidan Feeney and Evan Heit

Bayesian inference refers to an approach first proposed by Rev. Thomas Bayes (1702-1761), whose rule allows calculating the probability of an event A upon observing an event B.

Bayes' Rule

P(A|B) = P(A) \times {{P(B|A)} \over {P(B)}}

Posterior

P(A|B)

is the conditional probability of event A given event B. It is also called the "posterior" probability because it depends on knowledge of event B. This is the probability of interest.

Note that referring to "posterior" should not be interpreted in a temporal sense, i.e., it does not imply a temporal order between events A and B.

Prior

P(A)

P(B)

is the prior or marginal probability of event B.

Note that "prior," just like "posterior," does not imply a temporal order.

Likelihood Ratio

{{P(B|A)} \over {P(B)}}

is the Bayes factor or likelihood ratio.

Learn More

Chapter 4: Knowledge Modeling & Probabilistic Reasoning

Calculating Fit: DL(D|B)

$\begin{array}{l} DL(D|B) = \sum\limits_{j = 1}^N {DL({e_j}|B)} \\ DL(D|B) = \sum\limits_{j = 1}^N {{{\log }_2}\left( {\frac{1}{{{P_B}({e_j})}}} \right)} \\ DL(D|B) = - \sum\limits_{j = 1}^N {{{\log }_2}\left( {{P_B}({e_j})} \right)} \end{array}$

where

${e_j}$ is the n-dimensional observation described in row ${j}$ , and
$PB\left( {{e_j}} \right)$ is the joint probability of this observation returned by the Bayesian network $B$ .

The chain rule allows rewriting this equation with:

$\begin{array}{l} DL(D|B) = - \sum\limits_{j = 1}^N {{{\log }_2}\left( {\prod\limits_{i = 1}^n {{P_B}({x_{ij}}|{\pi _{ij}})} } \right)} \\ DL(D|B) = - \sum\limits_{j = 1}^N {\sum\limits_{i = 1}^n {{{\log }_2}\left( {{P_B}({x_{ij}}|{\pi _{ij}})} \right)} } \end{array}$

Latent Variables, Factors, and Hidden Nodes vs. Manifest Variables

Context

Latent Variables (or Factors) and Manifest Nodes are central to building many types of models in BayesiaLab, including Probabilistic Structural Equation Models.

Manifest Variables

A Manifest Variable is a variable for which there are recorded observations from a given domain.
Manifest Variables include variables such as the temperature, speed, or mass of an object, i.e., properties that are generally measurable.
Furthermore, survey responses are also typical Manifest Variables, as they refer to ratings or assessments directly stated by respondents. In this case, a consumer's opinion is made manifest in the survey response.

manifest (adj.), late 14c., "clearly revealed to the eye or the understanding, open to view or comprehension," from Old French manifest "evident, palpable," (12c.), or directly from Latin manifestus "plainly apprehensible, clear, apparent, evident;" of offenses, "proved by direct evidence;" of offenders, "caught in the act."

Latent Variables, Factors, and Hidden Nodes

A Latent variable (or Factor), as opposed to a Manifest Variable, is a variable that cannot be directly observed. As a result, such a variable would not be recorded in a dataset collected from the original problem domain.
Latent typically refers to a theoretical or "hidden" concept or construct that cannot be observed directly, such as safety, health, freedom, etc.
Connecting Latent Variables to several Manifest Variables often allows inferring values of the Latent Variables based on the measurements of the Manifest Variables.
For instance, values for the Latent VariableHealth could be inferred from a patient's Manifest Variables, such as body mass index, blood pressure, heart rate, lung function, etc.
The term Factor is entirely equivalent to Latent Variable. In the Bayesia Knowledge Base & Library, we use both terms interchangeably. Occasionally, we also refer to a Latent Factor, which is also the same.
Using Factor expresses intuitively that a Latent Variable can be a hidden cause of Manifest Variables. Consistent with the Latin origin of the word, a Factor can be the "doer" or "maker" behind Manifest Variables.

latent (adj.), mid-15c., "concealed, secret," from Latin latentem (nominative latens) "lying hid, concealed, secret, unknown," present participle of latere "lie hidden, lurk, be concealed."

factor (n.), early 15c., "commercial agent, deputy, one who buys or sells for another," from French facteur "agent, representative" (Old French factor, faitor "doer, author, creator"), from Latin factor "doer, maker, performer," in Medieval Latin, "agent," agent noun from past participle stem of facere "to do."

Usage

For a given network, BayesiaLab can report the Mutual Information in several contexts:

Main Menu > Analysis > Report > Target > Relationship with Target Node.
Note that this table shows the Mutual Information of each node, e.g., XRay, Dyspnea, etc., only with regard to the Target Node, Cancer.

Main Menu > Analysis > Report > Relationship Analysis:

Note that the corresponding options under Preferences > Analysis > Visual Analysis > Arc's Mutual Information Analysis have to be selected first:

In Preferences, Child refers to the Relative Mutual Information from the Parent onto the Child node, i.e., in the direction of the arc.
Conversely, Parent refers to the Relative Mutual Information from the Child onto the Parent node, i.e., in the opposite direction of the arc.

Means and Values of Nodes

Context

At the top of each Monitor, the items Mean, Dev, and Value are displayed.

Mean refers to the Mean Value m and is only shown in the Monitors of numerical nodes.
Dev stands for Standard Deviation and is shown alongside Mean.

Examples

The calculations for Expected Value and Mean Value are shown in the context of the following examples:

Categorical Node

Let's take the discrete node Age with three categorical Node States:

Child
Adult
Senior

Categorial Node with Assigned State Values

In the Node Editor, you can assign State Values to the Node States of Age.

A Monitor of a categorical node does not show a Mean value.

Discrete Numerical Variable

Let's suppose that the node Age has three numerical Node States instead of categorical Node States.

In this context, we need to consider two conditions, with and without State Values specified in the Node Editor:

No State Values Specified

Here, State Values are not specified in the Values tab of the Node Editor. Note the empty Value column below.

As a result, BayesiaLab uses the numerical values of the Node States, as they appear in the States tab, as the State Values.

Furthermore, as Age is a numerical node, its Monitor will now display the Mean (Mean) and the Standard Deviation (Dev) in addition to the Expected Value (Value)

The Mean m is computed using the numerical values of the Node States and the marginal probability distribution of the Node States:

Note that Mean and Value are identical in this case.

State Values Specified

However, if State Values are separately specified in the Values tab of the Node Editor, they will be used for the calculation of Value in the Monitor.

To highlight the distinction between the Node States {10, 40, 70} and the State Values, we assign unrelated arbitrary State Values of 0, 1, and 2.

Note that Mean and Value are not identical in this case.

Continuous Numerical Variable

Let's now consider a continuous variable Age defined in the domain [0; 99], discretized into three states:

Child: [0 ; 18]
Adult: ]18 ; 65]
Senior: ]65 ; 99]

Given Age is a numerical node, its Monitor shows the Mean (Mean), the Standard Deviation (Dev), plus the Expected Value (Value).

No Associated Data

So, the Mean Value m is computed as follows:

Associated Data

Value Delta

If you set a new piece of evidence on a node that modifies the distribution of the node, the Monitor displays a delta value in parentheses adjacent to Value.

This delta is the difference between the current Expected Valuev and:

Special Case: Some Node States Without Values

Comparing Mutual Information and Arc Force

Mutual Information

The Mutual Information between two variables $X$ and $Y$ is defined as follows:

I(X,Y)=\sum_{x \in X}\sum_{y \in Y} p(x,y)\log_2 \frac{p(x,y)}{p(x)p(y)}

Arc Force

The Kullback-Leibler Divergence (or KL Divergence) is used to measure the strength of the relationship between two nodes that are directly connected by an arc.

We commonly refer to the KL Divergence as Arc Force.

Formally, the Kullback-Leibler Divergence ${D_{KL}}$ measures the difference between two distributions $P$ and $Q$ .

D_{KL}(P({\cal X})\|Q({\cal X}))=\sum_{\cal X}P({\cal X})log_2\frac{P({\cal X})}{Q({\cal X})}

For our purposes, we consider $P$ the Bayesian network that does include the arc for which we wish to compute the Arc Force, and $Q$ the Bayesian network that does not contain that arc but is otherwise identical.

We interpret this difference ${D_{KL}}$ as the "force of the arc" or Arc Force.

Comparing Mutual Information and Arc Force

Mutual Information can be rewritten as:

$I(X,Y)=D_{KL}(p(x,y)\|p(x)p(y))$

Therefore, Mutual Information $I$ and Arc Force ${D_{KL}}$ are identical if there are no spouses (co-parents) involved in the relationship of interest.

Example 1

Let's consider the following network consisting of two nodes, X and Z.

The Conditional Probability Table associated with the node Z is defined as follows:

The top number in the box shows the Mutual Information $I$ .

The bottom number in the box is Symmetric Normalized Mutual Information ${I_{SN}}$ .

{I_{SN}}(X,Y) = 2 \times \frac{{I(X,Y)}}{{{{\log }_2}({S_X}) + {{\log }_2}({S_Y})}}

The top number in the box shows the Arc Force ${D_{KL}}$ .

The bottom number, in blue, represents the relative weight of this arc compared to the sum of all Arc Forces in the network. Given that this network consists only of one arc, this arc's weight accounts for 100%.

So, for now, both analyses return the same value, i.e., 0.3436. As we stated above, Mutual InformationI and Arc Force ${D_{KL}}$ are identical with regard to an arc if no spouses (co-parents) are involved in the relationship of interest.

Example 2

However, as soon as we have spouses (co-parents) involved, the Arc Force provides a more comprehensive characterization of the relationship.

Let's consider the following deterministic example, in which node Z represents an Exclusive-OR (XOR) gate with regard to its inputs X and Y.

The Truth Table associated with the node Z is defined as follows:

We can easily validate this assessment by simulating evidence for X and Y individually.

Indeed, there is no impact of either X and Y on Z.

Comparison with Arc Force (Kullback-Leibler Divergence)

The Arc Force, which takes into account the network as a whole, reveals the perfectly-deterministic relationship between X, Y, and Z.

Key Concepts

Context

Bayes' Rule (Bayes' Theorem)

Context

Bayes' Rule

Posterior

Prior

Likelihood Ratio

Learn More

Joint Probability & Joint Probability Distribution (JPD)

Definition

Example

Relevance

Challenge

References

Minimum Description Length Score

Definition

Learn More About Calculating Complexity & Fit

Calculating Complexity: DL(B)

Description Length of the Bayesian Network — DL(B)

Calculating DL(G)

Calculating DL(P|G)

Calculating Fit: DL(D|B)

Conditional Probability Table (CPT)

Context

Example

Entropy

Definition

Example

Maximum Entropy as a Function of the Number of States

Normalized Entropy

Normalized Entropy

Example

Usage

Demo Network

Log-Loss

Definition

Expected Log-Loss

Context

Usage

Monitors

Visual Illustration

Monitor Tooltip

Workflow Animation

Kullback-Leibler Divergence (Arc Force)

Definition

Hellinger Distance

Key Concepts

Context

Joint Probability & Joint Probability Distribution (JPD)

Definition

Example

Relevance

Challenge

References

Calculating Complexity: DL(B)

Description Length of the Bayesian Network — DL(B)

Calculating DL(G)

Calculating DL(P|G)

Minimum Description Length Score

Definition

Learn More About Calculating Complexity & Fit

Bayes' Rule (Bayes' Theorem)

Context

Bayes' Rule

Posterior

Prior

Likelihood Ratio

Learn More

Conditional Probability Table (CPT)

Context

Example

Calculating Fit: DL(D|B)

Normalized Entropy

Normalized Entropy

Example

Usage

Demo Network

Expected Log-Loss

Context