1 of 31

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Context

In BayesiaLab, nearly all learning and analysis functions are based on principles and metrics from the field of Information Theory.

In this section, we summarize some of these concepts and attempt to relate them to the corresponding BayesiaLab functions.

Furthermore, we include several relevant statistical concepts for understanding BayesiaLab's estimates and visualizations.

Context

“Bayesian inference is important because it provides a normative and general-purpose procedure for reasoning under uncertainty.”

Inductive Reasoning: Experimental, Developmental, and Computational Approaches, edited by Aidan Feeney and Evan Heit

Bayesian inference refers to an approach first proposed by Rev. Thomas Bayes (1702-1761), whose rule allows calculating the probability of an event A upon observing an event B.

Bayes' Rule

Bayes' rule or Bayes’ theorem relates the conditional and marginal probabilities of events A and B (provided that the probability of B is not equal to zero). More specifically, Bayes' rule allows calculating the conditional probability of event A given event B with the inverse conditional probability of event B given event A.

Posterior

is the conditional probability of event A given event B. It is also called the "posterior" probability because it depends on knowledge of event B. This is the probability of interest.

Note that referring to "posterior" should not be interpreted in a temporal sense, i.e., it does not imply a temporal order between events A and B.

Prior

is the prior probability (or “unconditional” or “marginal” probability) of event A. The unconditional probability P(A) was first called “a priori” by Sir Ronald A. Fisher. It is a “prior” probability because it does not consider any information about event B.

is the prior or marginal probability of event B.

Note that "prior," just like "posterior," does not imply a temporal order.

Likelihood Ratio

is the Bayes factor or likelihood ratio.

Learn More

$P(A|B) = P(A) \times {{P(B|A)} \over {P(B)}}$

$P(A|B)$

$P(A)$

$P(B)$

${{P(B|A)} \over {P(B)}}$

To calculate the description length of the data given the Bayesian network, we utilize the fact that the description length is inversely proportional to the probability of the observed data inferred by the model.

$\begin{array}{l} DL(D|B) = \sum\limits_{j = 1}^N {DL({e_j}|B)} \\ DL(D|B) = \sum\limits_{j = 1}^N {{{\log }_2}\left( {\frac{1}{{{P_B}({e_j})}}} \right)} \\ DL(D|B) = - \sum\limits_{j = 1}^N {{{\log }_2}\left( {{P_B}({e_j})} \right)} \end{array}$

where

${e_j}$ is the n-dimensional observation described in row ${j}$, and

$PB\left( {{e_j}} \right)$ is the joint probability of this observation returned by the Bayesian network $B$.

The chain rule allows rewriting this equation with:

$\begin{array}{l} DL(D|B) = - \sum\limits_{j = 1}^N {{{\log }_2}\left( {\prod\limits_{i = 1}^n {{P_B}({x_{ij}}|{\pi _{ij}})} } \right)} \\ DL(D|B) = - \sum\limits_{j = 1}^N {\sum\limits_{i = 1}^n {{{\log }_2}\left( {{P_B}({x_{ij}}|{\pi _{ij}})} \right)} } \end{array}$

Definition

The **Minimum Description Length Score (MDL Score)** is derived from Information Theory and has been used extensively in the Artificial Intelligence community.

It consists of the sum of two components that estimate:

the minimum number of bits required to represent a model, and

the minimum number of bits required to represent the data given the model.

However, in the specific context of Bayesian networks, we need to explain the exact meaning and the notation of these two components:

Calculating Complexity: DL(B)Calculating Fit: DL(D|B)"the minimum number of bits required to represent a model" is denoted $DL\left( {B} \right)$ (="Description Length of the Bayesian network $B$") and refers to the structural complexity of the Bayesian network model $B$, which includes the network graph and all probability tables.

For brevity, we often use the shorthand "complexity" or "structure" to refer to $DL\left( {B} \right)$.

Small values of $DL\left( {B} \right)$ suggest a simple model structure, and large values a complex model.

The goal of this structural part is to apply Occam's Razor, or the law of parsimony, i.e., to choose the simplest hypothesis, all other things being equal.

"the minimum number of bits required to represent the data given the model" is denoted $DL\left( {D|B} \right)$ (="Description Length of the data $D$ given the Bayesian network $B$") and refers to the likelihood of the data $D$ with respect to the Bayesia network model $B$.

The data likelihood is inversely proportional to the probability of the observed dataset, as inferred by the Bayesian network model.

Put simply,$DL\left( {D|B} \right)$ refers to the "fit" of the model to the data.

Small values of$DL\left( {D|B} \right)$ suggest a well-fitting model; large values, conversely, imply a poor fit.

BayesiaLab attempts to minimize the **MDL Score** by evaluating candidate networks during structural learning.

Learn More About Calculating Complexity & Fit

Definition

A Joint Probability Distribution is the distribution of Joint Probabilities.

A Joint Probability is the probability of specific values of variables jointly occurring in a domain.

Example

We observe the variables Hair Color and Eye Color in a population of college students.

Joint Probability refers to the probability of specific values for Hair Color and Eye Color jointly occurring in this population.

For instance,

P(Eye Color=Blue, Hair Color=Blond)=15.86% means that the probability of a student having blue eyes and blond hair in the given population is 15.86%.

P(Eye Color=Green, Hair Color=Black)=0.85% means that the probability of having green eyes and black hair in that population is only 0.85%.

We can now look across all possible combinations of Hair Color and Eye Color, compute all Joint Probabilities and list them in a Joint Probability Table, with one row for each combination of the states of the variables.

In this example, the size of the Joint Probability Table is manageable: Number of States (Hair Color) × Number of States (Eye Color) = 4 × 4 = 16

This Joint Probability Table is a direct and complete representation of the Joint Probability Distribution for the variables Hair Color and Eye Color:

Relevance

As the Joint Probability Distribution covers all possible combinations, it represents all regularities and patterns (or the lack thereof) within a domain.

Knowing the Joint Probability Distribution is required for performing two key operations for data analysis and inference:

Marginalization, which is calculating the marginal probability of a variable, e.g., P(Hair Color=Black)=18.25%.

Conditioning, which refers to inferring the values of a variable, given a specific value of another variable, e.g., P(Hair Color=Blond | Eye Color=Blue)=43.7%.

Challenge

In high-dimensional domains, however, calculating and listing the Joint Probabilities in a Joint Probability Table can become intractable.

The size of a Joint Probability Table grows exponentially with the number of variables. For example, if we had 20 variables with 4 states each, the size of the corresponding Joint Probability Table would exceed 1 trillion rows.

While the arithmetic is straightforward, the sheer number of calculations can easily exceed the available computational power, both for generating the Joint Probability Table as well as for performing Marginalization and Conditioning.

"The only way to deal with such large distributions is to constrain the nature of the variable interactions in some manner, both to render specification and ultimately inference in such systems tractable. The key idea is to specify which variables are independent of others, leading to a structured factorisation of the joint probability distribution. [Bayesian] Belief Networks are a convenient framework for representing such factorisations into local conditional distributions." (Barber, 2012)

This means that Bayesian networks are extremely practical for approximating Joint Probability Distributions in complex, high-dimensional problem domains.

References

Barber, D. (2012). Bayesian Reasoning and Machine Learning. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511804779

Context

Bayesian networks are models that consist of two parts:

A qualitative part to represent the dependencies using a Directed Acyclic Graph (DAG).

A quantitative part, using local probability distributions, for specifying the probabilistic relationships.

A Directed Acyclic Graph (DAG) consists of nodes and directed links:

Nodes represent variables of interest (e.g., the temperature of a device, the gender of a patient, a feature of an object, or the occurrence of an event).

Nodes can correspond to symbolic/categorical variables, numerical variables with discrete values, or discretized continuous variables.

Directed arcs represent statistical (informational) or causal dependencies among the variables. The directions are used to define kinship relations, i.e., parent-child relationships.

For example, in a Bayesian network with an arc from X to Y, X is the parent node of Y, and Y is the child node.

The local probability distributions can be either marginal for nodes without parents (Root Nodes) or conditional for nodes with parents.

In the latter case, the dependencies are quantified by Conditional Probability Tables (CPT) for each node given its parents in the Directed Acyclic Graph (DAG).

Thus, the Bayesian network can be used for computing the posterior probabilities of any subset of nodes given evidence set on any other subset.

Example

The following illustration shows a simple Bayesian network, which consists of only two nodes and one directed arc.

Eye Color is a Root Node and, therefore, does not have any Parents. In other words, Eye Color does not depend on any other node.

As a result, the table associated with Eye Color is a Probability Table, i.e., it represents the marginal distribution of Eye Color unconditionally.

On the other hand, the probabilities of Hair Color are only defined conditionally upon the values of its parent node, Eye Color.

Hence, the probabilities of Hair Color are provided in a Conditional Probability Table (CPT).

It is important to point out that this Bayesian network does not imply any causal relationships, even though the arc direction may suggest that to a casual observer.

Hair Color | Eye Color | Joint Probability |
---|---|---|

is the number of bits required to represent a Bayesian network. We can break down this value into the sum of two components:

, which stands for the number of bits required to represent the graph G of the Bayesian network,

represents the number of bits required to represent the set of probability tables P.

To calculate , we need to determine the number of nodes and the number of their parent nodes.

n is the number of random variables (nodes):

is the set of the random variables that are parents of in graph G

and is the number of parents of the random variable .

Computing is straightforward as it is proportional to the number of cells in all probability tables.

is the number of states of the random variable

is the probability associated with the cell.

Once fully specified, a Bayesian network compactly represents the .

This Bayesian network represents the of the variables Eye Color and Hair Color in a population of students ().

The arc direction merely defines the parent-child relationship of the nodes for purposes of representing the .