# Bayesian Networks

Probabilistic models based on directed acyclic graphs (DAG) have a long and rich tradition, beginning with the work of geneticist Sewall Wright in the 1920s. Variants have appeared in many fields. Within statistics, such models are known as directed graphical models; within cognitive science and artificial intelligence, such models are known as Bayesian networks. The name honors the Rev. Thomas Bayes (1702-1761), whose rule for updating probabilities in the light of new evidence is the foundation of the approach.

Rev. Bayes addressed both the case of discrete probability distributions of data and the more complicated case of continuous probability distributions. In the discrete case, Bayes’ theorem relates the conditional and marginal probabilities of events A and B, provided that the probability of B not equal zero:

In Bayes’ theorem, each probability has a conventional name: P(A) is the prior probability (or “unconditional” or “marginal” probability) of A. It is “prior” in the sense that it does not take into account any information about B; however, the event B need not occur after event A. In the nineteenth century, the unconditional probability P(A) in Bayes’ rule was called the “antecedent” probability; in deductive logic, the antecedent set of propositions and the inference rule imply consequences. The unconditional probability P(A) was called “a priori” by Ronald A. Fisher.

• P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B.
• P(B|A) is the conditional probability of B given A. It is also called the likelihood.
• P(B) is the prior or marginal probability of B, and acts as a normalizing constant.
• is the Bayes factor or likelihood ratio.

Bayes theorem in this form gives a mathematical representation of how the conditional probability of event A given B is related to the converse conditional probability of B given A.

The initial development of Bayesian networks in the late 1970s was motivated by the necessity of modeling top-down (semantic) and bottom-up (perceptual) combinations of evidence for inference. The capability for bi-directional inferences, combined with a rigorous probabilistic foundation, led to the rapid emergence of Bayesian networks. They became the method of choice for uncertain reasoning in artificial intelligence and expert systems, replacing earlier, ad hoc rule-based schemes.

Bayesian networks are models that consist of two parts, a qualitative one based on a DAG for indicating the dependencies, and a quantitative one based on local probability distributions for specifying the probabilistic relationships. The DAG consists of nodes and directed links:

• Nodes represent variables of interest (e.g. the temperature of a device, the gender of a patient, a feature of an object, the occurrence of an event). Even though Bayesian networks can handle continuous variables, we exclusively discuss Bayesian networks with discrete nodes in this book. Such nodes can correspond to symbolic/categorical variables, numerical variables with discrete values, or discretized continuous variables.
• Directed links represent statistical (informational) or causal dependencies among the variables. The directions are used to define kinship relations, i.e. parent-child relationships. For example, in a Bayesian network with a link from X to Y, X is the parent node of Y, and Y is the child node.

The local probability distributions can be either marginal, for nodes without parents (root nodes), or conditional, for nodes with parents. In the latter case, the dependencies are quantified by conditional probability tables (CPT) for each node given its parents in the graph.

Once fully specified, a Bayesian network compactly represents the joint probability distribution (JPD) and, thus, can be used for computing the posterior probabilities of any subset of variables given evidence about any other subset.