[JLJ - This short paper is the best introduction to Bayesian Networks that I have found.]
Bayesian networks (BNs), also known as belief networks (or Bayes nets for short), belong to the family of
probabilistic graphical models (GMs). These graphical structures are used to represent knowledge about an uncertain domain.
In particular, each node in the graph represents a random variable, while the edges between the nodes represent probabilistic
dependencies among the corresponding random variables. These conditional dependencies in the graph are often estimated by
using known statistical and computational methods. Hence, BNs combine principles from graph theory, probability theory, computer
science, and statistics.
BNs correspond to another GM structure known as a directed acyclic graph (DAG) that is popular in the statistics,
the machine learning, and the artificial intelligence societies. BNs are both mathematically rigorous and intuitively understandable.
They enable an effective representation and computation of the joint probability distribution (JPD) over a set of random variables
[3]. The structure of a DAG is defined by two sets: the set of nodes (vertices) and the set of directed edges. The nodes represent
random variables and are drawn as circles labeled by the variable names. The edges represent direct dependence among the variables
and are drawn by arrows between nodes. In particular, an edge from node Xi to node Xj represents a statistical dependence
between the corresponding variables. Thus, the arrow indicates that a value taken by variable Xj depends on the value taken
by variable Xi, or roughly speaking that variable Xi “influences” Xj . Node Xi is then referred to as a parent
of Xj and, similarly, Xj is referred to as the child of Xi.
A BN reflects a simple conditional independence statement. Namely that each
variable is independent of its nondescendents in the graph given the state of its parents. This property is used to reduce,
sometimes significantly, the number of parameters that are required to characterize the JPD of the variables. This reduction
provides an efficient way to compute the posterior probabilities given the evidence [3, 6, 7].