This vignette covers the whole process of Bayesian network structure learning to parameter estimation and data simulation. # Find the best fitting graphical structure using an exact search algorithm
Basic workflow with the abn
package
The package abn
is a collection of functions for
modelling of additive Bayesian networks. It contains routines to score
Bayesian Networks based on Bayesian (default) or information-theoretic
formulation of generalized linear models. Depending on the type of data,
the package supports a possible mixture of continuous, discrete, and
count data. The following table shows which of distribution types are
supported by for each method of estimation:
Distribution type | method = "bayes" |
method = "mle" |
---|---|---|
Gaussian | ✅ | ✅ |
Binomial | ✅ | ✅ |
Poisson | ✅ | ✅ |
Multinomial | ❌ | ✅ |
Structure learning of additive Bayesian networks with
abn
is a three-step process. Based on a set of model
specifications (data, maximal number of possible parent nodes,
restricted or enforced arcs, etc.), abn
calculates in a
first step the score of the data given the model
(buildScoreCache()
). This list of scores is then used to
estimate the most probable Bayesian network structure (“structure
learning”) and to infer the network structure in a third step
(fitAbn()
). Four structure-learning algorithms have been
implemented in abn
: the hill-climbing algorithm, the “exact
search” algorithm, the simulated annealing algorithm and tabu search
algorithm. With the network structure inferred, the package provides
routines to estimate the parameters of the network and to simulate data
from the fitted additive Bayesian network model.
The following example shows how to find the best fitting graphical structure using an exact search algorithm.
Model specification
Load the example dataset ex1.dag.data
This artificial data set comes with abn
and contains
10000 observations of 10 variables. The variables are a mixture of
continuous (gaussian
), binary (binomial
), and
count (poisson
) data. The data set is a simulated data set
from a known network structure.
mydat <- ex1.dag.data
str(mydat)
Set up distribution list for each node
abn
requires a list of the type of distribution for each
node in the data set.
mydists <- list(b1="binomial",
p1="poisson",
g1="gaussian",
b2="binomial",
p2="poisson",
b3="binomial",
g2="gaussian",
b4="binomial",
b5="binomial",
g3="gaussian")
Set the parent limits node-wise
The max.par
argument sets the maximum number of parent
nodes for each node in the data set. It can be set to a single value for
all nodes or to a list with the node names as keys and the maximum
number of parent nodes as values. This is a crucial parameter to speed
up the model estimation in abn
as it limits the number of
possible combinations.
# max.par <- list("b1"=1,"p1"=2,"g1"=3,"b2"=4,"p2"=1,"b3"=2,"g2"=3,"b4"=4,"b5"=1,"g3"=2) # set different max parents for each node
max.par <- 4 # set the same max parents for all nodes
Build the score cache
The score cache is a list of scores for each possible parent combination for each node in the data set. It is used to learn the structure of the Bayesian network in the next step.
mycache <- buildScoreCache(data.df = mydat,
data.dists = mydists,
method = "bayes", # the default method is "bayes"
max.parents = max.par)
The minimal number of input arguments for
buildScoreCache()
is the data set and the distribution
list. By default, the function uses the Bayesian score which is based on
the posterior probability of the model given the data. To use the
Log-Likelihood score, Akaike Information Criterion (AIC) or Bayesian
Information Criterion (BIC) instead, the method
argument
can be set to "mle"
.
The function buildScoreCache()
also accepts a list of
banned and retained arcs, which can be used to enforce or restrict the
presence of certain arcs in the network structure. This can be useful if
prior knowledge about the network structure is available, e.g. from
expert knowledge or from previous analyses it is known that certain arcs
must be present or have to be absent.
The max.parents
argument sets the maximum number of
parent nodes for each node in the data set and together with the
dag.banned
and dag.retained
arguments, it
restricts the model search space and can speed up the model estimation
in abn
.
Structure learning
The next step is to find the best fitting graphical structure of the
Bayesian network. In this example, we use the exact search algorithm to
find the most probable Bayesian network structure given the score cache
from the previous step. We supply the score cache as
abnCache
object from the previous step to the structure
learning function.
mp.dag <- mostProbable(score.cache = mycache)
The mostProbable()
function returns an object of class
abnLearned
which contains the most probable Bayesian
network structure and the score of the model given the data.
Estimate the parameters of the network
The parameters of the network can be estimated using the
fitAbn()
function.
myfit <- fitAbn(object = mp.dag)
The fitAbn()
function returns an object of class
abnFit
which contains the estimated parameters of the
network.
Simulate data from the fitted model
The simulateAbn()
function can be used to simulate data
from the fitted model.
simdat <- simulateAbn(object = myfit,
n.iter = 10000L)
summary(simdat)