This vignette covers the whole process of Bayesian network structure learning to parameter estimation and data simulation. # Find the best fitting graphical structure using an exact search algorithm

## Basic workflow with the `abn`

package

The package `abn`

is a collection of functions for
modelling of additive Bayesian networks. It contains routines to score
Bayesian Networks based on Bayesian (default) or information-theoretic
formulation of generalized linear models. Depending on the type of data,
the package supports a possible mixture of continuous, discrete, and
count data. The following table shows which of distribution types are
supported by for each method of estimation:

Distribution type | `method = "bayes"` |
`method = "mle"` |
---|---|---|

Gaussian | ✅ | ✅ |

Binomial | ✅ | ✅ |

Poisson | ✅ | ✅ |

Multinomial | ❌ | ✅ |

Structure learning of additive Bayesian networks with
`abn`

is a three-step process. Based on a set of model
specifications (data, maximal number of possible parent nodes,
restricted or enforced arcs, etc.), `abn`

calculates in a
first step the score of the data given the model
(`buildScoreCache()`

). This list of scores is then used to
estimate the most probable Bayesian network structure (“structure
learning”) and to infer the network structure in a third step
(`fitAbn()`

). Four structure-learning algorithms have been
implemented in `abn`

: the hill-climbing algorithm, the “exact
search” algorithm, the simulated annealing algorithm and tabu search
algorithm. With the network structure inferred, the package provides
routines to estimate the parameters of the network and to simulate data
from the fitted additive Bayesian network model.

The following example shows how to find the best fitting graphical structure using an exact search algorithm.

## Model specification

### Load the example dataset `ex1.dag.data`

This artificial data set comes with `abn`

and contains
10000 observations of 10 variables. The variables are a mixture of
continuous (`gaussian`

), binary (`binomial`

), and
count (`poisson`

) data. The data set is a simulated data set
from a known network structure.

```
mydat <- ex1.dag.data
str(mydat)
```

### Set up distribution list for each node

`abn`

requires a list of the type of distribution for each
node in the data set.

```
mydists <- list(b1="binomial",
p1="poisson",
g1="gaussian",
b2="binomial",
p2="poisson",
b3="binomial",
g2="gaussian",
b4="binomial",
b5="binomial",
g3="gaussian")
```

### Set the parent limits node-wise

The `max.par`

argument sets the maximum number of parent
nodes for each node in the data set. It can be set to a single value for
all nodes or to a list with the node names as keys and the maximum
number of parent nodes as values. This is a crucial parameter to speed
up the model estimation in `abn`

as it limits the number of
possible combinations.

```
# max.par <- list("b1"=1,"p1"=2,"g1"=3,"b2"=4,"p2"=1,"b3"=2,"g2"=3,"b4"=4,"b5"=1,"g3"=2) # set different max parents for each node
max.par <- 4 # set the same max parents for all nodes
```

## Build the score cache

The score cache is a list of scores for each possible parent combination for each node in the data set. It is used to learn the structure of the Bayesian network in the next step.

```
mycache <- buildScoreCache(data.df = mydat,
data.dists = mydists,
method = "bayes", # the default method is "bayes"
max.parents = max.par)
```

The minimal number of input arguments for
`buildScoreCache()`

is the data set and the distribution
list. By default, the function uses the Bayesian score which is based on
the posterior probability of the model given the data. To use the
Log-Likelihood score, Akaike Information Criterion (AIC) or Bayesian
Information Criterion (BIC) instead, the `method`

argument
can be set to `"mle"`

.

The function `buildScoreCache()`

also accepts a list of
banned and retained arcs, which can be used to enforce or restrict the
presence of certain arcs in the network structure. This can be useful if
prior knowledge about the network structure is available, e.g. from
expert knowledge or from previous analyses it is known that certain arcs
must be present or have to be absent.

The `max.parents`

argument sets the maximum number of
parent nodes for each node in the data set and together with the
`dag.banned`

and `dag.retained`

arguments, it
restricts the model search space and can speed up the model estimation
in `abn`

.

## Structure learning

The next step is to find the best fitting graphical structure of the
Bayesian network. In this example, we use the exact search algorithm to
find the most probable Bayesian network structure given the score cache
from the previous step. We supply the score cache as
`abnCache`

object from the previous step to the structure
learning function.

`mp.dag <- mostProbable(score.cache = mycache)`

The `mostProbable()`

function returns an object of class
`abnLearned`

which contains the most probable Bayesian
network structure and the score of the model given the data.

## Estimate the parameters of the network

The parameters of the network can be estimated using the
`fitAbn()`

function.

`myfit <- fitAbn(object = mp.dag)`

The `fitAbn()`

function returns an object of class
`abnFit`

which contains the estimated parameters of the
network.

## Simulate data from the fitted model

The `simulateAbn()`

function can be used to simulate data
from the fitted model.

```
simdat <- simulateAbn(object = myfit,
n.iter = 10000L)
summary(simdat)
```