Additive Bayesian Network Modelling in R

Bayesian network analysis is a form of probabilistic graphical models which derives from empirical data a directed acyclic graph (DAG)

Introduction

Bayesian network modelling is a data analysis technique which is ideally suited to messy, highly correlated and complex datasets. This methodology is rather distinct from other forms of statistical modelling in that its focus is on structure discovery – determining an optimal graphical model which describes the inter-relationships in the underlying processes which generated the data. It is a multivariate technique and can used for one or many dependent variables. This is a data driven approach, as opposed to, rely only on subjective expert opinion to determine how variables of interest are inter-related (for example: structural equation modelling). An example can be found in the American Journal of Epidemiology where this approach was used to investigate risk factors for child diarrhoea. A special issue of Preventive Veterinary Medicine on graphical modelling features a number of articles which use abn to fit epidemiological data. An introduction to this methodology can be found in Emerging Themes in Epidemiology.

This website provides some cookbook type examples of how to perform Bayesian network structure discovery analyses with observational data. The particular type of Bayesian network models considered here are additive Bayesian networks. These are rather different, mathematically speaking, from the standard form of Bayesian network models (for binary or categorical data) presented in the academic literature, which typically use an analytically elegant, but arguably interpretation-wise opaque, contingency table parametrization. An additive Bayesian network model is simply a multidimensional regression model, e.g. directly analogous to generalised linear modelling but with all variables potentially dependent. All examples presented use an extension library for R called abn.

Contact abn package maintainer by email: gilles.kratzer at math.uzh.ch

Contributors: Gilles Kratzer, Marta Pittavino, Fraser Lewis and Reinhard Furrer


Installation

abn R package can easily be installed from CRAN using:

install.packages("abn", dependencies = TRUE)

However further libraries could be necessary to best profit from the abn features.


Quickstart

Simple examples provide illustrations of how to perform data analyses using additive Bayesian networks with abn ( installation procedure). The data sets used are provided with abn. Many more examples are given at the end of the relevant manual pages in R, e.g. see ?fitabn, ?buildscorecache, ?mostprobable, ?search.hillclimber. More realistic examples are given in case studies.


Case studies

More case studies will be presented as additional features will be added in abn. The general approach for structure discovery is broadly similar and relatively independent of the specific problem data. While Bayesian network modelling is computationally intensive, comparing across potentially large numbers of different models, it should not be treated as a black box approach as each individual data set has its own quirks and difficulties.



Literature

General note

Typical BN models involving binary nodes, arguably the most commonly used type of BN, use a contingency table rather than additive parameter formulation. This facilities mathematical elegance and means that key metrics like model goodness of fit and marginal posterior parameters can be estimated analytically (e.g. from a formula) rather than numerically (an approximation). The downside being that this parametrisation is likely far from parsimonious, and the interpretation of the model parameters is less clear than the usual GLM type models (which are common across all areas of science). This is, while practically important, a fairly low level technical distinction as the key aspect of BN modelling is that this is a form of graphical modelling – that is a model of the joint probability distribution of the data. It is this joint – multidimensional – aspect which makes this methodology so attractive for analyses of complex data and what discriminates it from the more standard regression techniques, e.g. glm’s, glmm’s etc, which are only one dimensional in that the covariates are all assumed independent. The latter is entirely reasonable in a classical experimental design scenario, but completely unrealistic for many observational studies in medicine, veterinary science, ecology and biology.


Technical articles


Application articles


Further ressources

Workshops

Causality:

ABN modeling:


Presentations