Working with Markov chains from other tools in R
There are lots of little tools floating around that apply particular models using Bayesian data analysis. These programs give you the output but don’t have the functionality to work with the posterior distributions they generate or make inferences from them. Like a champ, R can fill the gap using the coda package for analyzing and diagnosing MCMC simulations.
> library(coda)
For now, don’t mind the model or the application, but let’s work with the output from bayesfst, which fits a model of population divergence and selection using allele frequency data. The first few rows and columns of output look like
The first column is the likelihood and the remaining columns are samples from the posterior distributions of each parameter. Reading the values into R
> read.table("http://gist.github.com/raw/641378/fst.out") -> fst
The columns in the bayesfst are not labelled, but the documentation says that the 2nd column will be the parameter for the 1st loci. In this case, it is the gene for Glucose-6-phosphate dehydrogenase (G6-PD). We can grab this column and turn it into an MCMC object using the mcmc function.
> mcmc(fst[,2]) -> g6pd
From here, we can start to look at the posterior distribution of the model parameter characterizing the divergence of this gene.
> summary(g6pd)
Iterations = 1:2000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 2000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
0.625945 0.335141 0.007494 0.028919
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
-0.060 0.410 0.640 0.850 1.260
Which tells us that there were 2000 samples in this chain. The mcmc object representation is not perfect because it does not know how many iterations the chain was actually run or how many of the samples were not stored (the thinning interval), but presumably if we’ve compiled bayesfst we already know what these values are. coda can generate some pretty plots but a histogram will do to visualize the posterior distribution
hist(g6pd, xlim=c(-2, 2))

According to the model, values of the loci parameters that are greater than 0 indicate adaptive evolution. What percentage of the simulated draws are greater than 0?
> table(g6pd > 0) / length(g6pd)
FALSE TRUE
0.0375 0.9625
- M.A. Beaumont & D.J. Balding (2004) Identifying adaptive genetic divergence among populations from genome scans. Molecular Ecology, 13: 969-980, 2004
- code for this example.