November 17, 2010
Passing around dot-dot-dots in R

The ... (pronounced dot-dot-dot) is a special argument for R functions that captures any arguments to a function that are not otherwise named. It is useful for making functions that can process arbitrary numbers of arguments. One such commonly used function that uses the dot-dot-dot is paste

This can be really cool way to make a function that can process as many objects as you care to through at it (a sequence of model outputs, for example). The way to get hold of the arguments is to turn them into a list, then loop through them.

I had a problem, however, where I wanted to tie two dot-dot-dot functions together. The form is lovely when you are working live in on the R prompt, but how to do it programmatically?

This is often a tangle you can get into wander far enough in to the R jungle. Learning something like dot-dot-dot makes part of your life easier but complicates something else you want to do. I can’t imagine this is something that ever happens to SPSS users but if you are familiar with LISP, it might make you smile.

Anyway, the trick is to use the do.call function, which lets you name a function and then pass it a list of arguments to apply the function to.

So, put those two things together:

October 22, 2010
Working with Markov chains from other tools in R

There are lots of little tools floating around that apply particular models using Bayesian data analysis. These programs give you the output but don’t have the functionality to work with the posterior distributions they generate or make inferences from them. Like a champ, R can fill the gap using the coda package for analyzing and diagnosing MCMC simulations.

> library(coda)

For now, don’t mind the model or the application, but let’s work with the output from bayesfst, which fits a model of population divergence and selection using allele frequency data. The first few rows and columns of output look like

The first column is the likelihood and the remaining columns are samples from the posterior distributions of each parameter. Reading the values into R

> read.table("http://gist.github.com/raw/641378/fst.out") -> fst

The columns in the bayesfst are not labelled, but the documentation says that the 2nd column will be the parameter for the 1st loci. In this case, it is the gene for Glucose-6-phosphate dehydrogenase (G6-PD). We can grab this column and turn it into an MCMC object using the mcmc function.

> mcmc(fst[,2]) -> g6pd

From here, we can start to look at the posterior distribution of the model parameter characterizing the divergence of this gene.

> summary(g6pd)

Iterations = 1:2000
Thinning interval = 1 
Number of chains = 1 
Sample size per chain = 2000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

          Mean             SD       Naive SE Time-series SE 
      0.625945       0.335141       0.007494       0.028919 

2. Quantiles for each variable:

  2.5%    25%    50%    75%  97.5% 
-0.060  0.410  0.640  0.850  1.260

Which tells us that there were 2000 samples in this chain. The mcmc object representation is not perfect because it does not know how many iterations the chain was actually run or how many of the samples were not stored (the thinning interval), but presumably if we’ve compiled bayesfst we already know what these values are. coda can generate some pretty plots but a histogram will do to visualize the posterior distribution

hist(g6pd, xlim=c(-2, 2))

According to the model, values of the loci parameters that are greater than 0 indicate adaptive evolution. What percentage of the simulated draws are greater than 0?

> table(g6pd > 0) / length(g6pd)

 FALSE   TRUE 
0.0375 0.9625 

  • M.A. Beaumont & D.J. Balding (2004) Identifying adaptive genetic divergence among populations from genome scans. Molecular Ecology, 13: 969-980, 2004
  • code for this example.

August 15, 2010
Sugary syntax in R with match.call

Tooling around with functional magic in R.

It can make the inside of your functions rather hairy but then then lets you sprinkle the rest of your code with functions that look really nice. Libraries like ggplot rely heavily on craft of this sort.

August 10, 2010
"I can’t really say why that is, but I can irresponsibly theorize."

Christian Rudder, OkTrends

Two thoughts

9:05pm  |   URL: http://tmblr.co/Z23PQyteCOS
Filed under: data analysis 
July 22, 2010
ECP15: Fungible regression

Because I had my own symposium yesterday on animal personality, I missed lots of great looking talks.

I am a sucker for methodology and pretty pictures so I quite enjoyed Niels Waller talk on fungible weights in regression. Waller used geometric principles to demonstrate that, while ordinary least-squares will also give the best fit to plain linear data, there are actually an infinite series of alternative solutions to a regression equation that give almost as good a fit. The paper inclues R code.

Other than that there is a lot of bashing on about the general factor of personality. There’s a symposium this afternoon devoted to the idea that a single, underlying factor explains most of the variance in all personality traits. A lot of the argument is methodological. One side says that it is just an artifact of factor analyzing noise (to put it mildly) while the other side claims that of course you won’t find it in personality inventories that have been designed to produce orthogonal factors. For me, the solution is to consider why (and when) we might expect a species to evolve a single underlying factor.

Waller. Fungible Weights in Multiple Regression. Psychometrika (2008) vol. 73 (4) pp. 691-703

March 29, 2010
The tools that bind

Over at The Hardest Science, Srivastava is gearing up to teach a course on SEM and thinking about how to warn students against the pixie dust approach to modeling and analysis.

This is definitely not easy, and something I struggle with, even when I have the time to prepare and know better. Last week I was helping someone learn how to do multi-level modeling in SPSS, a software package that apparently I am supposed to be competent with. I was so happy when I figured out the MIXED function to get the output I wanted, that this is all I was able to focus on. I totally forgot about conveying the higher level concepts of model formulation, such as writing out equations for each of your models and plotting regression lines for each of the subjects or treatment levels to get a feel for where modeling energy should be expended. Part of the problem is that I don’t know anything about plotting in SPSS. A poor excuse, I know.

Our tools do constrain us, and while knowing the principles and concepts is important, there is another aspect of statistical literacy, which is making your tools do what you want.

Photo cc-by FordRanger