May 4, 2011
Personality and the senses

there is an apparent relationship between certain personality traits and sensory capacities.…[S]ensory capacity may provide a filter through which we perceive the world, and that this filter may influence the picture we receive of the world….We found no coherence between personality traits and gustatory modality (mainly related to eating) but significant coherence between personality traits and olfactory, trigeminal sensory and electrical cutaneous modality; systems usually thought to be related to detection of social cues and awareness of danger.

If behavior is the output of a reaction to a stimulus, part of the behavior (and therefore personality) will be related to how each stimulus is perceived. It is interesting to find this at such a basic level (as opposed to say, the global stimulation experienced by the extravert versus the introvert).

  • Croy I, Springborn M, Lötsch J, Johnston ANB, Hummel T, 2011 Agreeable Smellers and Sensitive Neurotics – Correlations among Personality Traits and Sensory Thresholds. PLoS ONE 6(4): e18701. doi:10.1371/journal.pone.0018701

(Source: plosone.org)

February 14, 2011
Hold on to the keys to your tool shed

Opening up SPSS the other day, the program, staring at me with a screwed up face, offered the following greeting:

A quick journey though the “Authentication Wizard” soon had the program placated. Whether the amnesiatic spell was from SPSS or the operating system, the experience emphasizes the importance of open tools for science. Don’t allow yourself to be locked out of your own analysis, never mind the hurdle that closed present to anyone who wants to check or extend your work.

4:00pm  |   URL: http://tmblr.co/Z23PQy34GVXR
  
Filed under: one column 
January 21, 2011
Supplementary Information 4: Project expenses

There is an intriguing new paper suggesting that people in your social network may be more or less likely than chance to share specific genes with you. I haven’t quite wrapped my head around all the analysis (Daniel MacArther at Genetic Future has a good summary), but this being one of those papers that was all over the news before actual publication, I was quite struck by some of the criticism:

“If this was a study looking for shared genes in patients with diabetes, it would not be up to the standards of the field. We set these standards after 10 years of seeing so many irreproducible results in gene-association studies.” — David Altshuler

It certainly is a provocative study — I would have loved to have seen it done with information from the rest of the genome. — Stanley Nelson

Only six SNPs were studied in an association study and such sweeping conclusions were arrived at? Were the reviewers not geneticists? — Tabrez Siddiqui

In other words, don’t do this study, do another, much more expensive study in its place.

Don’t get sidetracked by the p-values or multiple comparisons. Replicating the direction and effect size in two samples is very strong evidence that there these genes are influence the formation of friendships to the degree estimated.

These results should put the authors in a good position to secure funding to do further genotyping in their sample. That is the way funding should work, I think: show some cost-effective results and then do the larger study. What if they’d done it the other way around? gotten funding for GWAS and then not found anything. That would have been a terrible waste but given the apparent bias that the more expensive a study is to conduct the more worthwhile, this must be happening a lot.

Along with open protocols, open data, and open access, publications can start including project expenses as supplementary material. We explicitly reward people for getting grants (during hiring and promotion) but not for doing science on the cheap.

10:12am  |   URL: http://tmblr.co/Z23PQy2gCOmZ
Filed under: one column 
January 5, 2011
Terminology: dimensions, factors, domains

Rough definitions for dimensional models of personality

  • dimension The latent thing that encompasses components, factors, and domains. This is what we are trying to study.
  • component, factor Yielded by EFA and PCA. Not interpreted.
  • domain Actual scores for each individual on a component or factor (loading).

None of this addresses the question of why personality should come out as (mostly) independent dimensions.

December 13, 2010
Installing matplotlib on OS X 10.6 with homebrew

This one took me a while figure out…

For some reason once I am working with plots that involve calculus, R doesn’t feel right. Part of me might prefer Mathematica for tasks like this but, alas, it is quite out of my price range and the only way I know to access it at uni is through the computing cluster. So I’m using matplotlib as an alternative to R for more math-y type plotting.

I have a new machine on my desk so I’m starting with a clean slate. I don’t know the ins and outs of building python and MacPorts did drive me to drink, so here it is with homebrew

The only dependency outside of python-land is a FORTRAN compiler. The step that I really got tripped up on is that pip will by default attempt to install an older version of matplotlib that breaks with the latest numpy (it installs fine, though).

5:51pm  |   URL: http://tmblr.co/Z23PQy29EIfB
Filed under: drafting one column 
November 17, 2010
Passing around dot-dot-dots in R

The ... (pronounced dot-dot-dot) is a special argument for R functions that captures any arguments to a function that are not otherwise named. It is useful for making functions that can process arbitrary numbers of arguments. One such commonly used function that uses the dot-dot-dot is paste

This can be really cool way to make a function that can process as many objects as you care to through at it (a sequence of model outputs, for example). The way to get hold of the arguments is to turn them into a list, then loop through them.

I had a problem, however, where I wanted to tie two dot-dot-dot functions together. The form is lovely when you are working live in on the R prompt, but how to do it programmatically?

This is often a tangle you can get into wander far enough in to the R jungle. Learning something like dot-dot-dot makes part of your life easier but complicates something else you want to do. I can’t imagine this is something that ever happens to SPSS users but if you are familiar with LISP, it might make you smile.

Anyway, the trick is to use the do.call function, which lets you name a function and then pass it a list of arguments to apply the function to.

So, put those two things together:

October 22, 2010
Working with Markov chains from other tools in R

There are lots of little tools floating around that apply particular models using Bayesian data analysis. These programs give you the output but don’t have the functionality to work with the posterior distributions they generate or make inferences from them. Like a champ, R can fill the gap using the coda package for analyzing and diagnosing MCMC simulations.

> library(coda)

For now, don’t mind the model or the application, but let’s work with the output from bayesfst, which fits a model of population divergence and selection using allele frequency data. The first few rows and columns of output look like

The first column is the likelihood and the remaining columns are samples from the posterior distributions of each parameter. Reading the values into R

> read.table("http://gist.github.com/raw/641378/fst.out") -> fst

The columns in the bayesfst are not labelled, but the documentation says that the 2nd column will be the parameter for the 1st loci. In this case, it is the gene for Glucose-6-phosphate dehydrogenase (G6-PD). We can grab this column and turn it into an MCMC object using the mcmc function.

> mcmc(fst[,2]) -> g6pd

From here, we can start to look at the posterior distribution of the model parameter characterizing the divergence of this gene.

> summary(g6pd)

Iterations = 1:2000
Thinning interval = 1 
Number of chains = 1 
Sample size per chain = 2000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

          Mean             SD       Naive SE Time-series SE 
      0.625945       0.335141       0.007494       0.028919 

2. Quantiles for each variable:

  2.5%    25%    50%    75%  97.5% 
-0.060  0.410  0.640  0.850  1.260

Which tells us that there were 2000 samples in this chain. The mcmc object representation is not perfect because it does not know how many iterations the chain was actually run or how many of the samples were not stored (the thinning interval), but presumably if we’ve compiled bayesfst we already know what these values are. coda can generate some pretty plots but a histogram will do to visualize the posterior distribution

hist(g6pd, xlim=c(-2, 2))

According to the model, values of the loci parameters that are greater than 0 indicate adaptive evolution. What percentage of the simulated draws are greater than 0?

> table(g6pd > 0) / length(g6pd)

 FALSE   TRUE 
0.0375 0.9625 

  • M.A. Beaumont & D.J. Balding (2004) Identifying adaptive genetic divergence among populations from genome scans. Molecular Ecology, 13: 969-980, 2004
  • code for this example.

September 30, 2010
Preparing scientific articles

The send-up of news articles on scientific findings brings to mind Harry Harlow’s truly classic advice on writing psychology research articles. A few gems:

  • The best part of an introduction is that “one is not constrained by facts”
  • “Since the data will already have been collected and processed, you will have no difficulty in making insightful predictions.”
  • “the author should remember that he is not reading the literature—just citing it.”
  • if your figures are legible “there is a real danger that editors and readers will compare the information given in the graph with what is written in the Results”
  • “Whereas there are firm rules and morals concerning the collection and reporting of data which should be placed in the Results, these rules no longer are in force when one comes to the Discussion.”

  • Harlow, H F (1962) Fundamental principles for preparing psychology journal articles. J Comp Physiol Psychol 55. 893-896.

(Source: sciencedirect.com)

5:09pm  |   URL: http://tmblr.co/Z23PQy18U-9k
  
Filed under: one column 
August 15, 2010
Sugary syntax in R with match.call

Tooling around with functional magic in R.

It can make the inside of your functions rather hairy but then then lets you sprinkle the rest of your code with functions that look really nice. Libraries like ggplot rely heavily on craft of this sort.

March 5, 2010
The mob is organizing to participate

Companies with works that can be broken into short, repetitive tasks that still require a discerning human to complete are turning to cloud labor to distribute these tasks to workers throughout the world. Amazon’s Mechanical Turk was the first to define the field, followed by more targeted and nimble services like CrowdFlower and txteagle.

I have noticed a number of researchers using these tools to recruit participants and collect data, so I wanted to see whether they would work in my own work. I opted for CrowdFlower because they have a completely unintimidating sign-up procedure that encourages you to play with designing tasks before you push your survey out into the world.

Once you are in, you can go about creating a job. If you are making a survey or trying to field a psychological instrument, the interface suggests that you can get started without first adding data. Under the more task/job orientation of most users of these types of services, you need to populate the Job with the information you want workers to work on (such as a list of URLs to visit). This is not quite what we want, but I have found the rest of the system doesn’t work if you don’t populate the Job with some sort of data.

So what you can do is just create a two line .csv file with something like a survey identification number.

Place that in a plain text file and upload it to CrowdFlower (if your browser blocks Flash, whitelist crowdflower.com; the uploader depends on it).

Now you’re ready to get cracking. CrowdFlower has a very nice little form editor (under the Edit tab). However, you’d probably like to avoid pointing, clicking, and dragging as much as possible. It is also likely you already have the item content for your questionnaire. The thing to do, then, is to skip the GUI form editor and head straight for the CrowdFlower markup language, which gives you XML tags for designing the content of your survey. This is the real gem of CrowdFlower’s platform.

Once you are through designing your survey, you are read to Order Judgments (sounds serious, right?). You’ll want to skip the calibration step, which presents you with a dummy copy of your survey and times you as you complete it. You should already know how long it takes to do your questionnaire.

Advanced Settings holds all the action. For Judgments per unit, put the number of individuals you want to fill out your survey. Remember that this whole interface is for microtasks, so it assumes a worker might get a page of 4 or 5 tasks to complete at once. That is not what we want, since all of our questions are in a single survey. Thus, Units per assignment should be 1.

CrowdFlower ties into two different labor communities: Amazon Turk and Give Work/Samasource. However, they also have a free internal interface that generates a URL you can give to your participants. For example, the survey I just made asks: Are you an individual?

So while CrowdFlower has many nice features, it isn’t quite suited for psychological surveys. This is hardly a criticism, since it wasn’t designed for this type of task. The main problem is the screen participants see once they’ve submitted the task. It isn’t what I’d describe as a good debriefing. That said, there is a type of task that this interface is quite suited for, which is assessing personality in nonhuman primates. Like the kind of jobs that CrowdFlower was designed for, having a number of raters assess the personality of some apes or monkeys precisely getting a number of judgments by each worker (the raters) on the assigned units (the primate subjects).

Lastly, a few tips

  • In the CSV file you use to import the units, don’t just use variables that will appear to the raters in the task information, but also metadata that will help you sort and organize the data later. Examples are stud numbers or database IDs for each animal.
  • You cannot edit the survey content once the job is running, so get it right before starting.

  • In CML, you can provide an alternative name that will show up as the column label in the dataset. Capitalization of this label won’t be preserved.

  • The output gives you some other information about the workers, such as an ID (which I assume is somehow tied to cookies in their browser) and the city they are connecting from.

photo cc-by Amodiovalerio Verde

February 19, 2010
Public data display with Tableau: Case study from NHANES

Nathan at Flowingdata turned me on to Tableau Public, a Software as a Service application for data sharing and visualization. The key feature, in addition to a graphical interface for data exploration and graphing, is embedding the visualizations on the web. Tableau takes care of hosting and gives you an snippet of Javascript that will render an interactive version of the visualization. As a trial run, I extracted some data from the National Health and Nutrition Survey. Among all the other physiological and psychological data they collect, they ask about marital status and sexual behavior. One oddity that the study authors point out is that the number of people who are or have been married but who claim never to have had sex has increased over the last few years:

That sort of gets the point across and it took longer to export the data from SPSS than to start taking stabs at the visualization. It did take me a while to figure out how to get Tableau to display the timeseries (I think because ‘year’ was a dimension rather than a measure at first). Sharing interactive graphics on the web is a great place to end up, but I could only manage to produce graphs by trial and error. When manipulating the interface, it isn’t really clear what the results of most of your actions will be, though at least they are displayed immediately.

5:57pm  |   URL: http://tmblr.co/Z23PQyNncu1
  
Filed under: data one column