Following the high-profile cases of Diederik Stapel, Dirk Smeesters, and Marc Hauser, there has been a lot of blogging on NewAPPs and elsewhere on academic fraud in experimental psychology. Between outright fraud (fabricating data, omitting data etc) and doing clean experiments, there is a large grey zone. How do we delineate what is and isn't ethical in experimental psychology?
Richard Feynman compared grey-zone practices in psychology and other fields to Oceanic cargo cults (where the adherents believe the ancestors will come and shower them with cargo, if only they perform the right rituals). Feynman admonishes scientists that they are in effect behaving like Cargo cultists. To avoid cargo cult science, we need a rigorous ethics of conducting and reporting experiments. Here is a recent take on this for psychology.
However, the practice of experimental psychological publishing makes it difficult to adopt these simple principles. Reporting a series of experiments, a psychologist team is not merely reporting the testing of a hypothesis, but constructing a compelling narrative. Below the fold are some common practices (as I gather from speaking with experimental psychologists).
- Scenario 1 - you have hypothesis H in mind. You conduct experiments 1, 2 and 3. However, the results are such that it's more natural, and more narratively compelling to first report experiment 3, then 2, then 1. Is it OK to switch the order? Note: in this scenario, you have different participants for each experiment (so no possible carry-over effect within subjects).
- Scenario 2 - you have hypothesis H_1 in mind. You run experiments. Now they confirm H_1 (as you hoped) but they also provide evidence for an additional hypothesis H_2 which you did not originally formulate. Is it OK to present the study as if you were thinking about H_2 all along?
- Scenario 3 - you test for significant effects on several independent variables v_1, v_2,… v_n. You find significance for v_1, your target variable, and also for some other variables. There's no statistically significant effect for v_3, v_4. You omit that you tested for v_3 and v_4.
- Scenario 4 - you have n subjects (say, 20) and have a p value of .059. You recruit more participants until you reach significance. Is it OK or should you start over from 0?
I personally think scenario 1 is acceptable, scenario 4 is unacceptable because it greatly increases finding false positives. I think scenarios 2 and 3, despite their pervasiveness, are also problematic.
Many of these slippages in conceptual clarity and experimental rigor can be very easily amended. For scenario 1 a footnote suffices ("the original order of these experiments was …, the order is changed for clarity…") For scenario 2, you simply say "we additionally observed E and we can explain this, post hoc by hypothesis H_2." More radically, if it's really spectacular start from zero with fresh set of participants, and do a new experiment. This is how Dehaene et al. discovered the SNARC effect for number. For scenario 3, you report all findings, even non-significant ones. For scenario 4, you avoid data-peeking and set out your no of participants in advance (and make it large enough to detect an effect, if there is one) or if this is no longer possible, recruit a preset number of additional participants (preferably a large number) and only test once you reached that pre-set number.
However, in spite of the simplicity of these measures, they are still rarely implemented, and I see no evidence of a radical turnover among psychologists (some I talk to say "They can't expect us to run a whole new experiment (in scenario 2 and 4). We don't have enough money, time etc".
So how can we change the research climate that made Smeesters believe or allege he was morally off the hook, claiming "the culture in his field and his department is such that he does not feel personally responsible, and is convinced that in the area of marketing and (to a lesser extent) social psychology, many consciously leave out data to reach significance without saying so."?
Update: even without radically overhauling the research climate in x-psychology, it is possible, with a few simple measures to improve the reporting of results. For instance, one can add to or consult the following databases (thanks to Dan Weiskopf and Ann Fischer for the pointers):