In a recent paper in Psychological Science, Joseph Simmons et al. report how listening to particular musical compositions affects your subjective age. In one experiment, they report how people listening to Hot Potato (by The Wiggles) felt subjectively older than those who listened to a control song (Kalimba, an instrumental tune that comes with Windows 7). In a second study, they report how listening to the Beatles' When I'm 64 leads one to underestimate of one's own age. Undergraduate students listened to either When I'm 64 or Kalimba. They were then asked to fill out a questionnaire including the age of their father and their own date of birth. Students who had listened to When I'm 64 reported an age of almost 1.5 years younger compared to those who listened to Kalimba. If these results strike you as unbelievable, they are! The study is absolutely bogus, as the authors themselves freely admit. Yet, they did not fabricate data or tamper with their results, but simply used practices that are common in experimental psychological research.
In 2009, I participated at an intensive two-week workshop in Oxford where I and a bunch of other lucky scholars in the humanities (mainly philosophers, anthropologists and literary scientists) were taught to do quantitative research in the cognitive science of religion. Our mentors Justin Barrett, Emma Cohen and Miguel Farias instructed us in good research practice. To give but one example, one of us asked: "What if I don't get a statistical result but only a trend with my sample? Can I just add more subjects?"Justin said "You just throw all the results away and start with a new sample if you think it's borderline significant. If you keep the original dataset and add to it, you may well risk detecting a significance that isn't there, because of the initial (freak) trend". This makes sense; Simmons et al. found that a researcher who starts with 10 observations per condition and then tests for significance after every new observation ends up finding a significant effect 22% of the time. Justin recommended we determine on beforehand how big our dataset is going to be, or at least outline a clear window in which data will be gathered (e.g., spring term).
But submitting and refereeing for journals in psychology has been a different matter. For instance, in a paper I mentioned statistically insignificant results (e.g., p = 0.67). Referees urge me to omit it, since, after all, the results aren't significant. Indeed, they even recommended that I omit reference to these tests at all. Similarly, when I controlled for lurking variables and my results still held, I put in the statistical test and significance level. Referees suggested I would not put those tests in, since, after all, my results remained significant. Once I was asked to referee a paper in cognitive psychology, and I recommended that the author performed a test to see how statistically powerful his results were (he did the study with a very small group, which was not to be helped because of the particular circumstances in which this experiment took place - I can't say more without revealing the identity of the author). However, the editor stepped in to say that his journal did not in fact publish such results, so the test was not performed. I have also been a participant to studies conducted by others, where I was originally not recruited as a participant, but was asked to fill out a questionnaire at a later instance, because referees asked the authors for some more subjects to increase the statistical reliability. Of course, the proper thing to do there was to start all over again with a fresh sample. Constraints on time and resources probably made this impractical for the authors. What I want to illustrate with these very limited experiences with research practices in psychology, is that the refereeing/editorial policies actually encourage the handling of data that according to Simmons et al. can lead to an excess of false positives. By implementing a few simple changes in refereeing practice, a lot of these problems could be avoided.
In particular, my experience has been that referees prefer tidiness, and this can get in the way of accuracy: "Review teams are the gatekeepers of the scientific community, and they should encourage authors not only to rule out alternative explanations, but also to more convincingly demonstrate that their findings are not due to chance alone. This means prioritizing transparency over tidiness; if a wonderful study is partially marred by a peculiar exclusion or an inconsistent condition, those imperfections should be retained" (p. 5).
Recent Comments