HN2new | past | comments | ask | show | jobs | submitlogin
Daryl Bem and the Replication Crisis (slate.com)
69 points by wellpast on May 24, 2017 | hide | past | favorite | 47 comments


I might have missed it in this rather long article, but I think the whole debate could use some people more that have read Karl Popper and are familiar with Positivism vs Falsifiability. I would not be surprised if all this amounts to is that you can of course "replicate" results if you try often enough, and ignore the falsifying results when it did not work.

I mean, of course the big problem is that applying strict falsifiability to social sciences does not work. You can always find a counter-example, the problems studied do not work like that. It is hard to reconcile this, but I think that is the heart of the replication crisis (together with some bad statistics). But (having done a PhD in a somewhat related field) I see only two options:

1. We need a complete overhaul of how experiments and study results are published, so that observers can see the failed results and we can try to assess how often a theory hold up

2. We have to limit non-hard sciences to questions that are not ambiguous, where one well-done negative result really shows a theory is wrong.

Of course, there is a third option: Just continue as it is now and ignore all results that seem unlikely, because given how those fields work, they are most likely wrong.


The problem with 2, is that psychology is happy to acknowledge that many of its results are context dependent [0]. The entire field of cross-cultural psychology is founded on the understanding that psychological findings may not generalize across cultures. While this suggest that there are likely no "natural laws" in psychology, it does nothing to dismiss the importance and usefulness of the non-natural laws we do find. Given this framework, falsifiability still applies, but may only suggest refinement of the conditions of our study.

Your third option seems to require the pre-empirical determination of which results of "unlikely" and instead choosing to believe only in the studies that confirm what we already intuitively believe.

Having said that, I completely agree with your first point. Replication studies alongside the increased visibility of studies that fail to reject the null hypothesis are absolutely essential.

[0] http://psych.wustl.edu/memory/Roddy%20article%20PDF's/Roedig...


> The problem with 2, is that psychology is happy to acknowledge that many of its results are context dependent

Yes. But that's only a specific kind of psychology. In my master I had a psychology professor (who came from the physics department originally) who clearly said all of that is bogus, and the only thing psychology should do are clear-cut studies like finding just noticeable differences. In his case, he was studying how the brain processes visual input that way. Very interesting, and worked well for him.


You simply need to declare studies before you preform them and you avoid the kind of bias you are talking about. The core issue is not the method, but poor incentives leading to people gaming the system instead of doing science. No system is going to survive most people trying to break it.

On the whole the 'soft' sciences are not actually science right now. They are philosophy playing dress up. Sure, you get a few honest people trying to do real research, but mostly not.

PS: I remember seeing the same problem on the small scale in the physical sciences. One professor tried to weigh air by weighing a bag with and without air. Which did not work for obvious reasons. But, blow on it adding enough saliva and eventually you can finagle the number you want. I was kind of hoping he was going to eventually say, see don't do this and then give details, but nope he felt like he got the results he wanted.


The article linked this https://www.psychologicalscience.org/observer/why-preregistr... explaining exactly why declaring studies before you perform them isn't the whole answer.


It's a common and disingenuous criticism that registered reports disallow discovery. They don't prevent you from doing exploratory analysis, they simply require you to display it as exploratory rather than pretending that was your hypothesis all along.


Yeah, I have very little respect for this criticism.

In fact, the end of the Slate article gives a perfect example of how things should work. Bem did a large, preregistered psi study, found no effect, but noticed an interesting positive result he hadn't registered for. So he published the negative result, mentioned the positive thread as a site for further research, and is now putting together a preregistered study on that basis.

This seems totally above board. (And again, Bem outperforms the standards of real fields.) There's nothing wrong with noticing something suggestive in a registered dataset, or even gathering exploratory data directly. You just have to confirm what you find with preregistration.


The problem with small changes is they are going to benefit or hurt each party. If you say A || B || C || D then party X will like A & D, and party Y will like B & C but there is no way to move forward.

Just look at giving DC a house seat when it's larger than other states who get not only a seat in the house but 2 in the senate. Well Party A wins, and party B loses so it deadlocks as simply party politics, because they are not trying to follow what people want, just change the system to benefit them.

IMO, the core issue is not the specifics it's a system which has been corrupted over time. Consider there is a North Dakota and South Dakota simply because that gives them more seats in the senate. So, now because of that power grab all those years ago we end up with a small population with more power than they would otherwise have and little reason to give it up.


Yea, pasted that in wrong thread. OPS sorry.


Why does strict falsifiability not work in social sciences? You just need to state your theories better if you can otherwise find counterexamples easily. Physicists don't say "smash these two particles together and you'll see a Higgs Boson", they have sufficiently nuanced theories that sifting through petabytes of data to find what you claim exists is justified.


It's about the problems you can study.

Take the Cornell Food Lab as example. One of the examples was that people will eat less sweets if they are stashed away and not fully visible on a table. It's interesting and it might very well be true, overall. But it is absurdly hard to know for sure. For one, devising an experiment for this is very hard. But the main problem is that you will absolutely find at least one person that eats more sweets when they are stashed away. So, the theory is wrong? Not really.

It might be an invalid theory though. But if you think that, there are very few things those fields could study. I'd say there were no use for them.

> they have sufficiently nuanced theories that sifting through petabytes of data to find what you claim exists is justified.

That does not sound like searching for falsifiability to me. However, it sounds like the workaround I also tried: Backing a theory - that expects an overall result, not to be true for each individual - up with as much data as possible so one can reasonable assume it is true. But that's not really the correct approach.

No, for something like the Higg Boson they try to see it or its effects (and if they were not to see it in the right circumstances, the theory were false). Might be a bad example though given how it touches physicists theory building.


Is a theory that is true only part of the time actually any use to anyone?

For example, your example. I might be the guy who eats more sweets when they're hidden. I go to my psych to talk about my sweet problem. The psych says "it's a well-studied phenomenon that you will eat less sweets if they're hidden". I hide my sweets. Boom, I eat more. What happens then?

We know there's a huge variety in humans. But statistics weeds them out and talks about "the average human". Which is great for statistics and researchers. But since there's no such thing as the average human, how does this actually help us?


If, say, 95% of people tend to eat less sweets when they're hidden, it's worth trying the method first (then discontinuing it if it you turn out to be one of the 5%). Is a drug that cures most, but not all, instances of a specific infection useless?


> Why does strict falsifiability not work in social sciences?

Theories in physics provide specific numerical predictions. Theories in social psychology provide predictions like 'X is correlated with Y' which is very difficult to falsify because in practice everything is correlated with everything else to some degree.

Null hypothesis significance testing is not the same as Popperian falsification either - it attempts to falsify the null hypothesis rather than the theory in question, and in the social sciences the null hypothesis is almost never strictly true.

Meehl was talking about this as far back as 1990:

> Null hypothesis testing of correlational predictions from weak substantive theories in soft psychology is subject to the influence of ten obfuscating factors whose effects are usually (1) sizeable, (2) opposed, (3) variable, and (4) unknown

https://meehl.dl.umn.edu/sites/g/files/pua1696/f/144whysumma...

The consensus among folk who care about the problem seems to be that we should move from significance testing to predicting and estimating effect sizes.

And then of course on top of that we still have to deal with publication bias, HARKing, researcher degrees of freedom etc.


From the article:

"A few students—all of them white guys, Wu remembers—would hang around to ask about the research and to probe for flaws in its design. Wu still didn’t believe in ESP, but she found herself defending the experiments to these mansplaining guinea pigs. "

Was this race- and gender-related jab really necessary? (Honest, non-rhetorical question.)


No, it was a mistake and distracts from the article. But empirically, the 3 previous discussions about this article have been too distracted by this "easy target" to discuss the interesting (to me) underlying issues, and have been user flagged to death. You can find others' answers to your question by searching for the earlier postings. So without malice, I downvoted your comment not because it's wrong to ask, but to try to move it to the bottom of the page where it can do less harm.


Thanks for the heads-up and backstory. I didn't know this article had this kind of history here on HN.


Yeah, that seemed really incongruous. It left me wondering if I was missing some industry-specific context.


If you can get past the linkbait title and some questionable word choices on the part of the author, this is actually an excellent article. The article is not trying to convince you that ESP is real, at least not in the sense of in-the-world reality. Rather, it's an article about the standards of scientific proof, showing that there are cases where standard statistical practices can "prove" apparently absurd results.

The issue is that in general, it's hard to tell the difference between "absurd" and "unexpected". In theory, the scientific method is about designing experiments that produce replicable results that cannot be explained by current theory, and then refining (or replacing) the theory until the can explain the new results, while still making correct predictions about all cases covered by the old theory.

But what do you do when results are obtained that violate the foundations of science, such as the time order of cause and effect? Naturally, one should start by being skeptical of the experiment. Was the data accidentally recorded wrong? Was the data inappropriately filtered before being analyzed? Is the experimenter lying about the data that was obtained?

Usually, thinking about these issues yields some apparent cause of error that would explain the unexpected results without violating one's basic beliefs. Unfortunately, an apparent reason for disbelief can often be found even if the results truly are impeccable. Often, the result is that doubters continue their disbelief, and the believers continue believing, until one faction or the other retires from the field and the other belief becomes "consensus".

What Bem has done is to design an experiment that surpasses the statistical standards of many fields, yet "proves" a result that on the surface seems impossible. Most of the usual scientific errors have been avoided, and his methodology and analysis are better than most. One possibility is that our current conceptions of causality are wrong: we think that for A to cause B, A must happen before B, but in fact this is not a requirement.

Another possibility is that something is grossly wrong with our current interpretation of scientific results, and many longstanding theories which are considered "scientifically" proven may in fact be mistakes, or at least have no more "proof" than Bem has managed to show for ESP. For most scientists, both of these answers are problematic, yet it would seem that at least one of them must be true. One twist is that some believe that Bem's goal is not actually to prove that ESP is correct, rather to show that the foundations of science are faulty: http://andrewgelman.com/2013/08/25/a-new-bem-theory/.


There is a video from a debate with Daryl Bem about those results. It is worth a look, because he admits to using various methods that increasing the degrees of freedom, and not taking those into account during statistical analysis. Essentially, he honestly admits to p-hacking, but doesn't seem to recognize anything wrong in that.

On those grounds, I think, it is little misleading to suggest that those studies surpasses the statistical standards of many fields. The statistical tests are completely misapplied, but you can't tell just by looking at publications.


Do you have a link to the debate you mean? I'd be interested to see it, but when I searched I found several. One theory is that while the initial experiments were p-hacked to some extent, the replications (by definition) were not. Some trustworthy sources argue that the problem is that the replications actually weren't, but I haven't looked into the specifics closely enough to know whether that's the case. That said, I thought Bem came across quite well in this interview: http://skeptiko.com/daryl-bem-responds-to-parapsychology-deb....

I think, it is little misleading to suggest that those studies surpasses the statistical standards of many fields

I didn't mean to imply that the statistical standards were good ones, rather that the standards in some fields are abysmally low. For example, there's a high profile case going on with the Cornell Food Lab, where the extremely prolific lead researcher seemed not just blissfully unaware but proud of the unseen pitfalls around him: https://web.archive.org/web/20170312041524/http:/www.brianwa....

The statistical tests are completely misapplied, but you can't tell just by looking at publications.

I think that's true, but (in my opinion) the problem is finding papers outside the very hard sciences where they _aren't_ completely misapplied. For example, I feel there is an insurmountable issue with applying formal statistics to any sort of meta-studies, where the data is at best a convenience sample, and (almost?) all conclusions are conditional on the unknown biases of the sample. This of course doesn't mean that all (or even most) meta-analyses produce the wrong answer, but I think it does mean that they should be treated as rhetorical rather than logical arguments.


This caught my attention: "...insurmountable issue with applying formal statistics to any sort of meta-studies..."

If I wanted to get a better understanding of what formed your opinion on this, where might I look?

P.S. I enjoyed the way you formatted and responded to your parent comment.


My belief is mostly intuitive and likely more extreme than most, and I don't know of a good single source to point to. John Ioannidis' writings are probably a good starting point: http://retractionwatch.com/2016/09/13/we-have-an-epidemic-of.... Searching for "convenience sampling" on Andrew Gelman's blog yields lots of good discussion in the comments: http://andrewgelman.com/?s=%22convenience+sample%22. Miguel Hernan's book on Causal Inference gives a good sense of the pitfalls of biased sampling: https://www.hsph.harvard.edu/miguel-hernan/causal-inference-.... Sorry I can't do better, and maybe others can add better sources.

P.S. I enjoyed the way you formatted and responded to your parent comment.

Thanks, the style is sometimes referred to as "interleaved" or "inline", as opposed to "top posting" and "bottom posting": https://brooksreview.net/2011/01/interleaved-email/. It was the norm for early online communications, but has mostly fallen out of favor. I think it works very well for some situations, although it too is surprisingly controversial: https://hackertimes.com/item?id=5233428.


There's also a Google Tech Talk on the subject of ESP, by Rupert Sheldrake.

https://www.youtube.com/watch?v=JnA8GUtXpXY


> Usually, thinking about these issues yields some apparent cause of error that would explain the unexpected results without violating one's basic beliefs. Unfortunately, an apparent reason for disbelief can often be found even if the results truly are impeccable. Often, the result is that doubters continue their disbelief, and the believers continue believing, until one faction or the other retires from the field and the other belief becomes "consensus".

This is a wonderfully succinct statement of the problem. It's not that no one can find a weakness in Bem's work, it's that no one can find a principled weakness, something that invalidates his result without invalidating more respectable results at the same time.

For scientists, this is an important guide to improving research practices. For the rest of us, it's largely a warning to continue relying on base rates, and not update too strongly even in the face of solid-looking results. With every failed replication, that's going to get even worse.


Your comment exceeds the article. Well done. This is why I read HN


Agreed. I often go straight to the comments.


Last week, I attended PyCon in Portland. Two keynote speakers were from academia. Both speakers touched on challenges that scientific communities are having with regards to replication. One attempt to address this challenge is by using, and sharing as reference material, Jupyter notebooks (Python) used during research.


> and sharing as reference material

I'm hugely in favor of this. Lots of replication-crisis stuff focuses on handling statistical issues like salami slicing. That's great, but it's a one-problem fix. Releasing more comprehensive information is a far, far bigger advance, which enables everything from catching data fraud to verifying statistical analysis.

Reinhart and Rogoff, for instance, was afflicted with nontrivial formula errors in the analysis. A standard of releasing data and code with papers would have allowed this to be discovered in months rather than years.


it really brings into question whether psychology can ever be a science.


This is a very worthwhile article in terms of simple facts. It covers largely the same ground as the Slate Star Codex piece a few years back, updated and made more accessible. But I can't help wondering at the article's urge to challenge all of science or none of it. It feels rather protective, as though the author is unwilling to countenance the possibility that studies in psychology (and related subfields) in particular are in jeopardy.

"The replication crisis as it’s understood today may yet prove to be a passing worry or else a mild problem calling for a soft corrective. It might also grow and spread in years to come, flaring from the social sciences into other disciplines, burning trails of cinder through medicine, neuroscience, and chemistry."

This brackets the possible outcomes, certainly. But it's one hell of an excluded middle, implying that perhaps there will be no serious errors found (already out of the question) and perhaps entire fields will be wiped away. Realistically, we have a much better understanding of the crisis than this already.

Medicine has a disturbing number of process issues, but many (like ignoring NNTH) are unrelated to replication errors. The neuroscience result is serious but specific to fMRI studies, and the chemistry link there is aggressively misleading. It concerns documentation and yield statistics for very real reactions, not the sort of "whole theories are junk" issues social psych is up against. There is effectively zero chance that chemistry and psychology come out of this looking equally good (or bad), and I'm disturbed by the equivocation.

"If you bought into those results, you’d be admitting that much of what you understood about the universe was wrong. If you rejected them, you’d be admitting something almost as momentous: that the standard methods of psychology cannot be trusted, and that much of what gets published in the field—and thus, much of what we think we understand about the mind—could be total bunk."

This is hyperbole, fine, but it still aggravates me in light of the other section. These two things are not comparable! One contravenes fundamental physics, the other says most psychology results are unproven (not even false, just unproven). Michael Inzlict says of the crisis "I feel like the ground is moving from underneath me, and I no longer know what is real and what is not," but even he doesn't consider this comparable to the collapse of basic physics.

A quick look at the Engber's other articles suggests a similar thread runs through all of them. He covers specific replication failures, but runs headlines like "Science Is Broken. How Much Should We Fix It? More rigor in research could stamp out false positive results. It might also do more harm than good." He writes about Gary Taubes (recently) as though past errors on carbs make Taubes (consistently failed) claims true. The list goes on.

There is a weird, pervasive implication - not just in Engber's writing - that the replication crisis means we must throw everything up in the air equally. That maybe ego depletion is still true-as-studied, and maybe basic chemistry and biology are false. This clouds needless confusion around the information we have already, and fuels a "teach the controversy" attitude on topics from stereotype threat to insulin. We would be better served by a less hedged but more cautious approach that makes real effort to discuss how confident we should be on what points.


Scott Alexander wrote about Bem and parapsychology back in 2014, http://slatestarcodex.com/2014/04/28/the-control-group-is-ou...


I thought the back and forth between Johann and others deep in that thread added a lot to the piece: http://slatestarcodex.com/2014/04/28/the-control-group-is-ou...

One conclusion seems to be that seeing successive replications of an experiment that you believe to be flawed does not mean that one should eventually lose one's doubt. Unfortunately, it also does not mean that one's doubt is justified, and that the results can safely be ignored. Rather, it means that at some point (if you want to improve your knowledge) you have to figure out some way to analyze the experiment from another angle and remove (or confirm) the base cause of the doubt.


A line from Andrew Gelman I really appreciate:

"Again, they’re placing the original study in a privileged position. There’s nothing special about the original study, relative to the replication. The original study came first, that’s all. What we should really care about is what is happening in the general population."

There are two very different questions about replication.

One is whether the study got its results by chance, including forced-chance techniques like forking paths and salami slicing. This can be handled with either preregistration or exact replication. (And at p < .05, replication is a must because 5% of un-forced results will still be off!)

But the other is whether the study got its non-chance results by methodological flaws or an actual insight about the world. Exact replications are no good for this - doing the wrong thing twice is no better than doing it once. The power poses study, for instance, used testosterone sampling procedures that introduced known confounders. What would help is a study equivalent of N-version programming: settle, preferably preregistered, on multiple tests for the same effect. If they all work, you win. If some work (repeatably) and others don't, you've either made a design error or found a different effect than the one you were looking for.

This also explains how to work your confidence levels (a topic discussed in that SSC thread). You can't replicate a study endlessly and gain confidence every time. Given a prior for P(effect), exact replications boost P(effect ∪ bad study), and your P(effect) belief is bounded by the odds of a methodology error. It's a point I'd never considered until that SSC post, and one a lot of actual researchers still seem to miss.


That's a great article. Scott Alexander is always worth a read.

I especially love this bit:

> Two experimenters in the same laboratory, using the same apparatus, having no contact with the subjects except to introduce themselves and flip a few switches – and whether one or the other was there that day completely altered the result. For a good time, watch the gymnastics they have to do to in the paper to make this sound sufficiently sensical to even get published. This is the only journal article I’ve ever read where, in the part of the Discussion section where you’re supposed to propose possible reasons for your findings, both authors suggest maybe their co-author hacked into the computer and altered the results.


I love the details in that article.

"For example, Savva et al was listed as an “exact replication” of Bem, but it was performed in 2004 – seven years before Bem’s original study took place. I know Bem believes in precognition, but that’s going too far."


This is clearer and a lot more comprehensive than the Slate piece.

Thanks for pointing to it!


Oh man, I've been looking for this article for ages. When the new Bem article came out yesterday I hunted for it again and couldn't find it. Thank you so much!


Frontpage six days ago with different title https://hackertimes.com/item?id=14364573


Yes, we invited this repost in the hope that a different title would make for a less lame discussion: https://hackertimes.com/item?id=14372695.


Resampling the comments until you get the results you want is ironic, considering the problems this causes in some (related to the article) scientific fields.


We're not running a study to see what comments arise under random conditions. We already know that! In fact it's what we're trying to avoid.


Thanks - this incarnation of the thread has actually been really good, and I'm glad it came back through.


How do you invite someone to repost something?


They get an email with a special link. If you resubmit using that link the submission gets insta-promoted to the lower end of the front page.


Right, which usually guarantees a few minutes of frontpage time and community interest (or disinterest) takes it from there. This is described at https://hackertimes.com/item?id=11662380 and links back from there.


It's been briefly on the front page a couple more times as well (click "past" on the top of the page), but each time quickly flagged off by users. Likely this was because many of the early comments were concentrating on the flaws in article rather than the intriguing underlying concept. Let's see if we can keep this one alive by staying focused on the interesting parts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: