# The difference between two t-tests doesn’t tell you what you think.

The situation is common. Let’s say you want to know if a drug increases sleep duration. You grab 20 mice, you give 10 of them the drug, and 10 of them vehicle. You measure their sleep durations and you do a t-test, and get p < 0.05. You conclude that the drug increases sleep time. Then you realize you did the experiment on all male mice, so you grab a cohort of 20 female mice, give 10 the drug and 10 placebo. You measure how long they sleep and you do a t-test and get p > 0.05. You conclude that the drug does not increase sleep time in female mice. Thus, you conclude that the drug has different effects in males and females.

Seems reasonable right? Well it’s not reasonable. In fact, it’s even less reasonable than I thought.

If you believe the approach above is reasonable, I think If I show you a graph that could be generated in the above situation you’ll start to see the problem Hypothetical data showing a statistically significant effect of a drug in one group, and no significance in the other group. That means the drug had a different effect, right?

One way to frame the logical flaw lies that underpins this problem is that it is a mistake to conclude that if p > 0.05, then that means the drug has had no effect. When p = 0.06, it means that if the null hypothesis was true, we would get a t value as, or more, extreme in 6/100 trials. This is not the same thing as saying there is no effect of the drug. Another way to think about it is that we can’t conclude anything about the difference in the effect of the drug in males and females, because we never actually compared it.

Perhaps you say “You’ve picked some specific case where the ‘two t-test’ approach doesn’t work, and okay, I get that some anally retentive statistician may have a problem with this approach, but it can’t be that bad’. Well it is. At worst, this approach will give you a false answer 50% of the time! Let me show you.

If we generate data just like the above, where there is no true effect of sex, and we keep everything constant, apart from an increasing effect of our “drug”, we can look to see how often sex appears to have a statistically significant effect (i.e. the two t-tests give different results, one significant, one insignificant). How frequently this happens could be called the “false positive rate”. When we do that, we see that at a maximum, 50% of the time the two t-test approach tells us sex has an effect when there was none. When we perform the two t-test approach, at a certain effect size (i.e. the difference between the groups divided by the standard deviation of each group), 50% of experiments would tells us there is a significant effect of sex.

If we plot that same graph, but instead of using the effect size, we plot the theoretical power of the statistical test (i.e. how often the each t-test would find an effect of drug), we will understand why this “two t-test” approach is so fraught. When the power of each statistical test is 50%, then in 50% of trials one test will be significant and the other trial will be insignificant.

You see, it comes down to statistical power. If each t-test has very low power, then both t-tests will always be insignificant, so you will never see an effect of sex. If each t-test has a very high power, then both t-tests will always be significant, and so again, you will never see an effect of sex. But in the middle, when the t-test has about 50% power, then 50% of t-tests will be significant, which means that on average, in 50% of the cases, one test will be positive and one will be negative, leading you to erroneously conclude that sex has an effect.

So what should you do instead? You should run a 2-way ANOVA. A 2-way ANOVA allows you to investigate if there is an effect of drug, if there is an effect of sex and importantly, if there is an interaction of these two factors, i.e. that the effect of drug depends on sex. If we do that, our problems go away.