Power and false positives¶

Peter Ralph

https://uodsci.github.io/dsci345

In [1]:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import pandas as pd
import numpy as np
import scipy.stats

rng = np.random.default_rng(seed=123)
In [2]:
2*(1-scipy.stats.t.cdf(2.25, df=29))
Out[2]:
np.float64(0.03220655492189728)

$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}}$$ This is here so we can use \P and \E and \var and \sd in LaTeX below.

Power analysis¶

Suppose we're measuring expression of many ($\sim$ 1,000) genes in a bunch of organisms at night and during the day. We'd like to know which genes are expressed at different levels in night and day, and by how much.

However, gene expression is highly variable: some genes differ much more between individuals than others. Here's a model for $X$, the difference in log gene expression of a random gene in a random individual: $$ \begin{aligned} \text{mean difference:} \qquad M &\sim \begin{cases} 3 \qquad &\text{with probability}\; 1/2 \\ 0 \qquad &\text{with probability}\; 1/2 \end{cases} \\ \text{SD difference:} \qquad D &\sim \text{Exponential}(\text{mean}=2) \\ \text{measured difference:} \qquad X &\sim \text{Normal}(\text{mean}=M, \text{sd}=D) . \end{aligned}$$

Goals: We will want to, in the real data,

  • Identify genes that we're sure are differentially expressed.
  • Estimate how different their expression levels are.

So, let's simulate from this and test out our methods.

Simulate¶

(Question: what shapes will M, D, and X be?)

In [ ]:
 

Methods¶

How will we identify differentially expressed genes? For each gene, let's

  1. do a $t$-test, and
  2. Then, we'll take the genes with $p$-value below $\alpha$ as candidate differentially expressed genes. What proportion do we get right (power) and wrong (false positive rate), and how does this depend on $D$?
  3. Finally, find a 95% confidence interval.

First, the $t$-tests. How do we use scipy.stats.ttest_1samp()? What shapes will the results be?

Overall performance: How do we find how many true positives we have? False positives? True/false negatives?

The true positives are genes for which $M>0$ (i.e., there is actually a difference between night & day) and for which the $p$-value is less than $\alpha$ (which btw we have to pick). Going (arbitrarily) with $\alpha=0.01$, this is:

In [ ]:
 

So -our power (probability of detecting a difference when there is one) is 392/477 = 82%.

Similarly, our false positive rate is:

In [ ]:
 

Question: How can we summarise power and false positive rate as a function of $D$?

Well, let's just do the calculations above but binning by values of $D$.

Here's a plot of that data:

In [ ]: