When we're doing data science, we
Fundamental to all stages are randomness and uncertainty.
For instance:
Computing a statistic gives you a number that describes a data set.
Doing statistics helps you understand how reliable that description is and how well it applies to the wider world.
We understand uncertainty, conceptually and quantitatively, with randomness,
i.e., through probability.
Become familiar with different types of probability models.
Calculate properties of probability models.
Construct and simulate from realistic models of data generation.
Be able to test estimation and prediction methods with simulation.
Gain familiarity with fundamental statistical concepts.
We'll spend a lot of time on probability models, for applications from classical statistics to machine learning.
Course mechanics: https://uodsci.github.io/dsci345
(break: fill out the survey)
Today: Probability and expectation, conditional probabilities, and random variables.
First: let's get some (pseudo-)randomness.
rng = np.random.default_rng()
help(rng.random)
Help on built-in function random: random(...) method of numpy.random._generator.Generator instance random(size=None, dtype=np.float64, out=None) Return random floats in the half-open interval [0.0, 1.0). Results are from the "continuous uniform" distribution over the stated interval. To sample :math:`Unif[a, b), b > a` use `uniform` or multiply the output of `random` by ``(b - a)`` and add ``a``:: (b - a) * random() + a Parameters ---------- size : int or tuple of ints, optional Output shape. If the given shape is, e.g., ``(m, n, k)``, then ``m * n * k`` samples are drawn. Default is None, in which case a single value is returned. dtype : dtype, optional Desired dtype of the result, only `float64` and `float32` are supported. Byteorder must be native. The default value is np.float64. out : ndarray, optional Alternative output array in which to place the result. If size is not None, it must have the same shape as the provided size and must match the type of the output values. Returns ------- out : float or ndarray of floats Array of random floats of shape `size` (unless ``size=None``, in which case a single float is returned). See Also -------- uniform : Draw samples from the parameterized uniform distribution. Examples -------- >>> rng = np.random.default_rng() >>> rng.random() 0.47108547995356098 # random >>> type(rng.random()) <class 'float'> >>> rng.random((5,)) array([ 0.30220482, 0.86820401, 0.1654503 , 0.11659149, 0.54323428]) # random Three-by-two array of random numbers from [-5, 0): >>> 5 * rng.random((3, 2)) - 5 array([[-3.99149989, -0.52338984], # random [-2.99091858, -0.79479508], [-1.23204345, -1.75224494]])
x = rng.random(size=1000)
print(x[:10])
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(np.sort(x))
ax2.hist(x);
[0.23168809 0.20112958 0.75185691 0.62481795 0.82925585 0.03235133 0.5648251 0.12889729 0.31620117 0.07020354]
Exercise: put some other distributions into this code and see what happens.
Suppose that I got a positive result on an HIV test. What’s the chance I am HIV positive? (Here we really mean that “I” am a randomly chosen person from the US population.)
The HIV rapid test has 99.4% specificity and 99.8% sensitivity.
Refreshing from Wikipedia, specificity is the “true positive” rate and the sensitivity is the “true negative” rate:
There are currently around 1.2 million people with HIV in the US, out of a total of 333 million, giving an overall rate of 0.0036 = 0.36%.
true_pos = .994
true_neg = .998
pop_rate = 1.2 / 333
Answer 1: simulation
What is the proportion of people who got a positive test result who actually have HIV?
N = 1_000_000
hiv_status = [rng.uniform() < pop_rate for k in range(N)]
test_result = np.full((N,), "")
for k in range(N):
if hiv_status[k]:
if rng.uniform() < true_pos:
result = "+"
else:
result = "-"
else:
if rng.uniform() < true_neg:
result = "-"
else:
result = "+"
test_result[k] = result
test_result[:10], hiv_status[:10]
(array(['-', '-', '-', '-', '-', '-', '-', '-', '-', '-'], dtype='<U1'), [False, False, False, False, False, False, False, False, False, False])
test_result = pd.Series(test_result, name="test result")
hiv_status = pd.Series(hiv_status, name="HIV status")
pd.crosstab(test_result, hiv_status)
HIV status | False | True |
---|---|---|
test result | ||
+ | 2007 | 3586 |
- | 994393 | 14 |
3469/(1982+3469) # TODO: update based on simulation numbers (which change)
0.6363969913777289
Conclusion: only 63.6% of people with a positive test result actually have HIV.
Suppose that I got a positive result on an HIV test. What’s the chance I am HIV positive?
Does this answer my question?
Next answer: let's use math.
(i.e., the axioms of probability)
Probabilities are proportions: $\hspace{2em} 0 \le \P\{A\} \le 1$
Everything: $\hspace{2em} \P\{ \Omega \} = 1$
Complements: $\hspace{2em} \P\{ \text{not } A\} = 1 - \P\{A\}$
Disjoint events: If $\hspace{2em} \P\{A \text{ and } B\} = 0$ then $\hspace{2em} \P\{A \text{ or } B\} = \P\{A\} + \P\{B\}$.
Independence: $A$ and $B$ are independent iff $\P\{A \text{ and } B\} = \P\{A\} \P\{B\}$.
Conditional probability: $$\P\{A \;|\; B\} = \frac{\P\{A \text{ and } B\}}{ \P\{B\} }$$
A consequence is that
$$\P\{B \;|\; A\} = \frac{\P\{B\} \P\{A \;|\; B\}}{ \P\{A\} } .$$(Example: HIV test calculation, on the board.)
We want to know the proportion of people who got a positive test result who actually have HIV?
i.e.
What is the probability of getting a positive test result given that you have HIV $$ \mathbb{P}\{ B | A \} = $$ where we choose someone uniformly at random from the population, and give them an HIV test, and
What do we know? Well, the proportion of the population that has HIV is 0.0036 and the chance of getting a positive test result for someone with HIV is .994, and the chance of getting a positive test for someone without HIV is .998, so
num = pop_rate * true_pos
denom = num + (1 - pop_rate) * (1 - true_neg)
num / denom
0.6425339366515835