When we're doing data science, we
Fundamental to all stages are randomness and uncertainty.
For instance: randomized algorithms (e.g., stochastic gradient descent).
For instance:
Computing a statistic gives you a number that describes a data set.
Doing statistics helps you understand how reliable that description is and how well it applies to the wider world.
We understand uncertainty, conceptually and quantitatively, with randomness,
i.e., through probability.
Become familiar with different types of probability models.
Calculate properties of probability models.
Construct and simulate from realistic models of data generation.
Be able to test estimation and prediction methods with simulation.
Gain familiarity with fundamental statistical concepts.
We'll spend a lot of time on probability models, for applications from classical statistics to machine learning.
Course mechanics: https://uodsci.github.io/dsci345
(break: fill out the survey)
Today: Probability and expectation, conditional probabilities, and random variables.
First: let's get some (pseudo-)randomness.
rng = np.random.default_rng()
help(rng.random)
Help on built-in function random: random(...) method of numpy.random._generator.Generator instance random(size=None, dtype=np.float64, out=None) Return random floats in the half-open interval [0.0, 1.0). Results are from the "continuous uniform" distribution over the stated interval. To sample :math:`Unif[a, b), b > a` multiply the output of `random` by `(b-a)` and add `a`:: (b - a) * random() + a Parameters ---------- size : int or tuple of ints, optional Output shape. If the given shape is, e.g., ``(m, n, k)``, then ``m * n * k`` samples are drawn. Default is None, in which case a single value is returned. dtype : dtype, optional Desired dtype of the result, only `float64` and `float32` are supported. Byteorder must be native. The default value is np.float64. out : ndarray, optional Alternative output array in which to place the result. If size is not None, it must have the same shape as the provided size and must match the type of the output values. Returns ------- out : float or ndarray of floats Array of random floats of shape `size` (unless ``size=None``, in which case a single float is returned). Examples -------- >>> rng = np.random.default_rng() >>> rng.random() 0.47108547995356098 # random >>> type(rng.random()) <class 'float'> >>> rng.random((5,)) array([ 0.30220482, 0.86820401, 0.1654503 , 0.11659149, 0.54323428]) # random Three-by-two array of random numbers from [-5, 0): >>> 5 * rng.random((3, 2)) - 5 array([[-3.99149989, -0.52338984], # random [-2.99091858, -0.79479508], [-1.23204345, -1.75224494]])
x = rng.random(size=1000)
print(x[:10])
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(np.sort(x))
ax2.hist(x);
[0.80028468 0.1207662 0.02434086 0.9819454 0.99290216 0.74296774 0.52593177 0.01181226 0.00296221 0.31311247]
Exercise: put some other distributions into this code and see what happens.
Suppose that I got a positive result on an HIV test. What’s the chance I am HIV positive? (Here we really mean that “I” am a randomly chosen person from the US population.)
The HIV rapid test has 99.4% specificity and 99.8% sensitivity.
Refreshing from Wikipedia, specificity is the “true positive” rate and the sensitivity is the “true negative” rate:
There are currently around 1.2 million people with HIV in the US, out of a total of 333 million, giving an overall rate of 0.0036 = 0.36%.
true_pos = .994
true_neg = .998
pop_rate = 1.2 / 333
Answer 1: simulation
N = int(1e6)
hiv_status = pd.Series( rng.random(N) < pop_rate, name="HIV+")
n = np.sum(hiv_status)
test_result = pd.Series( np.full((N,), ""), name="test")
# hiv+ people
test_result[hiv_status] = ["+" if p < true_pos else "-" for p in rng.random(n)]
# hiv- people
test_result[~hiv_status] = ["-" if p < true_neg else "+" for p in rng.random(N - n)]
pd.crosstab(hiv_status, test_result, margins=True)
test | + | - | All |
---|---|---|---|
HIV+ | |||
False | 2000 | 994327 | 996327 |
True | 3648 | 25 | 3673 |
All | 5648 | 994352 | 1000000 |
What is the proportion of people who got a positive test result who actually have HIV?
pd.crosstab(hiv_status, test_result, margins=True)
test | + | - | All |
---|---|---|---|
HIV+ | |||
False | 2000 | 994327 | 996327 |
True | 3648 | 25 | 3673 |
All | 5648 | 994352 | 1000000 |
Suppose that I got a positive result on an HIV test. What’s the chance I am HIV positive?
hiv_given_plus = sum(hiv_status & (test_result == "+")) / np.sum(test_result == "+")
print(f"The proportion of the {np.sum(test_result == '+')} people "
f"that had a positive test result that actually have HIV is {100*hiv_given_plus:.2f}%.")
The proportion of the 5648 people that had a positive test result that actually have HIV is 64.59%.
Does this answer my question?
Next answer: let's use math.
(i.e., the axioms of probability)
Probabilities are proportions: $\hspace{2em} 0 \le \P\{A\} \le 1$
Everything: $\hspace{2em} \P\{ \Omega \} = 1$
Complements: $\hspace{2em} \P\{ \text{not } A\} = 1 - \P\{A\}$
Disjoint events: If $\hspace{2em} \P\{A \text{ and } B\} = 0$ then $\hspace{2em} \P\{A \text{ or } B\} = \P\{A\} + \P\{B\}$.
Independence: $A$ and $B$ are independent iff $\P\{A \text{ and } B\} = \P\{A\} \P\{B\}$.
Conditional probability: $$\P\{A \;|\; B\} = \frac{\P\{A \text{ and } B\}}{ \P\{B\} }$$
A consequence is that
$$\P\{B \;|\; A\} = \frac{\P\{B\} \P\{A \;|\; B\}}{ \P\{A\} } .$$(Example: HIV test calculation, on the board.)
pop_rate * true_pos / (pop_rate * true_pos + (1 - pop_rate) * (1 - true_neg))
0.6425339366515835