Probability and Statistics for Data Science¶
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd
This is here so we can use \P
and \E
in LaTeX below.
$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}}$$
Uncertainty: (how to) deal with it¶
When we're doing data science, we
- look at data
- make predictions about future values
- infer aspects of the underlying process
Fundamental to all stages are randomness and uncertainty.
For instance:
Computing a statistic gives you a number that describes a data set.
Doing statistics helps you understand how reliable that description is and how well it applies to the wider world.
We understand uncertainty, conceptually and quantitatively, with randomness,
i.e., through probability.
Goals of this class¶
Become familiar with different types of probability models.
Calculate properties of probability models.
Construct and simulate from realistic models of data generation.
Be able to test estimation and prediction methods with simulation.
Gain familiarity with fundamental statistical concepts.
We'll spend a lot of time on probability models, for applications from classical statistics to machine learning.
Course mechanics: https://uodsci.github.io/dsci345
(break: fill out the survey)
Getting random¶
Today: Probability and expectation, conditional probabilities, and random variables.
First: let's get some (pseudo-)randomness.
rng = np.random.default_rng()
help(rng.random)
Help on method random in module numpy.random._generator: random(size=None, dtype=<class 'numpy.float64'>, out=None) method of numpy.random._generator.Generator instance random(size=None, dtype=np.float64, out=None) Return random floats in the half-open interval [0.0, 1.0). Results are from the "continuous uniform" distribution over the stated interval. To sample :math:`Unif[a, b), b > a` use `uniform` or multiply the output of `random` by ``(b - a)`` and add ``a``:: (b - a) * random() + a Parameters ---------- size : int or tuple of ints, optional Output shape. If the given shape is, e.g., ``(m, n, k)``, then ``m * n * k`` samples are drawn. Default is None, in which case a single value is returned. dtype : dtype, optional Desired dtype of the result, only `float64` and `float32` are supported. Byteorder must be native. The default value is np.float64. out : ndarray, optional Alternative output array in which to place the result. If size is not None, it must have the same shape as the provided size and must match the type of the output values. Returns ------- out : float or ndarray of floats Array of random floats of shape `size` (unless ``size=None``, in which case a single float is returned). See Also -------- uniform : Draw samples from the parameterized uniform distribution. Examples -------- >>> rng = np.random.default_rng() >>> rng.random() 0.47108547995356098 # random >>> type(rng.random()) <class 'float'> >>> rng.random((5,)) array([ 0.30220482, 0.86820401, 0.1654503 , 0.11659149, 0.54323428]) # random Three-by-two array of random numbers from [-5, 0): >>> 5 * rng.random((3, 2)) - 5 array([[-3.99149989, -0.52338984], # random [-2.99091858, -0.79479508], [-1.23204345, -1.75224494]])
We can look at some typical values:
x = rng.random(size=1000)
print(x[:10])
[0.45876906 0.20363319 0.27681314 0.37475255 0.98781054 0.25777921 0.13543568 0.7840902 0.29159375 0.82445365]
or the largest and smallest values:
print(np.sort(x)[:10])
print(np.sort(x)[-10:])
[0.00219953 0.00234292 0.00417485 0.00429553 0.00569055 0.00629844 0.00651285 0.00666214 0.00733786 0.00816952] [0.99021119 0.99075891 0.99298004 0.99410224 0.99665872 0.99765117 0.99788518 0.9992273 0.9995685 0.99959431]
We can also plot the results: (left) the simulated values plotted in sorted order, making a steadily increasing line; (right) a histogram of the simulated values, which is more or less flat.
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(np.sort(x))
ax2.hist(x);
Exercise: put some other distributions into the code in one of the previous two slides and see what happens.
Example: false positives¶
Suppose that I got a positive result on an HIV test. What’s the chance I am HIV positive? (Here we really mean that “I” am a randomly chosen person from the US population.)
Background data¶
The HIV rapid test has 99.4% specificity and 99.8% sensitivity.
Refreshing from Wikipedia, specificity is the “true positive” rate and the sensitivity is the “true negative” rate:
- if you have HIV, the chance that it (wrongly) comes out negative is .006 = 0.6%;
- if you don’t have HIV, the chance that it (wrongly) comes out positive is .002 = .2%.
There are currently around 1.2 million people with HIV in the US, out of a total of 333 million, giving an overall rate of 0.0036 = 0.36%.
true_pos = .994
true_neg = .998
pop_rate = 1.2 / 333
Answer 1: simulation
What is the proportion of people who got a positive test result who actually have HIV?
N = 1_000_000
hiv_status = [rng.uniform() < pop_rate for k in range(N)]
test_result = np.full((N,), "")
for k in range(N):
if hiv_status[k]:
if rng.uniform() < true_pos:
result = "+"
else:
result = "-"
else:
if rng.uniform() < true_neg:
result = "-"
else:
result = "+"
test_result[k] = result
test_result[:10], hiv_status[:10]
(array(['-', '-', '-', '-', '-', '-', '-', '-', '-', '-'], dtype='<U1'), [False, False, False, False, False, False, False, False, False, False])
test_result = pd.Series(test_result, name="test result")
hiv_status = pd.Series(hiv_status, name="HIV status")
pd.crosstab(test_result, hiv_status)
HIV status | False | True |
---|---|---|
test result | ||
+ | 1984 | 3409 |
- | 994585 | 22 |
3469/(1982+3469) # TODO: update based on simulation numbers (which change)
0.6363969913777289
Conclusion: only 63.6% of people with a positive test result actually have HIV.
Suppose that I got a positive result on an HIV test. What’s the chance I am HIV positive?
Does this answer my question?
Next answer: let's use math.
Probability rules¶
(i.e., the axioms of probability)
Probability rules¶
Probabilities are proportions: $\hspace{2em} 0 \le \P\{A\} \le 1$
Everything: $\hspace{2em} \P\{ \Omega \} = 1$
Complements: $\hspace{2em} \P\{ \text{not } A\} = 1 - \P\{A\}$
Disjoint events: If $\hspace{2em} \P\{A \text{ and } B\} = 0$ then $\hspace{2em} \P\{A \text{ or } B\} = \P\{A\} + \P\{B\}$.
Independence: $A$ and $B$ are independent iff $\P\{A \text{ and } B\} = \P\{A\} \P\{B\}$.
Conditional probability: $$\P\{A \;|\; B\} = \frac{\P\{A \text{ and } B\}}{ \P\{B\} }$$
(Example: HIV test calculation, on the board.)
We want to know the proportion of people who got a positive test result who actually have HIV?
i.e.
What is the probability of getting a positive test result given that you have HIV $$ \mathbb{P}\{ B | A \} = $$ where we choose someone uniformly at random from the population, and give them an HIV test, and
- $A$ is that they get a positive test result, and
- $B$ is that they have HIV.
What do we know? Well, the proportion of the population that has HIV is 0.0036 and the chance of getting a positive test result for someone with HIV is .994, and the chance of getting a positive test for someone without HIV is .998, so
- $\mathbb{P}\{A \text{ and } B\} = \mathbb{P}\{B\} \times \mathbb{P}\{A|B\} = 0.0036 \times .994$
- $\mathbb{P}\{A\} = \mathbb{P}\{A\text{ and }B\} + \mathbb{P}\{A\text{ and not }B\} = 0.0036 \times .994 + (1 - 0.0036) \times (1 - .998)$
num = pop_rate * true_pos
denom = num + (1 - pop_rate) * (1 - true_neg)
num / denom
0.6425339366515835