Probability and Statistics for Data Science¶

Fall 2022: Peter Ralph

https://uodsci.github.io/dsci345

Uncertainty: (how to) deal with it¶

When we're doing data science, we

  • look at data
  • make predictions about future values
  • infer aspects of the underlying process

Fundamental to all stages are randomness and uncertainty.

For instance: randomized algorithms (e.g., stochastic gradient descent).

For instance:

Computing a statistic gives you a number that describes a data set.

Doing statistics helps you understand how reliable that description is and how well it applies to the wider world.

We understand uncertainty, conceptually and quantitatively, with randomness,

i.e., through probability.

Goals of this class¶

  • Become familiar with different types of probability models.

  • Calculate properties of probability models.

  • Construct and simulate from realistic models of data generation.

  • Be able to test estimation and prediction methods with simulation.

  • Gain familiarity with fundamental statistical concepts.

We'll spend a lot of time on probability models, for applications from classical statistics to machine learning.

Course mechanics: https://uodsci.github.io/dsci345

(break: fill out the survey)

Getting random¶

Today: Probability and expectation, conditional probabilities, and random variables.

First: let's get some (pseudo-)randomness.

In [2]:
rng = np.random.default_rng()
help(rng.random)
Help on built-in function random:

random(...) method of numpy.random._generator.Generator instance
    random(size=None, dtype=np.float64, out=None)
    
    Return random floats in the half-open interval [0.0, 1.0).
    
    Results are from the "continuous uniform" distribution over the
    stated interval.  To sample :math:`Unif[a, b), b > a` multiply
    the output of `random` by `(b-a)` and add `a`::
    
      (b - a) * random() + a
    
    Parameters
    ----------
    size : int or tuple of ints, optional
        Output shape.  If the given shape is, e.g., ``(m, n, k)``, then
        ``m * n * k`` samples are drawn.  Default is None, in which case a
        single value is returned.
    dtype : dtype, optional
        Desired dtype of the result, only `float64` and `float32` are supported.
        Byteorder must be native. The default value is np.float64.
    out : ndarray, optional
        Alternative output array in which to place the result. If size is not None,
        it must have the same shape as the provided size and must match the type of
        the output values.
    
    Returns
    -------
    out : float or ndarray of floats
        Array of random floats of shape `size` (unless ``size=None``, in which
        case a single float is returned).
    
    Examples
    --------
    >>> rng = np.random.default_rng()
    >>> rng.random()
    0.47108547995356098 # random
    >>> type(rng.random())
    <class 'float'>
    >>> rng.random((5,))
    array([ 0.30220482,  0.86820401,  0.1654503 ,  0.11659149,  0.54323428]) # random
    
    Three-by-two array of random numbers from [-5, 0):
    
    >>> 5 * rng.random((3, 2)) - 5
    array([[-3.99149989, -0.52338984], # random
           [-2.99091858, -0.79479508],
           [-1.23204345, -1.75224494]])

In [64]:
x = rng.random(size=1000)
print(x[:10])

fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(np.sort(x))
ax2.hist(x);
[0.80028468 0.1207662  0.02434086 0.9819454  0.99290216 0.74296774
 0.52593177 0.01181226 0.00296221 0.31311247]

Exercise: put some other distributions into this code and see what happens.

Example: false positives¶

Suppose that I got a positive result on an HIV test. What’s the chance I am HIV positive? (Here we really mean that “I” am a randomly chosen person from the US population.)

Background data¶

The HIV rapid test has 99.4% specificity and 99.8% sensitivity.

Refreshing from Wikipedia, specificity is the “true positive” rate and the sensitivity is the “true negative” rate:

  • if you have HIV, the chance that it (wrongly) comes out negative is .006 = 0.6%;
  • if you don’t have HIV, the chance that it (wrongly) comes out positive is .002 = .2%.

There are currently around 1.2 million people with HIV in the US, out of a total of 333 million, giving an overall rate of 0.0036 = 0.36%.

In [4]:
true_pos = .994
true_neg = .998
pop_rate = 1.2 / 333

Answer 1: simulation

In [5]:
N = int(1e6)
hiv_status = pd.Series( rng.random(N) < pop_rate, name="HIV+")
n = np.sum(hiv_status)
test_result = pd.Series( np.full((N,), ""), name="test")
# hiv+ people
test_result[hiv_status] = ["+" if p < true_pos else "-" for p in rng.random(n)]
# hiv- people
test_result[~hiv_status] = ["-" if p < true_neg else "+" for p in rng.random(N - n)]

pd.crosstab(hiv_status, test_result, margins=True)
Out[5]:
test + - All
HIV+
False 2000 994327 996327
True 3648 25 3673
All 5648 994352 1000000

What is the proportion of people who got a positive test result who actually have HIV?

In [6]:
pd.crosstab(hiv_status, test_result, margins=True)
Out[6]:
test + - All
HIV+
False 2000 994327 996327
True 3648 25 3673
All 5648 994352 1000000

Suppose that I got a positive result on an HIV test. What’s the chance I am HIV positive?

In [7]:
hiv_given_plus = sum(hiv_status & (test_result == "+")) / np.sum(test_result == "+")
print(f"The proportion of the {np.sum(test_result == '+')} people "
      f"that had a positive test result that actually have HIV is {100*hiv_given_plus:.2f}%.")
The proportion of the 5648 people that had a positive test result that actually have HIV is 64.59%.

Does this answer my question?

Next answer: let's use math.

Probability rules¶

(i.e., the axioms of probability)

Probability rules¶

  1. Probabilities are proportions: $\hspace{2em} 0 \le \P\{A\} \le 1$

  2. Everything: $\hspace{2em} \P\{ \Omega \} = 1$

  3. Complements: $\hspace{2em} \P\{ \text{not } A\} = 1 - \P\{A\}$

  4. Disjoint events: If $\hspace{2em} \P\{A \text{ and } B\} = 0$ then $\hspace{2em} \P\{A \text{ or } B\} = \P\{A\} + \P\{B\}$.

  5. Independence: $A$ and $B$ are independent iff $\P\{A \text{ and } B\} = \P\{A\} \P\{B\}$.

  6. Conditional probability: $$\P\{A \;|\; B\} = \frac{\P\{A \text{ and } B\}}{ \P\{B\} }$$

Bayes' rule¶

A consequence is that

$$\P\{B \;|\; A\} = \frac{\P\{B\} \P\{A \;|\; B\}}{ \P\{A\} } .$$

(Example: HIV test calculation, on the board.)

In [67]:
pop_rate * true_pos / (pop_rate * true_pos + (1 - pop_rate) * (1 - true_neg))
Out[67]:
0.6425339366515835