Simulation, and random variables¶

https://uodsci.github.io/dsci345

In [1]:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd

rng = np.random.default_rng()

$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}}$$ This is here so we can use \P and \E and \var and \sd in LaTeX below.

Outline¶

  1. What do we need to do to specify "the distribution" of a random number?
  2. How do we compute means and variances of these distributions?
  3. What are some examples?

Random variables¶

The mathematical definition of a "random variable" is "a mapping from the probability space $\Omega$ into the real numbers".

A random variable is just a variable, with some extra structure: a probability distribution.

We create random variables in programming all the time:

In [2]:
X = rng.random()

What's X? We don't know! But, we do know X is a number, and so can do algebra with it, e.g., declare that Y is twice X squared plus one:

In [3]:
Y = 2 * X**2 + 1

Do you want to know what X is, for reals? I could do print(X) but too bad, I won't: for the analogy to hold up, X should be the abstract instantiation of a draw from rng.random().

Example: the martingale

Let $G_i$ be a random variable that takes values either 0 or 2, with probability 1/2 each. Let $W_n = G_1 G_2 \cdots G_n$, and $X = \max_n W_n$.

  1. If $G_i$ is the outcome of a "double-or-nothing" bet, give an interpretation of $W_n$ and $X$.
  2. Write a function to simulate from $X$.
  3. Find the distribution of $W_n$ and $X$.
In [4]:
def sim_X():
    W = 1
    X = 0
    while W > 0:
        if rng.random() > 0.5: # win
            W *= 2
            X = W
        else:
            W = 0
    return X
sim_X()
Out[4]:
0

The distribution of a random variable¶

To say how $X$ behaves, we need to specify the probability of each possible outcome. For instance:

  • $X$ is the number rolled on a fair die: $\P\{X = k\} = 1/6$ for $k \in \{1, 2, 3, 4, 5, 6\}$.

  • $X$ is uniformly chosen in $[0, 1)$: $\P\{X < x\} = x$ for $0 \le x \le 1$.

  • $X$ is the number of times I get "heads" when flipping a fair coin before my first "tails": $\P\{X \ge k\} = 2^{-k}$.

For some of these, the set of possible values is discrete, while for others it is continuous.

A continuous distribution has a density function, i.e., a function $f_X(x)$ so that $$ \P\{ a \le X \le b \} = \int_a^b f_X(x) dx . $$

Example: Let $X \sim \text{Binomial}(10, 1/4)$. What is $\P\{X \ge 2\}$?

Answer: By complements, and then additivity of disjoint events, $$ \P\{ X \ge 2 \} = 1 - \P\{ X < 2 \} = 1 - \P\{X = 0\} - \P\{X = 1\} . $$ By Wikipedia, if $X \sim \text{Binomial}(n, p)$ then $\P\{X = k\} = \binom{n}{k} p^k (1-p)^{n-k}$, so this is $$ 1 - (3/4)^{10} - 10 \times (1/4) (3/4)^9 . $$

In [5]:
x = rng.binomial(10, 0.25, size=1000000)
np.mean(x >= 2)
Out[5]:
np.float64(0.755317)
In [6]:
1 - 0.75**10 - 10 * 0.25 * .75**9
Out[6]:
0.7559747695922852

Example: Let $X \sim \text{Normal}(0, 1)$. What is $\P\{ |X| > 2 \}$?

Answer: By additivity of disjoint events, and then by complements, $$ \P\{ |X| > 2 \} = \P\{ X < - 2 \} + \P\{ X > 2 \} = 1 - \P\{ -2 \le X \le 2 \} . $$ By Wikipedia, $X$ has density $e^{-x^2/2}/\sqrt{2\pi}$, so this is $$ 1 - \int_{-2}^2 \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} dx . $$

For a numerical answer, we can go to scipy:

In [7]:
x = rng.normal(0, 1, size=1000000)
np.mean(np.abs(x) > 2)
Out[7]:
np.float64(0.045365)
In [8]:
from scipy.stats import norm
norm.cdf(-2) + (1 - norm.cdf(2))
Out[8]:
np.float64(0.0455002638963584)

Terminology:¶

  • the cumulative distribution function (CDF) of $X$ is $$ F_X(x) = \P\{ X \le x \} . $$

  • the probability density function (PDF) of $X$, if it exists, is $$ f_X(x) = \frac{d}{dx} \P\{ X \le x \} , $$ or $$ \text{"} f_X(x) dx = \P\{ X = x \} dx \text{"} . $$

For discrete random variables, probabilities are sums. For continuous random variables, they are integrals of the PDF... but, integrals are just fancy sums, anyways.

Exercise: The Exponential(1) distribution has cumulative distribution function $\P\{ X < x \} = 1 - e^{-x}$. Plot this (a) empirically, using rng.exponential(size=1000000), and (b) by plotting this function.

In [9]:
x = rng.exponential(size=1000000)
max(x)
Out[9]:
np.float64(14.977939293891616)
In [10]:
xx = np.linspace(0, 15, 101)
yy = np.array([
    np.mean(x < y) for y in xx
])
In [11]:
plt.plot(xx, yy)
plt.plot(xx, 1 - np.exp(-xx))
Out[11]:
[<matplotlib.lines.Line2D at 0x7f808081cce0>]
No description has been provided for this image

What is the mean of the Exponential(1) distribution?

In [12]:
np.mean(x)
Out[12]:
np.float64(1.001067460363066)

Means and variances¶

The expected value¶

For a random variable $X$ and a function $f( )$, the expected value of $f(X)$ random variable is the weighted average of its possible values: $$ \E[f(X)] = \sum_x f(x) \P\{X = x\} . $$

The simplest example of this is the mean of $X$: $$ \E[X] = \sum_x x \P\{X = x\} . $$

Example: If $X$ is Binomial($n$, $p$), then $\P\{X = x\} = \binom{n}{x} p^x (1-p)^{n-x}$, so (by Wikipedia): $$ \E[X] = \sum_{x=0}^n x \binom{n}{x} p^x (1-p)^{n-x} = np .$$

Example: If $X$ is Exponential($\lambda$), then $X$ has density $\lambda e^{-\lambda x}$ so $$ \E[X] = \int_0^\infty x \lambda e^{-\lambda x} dx = \frac{1}{\lambda} .$$

Additivity of expectation¶

For any random variables $X$ and $Y$, $$ \E[X + Y] = \E[X] + \E[Y] . $$

Example: Suppose a random student spends on average \$2.50 for coffee each day. In a class of 30, what is the expected total amount of money they spent on coffee that day?

In [13]:
2.5 * 30
Out[13]:
75.0
In [14]:
2.5 + 7000
Out[14]:
7002.5

Example: Suppose also that the average amount of federal loans for a UO student is \$7,000/year. What is the average sum of a student's annual loan amount and the amount they spend on coffee in a month?

Multiplication and expectation¶

If $X$ and $Y$ are independent then $$ \E[X Y] = \E[X] \E[Y] .$$

Example: Let $U$ and $V$ be independent and Uniform on $[0, 1]$. What is the expected area of a rectangle with width $U$ and height $V$?

$\mathbb{E}[U] = \mathbb{E}[V] = 0.5$, so 1/4.

Non-example: Let $U$ be Uniform on $[0, 1]$. What is the expected area of a square with width $U$?

In [15]:
np.mean(rng.uniform(size=1000000)**2)
Out[15]:
np.float64(0.33342306610504213)

Exercise: Write down the integral for this area, and then evaluate it by simulation.

The variance¶

The variance of $X$ is $$ \var[X] = \E[X^2] - \E[X]^2 , $$ and the standard deviation is $$ \sd[X] = \sqrt{\var[X]} .$$

Equivalent definition: if $X$ has mean $\mu$, then $$ \var[X] = \E[(X-\mu)^2] . $$

The standard deviation tells us how much $X$ tends to differ from the mean.

Additivity of variance (or not)¶

If $X$ and $Y$ are independent, then $$ \var[X + Y] = \var[X] + \var[Y]. $$

Example: Suppose the amount of rain that falls each day is independent, with mean 0.1" and variance 0.25". What is the mean and variance of the total amount of rain in a week?

Some distributions¶

Here's a helper function we'll use to summarize the distributions.

In [16]:
def print_dist(x):
    x = np.sort(x)
    with np.printoptions(precision=2):
        print(f"{len(x)} values; mean={np.mean(x):.1}; sd={np.std(x):.1}")
        print(f"  smallest five values: {x[:5]}")
        print(f"  largest five values: {x[-5:]}")

Exponential¶

Simulate 100,000 draws from the Exponential(5) distribution:

In [17]:
np.min(x)
Out[17]:
np.float64(3.2185866245579723e-07)
In [18]:
x = rng.exponential(scale=5, size=100000)
print_dist(x)
100000 values; mean=5e+00; sd=5e+00
  smallest five values: [8.22e-05 9.05e-05 9.93e-05 1.61e-04 2.17e-04]
  largest five values: [46.08 46.16 46.51 47.41 49.44]

The histogram, which decays like $\exp(-x/5)$, with vertical lines at the mean, which is about in the middle, and the mean plus/minus the SD, which are at about zero and where the curve is starting to get low.

In [19]:
mean_x = np.mean(x)
sd_x = np.std(x)
plt.hist(x, bins=100)
plt.title(f"mean={mean_x:.2f}, sd={sd_x:.2f}")
plt.vlines(mean_x, 0, 10000, 'red')
plt.vlines([mean_x - sd_x, mean_x + sd_x], 0, 10000, 'red', ':');
No description has been provided for this image

Normal¶

Same thing, for the Normal distribution:

In [20]:
x = rng.normal(loc=2, scale=7, size=100000)
print_dist(x)
100000 values; mean=2e+00; sd=7e+00
  smallest five values: [-31.81 -29.92 -29.78 -27.74 -27.17]
  largest five values: [29.41 29.99 30.01 30.48 32.59]

The histogram, which is a rounded bump with longish tails, centered at about 2 and halfway down the slope at about -5 and 9.

In [21]:
mean_x = np.mean(x)
sd_x = np.std(x)

plt.hist(x, bins=100)
plt.title(f"mean={mean_x:.2f}, sd={sd_x:.2f}")
plt.vlines(mean_x, 0, 4000, 'red')
plt.vlines([mean_x - sd_x, mean_x + sd_x], 0, 4000, 'red', ':');
No description has been provided for this image

Binomial¶

The Binomial distribution

In [22]:
x = rng.binomial(n=50, p=0.4, size=100000)
print_dist(x)
100000 values; mean=2e+01; sd=3e+00
  smallest five values: [7 7 7 7 7]
  largest five values: [34 34 35 35 36]

The histogram here shows discrete values, but is shaped similar to the Normal; this one is centered at about 20 and halfway down the slopes are at about plus/minus 3.5.

In [23]:
mean_x = np.mean(x)
sd_x = np.std(x)

plt.hist(x, bins=100)
plt.title(f"mean={mean_x:.2f}, sd={sd_x:.2f}")
plt.vlines(mean_x, 0, 4000, 'red')
plt.vlines([mean_x - sd_x, mean_x + sd_x], 0, 4000, 'red', ':');
No description has been provided for this image

Poisson¶

Do the same thing with rng.poisson, for the Poisson distribution

In [24]:
rng.poisson(lam=3, size=10)
Out[24]:
array([5, 3, 3, 3, 0, 2, 3, 0, 4, 4])