{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "59742dcb",
   "metadata": {},
   "source": [
    "# Homework 2: Random variables\n",
    "\n",
    "*Instructions:*\n",
    "Please answer the following questions and submit your work\n",
    "by editing this jupyter notebook and submitting it on Canvas.\n",
    "Questions may involve math, programming, or neither,\n",
    "but you should make sure to *explain your work*:\n",
    "i.e., you should usually have a cell with at least a few sentences\n",
    "explaining what you are doing."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e94f2a0",
   "metadata": {},
   "source": [
    "Several times below I ask you to compare the results of a simulation to a theoretical distribution.\n",
    "Here is a function that makes this easy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2d27b846",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "rng = np.random.default_rng()\n",
    "\n",
    "def comparison_table(x, expected, *sims):\n",
    "    \"\"\"\n",
    "    Returns a pandas DataFrame with columns corresponding to:\n",
    "    x: the possible values\n",
    "    expected: the expected frequencies these should happen at\n",
    "    sim1, sim2, ...: these are vectors of simulated values that\n",
    "        will be tabulated, and the frequencies of each of the values\n",
    "        of x will be put in a column in the result.\n",
    "    \"\"\"\n",
    "    df = pd.DataFrame(data={\"x\" : x})\n",
    "    df['expected'] = expected\n",
    "    for k, sim in enumerate(sims):\n",
    "        total = len(sim)\n",
    "        n = [np.sum(sim == y)/total for y in x]\n",
    "        df[f\"sim{k}\"] = n\n",
    "    df.style.format(\"{:.3f}\")\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "fc9fa547",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x</th>\n",
       "      <th>expected</th>\n",
       "      <th>sim0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>4.000000e-02</td>\n",
       "      <td>0.2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>8.000000e-03</td>\n",
       "      <td>0.1674</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1.600000e-03</td>\n",
       "      <td>0.1240</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>3.200000e-04</td>\n",
       "      <td>0.0997</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>6.400000e-05</td>\n",
       "      <td>0.0796</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6</td>\n",
       "      <td>1.280000e-05</td>\n",
       "      <td>0.0733</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>7</td>\n",
       "      <td>2.560000e-06</td>\n",
       "      <td>0.0503</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>8</td>\n",
       "      <td>5.120000e-07</td>\n",
       "      <td>0.0408</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>9</td>\n",
       "      <td>1.024000e-07</td>\n",
       "      <td>0.0333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>10</td>\n",
       "      <td>2.048000e-08</td>\n",
       "      <td>0.0252</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    x      expected    sim0\n",
       "0   1  4.000000e-02  0.2015\n",
       "1   2  8.000000e-03  0.1674\n",
       "2   3  1.600000e-03  0.1240\n",
       "3   4  3.200000e-04  0.0997\n",
       "4   5  6.400000e-05  0.0796\n",
       "5   6  1.280000e-05  0.0733\n",
       "6   7  2.560000e-06  0.0503\n",
       "7   8  5.120000e-07  0.0408\n",
       "8   9  1.024000e-07  0.0333\n",
       "9  10  2.048000e-08  0.0252"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = np.arange(1, 11)\n",
    "y = rng.geometric(0.2, size=10000)\n",
    "p = [0.2 ** n * 0.2 for n in x]\n",
    "comparison_table(x, p, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2eb4aa2c",
   "metadata": {},
   "source": [
    "For instance, here I'm simulating 1000 draws, twice, from a fair die (i.e., uniform draws from 1, 2, 3, 4, 5, 6)\n",
    "and using the function to see if it's close to the expected proportions of 1/6.  Note that `expected` is a vector of length 6, but `sim1` and `sim2` are vectors of length 1000.\n",
    "A good way to check your numbers are \"close enough\" is to do the simulations two or three times (like I do here),\n",
    "and so check that the differences between simulations are similar to the difference to the expected column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "67985d54",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x</th>\n",
       "      <th>expected</th>\n",
       "      <th>sim0</th>\n",
       "      <th>sim1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0.166667</td>\n",
       "      <td>0.157</td>\n",
       "      <td>0.174</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>0.166667</td>\n",
       "      <td>0.180</td>\n",
       "      <td>0.178</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>0.166667</td>\n",
       "      <td>0.159</td>\n",
       "      <td>0.155</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>0.166667</td>\n",
       "      <td>0.152</td>\n",
       "      <td>0.155</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0.166667</td>\n",
       "      <td>0.175</td>\n",
       "      <td>0.159</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6</td>\n",
       "      <td>0.166667</td>\n",
       "      <td>0.177</td>\n",
       "      <td>0.179</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   x  expected   sim0   sim1\n",
       "0  1  0.166667  0.157  0.174\n",
       "1  2  0.166667  0.180  0.178\n",
       "2  3  0.166667  0.159  0.155\n",
       "3  4  0.166667  0.152  0.155\n",
       "4  5  0.166667  0.175  0.159\n",
       "5  6  0.166667  0.177  0.179"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n = 1000\n",
    "x = np.arange(1, 7)\n",
    "expected = np.repeat(1/6, 6)\n",
    "sim1 = rng.choice(x, size=n, replace=True)\n",
    "sim2 = rng.choice(x, size=n, replace=True)\n",
    "comparison_table(x, expected, sim1, sim2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0091eb85",
   "metadata": {},
   "source": [
    "## 1. The geometric distribution\n",
    "\n",
    "A random variable $X$ has a *geometric distribution* with parameter $p$\n",
    "if $\\mathbb{P}\\{ X = k \\} = (1-p)^{k-1} p$, for $k \\ge 1$\n",
    "(at least, that's the version that np.random simulates from, as we learned in HW1).\n",
    "\n",
    "*(a)* This has the following interpretation:\n",
    "suppose you are trying to do something;\n",
    "on each attempt the chance of success is $p$, independently of how many times you've tried;\n",
    "then $X$ is the number of attempts before you succeed (including the last).\n",
    "Write a function to simulate a random number in this way\n",
    "(by explicitly simulating the failures and successes).\n",
    "\n",
    "*(b)* Check your answer in (a) by comparing the distribution you get\n",
    "to the Geometric distribution.\n",
    "You may do this by simulating at least 10,000 draws using your function in (a) with $p=0.2$\n",
    "and comparing the proportion of these that are $k$ to $(1-p)^{k-1} p$, for $k$ between 1 and 10.\n",
    "\n",
    "*(c)* Make up a story about a situation in which you'd get a Geometric distribution."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4001dbb1",
   "metadata": {},
   "source": [
    "## 2. The Binomial distribution\n",
    "\n",
    "A random variable $X$ has a Binomial($n$, $p$) distribution if\n",
    "$\\mathbb{P}\\{ X = k \\} = \\binom{n}{k} p^k (1-p)^{n-k}$,\n",
    "where $\\binom{n}{k} = n! / (k! (n-k)!)$,\n",
    "for $0 \\le k \\le n$.\n",
    "\n",
    "*(a)* This has the following interpretation:\n",
    "suppose you try to do something $n$ times, and each time the chance you succeed is $p$,\n",
    "independently of everything else.\n",
    "$X$ is the total number of successes.\n",
    "Write a function to simulate a random number in this way.\n",
    "\n",
    "*(b)* Check your function by simulating at least 10,000 random draws\n",
    "with $n=20$ and $p=0.3$,\n",
    "and making a table comparing the observed and expected proportions of these draws that are $k$\n",
    "for each $0 \\le k \\le 20$.\n",
    "\n",
    "*(c)* Make up a story in which you'd get a Binomial distribution.\n",
    "\n",
    "*Note:* you can do the factorial, $n!$, in numpy by `np.math.factorial( )`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c73008d",
   "metadata": {},
   "source": [
    "## 3. The Poisson distribution\n",
    "\n",
    "A random variable $X$ has a Poisson($\\lambda$) distribution if\n",
    "$\\mathbb{P}\\{ X = k \\} = \\frac{\\lambda^k}{k!} e^{-\\lambda}$,\n",
    "for $k \\ge 0$.\n",
    "\n",
    "*(a)* The Poisson distribution is a good approximation for\n",
    "\"the number of rare events\", i.e., for the Binomial when $n$ is large but $p$ is small.\n",
    "Simulate at least 10,000 draws from the Poisson($5$) and compare their distribution to\n",
    "the same number of draws from the Binomial(10, 0.5), the Binomial(100, 0.05), and the Binomial(1000, 0.005).\n",
    "*(Note: you may use `rng.binomial( )` instead of the function you wrote above.)*\n",
    "Explain at least one difference in the distributions you see between the Poisson and Binomial\n",
    "that gets smaller as $n$ gets bigger.\n",
    "\n",
    "*(b)* Make up a story in which you might get the Poisson distribution.\n",
    "\n",
    "*(c)* Suppose that $X \\sim \\text{Poisson}(5)$. Write down a mathematical expression for $\\mathbb{E}[X (X-1)]$\n",
    "using the definition of expectation, and evaluate it either with math or simulation."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16aeddbb",
   "metadata": {},
   "source": [
    "## 4. The Normal distribution\n",
    "\n",
    "The Normal distribution is *additive*,\n",
    "meaning that if $X_1$ is Normal with mean $\\mu_1$ and variance $\\sigma^2_1$\n",
    "and $X_2$ is Normal with mean $\\mu_2$ and variance $\\sigma^2_2$, independent of $X_1$,\n",
    "then $X_1 + X_2$ is again Normal, with mean $\\mu_1 + \\mu_2$ and variance $\\sigma_1^2 + \\sigma_2^2$.\n",
    "\n",
    "*(a)* Simulate a large number of draws of $X_1$ and $X_2$ above\n",
    "with $\\mu_1 = 0$, $\\sigma^2_1 = 2$, $\\mu_2 = 3$, and $\\sigma^2_2 = 1.5$,\n",
    "and compare the distribution to Normal(3, 3.5) by plotting the histogram of $X_1 + X_2$\n",
    "and the histogram of draws from the second Normal.\n",
    "\n",
    "*(b)* The Exponential distribution is *not* additive in this way;\n",
    "show this by plotting a histogram of $Y_1 + Y_2$, where $Y_1$ and $Y_2$ are independent Exponential(1)."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}