Probability¶
Concepts:
- probability distribution: method of getting random numbers
- cumulative distribution function
- probability density function (continuous histogram) or probability mass function (discrete "histogram")
- conditional probability, false positives, etc
- independence
Skills:
- how to generate random numbers in python (e.g.,
rng.poisson(...))
Probability rules:
Probabilities are proportions: $\hspace{2em} 0 \le \P\{A\} \le 1$
Everything: $\hspace{2em} \P\{ \Omega \} = 1$
Complements: $\hspace{2em} \P\{ \text{not } A\} = 1 - \P\{A\}$
Disjoint events: If $\hspace{2em} \P\{A \text{ and } B\} = 0$ then $\hspace{2em} \P\{A \text{ or } B\} = \P\{A\} + \P\{B\}$.
Independence: $A$ and $B$ are independent iff $\P\{A \text{ and } B\} = \P\{A\} \P\{B\}$.
Conditional probability: $$\P\{A \;|\; B\} = \frac{\P\{A \text{ and } B\}}{ \P\{B\} }$$
These imply Bayes' rule:
$$\P\{B \;|\; A\} = \frac{\P\{B\} \P\{A \;|\; B\}}{ \P\{A\} } .$$
Example problems¶
Question: The donut shop:
- 30% of the donuts are raised (and the remaining 70% are cake donuts)
- 20% of the donuts have sprinkles
- 50% of the raised donuts are filled, while only 10% of the cake donuts are filled
Suppose that we choose a donut randomly out of all that are available.
If the presence of sprinkles is statistically independent of whether the donut is raised, what is the chance we get a raised donut with sprinkles?
What is the probability the donut is filled?
What is the probability that the donut is raised, given that it is filled?
Random variables¶
Concepts and definitions:
- notation: what's $X \sim \text{Normal}(\text{mean}=1, \text{sd}=3)$ mean? how about $Y = 2X + 1$?
- "distribution of a RV": the CDF of $X$ is $F(x) = \P\{ X \le x \}$.
- examples: given a description in math, translate into words or code
- the mean is the "weighted average":
- if discrete, $\E[X] = \sum_x x \P\{X = x\}$
- if continuous, $\E[X] = \int x f_X(x) dx$, where $f_X(x)$ is the probability density function
- (those are the same, since $\int$ is a "continuous sum")
- transformations: $f(X)$ is also a random variable, and so $\E[f(X)] = \sum_x f(x) \P\{X = x\}$.
- random variables are independent if knowing the value of one doesn't give you any information about the value of the other(s)
Distributions:
What to know:
- What parameters do each have?
- What does it mean to have different parameterizations?
- What is the range of values each can produce?
- Is each continuous or discrete?
- Formulas for mean, variance, probability density.
Common ones, with some properties:
- Uniform
- Discrete uniform (random choice)
- Binomial(size $n$, prob $p$)
- mean = $np$, variance = $n p (1-p)$
- counts things out of $n$: number of "successes" in $n$ independent "trials" if the probability of success per trial is $p$
- Normal(mean, sd)
- "cumulative effect of lots of independent deviations"
- additive: adding independent Normals gives you another Normal
- Exponential(mean)
- alternative parameterization: rate = 1/mean
- appears as the waiting time between rare events
- Poisson(mean)
- counts things
- appears as the number of rare events happening in a give time period
- variance = mean
- Gamma(shape, scale)
- alternative parameterization: mean, variance
- appears as a sum of $k$ exponentials each with scale $\theta$
Means and variances¶
Expectation: another word for "mean" or "average but specifically meaning "average over the probability distribution".
- additivity of means: $\E[X+Y] = \E[X] + \E[Y]$
- multiplication: if $X$ and $Y$ are independent, $\E[XY] = \E[X] \E[Y]$
Variance: nicer to do math with than SD but in less intuitive units
- variance is SD squared: $\var[X] = \sd[X]^2$
- scaling: $\var[a X] = a^2 \var[X]$ (if $a$ is a nonrandom number; think units)
- additivity: if $X$ and $Y$ are independent, then $\var[X+Y] = \var[X] + \var[Y]$
- law of total variance: $ \var[X] = \E[\var[X|R]] + \var[\E[X|R]] $
Covariance:
Defined as $$\begin{aligned} \cov[X, Y] &= \E[(X - \E[X])(Y - \E[Y])] \\ &= \E[XY] - \E[X] \E[Y] , \end{aligned}$$
- $\var[X] = \cov[X, X]$
- bilinearity: for instance,
- $\cov[X, Y + Z] = \cov[X, Y] + \cov[X, Z]$
- $\cov[A + B, C + D] = \cov[A, C] + \cov[A, D] + \cov[B, C] + \cov[B, D]$
- $\cov[a X , Y] = a \; \cov[X, Y]$ if $a$ is a nonrandom number; think "units"
- For a random vector $X = (X_1, \ldots, X_n)$, the matrix $\cov[X]_{ij} = \cov[X_i, X_j]$ is the "covariance matrix" of $X$.
Correlation:
Defined as $$ \cor[X, Y] = \frac{ \cov[X, Y] }{ \sd[X] \sd[Y] } $$
- Always between -1 and 1.
- $\cor[X,Y] = 1$ implies that $X=Y$ always.
- $\cor[X,Y] = -1$ implies that $X=-Y$ always.
Sample variance: and sample SD: the sample variance of a dataset $(x_1, \ldots, x_n)$ is: $$ s^2(x) = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2 . $$
- this provides an unbiased estimator for the variance from a dataset
- the sample mean of a dataset is just the usual mean of a random draw from those $n$ numbers; the sample variance is almost just the usual variance except it has $n-1$ instead of $n$.
Skills:
- be able to write down the integral or sum corresponding to a given expression (for instance, $\E[X^2]$ where $X \sim \text{Exponential}(\text{rate=3})$)
Example problems¶
Question:
Give examples from your week of quantities that might be reasonably modeled as random draws from the following distributions:
(a) Binomial (b) Normal (c) Poisson (d) Exponential
In each case, give example parameter values (i.e., for (a), say what $n$ and $p$ are in your example).
Question: Basketball
You are standing outside the EMU with a basketball hoop, asking students who walk by to shoot 10 baskets, and recording how many times (out of 10) they get the ball in. Suppose that each individual student has a fixed probability of success, and that each basket the student shoots goes in the basket independently with that probability. The distribution of this probability of success across students is uniform between 0 and 1. Write down a model for $X$, the number of times a random student gets the ball in, using terms like "$X \sim $ name of some distribution(...)". Be sure to incorporate variation between students.
Question: Means and (co)variances
$X$ and $Y$ are random variables, with means $\mathbb{E}[X] = 1$ and $\mathbb{E}[Y] = 2$. Their (co)variances are $\text{var}[X] = 1$ and $\text{var}[Y] = 2$ and $\text{cov}[X,Y] = 1$. Let $Z = X+Y$. What is the mean and variance of $Z$? Say which facts about mean and variance you are using in this calculation.
Question: (correlation)
Suppose $X$ and $Y$ are independent normal(0,1) random variables; let $Z=X+Y$.
(a) Show that $X$ is positively correlated with $Z$.
(b) Find a number $a$ such that $X + aY$ is uncorrelated with Z.
Question: (correlation)
Let $X$ be a random variable which takes on values $\{-1, 0, 1\}$ as follows: $\mathbb{P}(X=1) = \mathbb{P}(X=-1)=\frac{1}{4}$, and $\mathbb{P}(X=0) = \frac{1}{2}$.
Also, let $Y = X^2 - \frac{1}{2}$. Note that both $X$ and $Y$ have mean zero.
(a) Explain why $X$ is uncorrelated with $Y$.
(b) Are $X$ and $Y$ independent? (This question's probably too hard for a final exam question).
Question: (expected value)
Suppose that $X$ is a random variable with $\mathbb{P}\{X=0\} = 3/5$, and $\mathbb{P}\{X=5\} = 1/5$, and $\mathbb{P}\{X=10\} = 1/5$. Find the mean and variance of $X$.
Question: (gamma; additivity of means and variances)
We learned that if $T_1, \ldots, T_k$ are independent Exponential random variables with scale $m$, then $$Y = T_1 + \cdots + T_k$$ has a Gamma(shape=$k$, scale=$m$) distribution. Using this fact, find the mean and variance of the Gamma(shape=$k$, scale=$m$) distribution. (Note: the Exponential distribution with scale $m$ has mean $m$ and variance $m^2$.)
Question: (additivity of means)
I play ten games at the carnival. On the $k$th game I have probability $1/(k+1)$ of winning and if I win I get \$$k+1$ dollars (so, I win the first game with probability 1/2, and if I win I get $2 dollars). What is my expected total winnings?
Poisson approximation¶
The Poisson is a good approximation for "the number of rare events". In math, that is: the distributions Binomial($N$, $\lambda/N$) and Poisson($\lambda$) are close to each other. (Note: the mean of the Binomial here is $N \times \lambda/N = \lambda$.)
So, if:
- a lot of things that are independent
- (this distinction can be arbitrary, for instance, dividing time up into arbitrary small intervals)
- we count the number of these that look a particular way,
- and that number is probably small (a lot smaller than the total number of things),
- then that number is approximately Poisson distributed.
For instance, the probability you get zero of the things is $\exp(-\lambda)$, where $\lambda$ is the mean number.
Times between events:
If something happens "at rate $\lambda$" per unit time, and we can treat each bit of time as independent, then:
- the number of things that happen in time $t$ is Poisson(mean=$\lambda t$)
- the amount of time you have to wait for the next thing to happen is Exponential(rate=$\lambda$)
Overdispersion:
Often we get overdispersed count data: where there's more zeros and/or large values than you'd expect from a Poisson model. One way to model this is to make the mean of the Poisson random, for instance: $$\begin{aligned} X &\sim \text{Poisson}(\text{mean}=R) \\ R &\sim \text{Gamma}(\text{scale}=\theta, \text{shape}=k). \end{aligned}$$
This model has:
- $\E[X] = \E[R] = k \theta$
- $\var[X] = \var[R] + \E[R] = k \theta (1 + \theta)$
- compare this to the Poisson model, for which $X$ would have variance equal to the mean
Example problems¶
Question: (Poisson limit)
Most camas flowers are blue, but a few are white. Suppose that each camas flower is white with probability 1/1000; write down and justify an approximate expression for the probability that a given patch of 4,000 camas flowers has no white ones.
Question: Empty road
After a lot of data collection and modeling I've determined that the number of drive-through coffee shacks seen next to a random $x$ miles of Oregon two-lane highway is Poisson distributed with a mean of $x/20$ coffee shacks.
It's 106 miles to Grants Pass on a two-lane highway, it's dark, and I'm wearing sunglasses. What's the probability that I don't pass any coffee shacks on my way there?
I've just left a coffee shack. What's the expected distance to the next one?
Central Limit Theorem¶
This says that "the sum of many independent things is approximately Normal", but with the caveat that they should not mostly be zero (since then it's Poisson).
Procedure:
- Do we have a sum of independent things? if yes, then Normal.
- Compute the mean and variance using additivity for means and variances of independent things.
Sample means:
Suppose we have $n$ observations $X_1, \ldots, X_n$ with $$ \E[X_i] = \mu \qquad \text{and} \qquad \var[X_i] = \sigma^2 .$$ The sample mean is $$ \bar X = \frac{1}{n} \left( X_1 + X_2 + \cdots + X_n \right) . $$
Additivity of means and variances of independent things says that $$ \E[\bar X] = \mu \qquad \var[\bar X] = \sigma^2 / n , $$ and so $\sd[\bar X] = \sigma / \sqrt{n}$.
So, $$ \bar X \sim \text{Normal}(\text{mean} = \mu, \text{sd} = \sigma/\sqrt{n}).$$
What's this mean? For instance, using the probability density function of the Normal, $$ \P\left\{ \frac{ \bar X - \mu }{ \sigma / \sqrt{n} } > x \right\} \approx \int_x^\infty \frac{e^{-y^2 / 2}}{\sqrt{2 \pi}} dy , $$
Sums:
Using the same $X_1, \ldots, X_n$ from above, let $$ S_n = X_1 + \cdots X_n. $$ Additivity of means and variances of independent things says that $$ \E[S_n] = n \mu \qquad \var[S_n] = n \sigma^2 , $$ and so $\sd[S_n] = \sigma \sqrt{n}$.
So, $$ S_n \sim \text{Normal}(\text{mean} = n \mu, \text{sd} = \sigma \sqrt{n}).$$
Example problems¶
Question: An average
Let $T_1, \ldots, T_{100}$ be independent and Exponential with mean 2, and let $M = (T_1 + \cdots + T_{100}) / 100$. What, approximately, is the distribution of $M$? Please give both the name and the values of any relevant parameters.
Question: Lots of exams
Let $E$ be a probability distribution which describes how many minutes it takes a professor to grade a statistics exam. $E$ has mean 5 and standard deviation 1, but I don't know exactly what the distribution is.
Suppose there are 100 exams. Let $X$ be the total amount of time it takes to grade them all. All of the grading times are independent, and have probability distribution $E$.
What, approximately, is the probability distribution of $X$? Make sure to give a reason why, and to give all values of any parameters that the distribution has (including at least its mean and standard deviation).
Multivariate Normal¶
If $$ X_i \sim \text{Normal}(\text{mean}=\mu_i, \text{sd}=\sigma_i), \qquad 0 \le i \le k-1, $$ are independent, and $a$ and $b$ are nonrandom, then:
$X_1 + \cdots + X_k$ is Normal, with mean $\mu_1 + \cdots + \mu_k$ and SD $\sqrt{\sigma_1^2 + \cdots + \sigma_k^2}$.
$a X_i + b$ is Normal($a \mu_i + b$, $a\sigma_i$).
So: if $Z_1, \ldots, Z_k$ are independent Normal(0,1) and $A$ is an $k \times k$ matrix, then $$ X = AZ $$ is a $k$-dimensional random variable and $$ X \sim \text{Normal}(0, A A^T) .$$
In other words, $$ \cov[ X_i, X_j ] = \cov[(AZ)_i, (AZ)_j] = \left(A A^T \right)_{ij} = \sum_\ell A_{i \ell} A_{j \ell} . $$
Simulation from the multivariate Normal: To simulate a random vector from $$ X \sim \text{Normal}(\text{mean}=a, \text{cov}=C) :$$
- Let $A$ be the Cholesky factor of $C$ (so $C = A A^T$).
- Choose $Z$ to be a vector of independent Normal(0, 1).
- Let $X = a + AZ$.
Example problems¶
Question: (multivariate normal)
Let $M$ and $L$ be the matrices $$M=\left[\begin{matrix} 20 & 12 \\ 12 & 9\end{matrix}\right], \qquad L = \left[\begin{matrix} 2 & 0 \\ 4 & 3\end{matrix}\right]$$ and observe that $M = LL^T$. Also let $X,Y$ be two independent $N(0,1)$ random variables.
Describe, in terms of $X$, $Y$, $L$, $M$, a multivariate random normal variable whose mean is $\left[\begin{matrix} 1 \\ 2 \end{matrix}\right]$ and whose covariance matrix is $M$.
Question: (covariance matrix)
Suppose we have data for a wide variety of books in the library: for each book, number of words ($N$); number of pages ($P$); and font size ($F$; a larger number means the letters printed on the page are larger). What do you expect the covariance matrix between these three variables to look like? Please describe whether you expect each entry to be positive, negative, or near zero, and which of the diagonal entries you expect to be the largest.
Model fitting¶
Method of Moments¶
To fit a distribution to some data:
- Pick a particular form of the distribution.
- Choose parameter values for the distribution so that some of the "moments" match:
- if you have just one parameter for the distribution, make the mean match
- if you have two parameters, make the mean and the variance match
- other choices are possible
Notes:
- Procedure:
- compute the moments from the data (ex: mean, variance)
- write down what the moments are in terms of the parameters
- solve for the parameters to make these equal
- look at some plots to assess goodness-of-fit
- Sometimes this is the same as maximum likelihood.
Example problems¶
Question: Counting steeds
I have counted up the number of times the word "steed" appears on each page of a long book. The average number of times it appears per page across the book is 0.2. Let $X$ denote the number of times the word "steed" appears on a randomly chosen page in the book. What is a distribution that might be a good fit for $X$? Justify your choice, including the method used to pick the parameter(s).
Question: (method of moments; goodness-of-fit)
A report of weights of 500 loaded shipping containers finds that the average weight is 85,243 pounds, with a standard deviation of 12,025 pounds. I suggest the following model for the weight $W$ of a random shipping container: $$ W \sim \text{Normal}(\text{mean}=85,243 \,\text{lbs}, \text{sd}=12,025 \,\text{lbs}). $$ What method did I use to fit the parameters for this model, and what is one thing we could do to assess whether this is a good model?
Question: Linear model by method of moments
Suppose that we have measured carbon flux from a wetland many times over the course of several days. Our observations are a series of times of day ($T$, in units of hours past midnight) and carbon flux measurements ($Y$, in units of g/hr). We would like to fit this model: $$ Y \sim \text{Exponential}(\text{mean}=a \; \cos(T) + b) . $$ The mean value of $Y$ across our measurements is 12 g/hr, and the correlation between $Y$ and $\cos(T)$ is -0.25. Explain what steps we would use to fit this model using the Method of Moments with these two statistics.
Maximum likelihood¶
Given data $D$ and a model $M$ with parameters $\theta \in A$ and likelihood function $$ L_M(D|\theta) = \P_M\{D|\theta\} , $$ a maximum likelihood estimate of $\theta$ is $$ \theta^* = \text{argmax}_{\theta \in A}\{ L_M(D|\theta) \}, $$ i.e., $\theta^*$ is the parameter values that maximize the likelihood.
Procedure:
- formulate a generative model that seems likely to fit the data with some free parameters,
- write down the likelihood (i.e., probability) of generating our actual data as a function of the parameter(s), and
- conclude that "gee, seems like a good guess of the parameter(s) values are the ones that make our data look most probable".
Remember:
- The likelihood of the data is (usually) the product of the likelihood of each data point.
- The likelihood of a single data point is either the probability density function (for continuous distributions) or the probability mass function (for discrete distributions). In both cases, this is "$\P\{X = x_i\}$", where $x_i$ is the $i^\text{th}$ data point.
- We almost always want to look at the log likelihood, so we use the sum of the log probabilities.
- We also sometimes use the negative log likelihood, and then minimize.
- The likelihood surface is a plot of the likelihood (of the data) against the parameters, which visualizes the thing we're trying to maximize.
Skills and concepts:
- What likelihood function do we use? (How do we get the likelihood from the model?)
- Where does the data go in to the likelihood function, and how do the parameters go in?
Example problems¶
Question: Waiting times
The last five times I waited for the bus I wrote down how long I had to wait; the times were 1.2, 10, 5.8, 12, and 8 minutes. Suppose we'd like to fit an Exponential(rate=$\lambda$) distribution to these data using maximum likelihood. Write down (in math, not code) the likelihood function (that we'd maximize) or the negative log-likelihood function (that we'd minimize), being sure to say which one you've written down and what parameter is maximized (or minimized) over. You do not have to find the maximum likelihood estimate.
Question: Least squares
You are given the a very small dataset: two variables with four observations each; they are $x = (0, -2, 3, 1)$ and $y = (3, 2, 9, 6)$. You are then asked to fit the following model by maximum likelihood: $$ \log(y_i) \sim \text{Normal}(\text{mean}=a x_i + b, \text{sd}=\sigma) . $$ Please write down either the likelihood or log-likelihood function that you would use to fit this model, and say which parameters will be estimated.
Classical statistics¶
Hypotheses and $p$-values and $t$ tests and confidence intervals for the mean.
Student's $t$¶
As above, suppose we have $n$ observations $X_1, \ldots, X_n$ with $$ \E[X_i] = \mu \qquad \text{and} \qquad \var[X_i] = \sigma^2 .$$ Using $\bar X$ as the sample mean and $S$ as the sample SD for a dataset of $n$ observations (again, as above), we know that $\E[\bar X] = \mu$ and $\sd[\bar X] = \sigma/\sqrt{n}$ and $\E[S^2] = \sigma^2$, so we expect "$\bar X$ to differ from $\mu$ by a few multiples of $S / \sqrt{n}$".
In fact: the difference between the sample mean and $\mu$, in units of $S/\sqrt{n}$: $$ T = \frac{\bar X - \mu}{S/\sqrt{n}} $$ has, approximately$^*$, Student's t distribution with $n-1$ degrees of freedom.
$p$-values¶
The $p$-value is the probability of seeing a result at least as surprising as what we observed in the data, if the null hypothesis is true.
The parts of this are:
the probability ... if the null hypothesis is true: we need a concrete model we can compute probabilities with
a result: a statistic summarizing how strongly our data suggest that model is not right
at least as surprising: usually, the statistic is chosen so that larger values are more surprising
Notes:
- Note the double negatives! $p$ measures the "strength of the evidence against the hypothesis".
- If $p$ is very small then we think the hypothesis is unlikely to be true. (i.e., the data do not seem consistent with the hypothesis)
- If $p$ is not small, then our data are consistent with the hypothesis (but statistically speaking, this is more "no evidence against" rather than "evidence for").
Simple examples for $p$-values:¶
We have counted up the number of successes out of $n$ trials, and wonder whether maybe the probability of success is a particular value of $p$ (for instance, is it reasonable that $p=1/2$?). Use: Binomial.
We have a bunch of observations and want to use a Poisson model to test whether it's reasonable that the mean is a given value of $\mu$.
$p$-values for the sample mean¶
If we have measurements $x_1, \ldots, x_n$, and we wonder whether the (true, population) mean is actually (some particular value) $\mu$, then:
How do we estimate the mean? With the sample mean, $(x_1 + \cdots + x_n)/n$. And, the SD? With the sample SD, $s$.
How sure are we the mean is not $\mu$? Well, if it was $\mu$, then $t = (\bar x - \mu) / (s/\sqrt{n})$ would be Student's $t(n-1)$ distributed, so if $t$ is bigger than we'd expect then we can be pretty sure that the mean is not in fact $\mu$.
The $p$-value (for the "null hypothesis" that the mean is $\mu$) is $$ \P\{ T > t \} , $$ where $T$ is Student's t with $n-1$ degrees of freedom.
Notes:
- This is a one-sided $p$-value; you might want to take absolute values to get a two-sided one
- See notes above!
Confidence intervals¶
A "95% confidence interval" is a range of values such that if you do a great many experiments, and in each construct a 95% confidence interval for an estimated parameter, then the true value of the parameter should lie within the confidence intervals in 95% of those experiments. (Note: "95%" is not special here, other values can be used.)
Confidence interval for the mean: Rearranging the definition of the $T$ statistic (above): $$ \mu = \bar{X} + T \frac{S}{\sqrt{n}} . $$ In other words: the true mean is within a few multiples of $S/\sqrt{n}$ of the sample mean, where "a few" has the $t(\text{df}=n-1)$ distribution.
So, if we choose $t_*$ so that $$ \P\{ - t_* \le T \le t* \} = 95\%, $$ then $$ \P\{ \bar X - t_* S / \sqrt{n} \le \mu \le \bar X + t_* S / \sqrt{n} \} = 95\% . $$ In other words, "from $\bar X - t_* S / \sqrt{n}$ to $\bar X + t_* S / \sqrt{n}$ is a 95% confidence interval.
Steps:
- Compute the sample mean ($\bar X$) and sample SD ($S$).
- Choose a confidence level (e.g., 95%)
- Compute $t_*$ from this confidence interval and Student's $t$ distribution with $n-1$ degrees of freedom
- The confidence interval is $t_*$-many multiples of $S/\sqrt{n}$ on either side of $\bar X$.
Notes:
- "$\mu$ is outside the 95% confidence interval" is the same thing as "$p < 0.05$ for the hypothesis that the mean is $\mu$"
Example problems¶
Question: (p-value)
Give a definition of a $p$-value.
Question: (confidence interval)
Suppose we are testing the effectiveness of a new vaccine for preventing a disease. What does a confidence interval for the vaccine's effectivenss tell you that a $p$-value does not?
Question:
Suppose we have $n$ observations whose mean $\bar x$ and whose standard deviation is $s$, and that we suppose these come from a Normal(mean=$\mu$, sd=$\sigma$) distribution. What is the difference between $$ z = (\bar x - \mu) / (\sigma / \sqrt{n}), $$ and $$ t = (\bar x - \mu) / (s / \sqrt{n}) $$ -- explain the difference and say what their distributions are.
Question: Prices
You've got data on the price of sandwiches over the last five years in Eugene: for each sandwich, you have the date it was purchased ($T$, in units of fractional years after January 1st, 2021) and the price ($P$, in dollars). The dates have a mean of 2.5 years (corresponding to midway through 2023) and a standard deviation of 0.5 years; and the prices have a mean of \$8.00 with a standard deviation of \$2. The correlation between price and date is 0.5. Based on this, what would you predict that a sandwich would cost at the start of 2027?
Question: Cheese confidence
A company surveys a random sample of Eugene residents, and estimates that the percent of people in Eugene who enjoy cheese is about 77%, with a 95% confidence interval of 75% to 77%. Explain what the "95%" means in this statement, in concrete probabilistic terms.
Linear (and other) models¶
The one-dimensional linear model¶
Say we have data $x = (x_1, \ldots, x_n)$ and $y = (y_1, \ldots, y_n)$, and want to predict $y$ based on $x$ with $$ \hat y_i = a x_i + b. $$ Then if we choose $a$ and $b$ either:
- by minimizing the sum of squared errors, $\frac{1}{n} \sum_{i=1}^n (y_i - \hat y_i)^2$, or
- by maximizing the likelihood under the model $$ Y = a X + b + \epsilon, \qquad \epsilon \sim \text{Normal}(0, \text{sd}=\sigma), $$
then $$ a = \frac{\sd[y]}{\sd[x]} \cor[x, y] , $$ and $$ b = \bar y - a \bar x ,$$ where $\bar x$ and $\bar y$ are the means of the vectors.
In words, the slope, $a$, is equal to the correlation between $x$ and $y$, but in units of standard deviations.
Terminology:
- $a$ is the slope, i.e., how much $Y$ goes up by, on average, per unit of increase in $X$,
- $b$ is the intercept, i.e., the expected value of $Y$ when $X=0$ (if this makes sense),
- $\hat Y = a X + b$ is the predicted value of $Y$ given $X$,
- $\epsilon = Y - \hat Y$ is the residual, which for the above to be true must have mean zero.
Notes:
- The reason these are the same is that the log-likelihood of the Normal model is equal to the sum of squared errors, with some scaling factors.
- Under this model, residuals are Normal with standard deviation $\sigma$.
The multivariate linear model¶
Now suppose that we we want to predict $y_i$ using the predictors $X_i = (X_{i1}, \ldots, X_{ik})$, with $$ \hat y_i = a_1 X_{i1} + \cdots + a_k X_{ik} . $$ By either least squares or maximum likelihood with Normal residuals, the estimate of $a = (a_1, \ldots, a_k)$ solves $$ (X^T X) a = X^T y , $$ where $X$ is the $n \times k$ data matrix and $y$ is the $n$-vector of values.
Notes:
- There's no intercept because we can just assume that one of the $X_i$'s is always equal to 1.
- Again, $\sigma$ tells you how much $y$ differs at the same values of $x$.
Transformations¶
Anything you can do to the data -- imagine making new columns in the data frame -- can give you new variables to put in your model. For instance, the model $$\begin{aligned} Y &= \exp\left( a X + b \cos(X) + c + \epsilon \right) \\ \epsilon &\sim \text{Normal}(\text{mean}=0, \text{sd}=\sigma) \end{aligned}$$ is a standard linear model because if we define new variables $Z = \log(Y)$ and $W = \cos(X)$, it's equivalent to: $$\begin{aligned} Z &= a X + b W + c + \epsilon \\ \epsilon &\sim \text{Normal}(\text{mean}=0, \text{sd}=\sigma) . \end{aligned}$$
The most common things to do are:
- take logs (do this if the variable naturally changes by multiplication - i.e., by percentages)
- add in powers of $X$, like $X^2$ (do this if the relationship between $Y$ and $X$ looks kinda curvy)
Example problems¶
Question: (multivariate model)
Suppose that we have data $Y_1, \ldots, Y_n$ and $(X_{11}, X_{12}), \ldots, (X_{n1}, X_{n2})$, and we use the data to fit the following model by maximum likelihood: $$\begin{aligned} Y_i \sim \text{Normal}(\text{mean}=b_0 + b_1 X_{i1} + b_2 X_{i2}, \text{sd}=\sigma^2) . \end{aligned}$$ Suppose that $(b_0, b_1, b_2)$ are the maximum likelihood estimates we obtain. Write down the expression in terms of the data that $(b_0, b_1, b_2)$ minimizes.
Generalized Linear Models¶
A generalized linear model has three ingredients:
- a response distribution for $Y$ (the "family"),
- a linear predictor, $Xb$, and
- a link function $h( )$ connecting the linear predictor to the mean of the response, usually $h(\E[Y]) = Xb$.
$$ Y \sim \text{Response}(\text{mean}=h^{-1}(Xb), \text{maybe other parameters}). $$
Linear predictors and parameters¶
A linear predictor is just a linear combination of predictor variables. In other words, it is a collection of predictors, each multiplied by a parameter, and added up. Here:
- a predictor variable is something from the data that we're using to predict the response; think of this as a column in the data frame; something we do know
- a parameter is a variable that stands for a value that we don't know, but want to guess
- fitting the model generally means finding good guesses at values for the parameters
The terminology is tricky because the word "variable" can mean "a variable in your dataset" (i.e., a column in the data frame), while sometimes it just has the mathematical meaning of "a letter we're using to stand in for a maybe unknown number".
For instance: if $U$, $V$, $W$, and $Y$ are specific things that we measure, then in the model $$\begin{aligned} Y \sim \text{Normal}(\text{mean}=\exp(aU + bV + cW + d), \text{sd}=\sigma), \end{aligned}$$ then the parameters are $a$, $b$, $c$, $d$, and $\sigma$. (Also, this is a Poisson family GLM with log link function and linear predictor $aU+vB+cW+d$._)
Examples:¶
Here $Xb = X_1 b_1 + X_2 b_2 + \cdots X_k b_k$ is the matrix product of the $n\times k$ data matrix $X$ and the vector $b$ of parameters; to include an intercept we have a column of all 1s in the data matrix.
Standard: Normal GLM with an identity link: $$\begin{aligned} Y &\sim \text{Normal}(\text{mean}=Xb, \text{sd}=\sigma) . \end{aligned}$$ Additional parameter: $\sigma$.
Poisson: "Poisson GLM with a log link": $$\begin{aligned} Y &\sim \text{Poisson}(\text{mean}=\exp(Xb)) . \end{aligned}$$
Logistic: "Binomial GLM with a logit link": where $N$ is a total count we observe, and $Y$ is an integer between 0 and $N$: $$\begin{aligned} Y &\sim \text{Binomial}(\text{size}=N, \text{prob}=\theta) \\ \theta &= \frac{1}{1 + e^{-Xb}} , \end{aligned}$$ where $f(x) = 1/(1 + e^{-x})$ is the logistic function. (Note: the terminology is a bit funny here; "logit" is the inverse of the logistic function, but the mean of $Y$ is $N\theta$.)
Robust: "Cauchy GLM with an identity link": $$\begin{aligned} Y &\sim \text{Cauchy}(\text{location}= Xb, \text{scale}=\sigma) . \end{aligned}$$
Example problems¶
Question: Houses
Suppose the correlation between home value and elevation in South Eugene is 0.2. Also suppose that across these homes, the standard deviation of home value is $250,000, the standard deviation of elevation is 300ft, and a linear model fits the data well. How much more expensive on average would we expect houses on Stonewood (elevation 900ft) are than on West 36th (elevation 600ft)?
Question: (GLMs)
Write down the model we use when we fit a "Poisson GLM with a log link function".
Question: (Poisson model)
It turns out that fruit are slightly radioactive - though some types of fruit are more radioactive than others. A geiger counter placed in a fruit warehouse counts every time a piece of fruit gives off an alpha particle (as well as counting particles that come from background radiation). Suppose that we keep a geiger counter in a fruit warehouse for several weeks; each week we make note of how many apples there are, how many bananas there are, and how many decays there are. Describe a generalized linear model that we could fit to this data that would give us information about the relative rate at which apples and bananas decay (radioactively).
Question: (linear models)
Suppose we have a 20 by 3 data matrix $A$, with the columns representing height, number of leaves, and daily water intake of a tree. Each row represents a different plant.
We also have a (column) vector $v$ of the total mass of fruit that the plant produces. We want to fit a linear model to the data, including fitting a $y$ intercept, using linear algebra - so as to predict fruit yield based on the varaibles.
(a) What matrix equation do we solve, in order to do this? (b) What assumptions are we implicitly making about the residuals?
Question: Effectiveness
Researchers used epidemiological data to fit the following (logistic-Binomial GLM) model to epidemiological data: let $Y=1$ if a given randomly chosen individual in question contracted COVID in a given month and $Y=0$ if they did not; $V=1$ if the individual was vaccinated with the Moderna vaccine and $V=0$ if not; and $X$ be a vector of other predictors. Then the model is $$\begin{aligned}Y &\sim \text{Binomial}(1, p) \\ p &= \frac{1}{1 + e^{-(aV + Xb)}} ,\end{aligned}$$ where $a$ is a number and $b$ is a vector of numbers of the same length as $X$. (a) Which variables here are parameters that they have estimated? (b) Which parameter tells us about the effectiveness of the vaccine? (c) Do you expect the parameter identified in part (b) to be positive or negative?
Question: Transformations
Suppose that we expect $Y$ to depend on quantities $X_1$, $X_2$ as follows: $$Y = e^{\displaystyle a\sin(X_1) + bX_2 + c + \varepsilon}.$$ Here, $a,b,c$ are unknown parameters, and $\varepsilon \sim N(0,1)$ is the residual.
Suppose we have a data set of twenty samples from $(X_1, X_2, Y)$.
Explain how to use standard linear algebra to infer good values of $a,b,c$ from the data. Will the results be reliable? Why or why not?.
Other topics¶
Principal Components Analysis (PCA)¶
Variance of a sum: If $X$ is a random, $k$-dimensional vector with $$ \cov[X] = C ,$$ and for a set of coefficients $a = (a_1, \ldots, a_{k})$, we define a new variable $Y = a \cdot X = \sum_{i=1}^k a_i X_i$. Then, $\var[Y] = a^T C a$.
So, the choice of $a$ that maximizes $\var[Y]$ is the top eigenvector of $C$.
Principal components: If $\lambda_i$ and $v_i$ are the $i^\text{th}$ eigenvalue and eigenvector of $C$, with $\|v_i\| = 1$, then $$ C = \sum_{i=1}^k \lambda_i v_i v_i^T . $$ Furthermore, since $C$ is symmetric, $v_i \cdot v_j = \delta_{ij}$, and $$ \sum_{ij} C_{ij}^2 = \sum_i \lambda_i^2 . $$ Then, we say that
- $(v_i \cdot X)$: the $i^\text{th}$ principal component, PC$i$
- $v_i$: the loadings of PC$i$ on each variable
- $\lambda_i^2$: the amount of variance explained by PC$i$
Notes:
- each entry of $v_i$ tells "how important" is that variable for the PC$i$
- the plot with PCs on the coordinates gives us a new view of the data; this is "the best" view in the sense of maximizing the variance shown in that plot
- if one variable is a lot bigger than the others, that one will probably be PC1, so you need to scale the variables first
Other concepts¶
- Goodness-of-fit: after you've fit a model, you can simulate from it to see if it looks like the real data.