import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd
import patsy
from dsci345 import pretty
rng = np.random.default_rng()
$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}} \newcommand{\cov}{\text{cov}} \newcommand{\cor}{\text{cor}}$$
This is here so we can use \P
and \E
and \var
and \cov
and \cor
and \sd
in LaTeX below.
Uncertainty: (how to) deal with it¶
When we're doing data science, we
- look at data
- make predictions about future values
- infer aspects of the underlying process
Fundamental to all stages are randomness and uncertainty.
For instance: randomized algorithms (e.g., stochastic gradient descent).
For instance:
Computing a statistic gives you a number that describes a data set.
Doing statistics helps you understand how reliable that description is and how well it applies to the wider world.
We understand uncertainty, conceptually and quantitatively, with randomness,
i.e., through probability.
Goals of this class¶
Become familiar with different types of probability models.
Calculate properties of probability models.
Construct and simulate from realistic models of data generation.
Be able to test estimation and prediction methods with simulation.
Gain familiarity with fundamental statistical concepts.
We'll spend a lot of time on probability models, for applications from classical statistics to machine learning.
Tools: Distributions¶
- Uniform
- Exponential
- Normal
- multivariate Normal
- Binomial
- Gamma
- Student's $t$
- Beta
- Cauchy
Probability and Random Variables¶
- probability rules: like area
- rules for means and variances
- algebra with random variables: $$ \mathbb{E}[f(X)] = \sum_x \mathbb{P}\{X=x\} f(x) $$
- independence
- means minimize sum-of-squared-errors
- covariance and correlation
Modeling:¶
- Central Limit Theorem: sums of little things gets you Normals
- rare events gets you Poisson counts and Exponential waiting times
Linear models:
- the usual
- Poisson-exponential
- Binomial-logistic
- robust
- nonlinear linear models
Model fitting:¶
- method of moments
- maximum likelihood
- penalized (or, regularized) likelihood
Statistics¶
- sampling distributions
- standard errors
- the $t$ distribution
- confidence intervals
- heteroskedasticity
Methods¶
- PCA
- diagnostics for linear models
- transformations
- crossvalidation
- the bootstrap
- the other bootstrap
Details¶
Relationships:
Definition(s) of variance: $$ \var[X] = \E[(X - \E[X])^2] = \E[X^2] - \E[X]^2 $$ and covariance: $$ \cov[X,Y] = \E[XY] - \E[X] \E[Y] $$ and correlation: $$ \cor[X,Y] = \frac{\cov[X,Y]}{\sqrt{\var[X]\var[Y]}} .$$
If $X$ and $Y$ are independent: $$ \E[XY] = \E[X] \E[Y] $$ and so $$ \var[X + Y] = \var[X] + \var[Y]. $$
Bilinearity of covariance: $$ \cov[a X + Y, Z] = a \cov[X, Z] + \cov[Y, Z] . $$
Law of total variance: $$ \var[X] = \var[\E[X|Y]] + \E[\var[X|Y]] . $$