Review¶

Fall 2022: Peter Ralph

https://uodsci.github.io/dsci345

Uncertainty: (how to) deal with it¶

When we're doing data science, we

  • look at data
  • make predictions about future values
  • infer aspects of the underlying process

Fundamental to all stages are randomness and uncertainty.

For instance: randomized algorithms (e.g., stochastic gradient descent).

For instance:

Computing a statistic gives you a number that describes a data set.

Doing statistics helps you understand how reliable that description is and how well it applies to the wider world.

We understand uncertainty, conceptually and quantitatively, with randomness,

i.e., through probability.

Goals of this class¶

  • Become familiar with different types of probability models.

  • Calculate properties of probability models.

  • Construct and simulate from realistic models of data generation.

  • Be able to test estimation and prediction methods with simulation.

  • Gain familiarity with fundamental statistical concepts.

We'll spend a lot of time on probability models, for applications from classical statistics to machine learning.

Tools: Distributions¶

  • Uniform
  • Exponential
  • Normal
  • multivariate Normal
  • Binomial
  • Gamma
  • Student's $t$
  • Beta
  • Cauchy

Probability and Random Variables¶

  • probability rules: like area
  • rules for means and variances
  • algebra with random variables: $$ \mathbb{E}[f(X)] = \sum_x \mathbb{P}\{X=x\} f(x) $$
  • independence
  • means minimize sum-of-squared-errors
  • covariance and correlation

Modeling:¶

  • Central Limit Theorem: sums of little things gets you Normals
  • rare events gets you Poisson counts and Exponential waiting times

Linear models:

  • the usual
  • Poisson-exponential
  • Binomial-logistic
  • robust
  • nonlinear linear models

Model fitting:¶

  • method of moments
  • maximum likelihood
  • penalized (or, regularized) likelihood

Statistics¶

  • sampling distributions
  • standard errors
  • the $t$ distribution
  • confidence intervals
  • heteroskedasticity

Methods¶

  • PCA
  • diagnostics for linear models
  • transformations
  • crossvalidation
  • the bootstrap
  • the other bootstrap