Review

K-NN Algorithm: If points are close, they would have similar label
Perceptron: Linearly Seperable Data

Estimating Probabilities from data

INTRO

Bayes Optimal Classifier
- known: $P(\bold X, Y)$
- choose $y$ = $h_{opt}(\bold x)$ = $\argmax\limits_y P(Y|\bold X)$
Goal of Machine Learning
- find $P_\theta(\bold X, Y)$ → $P_\theta(Y|\bold X)$
  
  $= P(\bold X|Y) * P(Y)$ : Generative Learning: estimating $P(y)$ and $P(\mathbf x \mid y)$
  
  $= P(Y|\bold X) * P(\bold X)$
- find $P_\theta(Y|\bold X)$ : Discriminative Learning
🧐 How to Estimate Probabilities from Data ?
- ESTIMATION LOGIC
  - GOAL: $P(X)$, $P(X,Y)$, $\cdots$
  - MEANS: estimating $P_\theta(X, Y)$
    - Maximum Likelihood Estimation: $~s.t~\theta = \argmax\limits_\theta P(D;\theta)$
    - Maximum A-posteriori Estimation: $~s.t~\theta = \argmax\limits_\theta P(\theta\mid D)$

`MLE` by `Frequentists`

MLE Prediction : $\hat{ \bold {\theta}}{MLE} = \argmax\limits{\bold \theta} P(D; \theta)$ $where$ $D: (\bold{X_1}, Y_1), \cdots, (\bold{X_N}, Y_N)$, $\theta$ is purely a model parameter(unkown constant)
- THOUGHTs On Parameters: Random Variable?
  - No probabilistic event, No sample space ∴ Not a Random Variable / UNKNOWN CONSTANT
Example
- find $P_\theta(X) = \theta$ : target variable follows Bernoulli Distribution
- find $\theta$ such that maximize $P(D; \theta)$ : independent Bernoulli outcomes follows Binomial Distribution
  - $P(D; \theta) = P(X_1, \cdots, X_N; \theta) =$ $\begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} \theta^{n_H} (1 - \theta)^{n_T}$
  - maximize $\log P(D; \theta)$
  ∴ $\hat\theta_{MLE} = \frac{n_H}{n_H + n_T}$
Problem: MLE can overfit the data when $n$ is small
- Solving: (Frequentists) Smoothing with Hallucinating Samples $\hat{\theta} = \frac{n_H + m}{n_H + n_T + 2m}$

`MAP` by `Bayesians`

MAP ****Prediction
1. $\hat{\bold {\theta}}{MAP} = \argmax\limits{\bold \Theta} P(\bold \Theta\mid D )$ $where$ $D: (\bold{X_1}, Y_1), \cdots, (\bold{X_N}, Y_N)$ / $\bold \Theta$ is random variable
  - THOUGHTs On Parameters: Random Variable!
    - I don’t care “No probabilistic event, No sample space” ∴ Random Variable
  - Bayes Rule: $P(\Theta \mid D) = \frac{P(D\mid \Theta) P(\Theta)}{P(D)}$
    - $P(D \mid \bold \Theta )$: Likelihood
    - $P(\bold \Theta)$: Prior, encode your belief(not samples)
    - $P(\bold \Theta\mid D )$: Posterior
2. TRUE BAYESIAN WAY
  - $P(Y\mid X, D) = \int_{\theta}P(Y,\theta \mid X, D) d\theta = \int_{\theta} P(Y \mid X, \theta, D) P(\theta | D) d\theta$
    - REFRENCE
      - Conditional Independence of A and B:
        
        $P(A \cap B|C)=P(A|C)P(B|C)$
        
        ⬄ $P(A | B,C)=P(A|C)$
    - INTERPRETATION
      - generally: intractable in closed form
      - exceptions
        
        Gaussian Process
        
        Coin toss example
Example
- find $P_\theta(X) = \theta$
- find $\Theta = \argmax\limits_\Theta P(\Theta \mid D)$
  - $P(\Theta \mid D) \propto P(D \mid \Theta) P(\Theta)$
    - $P(D \mid \Theta) = P(X_1, \cdots, X_N\mid \Theta) = \begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} \Theta^{n_H} (1 - \Theta)^{n_T}$
    - $P(\Theta) = \frac{\theta^{\alpha - 1}(1 - \theta)^{\beta - 1}}{B(\alpha, \beta)}$ : A natural choice for the prior $P(\Theta)$ is the Beta distribution
    - so that, $P(\Theta \mid D) \propto \Theta^{n_H + \alpha -1} (1 - \Theta)^{n_T + \beta -1}$
  - find $\theta$ such that maximize $P(\Theta \mid D)$
    - maximize $\log P(\Theta \mid D)$ = $log[P(D|\Theta)] + log[P(\Theta)]$
      - only add term $log[P(\Theta)]$ to our optimization
    - $\hat\theta_{MAP} = \frac{n_H+\alpha -1 }{n_H + n_T + \alpha + \beta - 2}$

Summary

In supervised ML,you are providedwith training data D. You use thisdata to train a model, represented by its parameters $\theta$. With this model, you want to make predictions on a test point $x_t$.

MLE Prediction: $P(y\mid x_t; \theta)$
- Learning: $\theta = \argmax\limits_\theta P(D;\theta)$. Here $\theta$ is purely a model parameter.
- Maximize $\log [P(D;\theta)]$
MAP Prediction: $P(y\mid x_t, \theta)$
- Learning: $\theta = \argmax\limits_\theta P(\theta\mid D) \propto P(D\mid \theta)P(\theta)$. Here $\theta$ is a random variable.
- Maximize $\log [P(\theta \mid D)] = \log[P(D \mid \theta)] + \log[P(\theta)]$
  - term $\log[P(\theta)]$ is independent of the data and penalize → regularization

Review

Estimating Probabilities from data

INTRO

MLE by Frequentists

MAP by Bayesians

Summary

`MLE` by `Frequentists`

`MAP` by `Bayesians`