Review
- K-NN Algorithm: If points are close, they would have similar label
- Perceptron: Linearly Seperable Data
Estimating Probabilities from data
INTRO
- Bayes Optimal Classifier
- known: $P(\bold X, Y)$
- choose $y$ = $h_{opt}(\bold x)$ = $\argmax\limits_y P(Y|\bold X)$
- Goal of Machine Learning
-
find $P_\theta(\bold X, Y)$ → $P_\theta(Y|\bold X)$
$= P(\bold X|Y) * P(Y)$ : Generative Learning
: estimating $P(y)$ and $P(\mathbf x \mid y)$
$= P(Y|\bold X) * P(\bold X)$
-
find $P_\theta(Y|\bold X)$ : Discriminative Learning
- 🧐 How to Estimate Probabilities from Data ?
- ESTIMATION LOGIC
- GOAL: $P(X)$, $P(X,Y)$, $\cdots$
- MEANS: estimating $P_\theta(X, Y)$
Maximum Likelihood Estimation
: $~s.t~\theta = \argmax\limits_\theta P(D;\theta)$
Maximum A-posteriori Estimation
: $~s.t~\theta = \argmax\limits_\theta P(\theta\mid D)$
MLE
by Frequentists
-
MLE
Prediction : $\hat{ \bold {\theta}}{MLE} = \argmax\limits{\bold \theta} P(D; \theta)$
$where$ $D: (\bold{X_1}, Y_1), \cdots, (\bold{X_N}, Y_N)$, $\theta$ is purely a model parameter(unkown constant
)
- THOUGHTs On Parameters: Random Variable?
- No probabilistic event, No sample space ∴ Not a Random Variable /
UNKNOWN CONSTANT
-
Example
-
find $P_\theta(X) = \theta$ : target variable follows Bernoulli Distribution
-
find $\theta$ such that maximize $P(D; \theta)$ : independent Bernoulli outcomes follows Binomial Distribution
- $P(D; \theta) = P(X_1, \cdots, X_N; \theta) =$ $\begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} \theta^{n_H} (1 - \theta)^{n_T}$
- maximize $\log P(D; \theta)$
∴ $\hat\theta_{MLE} = \frac{n_H}{n_H + n_T}$
-
Problem: MLE
can overfit the data when $n$ is small
- Solving: (
Frequentists
) Smoothing with Hallucinating Samples $\hat{\theta} = \frac{n_H + m}{n_H + n_T + 2m}$
MAP
by Bayesians
MAP
****Prediction
- $\hat{\bold {\theta}}{MAP} = \argmax\limits{\bold \Theta} P(\bold \Theta\mid D )$
$where$ $D: (\bold{X_1}, Y_1), \cdots, (\bold{X_N}, Y_N)$ / $\bold \Theta$ is
random variable
- THOUGHTs On Parameters: Random Variable!
- I don’t care “No probabilistic event, No sample space” ∴
Random Variable
- Bayes Rule: $P(\Theta \mid D) = \frac{P(D\mid \Theta) P(\Theta)}{P(D)}$
- $P(D \mid \bold \Theta )$: Likelihood
- $P(\bold \Theta)$: Prior, encode your belief(not samples)
- $P(\bold \Theta\mid D )$: Posterior
- TRUE BAYESIAN WAY
- $P(Y\mid X, D) = \int_{\theta}P(Y,\theta \mid X, D) d\theta = \int_{\theta} P(Y \mid X, \theta, D) P(\theta | D) d\theta$
- REFRENCE
- INTERPRETATION
- generally: intractable in closed form
- exceptions
- Gaussian Process
- Example
- find $P_\theta(X) = \theta$
- find $\Theta = \argmax\limits_\Theta P(\Theta \mid D)$
- $P(\Theta \mid D) \propto P(D \mid \Theta) P(\Theta)$
- $P(D \mid \Theta) = P(X_1, \cdots, X_N\mid \Theta) = \begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} \Theta^{n_H} (1 - \Theta)^{n_T}$
- $P(\Theta) = \frac{\theta^{\alpha - 1}(1 - \theta)^{\beta - 1}}{B(\alpha, \beta)}$ : A natural choice for the prior $P(\Theta)$ is the Beta distribution
- so that, $P(\Theta \mid D) \propto \Theta^{n_H + \alpha -1} (1 - \Theta)^{n_T + \beta -1}$
- find $\theta$ such that maximize $P(\Theta \mid D)$
- maximize $\log P(\Theta \mid D)$ = $log[P(D|\Theta)] + log[P(\Theta)]$
- only add term $log[P(\Theta)]$ to our optimization
- $\hat\theta_{MAP} = \frac{n_H+\alpha -1 }{n_H + n_T + \alpha + \beta - 2}$
Summary
In supervised ML,you are providedwith training data D. You use thisdata to train a model, represented by its parameters $\theta$. With this model, you want to make predictions on a test point $x_t$.
- MLE Prediction: $P(y\mid x_t; \theta)$
- Learning: $\theta = \argmax\limits_\theta P(D;\theta)$. Here $\theta$ is purely a model parameter.
- Maximize $\log [P(D;\theta)]$
- MAP Prediction: $P(y\mid x_t, \theta)$
- Learning: $\theta = \argmax\limits_\theta P(\theta\mid D) \propto P(D\mid \theta)P(\theta)$. Here $\theta$ is a random variable.
- Maximize $\log [P(\theta \mid D)] = \log[P(D \mid \theta)] + \log[P(\theta)]$
- term $\log[P(\theta)]$ is independent of the data and penalize → regularization