The working horse of every Bayesian model is the equation:
\[ posterior \propto likelihood \times prior \]
The likelihood is the likelihood of observing the data, given our current set of weights.
Suppose you have a fixed set of weights \(\vec w\).
For example, you have just initialized your weights by drawing them from a multivariate Gaussian distribution.
Now you want to know how likely it is that the data comes from a model specified by \(\vec w\). It helps to think of \(\vec w\) as your null hypothesis which you want to test. So you ask:
For now, think of the likelihood as a probability. Technically, the likelihood can be a probability, but does not have to be. But this doesn’t change the idea behind it.
Think of the dataset as a series of random events, like coin flips. Suppose you throw a coin 3 times and you observe HEAD, HEAD, TAILS. How likely is it that you see such a sequence under the null hypothesis that the coin is fair? Well, it is the probability to observe HEAD, times the probability to observe HEAD, times the probability to observe TAIL.
\[ Pr(\text{Head}\rightarrow \text{Head}\rightarrow \text{Tails}|\text{fair coin})=0.5\times 0.5 \times 0.5=0.5^3 \]
Think of the results of the coin flips as our targets. The three coin flips correspond to targets \(t_1, t_2, t_3\). And we were looking for
\[ Pr(t_1 \cap t_2 \cap t_3 | \text{fair coin})=Pr(t_1| \text{fair coin})\times Pr(t_2| \text{fair coin}) \times Pr(t_3| \text{fair coin}) \]
We can apply this idea to our scenario. The likelihood of observing the kind of data that we observed under null hypothesis \(\vec w\) is given by:
\[ \begin{aligned} L(D|\vec w) =& \left(\text{probability of observing }t_1 \text{ given }\vec w \right)\\ & \times \left(\text{probability of observing }t_2 \text{ given }\vec w \right) \\ & \ \ \vdots \\ & \times \left(\text{probability of observing }t_N \text{ given }\vec w \right) \\ =& p(t_1|\vec w, \vec x_i, D) \times p(t_2|\vec w, \vec x_i, D) \times \dots p(t_N|\vec w, \vec x_i, D) \\ =& \prod_{i=1}^N p(t_i|\vec w, \vec x_i, D) \\ =& \prod_{i=1}^N \underbrace{N(t_i|\vec w^T\vec x_i, \alpha^{-1})}_{ \text{Gaussian density} } \\ \end{aligned} \]
\(N(t_i|\vec w^T \vec x_i, \alpha^{-1}\)) is the density at the point \(t_i\) of the Gaussian distribution with mean \(\vec w^T \vec x_i\) and variance \(\alpha^{-1}\). In R, you would calculate that density with dnorm(x, mean, sd)
.
The likelihood tells us how likely it is to draw the data that we observed from the distribution \(p(t_i|\vec w, \vec x_i, D)\) for some fixed values of our model parameters \(\vec w\) (and precision \(\alpha\)).
To make things easier, we usually work with the log-likelihood instead of the likelihood:
\[ \begin{aligned} LL(D,\vec w) =& \log L(D|\vec w) \\ =& \log \left(\prod_{i=1}^N p(t_i|\vec w)\right) \\ =& \sum_{i=1}^N \log p(t_i|\vec w) \\ =& \sum_{i=1}^N \log N(t_i|\vec w^T\vec x_i, \alpha^{-1}) \end{aligned} \]
Note that the (log-)likelihood changes if your weights change.
Note: Sometimes people write \(L_{\vec w}(D)\), which you would read as “the likelihood function under the parametrization \(\vec w\)”.
data = data.frame(y = ..., x = ...)
weights = [0.41, 1.23] # [intercept, slope]
alpha = 2 # precision of predictive distribution
loglik = 0
for i in 1,2,...,N:
t_i = data$y[i]
x_i = [1, data$x[i]]
wTx_i = dot(w, x)
density_i = normal_density(
x=t_i,
mean=wTx_i,
var=(1/alpha)
)
loglik += log(density_i)