Linear Regression

Linear function:

f(x;w)w0+w1x1++wdxd=wx\begin{align*} f(\bm{x}; \bm{w}) &\doteq {\color{magenta} w_0} + w_1 x_1 + \ldots + w_d x_d \\ &= \bm{w} \cdot \bm{x} \end{align*}

Note that we set x0=1x_0 = 1, resulting in xRd+1\bm{x} \in \mathbb{R}^{d+1} and wRd+1\bm{w} \in \mathbb{R}^{d+1}.

Given datapoints XX and labels y\bm{y}:

x1,x2,,xNX (xiRd+1)y1,y2,,yNy (yiR),\begin{align*} \bm{x}_1, \bm{x}_2, \ldots, \bm{x}_N &\in X~&&(\bm{x}_i \in \mathbb{R}^{d+1}) \\ y_1, y_2, \ldots, y_N &\in \bm{y}~&&(y_i \in \mathbb{R}), \end{align*}

Squared loss L(w,X,y)L(\bm{w}, X, \bm{y}) can be defined as:

L(w,X,y)1Ni=1N(yif(xi;w))2=1Ni=1N(yiwxi)2\begin{align*} L(\bm{w}, X, \bm{y}) &\doteq \frac{1}{N} \sum_{i=1}^N \left( y_i - f(\bm{x}_i; \bm{w}) \right)^2 \\ &= \frac{1}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right)^2 \end{align*}

Knowing that this is convex, the first necessary condition to find the minimum is

wL(w,X,y)=0\frac{\partial}{\partial \bm{w}} L(\bm{w}, X, \bm{y}) = \bm{0}

Let's calculate this for w0w_0 first:

w0L(w)=w0[1Ni=1N(yiwxi)2]=2Ni=1N(yiwxi)=0\begin{align*} \frac{\partial}{\partial w_0}L(w) &= \frac{\partial}{\partial w_0} \big[ \frac{1}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right)^2 \big] \\ &= -\frac{2}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) = 0 \\ \end{align*}

Therefore, a necessary condition for w0w_0 is:

i=1N(yiwxi)=0\sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) = 0

Interpretation: the average of the residuals (also called errors) is zero.

In the same way, for wj(j0)w_j (j \neq 0):

wjL(w)=wj[1Ni=1N(yiwxi)2]=2Ni=1N(yiwxi)xij=0\begin{align*} \frac{\partial}{\partial w_j}L(w) &= \frac{\partial}{\partial w_j} \big[ \frac{1}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right)^2 \big] \\ &= -\frac{2}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) x_{ij} = 0 \\ \end{align*}

Hence, another necessary condition for wj(j0)w_j (j \neq 0) is:

i=1N(yiwxi)xij=0\sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) x_{ij} = 0

Interpretation: the residuals (or errors) are uncorrelated with the data.

In summary:

{i=1N(yiwxi)xij=0i=1N(yiwxi)xij=0  (j=1,,d)\left\{ \, \begin{align*} &\sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) \hphantom{x_{ij}} = 0 \\ &\sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) x_{ij} = 0~~(j = 1, \ldots, d) \end{align*} \right.

Least squares in matrix form

X[1x11x12x1d1x21x22x2d1xN1xN2xNd]=[——x1T————x2T————xNT——]y[y1y2yN],w[w1w2wd]\begin{align*} X &\doteq \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1d} \\ 1 & x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{N1} & x_{N2} & \cdots & x_{Nd} \\ \end{bmatrix} = \begin{bmatrix} \text{------} & \bm{x}_1^T & \text{------} \\ \text{------} & \bm{x}_2^T & \text{------} \\ \vdots & \vdots & \vdots \\ \text{------} & \bm{x}_N^T & \text{------} \\ \end{bmatrix}\\ \bm{y} &\doteq \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}, \bm{w} \doteq \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix} \end{align*}

Now, how can we express errors (residuals)?

yXw\bm{y} - X \bm{w}

Hence, the loss L(w)L(\bm{w}) can be expressed as:

L(w)=1N(yXw)T(yXw)L(\bm{w}) = \frac{1}{N} \left( \bm{y} - X \bm{w} \right)^T \left( \bm{y} - X \bm{w} \right)

Let's find the minimum of L(w)L(\bm{w}). Again, we use wL(w)=0\frac{\partial}{\partial \bm{w}} L(\bm{w}) = \bm{0}
Also note that given two vectors a,bRd\bm{a}, \bm{b} \in \mathbb{R}^d, aTb=bTa\bm{a}^T \bm{b} = \bm{b}^T \bm{a}.

wL(w)=1Nw[yTyyTXwwTXTy+wTXTXw]=1Nw[yTy(yTX)wwT(XTy)+wTXTXw]  (associativity)=1Nw[yTywT(yTX)TwT(XTy)+wTXTXw]  (dot product in the form of matrix mul)=1Nw[yTywT(XTy)wT(XTy)+wTXTXw]=1Nw[yTy2wT(XTy)+wTXTXw]\begin{align*} \frac{\partial}{\partial \bm{w}} L(\bm{w}) &= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - \bm{y}^T X \bm{w} - \bm{w}^T X^T \bm{y} + \bm{w}^T X^T X \bm{w} \big] \\ &= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - (\bm{y}^T X) \bm{w} - \bm{w}^T (X^T \bm{y}) + \bm{w}^T X^T X \bm{w} \big]~~(\text{associativity}) \\ &= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - {\color{magenta} \bm{w}^T (\bm{y}^T X)^T} - \bm{w}^T (X^T \bm{y}) + \bm{w}^T X^T X \bm{w} \big]~~(\text{dot product in the form of matrix mul}) \\ &= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - {\color{magenta} \bm{w}^T (X^T \bm{y})} - \bm{w}^T (X^T \bm{y}) + \bm{w}^T X^T X \bm{w} \big] \\ &= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - 2 \bm{w}^T (X^T \bm{y}) + \bm{w}^T X^T X \bm{w} \big]\\ \end{align*}

Now, we have these simple derivative rules:

wwTa=a, wwTAw=(A+AT)w\frac{\partial}{\partial \bm{w}} \bm{w}^T \bm{a} = \bm{a},~\frac{\partial}{\partial \bm{w}} \bm{w}^T A \bm{w} = (A + A^T) \bm{w}

Using these properties,

w=1Nw[2XTy+2XTXw]  (XTX is symmetric)=2N[XTyXTXw]=0\begin{align*} \frac{\partial}{\partial \bm{w}} &= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ - 2 X^T \bm{y} + 2 X^T X \bm{w} \big]~~(\because X^TX \text{ is symmetric})\\ &= - \frac{2}{N} \big[ X^T \bm{y} - X^T X \bm{w} \big] = \bm{0}\\ \end{align*}

Therefore,

XTyXTXw=0.X^T \bm{y} - X^T X \bm{w} = \bm{0}.

We can see it in this way as well which reminds us a form that we derived before going into matrix notation.

XT(yXw)=0.X^T \left( \bm{y} - X \bm{w} \right) = \bm{0}.

Solving for w\bm{w} gives us

w=(XTX)1XTX: Moore-Penrose pseudoinversey\bm{w}^* = \underbrace{{\color{magenta} \left( X^T X \right)^{-1} X^T}}_{X^\dagger \text{: Moore-Penrose pseudoinverse}} \bm{y}

Let's use this to predict y^t\hat{y}_t for a new datapoint xt\bm{x}_t:

y^t=wTxt=yTXTxt\begin{align*} \hat{y}_t &= \bm{w}^{*T} \bm{x}_t \\ &= \bm{y}^T {\color{magenta} X^\dagger}^T \bm{x}_t \end{align*}

References