Linear Regression Nov 22, 2023 Linear function:
f ( x ; w ) ≐ w 0 + w 1 x 1 + … + w d x d = w ⋅ x \begin{align*}
f(\bm{x}; \bm{w}) &\doteq {\color{magenta} w_0} + w_1 x_1 + \ldots + w_d x_d \\
&= \bm{w} \cdot \bm{x}
\end{align*} f ( x ; w ) ≐ w 0 + w 1 x 1 + … + w d x d = w ⋅ x
Note that we set x 0 = 1 x_0 = 1 x 0 = 1 , resulting in x ∈ R d + 1 \bm{x} \in \mathbb{R}^{d+1} x ∈ R d + 1 and w ∈ R d + 1 \bm{w} \in \mathbb{R}^{d+1} w ∈ R d + 1 .
Given datapoints X X X and labels y \bm{y} y :
x 1 , x 2 , … , x N ∈ X ( x i ∈ R d + 1 ) y 1 , y 2 , … , y N ∈ y ( y i ∈ R ) , \begin{align*}
\bm{x}_1, \bm{x}_2, \ldots, \bm{x}_N &\in X~&&(\bm{x}_i \in \mathbb{R}^{d+1}) \\
y_1, y_2, \ldots, y_N &\in \bm{y}~&&(y_i \in \mathbb{R}),
\end{align*} x 1 , x 2 , … , x N y 1 , y 2 , … , y N ∈ X ∈ y ( x i ∈ R d + 1 ) ( y i ∈ R ) ,
Squared loss L ( w , X , y ) L(\bm{w}, X, \bm{y}) L ( w , X , y ) can be defined as:
L ( w , X , y ) ≐ 1 N ∑ i = 1 N ( y i − f ( x i ; w ) ) 2 = 1 N ∑ i = 1 N ( y i − w ⋅ x i ) 2 \begin{align*}
L(\bm{w}, X, \bm{y}) &\doteq \frac{1}{N} \sum_{i=1}^N \left( y_i - f(\bm{x}_i; \bm{w}) \right)^2 \\
&= \frac{1}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right)^2
\end{align*} L ( w , X , y ) ≐ N 1 i = 1 ∑ N ( y i − f ( x i ; w ) ) 2 = N 1 i = 1 ∑ N ( y i − w ⋅ x i ) 2
Knowing that this is convex, the first necessary condition to find the minimum is
∂ ∂ w L ( w , X , y ) = 0 \frac{\partial}{\partial \bm{w}} L(\bm{w}, X, \bm{y}) = \bm{0} ∂ w ∂ L ( w , X , y ) = 0
Let's calculate this for w 0 w_0 w 0 first:
∂ ∂ w 0 L ( w ) = ∂ ∂ w 0 [ 1 N ∑ i = 1 N ( y i − w ⋅ x i ) 2 ] = − 2 N ∑ i = 1 N ( y i − w ⋅ x i ) = 0 \begin{align*}
\frac{\partial}{\partial w_0}L(w) &= \frac{\partial}{\partial w_0} \big[ \frac{1}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right)^2 \big] \\
&= -\frac{2}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) = 0 \\
\end{align*} ∂ w 0 ∂ L ( w ) = ∂ w 0 ∂ [ N 1 i = 1 ∑ N ( y i − w ⋅ x i ) 2 ] = − N 2 i = 1 ∑ N ( y i − w ⋅ x i ) = 0
Therefore, a necessary condition for w 0 w_0 w 0 is:
∑ i = 1 N ( y i − w ⋅ x i ) = 0 \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) = 0 i = 1 ∑ N ( y i − w ⋅ x i ) = 0
Interpretation: the average of the residuals (also called errors ) is zero.
In the same way, for w j ( j ≠ 0 ) w_j (j \neq 0) w j ( j = 0 ) :
∂ ∂ w j L ( w ) = ∂ ∂ w j [ 1 N ∑ i = 1 N ( y i − w ⋅ x i ) 2 ] = − 2 N ∑ i = 1 N ( y i − w ⋅ x i ) x i j = 0 \begin{align*}
\frac{\partial}{\partial w_j}L(w) &= \frac{\partial}{\partial w_j} \big[ \frac{1}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right)^2 \big] \\
&= -\frac{2}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) x_{ij} = 0 \\
\end{align*} ∂ w j ∂ L ( w ) = ∂ w j ∂ [ N 1 i = 1 ∑ N ( y i − w ⋅ x i ) 2 ] = − N 2 i = 1 ∑ N ( y i − w ⋅ x i ) x ij = 0
Hence, another necessary condition for w j ( j ≠ 0 ) w_j (j \neq 0) w j ( j = 0 ) is:
∑ i = 1 N ( y i − w ⋅ x i ) x i j = 0 \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) x_{ij} = 0 i = 1 ∑ N ( y i − w ⋅ x i ) x ij = 0
Interpretation: the residuals (or errors) are uncorrelated with the data.
In summary:
{ ∑ i = 1 N ( y i − w ⋅ x i ) x i j = 0 ∑ i = 1 N ( y i − w ⋅ x i ) x i j = 0 ( j = 1 , … , d ) \left\{ \,
\begin{align*}
&\sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) \hphantom{x_{ij}} = 0 \\
&\sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) x_{ij} = 0~~(j = 1, \ldots, d)
\end{align*}
\right. ⎩ ⎨ ⎧ i = 1 ∑ N ( y i − w ⋅ x i ) x ij = 0 i = 1 ∑ N ( y i − w ⋅ x i ) x ij = 0 ( j = 1 , … , d )
Least squares in matrix form
X ≐ [ 1 x 11 x 12 ⋯ x 1 d 1 x 21 x 22 ⋯ x 2 d ⋮ ⋮ ⋮ ⋱ ⋮ 1 x N 1 x N 2 ⋯ x N d ] = [ —— x 1 T —— —— x 2 T —— ⋮ ⋮ ⋮ —— x N T —— ] y ≐ [ y 1 y 2 ⋮ y N ] , w ≐ [ w 1 w 2 ⋮ w d ] \begin{align*}
X &\doteq
\begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1d} \\
1 & x_{21} & x_{22} & \cdots & x_{2d} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_{N1} & x_{N2} & \cdots & x_{Nd} \\
\end{bmatrix}
=
\begin{bmatrix}
\text{------} & \bm{x}_1^T & \text{------} \\
\text{------} & \bm{x}_2^T & \text{------} \\
\vdots & \vdots & \vdots \\
\text{------} & \bm{x}_N^T & \text{------} \\
\end{bmatrix}\\
\bm{y} &\doteq
\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_N
\end{bmatrix},
\bm{w} \doteq
\begin{bmatrix}
w_1 \\
w_2 \\
\vdots \\
w_d
\end{bmatrix}
\end{align*} X y ≐ ⎣ ⎡ 1 1 ⋮ 1 x 11 x 21 ⋮ x N 1 x 12 x 22 ⋮ x N 2 ⋯ ⋯ ⋱ ⋯ x 1 d x 2 d ⋮ x N d ⎦ ⎤ = ⎣ ⎡ —— —— ⋮ —— x 1 T x 2 T ⋮ x N T —— —— ⋮ —— ⎦ ⎤ ≐ ⎣ ⎡ y 1 y 2 ⋮ y N ⎦ ⎤ , w ≐ ⎣ ⎡ w 1 w 2 ⋮ w d ⎦ ⎤
Now, how can we express errors (residuals)?
y − X w \bm{y} - X \bm{w} y − X w
Hence, the loss L ( w ) L(\bm{w}) L ( w ) can be expressed as:
L ( w ) = 1 N ( y − X w ) T ( y − X w ) L(\bm{w}) = \frac{1}{N} \left( \bm{y} - X \bm{w} \right)^T \left( \bm{y} - X \bm{w} \right) L ( w ) = N 1 ( y − X w ) T ( y − X w )
Let's find the minimum of L ( w ) L(\bm{w}) L ( w ) . Again, we use ∂ ∂ w L ( w ) = 0 \frac{\partial}{\partial \bm{w}} L(\bm{w}) = \bm{0} ∂ w ∂ L ( w ) = 0
Also note that given two vectors a , b ∈ R d \bm{a}, \bm{b} \in \mathbb{R}^d a , b ∈ R d , a T b = b T a \bm{a}^T \bm{b} = \bm{b}^T \bm{a} a T b = b T a .
∂ ∂ w L ( w ) = 1 N ∂ ∂ w [ y T y − y T X w − w T X T y + w T X T X w ] = 1 N ∂ ∂ w [ y T y − ( y T X ) w − w T ( X T y ) + w T X T X w ] ( associativity ) = 1 N ∂ ∂ w [ y T y − w T ( y T X ) T − w T ( X T y ) + w T X T X w ] ( dot product in the form of matrix mul ) = 1 N ∂ ∂ w [ y T y − w T ( X T y ) − w T ( X T y ) + w T X T X w ] = 1 N ∂ ∂ w [ y T y − 2 w T ( X T y ) + w T X T X w ] \begin{align*}
\frac{\partial}{\partial \bm{w}} L(\bm{w}) &= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - \bm{y}^T X \bm{w} - \bm{w}^T X^T \bm{y} + \bm{w}^T X^T X \bm{w} \big] \\
&= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - (\bm{y}^T X) \bm{w} - \bm{w}^T (X^T \bm{y}) + \bm{w}^T X^T X \bm{w} \big]~~(\text{associativity}) \\
&= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - {\color{magenta} \bm{w}^T (\bm{y}^T X)^T} - \bm{w}^T (X^T \bm{y}) + \bm{w}^T X^T X \bm{w} \big]~~(\text{dot product in the form of matrix mul}) \\
&= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - {\color{magenta} \bm{w}^T (X^T \bm{y})} - \bm{w}^T (X^T \bm{y}) + \bm{w}^T X^T X \bm{w} \big] \\
&= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - 2 \bm{w}^T (X^T \bm{y}) + \bm{w}^T X^T X \bm{w} \big]\\
\end{align*} ∂ w ∂ L ( w ) = N 1 ∂ w ∂ [ y T y − y T X w − w T X T y + w T X T X w ] = N 1 ∂ w ∂ [ y T y − ( y T X ) w − w T ( X T y ) + w T X T X w ] ( associativity ) = N 1 ∂ w ∂ [ y T y − w T ( y T X ) T − w T ( X T y ) + w T X T X w ] ( dot product in the form of matrix mul ) = N 1 ∂ w ∂ [ y T y − w T ( X T y ) − w T ( X T y ) + w T X T X w ] = N 1 ∂ w ∂ [ y T y − 2 w T ( X T y ) + w T X T X w ]
Now, we have these simple derivative rules:
∂ ∂ w w T a = a , ∂ ∂ w w T A w = ( A + A T ) w \frac{\partial}{\partial \bm{w}} \bm{w}^T \bm{a} = \bm{a},~\frac{\partial}{\partial \bm{w}} \bm{w}^T A \bm{w} = (A + A^T) \bm{w} ∂ w ∂ w T a = a , ∂ w ∂ w T A w = ( A + A T ) w
Using these properties,
∂ ∂ w = 1 N ∂ ∂ w [ − 2 X T y + 2 X T X w ] ( ∵ X T X is symmetric ) = − 2 N [ X T y − X T X w ] = 0 \begin{align*}
\frac{\partial}{\partial \bm{w}} &= \frac{1}{N} \frac{\partial}{\partial \bm{w}}
\big[ - 2 X^T \bm{y} + 2 X^T X \bm{w} \big]~~(\because X^TX \text{ is symmetric})\\
&= - \frac{2}{N} \big[ X^T \bm{y} - X^T X \bm{w} \big] = \bm{0}\\
\end{align*} ∂ w ∂ = N 1 ∂ w ∂ [ − 2 X T y + 2 X T X w ] ( ∵ X T X is symmetric ) = − N 2 [ X T y − X T X w ] = 0
Therefore,
X T y − X T X w = 0 . X^T \bm{y} - X^T X \bm{w} = \bm{0}. X T y − X T X w = 0 .
We can see it in this way as well which reminds us a form that we derived before going into matrix notation.
X T ( y − X w ) = 0 . X^T \left( \bm{y} - X \bm{w} \right) = \bm{0}. X T ( y − X w ) = 0 .
Solving for w \bm{w} w gives us
w ∗ = ( X T X ) − 1 X T ⏟ X † : Moore-Penrose pseudoinverse y \bm{w}^* = \underbrace{{\color{magenta} \left( X^T X \right)^{-1} X^T}}_{X^\dagger \text{: Moore-Penrose pseudoinverse}} \bm{y} w ∗ = X † : Moore-Penrose pseudoinverse ( X T X ) − 1 X T y
Let's use this to predict y ^ t \hat{y}_t y ^ t for a new datapoint x t \bm{x}_t x t :
y ^ t = w ∗ T x t = y T X † T x t \begin{align*}
\hat{y}_t &= \bm{w}^{*T} \bm{x}_t \\
&= \bm{y}^T {\color{magenta} X^\dagger}^T \bm{x}_t
\end{align*} y ^ t = w ∗ T x t = y T X † T x t
References
Introduction to Machine Learning Lecture 2 (TTIC 31020; 2018)
Matrix Cookbook
Jacobian matrix definition (wikipedia)