Linear Regression Nov 22, 2023 Linear function:
f ( x ; w ) ≐ w 0 + w 1 x 1 + … + w d x d = w ⋅ x \begin{align*}
f(\bm{x}; \bm{w}) &\doteq {\color{magenta} w_0} + w_1 x_1 + \ldots + w_d x_d \\
&= \bm{w} \cdot \bm{x}
\end{align*} f ( x ; w )  ≐ w 0  + w 1  x 1  + … + w d  x d  = w ⋅ x  Note that we set x 0 = 1 x_0 = 1 x 0  = 1 x ∈ R d + 1 \bm{x} \in \mathbb{R}^{d+1} x ∈ R d + 1 w ∈ R d + 1 \bm{w} \in \mathbb{R}^{d+1} w ∈ R d + 1 
Given datapoints X X X y \bm{y} y 
x 1 , x 2 , … , x N ∈ X   ( x i ∈ R d + 1 ) y 1 , y 2 , … , y N ∈ y   ( y i ∈ R ) , \begin{align*}
\bm{x}_1, \bm{x}_2, \ldots, \bm{x}_N &\in X~&&(\bm{x}_i \in \mathbb{R}^{d+1}) \\
y_1, y_2, \ldots, y_N &\in \bm{y}~&&(y_i \in \mathbb{R}),
\end{align*} x 1  , x 2  , … , x N  y 1  , y 2  , … , y N   ∈ X   ∈ y     ( x i  ∈ R d + 1 ) ( y i  ∈ R ) ,  Squared loss L ( w , X , y ) L(\bm{w}, X, \bm{y}) L ( w , X , y ) 
L ( w , X , y ) ≐ 1 N ∑ i = 1 N ( y i − f ( x i ; w ) ) 2 = 1 N ∑ i = 1 N ( y i − w ⋅ x i ) 2 \begin{align*}
L(\bm{w}, X, \bm{y}) &\doteq \frac{1}{N} \sum_{i=1}^N \left( y_i - f(\bm{x}_i; \bm{w}) \right)^2 \\
&= \frac{1}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right)^2
\end{align*} L ( w , X , y )  ≐ N 1  i = 1 ∑ N  ( y i  − f ( x i  ; w ) ) 2 = N 1  i = 1 ∑ N  ( y i  − w ⋅ x i  ) 2  Knowing that this is convex, the first necessary condition to find the minimum is
∂ ∂ w L ( w , X , y ) = 0 \frac{\partial}{\partial \bm{w}} L(\bm{w}, X, \bm{y}) = \bm{0} ∂ w ∂  L ( w , X , y ) = 0 Let's calculate this for w 0 w_0 w 0  
∂ ∂ w 0 L ( w ) = ∂ ∂ w 0 [ 1 N ∑ i = 1 N ( y i − w ⋅ x i ) 2 ] = − 2 N ∑ i = 1 N ( y i − w ⋅ x i ) = 0 \begin{align*}
\frac{\partial}{\partial w_0}L(w) &= \frac{\partial}{\partial w_0} \big[ \frac{1}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right)^2 \big] \\
&= -\frac{2}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) = 0 \\
\end{align*} ∂ w 0  ∂  L ( w )  = ∂ w 0  ∂  [ N 1  i = 1 ∑ N  ( y i  − w ⋅ x i  ) 2 ] = − N 2  i = 1 ∑ N  ( y i  − w ⋅ x i  ) = 0  Therefore, a necessary condition for w 0 w_0 w 0  
∑ i = 1 N ( y i − w ⋅ x i ) = 0 \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) = 0 i = 1 ∑ N  ( y i  − w ⋅ x i  ) = 0 Interpretation: the average of the residuals (also called errors ) is zero.
In the same way, for w j ( j ≠ 0 ) w_j (j \neq 0) w j  ( j  = 0 ) 
∂ ∂ w j L ( w ) = ∂ ∂ w j [ 1 N ∑ i = 1 N ( y i − w ⋅ x i ) 2 ] = − 2 N ∑ i = 1 N ( y i − w ⋅ x i ) x i j = 0 \begin{align*}
\frac{\partial}{\partial w_j}L(w) &= \frac{\partial}{\partial w_j} \big[ \frac{1}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right)^2 \big] \\
&= -\frac{2}{N} \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) x_{ij} = 0 \\
\end{align*} ∂ w j  ∂  L ( w )  = ∂ w j  ∂  [ N 1  i = 1 ∑ N  ( y i  − w ⋅ x i  ) 2 ] = − N 2  i = 1 ∑ N  ( y i  − w ⋅ x i  ) x ij  = 0  Hence, another necessary condition for w j ( j ≠ 0 ) w_j (j \neq 0) w j  ( j  = 0 ) 
∑ i = 1 N ( y i − w ⋅ x i ) x i j = 0 \sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) x_{ij} = 0 i = 1 ∑ N  ( y i  − w ⋅ x i  ) x ij  = 0 Interpretation: the residuals (or errors) are uncorrelated with the data.
In summary: 
{   ∑ i = 1 N ( y i − w ⋅ x i ) x i j = 0 ∑ i = 1 N ( y i − w ⋅ x i ) x i j = 0    ( j = 1 , … , d ) \left\{ \,
\begin{align*}
&\sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) \hphantom{x_{ij}} = 0 \\
&\sum_{i=1}^N \left( y_i - \bm{w} \cdot \bm{x}_i \right) x_{ij} = 0~~(j = 1, \ldots, d)
\end{align*}
\right. ⎩ ⎨ ⎧   i = 1 ∑ N  ( y i  − w ⋅ x i  ) x ij  = 0 i = 1 ∑ N  ( y i  − w ⋅ x i  ) x ij  = 0     ( j = 1 , … , d )  Least squares in matrix form 
X ≐ [ 1 x 11 x 12 ⋯ x 1 d 1 x 21 x 22 ⋯ x 2 d ⋮ ⋮ ⋮ ⋱ ⋮ 1 x N 1 x N 2 ⋯ x N d ] = [ —— x 1 T —— —— x 2 T —— ⋮ ⋮ ⋮ —— x N T —— ] y ≐ [ y 1 y 2 ⋮ y N ] , w ≐ [ w 1 w 2 ⋮ w d ] \begin{align*}
X &\doteq
\begin{bmatrix}
  1 & x_{11} & x_{12} & \cdots & x_{1d} \\
  1 & x_{21} & x_{22} & \cdots & x_{2d} \\
  \vdots  & \vdots & \vdots & \ddots & \vdots \\
  1 & x_{N1} & x_{N2} & \cdots & x_{Nd} \\
\end{bmatrix}
=
\begin{bmatrix}
  \text{------} & \bm{x}_1^T & \text{------} \\
  \text{------} & \bm{x}_2^T & \text{------} \\
  \vdots & \vdots & \vdots \\
  \text{------} & \bm{x}_N^T & \text{------} \\
\end{bmatrix}\\
\bm{y} &\doteq
\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_N
\end{bmatrix},
\bm{w} \doteq
\begin{bmatrix}
w_1 \\
w_2 \\
\vdots \\
w_d
\end{bmatrix}
\end{align*} X y  ≐ ⎣ ⎡  1 1 ⋮ 1  x 11  x 21  ⋮ x N 1   x 12  x 22  ⋮ x N 2   ⋯ ⋯ ⋱ ⋯  x 1 d  x 2 d  ⋮ x N d   ⎦ ⎤  = ⎣ ⎡  —— —— ⋮ ——  x 1 T  x 2 T  ⋮ x N T   —— —— ⋮ ——  ⎦ ⎤  ≐ ⎣ ⎡  y 1  y 2  ⋮ y N   ⎦ ⎤  , w ≐ ⎣ ⎡  w 1  w 2  ⋮ w d   ⎦ ⎤   Now, how can we express errors (residuals)?
y − X w \bm{y} - X \bm{w} y − X w Hence, the loss L ( w ) L(\bm{w}) L ( w ) 
L ( w ) = 1 N ( y − X w ) T ( y − X w ) L(\bm{w}) = \frac{1}{N} \left( \bm{y} - X \bm{w} \right)^T \left( \bm{y} - X \bm{w} \right) L ( w ) = N 1  ( y − X w ) T ( y − X w ) Let's find the minimum of L ( w ) L(\bm{w}) L ( w ) ∂ ∂ w L ( w ) = 0 \frac{\partial}{\partial \bm{w}} L(\bm{w}) = \bm{0} ∂ w ∂  L ( w ) = 0 a , b ∈ R d \bm{a}, \bm{b} \in \mathbb{R}^d a , b ∈ R d a T b = b T a \bm{a}^T \bm{b} = \bm{b}^T \bm{a} a T b = b T a 
∂ ∂ w L ( w ) = 1 N ∂ ∂ w [ y T y − y T X w − w T X T y + w T X T X w ] = 1 N ∂ ∂ w [ y T y − ( y T X ) w − w T ( X T y ) + w T X T X w ]    ( associativity ) = 1 N ∂ ∂ w [ y T y − w T ( y T X ) T − w T ( X T y ) + w T X T X w ]    ( dot product in the form of matrix mul ) = 1 N ∂ ∂ w [ y T y − w T ( X T y ) − w T ( X T y ) + w T X T X w ] = 1 N ∂ ∂ w [ y T y − 2 w T ( X T y ) + w T X T X w ] \begin{align*}
\frac{\partial}{\partial \bm{w}} L(\bm{w}) &= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - \bm{y}^T X \bm{w} - \bm{w}^T X^T \bm{y} + \bm{w}^T X^T X \bm{w} \big] \\
&= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - (\bm{y}^T X) \bm{w} - \bm{w}^T (X^T \bm{y}) + \bm{w}^T X^T X \bm{w} \big]~~(\text{associativity}) \\
&= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - {\color{magenta} \bm{w}^T (\bm{y}^T X)^T} - \bm{w}^T (X^T \bm{y}) + \bm{w}^T X^T X \bm{w} \big]~~(\text{dot product in the form of matrix mul}) \\
&= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - {\color{magenta} \bm{w}^T (X^T \bm{y})} - \bm{w}^T (X^T \bm{y}) + \bm{w}^T X^T X \bm{w} \big] \\
&= \frac{1}{N} \frac{\partial}{\partial \bm{w}} \big[ \bm{y}^T \bm{y} - 2 \bm{w}^T (X^T \bm{y}) + \bm{w}^T X^T X \bm{w} \big]\\
\end{align*} ∂ w ∂  L ( w )  = N 1  ∂ w ∂  [ y T y − y T X w − w T X T y + w T X T X w ] = N 1  ∂ w ∂  [ y T y − ( y T X ) w − w T ( X T y ) + w T X T X w ]     ( associativity ) = N 1  ∂ w ∂  [ y T y − w T ( y T X ) T − w T ( X T y ) + w T X T X w ]     ( dot product in the form of matrix mul ) = N 1  ∂ w ∂  [ y T y − w T ( X T y ) − w T ( X T y ) + w T X T X w ] = N 1  ∂ w ∂  [ y T y − 2 w T ( X T y ) + w T X T X w ]  Now, we have these simple derivative rules:
∂ ∂ w w T a = a ,   ∂ ∂ w w T A w = ( A + A T ) w \frac{\partial}{\partial \bm{w}} \bm{w}^T \bm{a} = \bm{a},~\frac{\partial}{\partial \bm{w}} \bm{w}^T A \bm{w} = (A + A^T) \bm{w} ∂ w ∂  w T a = a ,   ∂ w ∂  w T A w = ( A + A T ) w Using these properties,
∂ ∂ w = 1 N ∂ ∂ w [ − 2 X T y + 2 X T X w ]    ( ∵ X T X  is symmetric ) = − 2 N [ X T y − X T X w ] = 0 \begin{align*}
\frac{\partial}{\partial \bm{w}} &= \frac{1}{N} \frac{\partial}{\partial \bm{w}} 
 \big[ - 2 X^T \bm{y} + 2 X^T X \bm{w} \big]~~(\because X^TX \text{ is symmetric})\\
&=  - \frac{2}{N} \big[ X^T \bm{y} - X^T X \bm{w} \big] = \bm{0}\\
\end{align*} ∂ w ∂   = N 1  ∂ w ∂  [ − 2 X T y + 2 X T X w ]     ( ∵ X T X  is symmetric ) = − N 2  [ X T y − X T X w ] = 0  Therefore,
X T y − X T X w = 0 . X^T \bm{y} - X^T X \bm{w} = \bm{0}. X T y − X T X w = 0 . We can see it in this way as well which reminds us a form that we derived before going into matrix notation.
X T ( y − X w ) = 0 . X^T \left( \bm{y} - X \bm{w} \right) = \bm{0}. X T ( y − X w ) = 0 . Solving for w \bm{w} w 
w ∗ = ( X T X ) − 1 X T ⏟ X † : Moore-Penrose pseudoinverse y \bm{w}^* = \underbrace{{\color{magenta} \left( X^T X \right)^{-1} X^T}}_{X^\dagger \text{: Moore-Penrose pseudoinverse}} \bm{y} w ∗ = X † : Moore-Penrose pseudoinverse ( X T X ) − 1 X T   y Let's use this to predict y ^ t \hat{y}_t y ^  t  x t \bm{x}_t x t  
y ^ t = w ∗ T x t = y T X † T x t \begin{align*}
\hat{y}_t &= \bm{w}^{*T} \bm{x}_t \\
&= \bm{y}^T {\color{magenta} X^\dagger}^T \bm{x}_t
\end{align*} y ^  t   = w ∗ T x t  = y T X † T x t   References 
Introduction to Machine Learning Lecture 2 (TTIC 31020; 2018) 
Matrix Cookbook Jacobian matrix definition (wikipedia)