Least Squares

안녕하세요. 오태호입니다.

이 글에서는 Least Squares와 관련된 몇 가지 수식을 유도하여 Least Squares에 대해 이해를 해 보도록 하겠습니다.

이 글을 이해하기 위해서는 Linear Algebra, Statistics에 대한 기초지식이 필요합니다.

Matrix나 Vector는 굵은 글꼴로 표현하도록 하겠습니다. 그리고 Vector는 특별히 언급이 없으면 Column Vector를 의미합니다.

A는

m \times n

의 Matrix이고, b는

m \times 1

의 Vector일 때, b의 A의 Column Space

C S(\mathbf{A})

에 대한 Projection Vector

\operatorname{proj}_{C S(\mathbf{A})} \mathbf{b}

는 다음과 같이 계산합니다.

x는

n \times 1

의 Vector이고 다음과 같이 정의합니다.

\operatorname{proj}_{C S(\mathbf{A})} \mathbf{b}=\mathbf{A} \mathbf{x}

\operatorname{proj}_{C S(\mathbf{A})} \mathbf{b}

는 A의 Column Vector들을 Linear Combination한 Vector이므로 Ax와 같이 표현이 가능합니다.

z는

m \times 1

의 Vector이고 다음과 같이 정의합니다.

\mathbf{z}=\mathbf{b}-\operatorname{proj}_{C S(\mathbf{A})} \mathbf{D}

아래 그림을 살펴보면 z는

C S(\mathbf{A})

의 Orthogonal Complement에 속해 있는 Vector라는 사실을 알 수 있습니다.

\mathbf{b}-\operatorname{proj}_{C S(\mathbf{A})} \mathbf{b}=\mathbf{z} \in C S(\mathbf{A})^{\perp}=\operatorname{Null}\left(\mathbf{A}^{T}\right)

C S(\mathbf{A})^{\perp}

는 A의 Column Space의 Orthogonal Complement를 의미합니다. 즉, A의 모든 Column Vector들과 Orthogonal한 Vector들로 이루어진 Vector Space를 의미합니다. 다시 말하면, A의 모든 Column Vector들과 Dot Product가 0이 되는 Vector들로 이루어진 Vector Space를 의미합니다.

\operatorname{Null}\left(\mathbf{A}^{T}\right)

는

\mathbf{A}^{T} \mathbf{h}=\mathbf{0}

을 만족하는 h들로 이루어진 Vector Space를 의미합니다. 즉,

\mathbf{A}^{T}

의 모든 Row Vector들과 Dot Product가 00가 되는 Vector들로 이루어진 Vector Space를 의미하고, 이것은 AA의 모든 Column Vector들과 Dot Product가 00이 되는 Vector들로 이루어진 Vector Space를 의미하며, 결국에

C S(\mathbf{A})^{\perp}

과 동일한 Vector Space를 의미합니다.

\begin{array}{l}\mathbf{A}^{T} \mathbf{z}=\mathbf{0} \\\mathbf{A}^{T}\left(\mathbf{b}-\operatorname{proj}_{C S(\mathbf{A})} \mathbf{b}\right)=\mathbf{0} \\\mathbf{A}^{T}(\mathbf{b}-\mathbf{A} \mathbf{x})=\mathbf{0} \\\mathbf{A}^{T} \mathbf{b}-\mathbf{A}^{T} \mathbf{A} \mathbf{x}=\mathbf{0} \\\mathbf{x}=\left(\mathbf{A}^{T} \mathbf{A}\right)^{-1} \mathbf{A}^{T} \mathbf{b} \\\operatorname{proj}_{C S(\mathbf{A})} \mathbf{b}=\mathbf{A} \mathbf{x} \\\quad=\mathbf{A}\left(\mathbf{A}^{T} \mathbf{A}\right)^{-1} \mathbf{A}^{T} \mathbf{b}\end{array}

정리하면 다음과 같습니다.

\operatorname{proj}_{C S(\mathbf{A})} \mathbf{b}=\mathbf{A}\left(\mathbf{A}^{T} \mathbf{A}\right)^{-1} \mathbf{A}^{T} \mathbf{b}

A는

m \times n

의 Matrix, x는

n \times 1

의 Vector, b는

m \times 1

의 Vector라고 정의합니다. A와 b의 값이 주어졌을 때

\min _{\mathbf{x}}\|\mathbf{A} \mathbf{x}-\mathbf{b}\|_{2}^{2}

을 만족하는 x는 다음과 같이 계산합니다.

\mathbf{A} \mathbf{x}

에서 x는 임의의 값을 가질 수 있는 Vector이고,

\mathbf{A} \mathbf{x}

는 A의 Column Space상의 임의의 Vector를 표현할 수 있습니다. b와

\mathbf{A} \mathbf{x}

의 거리가 가장 가까운 지점은, b와 A의 Column Space상에서 가장 거리가 가까운 지점이 됩니다. 이 지점은 b에서 A의 Column Space에 수직이 되는 지점으로, 즉, b를 A의 Column Space에 Projection한 지점입니다. 이것을 정리해서 표현하면

\operatorname{proj}_{C S(\mathbf{A})} \mathbf{b}

이 됩니다. 이 사실을 바탕으로

\mathbf{A} \mathbf{x}

를 구하고 x를 구합니다.

\begin{aligned}\mathbf{A x} & =\operatorname{proj}_{C S(\mathbf{A})}^{\mathbf{b}} \\& =\mathbf{A}\left(\mathbf{A}^{T} \mathbf{A}\right)^{-1} \mathbf{A}^{T} \mathbf{b} \\\mathbf{x}= & \left(\mathbf{A}^{T} \mathbf{A}\right)^{-1} \mathbf{A}^{T} \mathbf{b}\end{aligned}

정리하면 다음과 같습니다.

\underset{\mathbf{x}}{\operatorname{argmin}}\|\mathbf{A} \mathbf{x}-\mathbf{b}\|_{2}^{2}=\left(\mathbf{A}^{T} \mathbf{A}\right)^{-1} \mathbf{A}^{1} \mathbf{b}

A는

m \times n

의 Matrix, x는

n \times 1

의 Vector, b는

m \times 1

의 Vector라고 정의합니다. A와 b와 λ의 값이 주어졌을 때

\min _{\mathbf{x}}\left(\|\mathbf{A} \mathbf{x}-\mathbf{b}\|_{2}^{2}+\lambda\|\mathbf{x}\|_{2}^{2}\right)

을 만족하는 x는 다음과 같이 계산합니다.

직관적으로는 다음과 같이 이해할 수 있습니다.

A는 Model 학습을 위한 Input Data이고 b는 Model 학습을 위한 Output Data입니다. Input Data가 주어졌을 때 Output Data를 Prediction하는 Model을 만들고 싶습니다. Model의 Parameter는 x입니다. A의 각 Column은 각각이 Feature입니다. 특정 Feature에 지나치게 의존하게 되면 Overfit이 발생할 우려가 있으니 x Vector의 특정 Element가 지나치게 커지는 것을 방지합니다. 그러기 위해서

\lambda\|\mathbf{x}\|_{2}^{2}

이 지나치게 커지는 것을 방지하여 여러 Feature를 골고루 바라보도록 유도합니다. 극단적으로 λ가 ∞의 값을 가지면 x가 0이 되면서 Underfit이 발생하고 λ가 0의 값을 가지면 Ordinary Least Squares와 동일한 형태가 되면서 Overfit이 발생할 위험이 있습니다.

\begin{aligned}\min _{\mathbf{x}}\left(\|\mathbf{A} \mathbf{x}-\mathbf{b}\|_{2}^{2}+\lambda\|\mathbf{x}\|_{2}^{2}\right)= & \min _{\mathbf{x}}\left\|\left[\begin{array}{c}\mathbf{A} \\\sqrt{\lambda} \mathbf{I}\end{array}\right] \mathbf{x}-\left[\begin{array}{l}\mathbf{b} \\\mathbf{0}\end{array}\right]\right\|_{2}^{2} \\\underset{\mathbf{x}}{\operatorname{argmin}}\left\|\left[\begin{array}{c}\mathbf{A} \\\sqrt{\lambda} \mathbf{I}\end{array}\right] \mathbf{x}-\left[\begin{array}{l}\mathbf{b} \\\mathbf{0}\end{array}\right]\right\|_{2}^{2} & =\left(\left[\begin{array}{c}\mathbf{A} \\\sqrt{\lambda} \mathbf{I}\end{array}\right]^{T}\left[\begin{array}{c}\mathbf{A} \\\sqrt{\lambda} \mathbf{I}\end{array}\right]\right)^{-1}\left[\begin{array}{c}\mathbf{A} \\\sqrt{\lambda} \mathbf{I}\end{array}\right]^{T}\left[\begin{array}{l}\mathbf{b} \\\mathbf{0}\end{array}\right] \\& =\left(\left[\begin{array}{ll}\mathbf{A}^{T} & \sqrt{\lambda} \mathbf{I}\end{array}\right]\left[\begin{array}{c}\mathbf{A} \\\sqrt{\lambda} \mathbf{I}\end{array}\right]\right)^{-1}\left[\begin{array}{ll}\mathbf{A}^{T} & \sqrt{\lambda} \mathbf{I}\end{array}\right]\left[\begin{array}{l}\mathbf{b} \\\mathbf{0}\end{array}\right] \\& =\left(\mathbf{A}^{T} \mathbf{A}+\lambda \mathbf{I}\right)^{-1} \mathbf{A}^{T} \mathbf{b}\end{aligned}

정리하면 다음과 같습니다.

\underset{\mathbf{x}}{\operatorname{argmin}}\left(\|\mathbf{A x}-\mathbf{b}\|_{2}^{2}+\lambda\|\mathbf{x}\|_{2}^{2}\right)=\left(\mathbf{A}^{T} \mathbf{A}+\lambda \mathbf{I}\right)^{-1} \mathbf{A}^{1} \mathbf{b}

정확한 정의는 조금씩 차이가 있지만 이 방법은 L2 Regularization, Tikhonov Regularization, Ridge Regression라고도 불립니다.

\mathbf{A}_{i}

는

m \times n

의 Matrix, x는

n \times 1

의 Vector,

\mathbf{b}_{i}

는 m×1m×1의 Vector라고 정의합니다.

\mathbf{A}_{i}

와

\mathbf{b}_{i}

와

\mathbf{λ}_{i}

의 값이 주어졌을 때

\min _{\mathbf{x}}\left(\sum_{i=1}^{k} \lambda_{i}\left\|\mathbf{A}_{i} \mathbf{x}-\mathbf{b}_{i}\right\|_{2}^{2}\right)

을 만족하는 x는 다음과 같이 계산합니다.

직관적으로는 다음과 같이 이해할 수 있습니다.

\mathbf{A}_{i}

는 Model 학습을 위한 Input Data이고

\mathbf{b}_{i}

는 Model 학습을 위한 Output Data입니다. Input Data가 주어졌을 때 Output Data를 Prediction하는 Model을 만들고 싶습니다. Model의 Parameter는 x입니다. Input Data와 Output Data를 여러 Group으로 나누고 Group마다

\mathbf{λ}_{i}

로 가중치를 줘서 특정 Group의 Data를 더 중요하게 생각해서 Model을 만듭니다. 특정 Group의

\mathbf{A}_{i}

를 I로 설정하고

\mathbf{b}_{i}

를 0으로 설정하고

\mathbf{λ}_{i}

를 적절하게 설정해 주면 Regularized Least Squares와 동일한 형태가 됩니다.

\begin{array}{l} \min _{\mathbf{x}}\left(\sum_{i=1}^{k} \lambda_{i}\left\|\mathbf{A}_{i} \mathbf{x}-\mathbf{b}_{i}\right\|_{2}^{2}\right)=\min _{\mathbf{x}}\left\|\left[\begin{array}{c}\sqrt{\lambda_{1}} \mathbf{A}_{1} \\\sqrt{\lambda_{2}} \mathbf{A}_{2} \\\vdots \\\sqrt{\lambda_{k}} \mathbf{A}_{k}\end{array}\right] \mathbf{x}-\left[\begin{array}{c}\sqrt{\lambda_{1}} \mathbf{b}_{1} \\\sqrt{\lambda_{2}} \mathbf{b}_{2} \\\vdots \\\sqrt{\lambda_{k}} \mathbf{b}_{k}\end{array}\right]\right\|_{2}^{2} \\\underset{\mathbf{x}}{\operatorname{argmin}}\left\|\left[\begin{array}{c}\sqrt{\lambda_{1}} \mathbf{A}_{1} \\\sqrt{\lambda_{2}} \mathbf{A}_{2} \\\vdots \\\sqrt{\lambda_{k}} \mathbf{A}_{k}\end{array}\right] \mathbf{x}-\left[\begin{array}{c}\sqrt{\lambda_{1}} \mathbf{b}_{1} \\\sqrt{\lambda_{2}} \mathbf{b}_{2} \\\vdots \\\sqrt{\lambda_{k}} \mathbf{b}_{k}\end{array}\right]\right\|_{2}^{2}\left.=\left(\begin{array}{c}\sqrt{\lambda_{1}} \mathbf{A}_{1} \\\sqrt{\lambda_{2}} \mathbf{A}_{2} \\\vdots \\\sqrt{\lambda_{k}} \mathbf{A}_{k}\end{array} \|^{T}\left[\begin{array}{c}\sqrt{\lambda_{1}} \mathbf{A}_{1} \\\sqrt{\lambda_{2}} \mathbf{A}_{2} \\\vdots \\\sqrt{\lambda_{k}} \mathbf{A}_{k}\end{array}\right]\right)\right]^{-1}\left[\begin{array}{c}\sqrt{\lambda_{1}} \mathbf{A}_{1} \\\sqrt{\lambda_{2}} \mathbf{A}_{2} \\\vdots \\\sqrt{\lambda_{k}} \mathbf{A}_{k}\end{array}\right]^{T}\left[\begin{array}{c}\sqrt{\lambda_{1}} \mathbf{b}_{1} \\\sqrt{\lambda_{2}} \mathbf{b}_{2} \\\vdots \\\sqrt{\lambda_{k}} \mathbf{b}_{k}\end{array}\right] \\=\left(\sum_{i=1}^{k} \lambda_{i} \mathbf{A}_{i}^{T} \mathbf{A}_{i}\right)^{k}\left(\sum_{i=1}^{k} \lambda_{i} \mathbf{A}_{i}^{T} \mathbf{b}_{i}\right)\end{array}

정리하면 다음과 같습니다.

\underset{\mathbf{x}}{\operatorname{argmin}}\left(\sum_{i=1}^{k} \lambda_{i}\left\|\mathbf{A}_{i} \mathbf{x}-\mathbf{b}_{i}\right\|_{2}^{2}\right)=\left(\sum_{i=1}^{k} \lambda_{i} \mathbf{A}_{i}^{T} \mathbf{A}_{i}\right)^{-1}\left(\sum_{i=1}^{k} \lambda_{i} \mathbf{A}_{i}^{T} \mathbf{b}_{i}\right)

A는

m \times n

의 Matrix, x는

n \times 1

의 Vector, b는

m \times 1

의 Vector, W는

m \times m

의 Diagonal Matrix라고 정의합니다. A와 b와 W의 값이 주어졌을 때

\min _{\mathbf{x}}(\mathbf{A} \mathbf{x}-\mathbf{b})^{T} \mathbf{W}(\mathbf{A} \mathbf{x}-\mathbf{b})

을 만족하는 x는 다음과 같이 계산합니다.

W와

\sqrt{ } \mathbf{W}

는 다음과 같이 정의합니다.

\begin{array}{l} \mathbf{W}=\left[\begin{array}{cccc}w_{1} & 0 & \cdots & 0 \\0 & w_{2} & \cdots & 0 \\\vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & w_{m}\end{array}\right] \\\sqrt{\mathbf{W}}=\left[\begin{array}{cccc}\sqrt{w_{1}} & 0 & \cdots & 0 \\0 & \sqrt{w_{2}} & \cdots & 0 \\\vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & \sqrt{w_{m}}\end{array}\right]\end{array}

직관적으로는 다음과 같이 이해할 수 있습니다.

A는 Model 학습을 위한 Input Data이고 b는 Model 학습을 위한 Output Data입니다. Input Data가 주어졌을 때 Output Data를 Prediction하는 Model을 만들고 싶습니다. Model의 Parameter는 x입니다. AA의 각 Row는 각각이 Data Point입니다. Model의 경우에는 Input으로 동일한 Data Point가 주어지면 Model의 Output이 동일하지만, 실제 수집된 Output Data b의 Data Point는 여러가지 이유로(측정오류 등) 값이 흔들리게 됩니다. 어떤 Input Data의 Data Point의 경우에는 Output이 크게 흔들리고 어떤 Input Data의 Data Point의 경우에는 Output이 작게 흔들릴 수 있습니다. 이런 상황을 Heteroskedasticity가 존재한다고 말합니다. 이런 경우에 그대로 Model을 학습하게 되면 Output이 크게 흔들리는 경우를 좀 더 중요하게 생각해서 학습이 이루어지게 됩니다. 그래서 이런 상황에는 Output이 크게 흔들리는 경우에는 Weight를 작게 주고 Output이 작게 흔들리는 경우에는 Weight를 크게 줘서 Model을 균형있게 학습시킵니다. Weight는 W에 설정합니다.

\begin{array}{l}\begin{aligned}\min _{\mathbf{x}}(\mathbf{A} \mathbf{x}-\mathbf{b})^{T} \mathbf{W}(\mathbf{A} \mathbf{x}-\mathbf{b}) & =\min _{\mathbf{x}}(\mathbf{A} \mathbf{x}-\mathbf{b})^{T}(\sqrt{\mathbf{W}})^{T}(\sqrt{\mathbf{W}})(\mathbf{A} \mathbf{x}-\mathbf{b}) \\& =\min _{\mathbf{x}}(\sqrt{\mathbf{W}} \mathbf{A} \mathbf{x}-\sqrt{\mathbf{W}} \mathbf{b})^{T}(\sqrt{\mathbf{W}} \mathbf{A} \mathbf{x}-\sqrt{\mathbf{W}} \mathbf{b}) \\& =\min _{\mathbf{x}}\|\sqrt{\mathbf{W}} \mathbf{A} \mathbf{x}-\sqrt{\mathbf{W}} \mathbf{b}\|_{2}^{2}\end{aligned}\\\begin{aligned}\underset{\mathbf{x}}{\operatorname{argmin}}\|\sqrt{\mathbf{W}} \mathbf{A} \mathbf{x}-\sqrt{\mathbf{W}} \mathbf{b}\|_{2}^{2} & =\left((\sqrt{\mathbf{W}} \mathbf{A})^{T}(\sqrt{\mathbf{W}} \mathbf{A})\right)^{-1}(\sqrt{\mathbf{W}} \mathbf{A})^{T}(\sqrt{\mathbf{W}} \mathbf{b}) \\& =\left(\mathbf{A}^{T} \mathbf{W} \mathbf{A}\right)^{-1} \mathbf{A}^{T} \mathbf{W} \mathbf{b}\end{aligned}\end{array}

정리하면 다음과 같습니다.

\underset{\mathbf{x}}{\operatorname{argmin}}(\mathbf{A x}-\mathbf{b})^{T} \mathbf{W}(\mathbf{A} \mathbf{x}-\mathbf{b})=\left(\mathbf{A}^{T} \mathbf{W} \mathbf{A}\right)^{-1} \mathbf{A}^{T} \mathbf{W} \mathbf{b}

A는

m \times n

의 Matrix, x는

n \times 1

의 Vector, b는

m \times 1

의 Vector, CC는

m \times m

의 Positive Definite Matrix라고 정의합니다. A와 b와 C의 값이 주어졌을 때

\min _{\mathbf{x}}(\mathbf{A x}-\mathbf{b})^{T} \mathbf{C}^{-1}(\mathbf{A x}-\mathbf{b})

을 만족하는 xx는 다음과 같이 계산합니다.

CC는 Positive Definite Matrix이므로 Cholesky Decomposition을 이용하여 아래와 같이 Decomposition할 수 있습니다.

\mathbf{C}=\mathbf{L L}^{T}

직관적으로는 다음과 같이 이해할 수 있습니다.

A는 Model 학습을 위한 Input Data이고 b는 Model 학습을 위한 Output Data입니다. Input Data가 주어졌을 때 Output Data를 Prediction하는 Model을 만들고 싶습니다. Model의 Parameter는 x입니다. A의 각 Row는 각각이 Data Point입니다. 첫 번째 Data Point에 대한 Output과 Model의 Output과 차이하고, 두 번째 Data Point에 대한 Output과 Model의 Output이 서로 Correlation이 있을 수 있습니다. 예를 들어, 첫 번째 Data Point에 대한 Output은 서울의 2021년 1월 1일의 온도이고 두 번째 Data Point에 대한 Output은 서울의 2021년 1월 2일의 온도라면, 첫 번째 Data Point의 Output과 Model의 Output과 차이하고, 두 번째 Data Point의 Output과 Model의 Output과 차이하고는, 서로 Correlation이 되어 있을 가능성이 높다는 것을 예상할 수 있습니다. 이런 경우에

\mathbf{L}^{-1}

Matrix를 A와 b에 곱해서 Correlation이 제거된 Vector Space로 Transform한 후 Model을 학습시킵니다. C는 Matrix는 Data Point의 Output과 Model의 Output과의 차이 값들의 Covariance Matrix로 설정합니다. 참고로 C Matrix를 구하는 것이 현실적으로 힘든 경우가 많이 있는데 이런 경우에는 Generalized Least Squares을 사용하기가 어렵습니다. C Matrix가 Diagonal Matrix라면 Correlation이 존재하지 않는 것을 의미하며 Weighted Least Squares와 동일한 형태가 됩니다.

\begin{aligned}\min _{\mathbf{x}}(\mathbf{A} \mathbf{x}-\mathbf{b})^{T} \mathbf{C}^{-1}(\mathbf{A} \mathbf{x}-\mathbf{b}) & =\min _{\mathbf{x}}(\mathbf{A} \mathbf{x}-\mathbf{b})^{T}\left(\mathbf{L L}^{T}\right)^{-1}(\mathbf{A} \mathbf{x}-\mathbf{b}) \\& =\min _{\mathbf{x}}(\mathbf{A} \mathbf{x}-\mathbf{b})^{T}\left(\mathbf{L}^{-1}\right)^{T}\left(\mathbf{L}^{-1}\right)(\mathbf{A} \mathbf{x}-\mathbf{b}) \\& =\min _{\mathbf{x}}\left(\mathbf{L}^{-1} \mathbf{A} \mathbf{x}-\mathbf{L}^{-1} \mathbf{b}\right)^{T}\left(\mathbf{L}^{-1} \mathbf{A} \mathbf{x}-\mathbf{L}^{-1} \mathbf{b}\right) \\& =\min _{\mathbf{x}}\left\|\mathbf{L}^{-1} \mathbf{A} \mathbf{x}-\mathbf{L}^{-1} \mathbf{b}\right\|_{2}^{2} \\\underset{\mathbf{x}}{\operatorname{argmin}}\left\|\mathbf{L}^{-1} \mathbf{A} \mathbf{x}-\mathbf{L}^{-1} \mathbf{b}\right\|_{2}^{2}= & \left(\left(\mathbf{L}^{-1} \mathbf{A}\right)^{T}\left(\mathbf{L}^{-1} \mathbf{A}\right)\right)^{-1}\left(\mathbf{L}^{-1} \mathbf{A}\right)^{T}\left(\mathbf{L}^{-1} \mathbf{b}\right) \\= & \left(\mathbf{A}^{T}\left(\mathbf{L} \mathbf{L}^{T}\right)^{-1} \mathbf{A}\right)^{-1} \mathbf{A}^{T}\left(\mathbf{L} \mathbf{L}^{T}\right)^{-1} \mathbf{b} \\= & \left(\mathbf{A}^{T} \mathbf{C}^{-1} \mathbf{A}\right)^{-1} \mathbf{A}^{T} \mathbf{C}^{-1} \mathbf{b}\end{aligned}

정리하면 다음과 같습니다.

\underset{\mathbf{x}}{\operatorname{argmin}}(\mathbf{A x}-\mathbf{b})^{T} \mathbf{C}^{-1}(\mathbf{A x}-\mathbf{b})=\left(\mathbf{A}^{T} \mathbf{C}^{-1} \mathbf{A}\right)^{-1} \mathbf{A}^{T} \mathbf{C}^{-1} \mathbf{b}

C Matrix에 대해서 조금 더 자세히 살펴보겠습니다.

Output Data인 b와 Model의 Output인 Ax와의 차이를 ε으로 정의하면 다음과 같이 표현할 수 있습니다.

\varepsilon=\mathbf{b}-\mathbf{A x}

여기서 A와 x는 고정되어 있는데 b 값이 (측정할 때마다) 흔들리고 이로 인해 ε 값도 흔들리는 상황을 생각해 봅니다. C Matrix는 다음과 같이 정의합니다.

\mathbf{C}=\operatorname{Var}(\varepsilon)

C는 Positive Definite Matrix이므로 Cholesky Decomposition을 하면 다음과 같이 Decomposition할 수 있습니다.

\mathbf{C}=\mathbf{L} \mathbf{L}^{T}

위에서 언급된 수식에 아래와 같이

\mathbf{L}^{-1}

을 곱해서 ε의 Correlation이 어떻게 변하는지 살펴봅니다.

\begin{array}{l}\boldsymbol{\varepsilon}=\mathbf{b}-\mathbf{A x} \\\mathbf{L}^{-1} \varepsilon=\mathbf{L}^{-1} \mathbf{b}-\mathbf{L}^{-1} \mathbf{A} \mathbf{x}\end{array}

\operatorname{Var}\left(\mathbf{L}^{-1} \boldsymbol{\varepsilon}\right)

은 Covariance Matrix의 성질을 이용하여 정리하면 다음과 같습니다.

\begin{aligned}\operatorname{Var}\left(\mathbf{L}^{-1} \varepsilon\right) & =\mathbf{L}^{-1} \operatorname{Var}(\varepsilon)\left(\mathbf{L}^{-1}\right)^{T} \\& =\mathbf{L}^{-1} \mathbf{C}\left(\mathbf{L}^{-1}\right)^{T} \\& =\mathbf{L}^{-1} \mathbf{L} \mathbf{L}^{T}\left(\mathbf{L}^{-1}\right)^{T} \\& =\mathbf{L}^{-1} \mathbf{L L}^{T}\left(\mathbf{L}^{T}\right)^{-1} \\& =\mathbf{I}\end{aligned}

즉, A와 b에

\mathbf{L}^{-1}

을 곱해 주면 ε에 있던 Correlation이 제거된 Vector Space로 Transform되는 것을 확인할 수 있습니다.

Weighted Least Squares의 경우와 비교해 보면 C가 Diagonal Matrix인 경우 W와 다음과 같은 관계가 있습니다.

\begin{array}{l}\sqrt{\mathbf{W}}=\mathbf{L}^{-1} \\(\sqrt{\mathbf{W}})^{-1}=\mathbf{L} \\\mathbf{C}=\mathbf{L} \mathbf{L}^{T} \\\quad=(\sqrt{\mathbf{W}})^{-1}\left((\sqrt{\mathbf{W}})^{-1}\right)^{T} \\\quad=(\sqrt{\mathbf{W}})^{-1}(\sqrt{\mathbf{W}})^{-1} \\\quad=\mathbf{W}^{-1}\end{array}

몇 가지 Least Squares와 관련된 수식을 유도해 보았습니다. 이 글을 통해 Least Squares에 대해 이해하는데 조금이라도 도움이 되기를 바랍니다.

Least Squares

Projection

Ordinary Least Squares

Regularized Least Squares

Multi Objective Least Squares

Weighted Least Squares

Generalized Least Squares

Conclusion