공돌이의 수학정리노트 (Angelo’s Math Notes)

아핀 변환 (Affine Transformation)

2024-06-28T00:00:00+00:00

행렬을 이용해 물체를 평행이동 시켜주는 변환은 수학적으로 어떻게 기술될 수 있을까?

Prerequisites

본 포스트를 잘 이해하기 위해선 아래의 내용에 대해 알고 오는 것이 좋습니다.

복습

벡터란 공간 상의 한 점으로 생각할 수 있다고 했다. 벡터를 표현할 때 위치와 방향성을 모두 고려하여 화살표로 나타낼 수도 있지만, 수 많은 벡터를 한번에 표시하기에는 너무 복잡해질 수 있으므로 위치만 표시하기도 한다.

그림 1. 벡터는 화살표로 표시하기도 하지만 점으로 표시할 수 있다.

만약 2차원 평면 상에 표시된 점들을 pixel로 생각한다면 벡터들의 나열을 그림으로 대체해 생각할 수도 있을 것이다.

2차원 공간에서의 선형 변환

행렬과 선형변환 편에서 배운 것 처럼 행렬은 하나의 선형 변환으로 표현할 수 있다. 그림에 선형 변환을 적용하면 기하학적으로 사진의 형태가 변형되기 때문이 기하 변환(geometric transformations)이라고도 부른다. 대표적인 변환들은 아래와 같다.

Scaling (2D)

\[\begin{bmatrix}2 & 0\\ 0& 1\end{bmatrix} % 식 (1)\]

그림 2. 2차원 공간의 scaling 변환

Shear (2D)

\[\begin{bmatrix}2 & 1\\ 1& 2\end{bmatrix} % 식 (2)\]

그림 3. 2차원 공간의 shear 변환

Rotation (2D)

\[\begin{bmatrix}\cos(\pi/3) & -\sin(\pi/3)\\ \sin(\pi/3)& \cos(\pi/3)\end{bmatrix} % 식 (3)\]

그림 4. 2차원 공간의 Rotation 변환

Permutation (2D)

\[\begin{bmatrix}0 & 1\\ 1& 0\end{bmatrix} % 식 (4)\]

그림 5. 2차원 공간의 Permutation 변환

3차원 공간에서의 선형 변환

선형 변환은 비단 2차원에만 해당하는 것이 아니며, 3차원에서도 적용할 수 있다. 아래는 3x3 행렬을 이용해 표현한 3차원 공간 상에서의 변환이다.

Scaling (3D)

\[\begin{bmatrix}2 & 0 & 0 \\ 0 & 1 & 0 \\ 0& 0&1\end{bmatrix}% 식 (5)\]

그림 6. 3차원 공간의 scaling 변환

Shear (3D)

\[\begin{bmatrix}2 & 1 & 0 \\ 1& 2 & 0 \\ 0& 0&1\end{bmatrix}% 식 (6)\]

그림 7. 3차원 공간의 Shear 변환

rotation (3D)

\[\begin{bmatrix}\cos(\pi/3) & -\sin(\pi/3) & 0 \\ \sin(\pi/3)& \cos(\pi/3) & 0 \\ 0 & 0 & 1\end{bmatrix} % 식 (7)\]

그림 8. 3차원 공간의 Rotation 변환

이 외에도 3차원 선형 변환이 동작하는 방식은 다양하므로 본 글의 맨 위에 있는 데모 애플릿을 이용해 3차원 선형 변환을 수행해보도록 하자.

아핀 변환 (Affine Transform)

평행이동을 하기 위해선 덧셈이 필요해

2차원 혹은 3차원에서의 변화에서 알 수 있는 점 중 하나는 나열되어 있는 변환들만으로는 평행이동을 표현하지 못한다는 것이다. 쉽게 말해, 그림을 상하좌우로 옮길 방법은 없어 보인다는 점이다. 그 이유는 일반적인 행렬의 곱만으로는 벡터 간의 덧셈을 표현할 수 없기 때문이다. 그러니까, 평행이동을 표현해주기 위해선 아래와 같이 덧셈 연산이 필요함을 생각할 수 있다.

\[\begin{bmatrix}x_{new} \\ y_{new} \end{bmatrix} =\begin{bmatrix}A_{11} & A_{12} \\ A_{21} & A_{22}\end{bmatrix} \begin{bmatrix}x \\ y\end{bmatrix} + \begin{bmatrix}b_1 \\ b_2\end{bmatrix} % 식 (8)\]

위 식에서는 $\begin{bmatrix} x & y\end{bmatrix}^T$ 벡터에 선형변환을 적용한 뒤 $x$ 축으로 $b_1$, $y$ 축으로 $b_2$ 만큼 이동시키는 것을 알 수 있다. 참고로 수학에서 평행이동을 시켜주는 양을 “bias”라고도 표현하기도 하므로 $b$라는 알파벳을 활용했다.

행렬 하나만으로 평행이동을 표현하기 위한 방법

그런데, 만약 2차원 평면 상에 표현하는 벡터에 “bias”에 해당하는 차원을 하나 더 추가해주면 편리하게도 행렬의 곱셈 하나만으로 평행이동을 함께 표현할 수 있게 된다. 글로만 설명하면 무슨 말인지 이해하기 어려울 수 있기 때문에 수식을 같이 곁들이자면 아래와 같다.

\[\begin{bmatrix}x_{new} \\ y_{new} \\ 1\end{bmatrix} =\begin{bmatrix}A_{11} & A_{12} & b_1 \\ A_{21} & A_{22} & b_2 \\ 0 & 0 & 1\end{bmatrix} \begin{bmatrix}x \\ y \\ 1\end{bmatrix} % 식 (9)\]

즉, 식 (9)에서 벡터의 세 번째 차원은 실제로는 사용하지 않고 “bias 계산의 편의를 위해” 숫자 1로 도입하고 중간에 곱해지는 행렬도 $b_1$, $b_2$ 값을 오른쪽에 더 붙이고, 3행에는 [0, 0, 1]을 넣어 3x3 행렬을 구성하면 된다는 것이다.

이런 식으로 방향과 크기 뿐만 아니라 위치를 함께 포함하는 변환을 아핀 변환 (Affine transformation)이라고 부른다. 또, 식 (9)에서처럼 기존의 N 차원 벡터에 차원 하나를 덧붙여 표시하는 좌표계를 동차 좌표계(homogeneous coordinates)라고 부른다.

아핀 변환의 실체

아핀 변환을 처음 배웠을 때에는 다소 의아한 점이 많았다. 식 (8)의 비선형변환을 식 (9)와 같이 차원을 하나 늘려줌으로써 선형 변환처럼 서술할 수 있다는 것이 “교묘한 트릭”처럼 느껴졌다.

즉, 행렬과 선형변환에서 배운 것 처럼 선형변환은 원점을 보존해야 하는데, 아핀 변환의 존재는 내가 알고 있는 지식이 부실한 기반인냥 나를 기만하는 것 같기도 했다.

계산만 보면 차원을 하나 더 늘렸을 때 평행이동을 수학적으로 서술할 수 있다는 것은 알겠으나, 그렇다면 하나 더 늘린 차원은 어디에 존재한다는 말인가? 또, 하나 더 붙였던 차원은 그냥 떼버리는 것 처럼 사용하는데 아핀 변환은 선형변환의 관점에서는 이해할 수 없는 것일까? 속된말로 “야매” 수학일까? 아니면 충분히 실용적이기 때문에 사람들에게서 받아들여지고 있는 내용인 것일까? 등의 생각으로 정리가 어려웠다.

하지만, 아핀 변환은 한 차원 높은 좌표계에서의 선형 변환이 어떻게 일어나는지 생각할 수 있다면 그 정체를 이해할 수 있으며, 기존의 “선형 변환은 원점을 보존한다”는 지식도 그 기반을 단단하게 다질 수 있게 된다. 아래는 맨 위의 데모를 조감도로 본 것이다. 1행 3열의 값을 바꿔보면 무슨 일이 일어나는지 생각해보자.

1행 3열의 값을 양으로 변화시키면 3차원 공간에서는 정육면체 윗면이 오른쪽으로 밀리게 되지만, 조감도로 투영해보면 사진이 오른쪽으로 이동하는 것과 같은 효과를 가져오게 된다.

그림 9. 3차원 공간에서 1행 3열의 원소값 변화는 조감도로 보면 x축 상에서 평행이동해준 것과 같은 효과를 보인다.

마찬가지 방법으로 2행 3열의 원소값을 변화시키면 y축에서 평행이동하는 것과 같은 효과를 가져오게 된다.

즉, 아핀 변환에서 작용하는 행렬은 역시 원점을 보존해주는 변환임을 알 수 있으며, 추가된 하나의 차원에서 높이가 1인 지점에서의 변화를 관찰하여 2차원 평면에 투영하는 것과 같다는 것을 알 수 있다.

마할라노비스 거리

2022-09-28T00:00:00+00:00

※ 본 포스팅에서는 벡터의 기본 방향을 “행벡터”로 보고 작성하였습니다. 이에 대한 더 자세한 설명은 첫 꼭지 “행벡터를 기본 방향으로 하는 데이터 표현” 챕터를 읽어주십시오.

Prerequisites

본 포스트를 잘 이해하기 위해선 아래의 내용에 대해 알고 오는 것이 좋습니다.

행렬과 선형변환

공분산 행렬에 대한 더 친절한 설명이 필요한 경우 아래의 포스트를 확인하십시오.

주성분 분석(PCA)

행벡터를 기본 방향으로 하는 데이터 표현

수학에서 벡터를 표현할 때 열벡터를 기본 방향으로 보는 것이 더 통용되는 방법이다. 다시 말해, 임의의 $n$ 차원 벡터 $x$는 다음과 같이 표현하는 것이 일반적이다.

\[\vec{x}=\begin{bmatrix}x_1 \\ x_2 \\ \vdots \\x_n\end{bmatrix} % 식 (1)\]

이 경우 행렬은 벡터의 왼쪽으로 들어가야 한다. 임의의 $n\times n$ 차원의 행렬 $A$와 $n$ 차원 열벡터 $x$의 곱은 다음과 같이 표현된다.

\[Ax % 식 (2)\]

또한, 열벡터 간의 내적은 전치 연산을 이용해 다음과 같이 쓸 수 있다. 임의의 $n$ 차원 벡터 $\vec x$와 $\vec y$에 대해,

\[dot(\vec x, \vec y)=\vec x^T\vec y % 식 (3)\]

그런데, 데이터 사이언스에서는 이유는 알 수 없으나 보통 데이터 하나를 행벡터로 취급해서 사용한다. 즉, 임의의 $d$ 차원 벡터 $x$는 다음과 같이 쓴다.

\[\vec{x}=\begin{bmatrix}x_1 & x_2 & \cdots & x_d\end{bmatrix}% 식 (4)\]

이렇게 되면 행렬은 벡터의 오른쪽으로 와야 한다. 임의의 $d\times d$ 차원 행렬 $R$과 $d$ 차원 행벡터 $x$의 곱은 아래와 같이 쓸 수 있다.

\[x R % 식 (5)\]

또한, 행벡터 간의 내적은 마찬가지로 전치연산을 이용하나 전치 연산이 붙는 벡터는 오른쪽에 있는 것이다. 다시 말해, 임의의 $d$ 차원 행벡터 $\vec x$와 $\vec y$에 대해,

\[dot(\vec x, \vec y)=\vec x \vec y^T % 식 (6)\]

가 된다.

더 나아가 데이터 사이언스에서는 표본의 수가 $n$이고 특징(feature)의 수가 $d$라고 했을 때 데이터 셋 $\mathcal D$를 $n\times d$ 차원 행렬로 두는 것이 일반적이다. 즉, 표본 데이터가 더 추가 된다면 하나의 행이 더 늘어나는 것이다. 다시 말해 하나의 데이터를 “행벡터”로 취급하는 것이다.

본 포스팅에서는 벡터의 기본 방향이 “행벡터”로 설정되었다.

맥락을 고려한 상대적인 거리

아래와 같이 두 벡터 $\vec x$와 $\vec y$를 생각해보자.

그림 1. 공간 상의 두 벡터 간의 거리는 벡터의 내적을 이용해 계산할 수 있다.

여기서 임의의 점 $\vec x$와 $\vec y$ 까지의 유클리드 거리를 계산하려면 어떤 식을 사용해야 할까? 두 벡터의 차와 내적을 이용해 계산할 수 있다. 이와 같은 거리를 유클리드 거리(Euclidean distance)라고 부른다.

\[d_E = \sqrt{(\vec x-\vec y)(\vec x-\vec y)^T} % 식 (7)\]

그런데 두 벡터 $\vec x$와 $\vec y$ 만을 생각하는 것이 아니라, 주변에 다른 데이터들을 고려한다면 두 점 사이의 거리는 항상 절대적인 거리를 사용해도 되는것일지 고민해보아야 한다.

그림 2. 다른 데이터들의 맥락을 고려한 두 점 사이의 거리는 다르게 계산되어야 할 수도 있다.

위 그림을 보면 (a)는 파란색 데이터의 분포에서 상당히 벗어나있는 점들이라는 것을 알 수 있다. 반면에 (b)는 파란색 데이터의 분포에서 상대적으로 덜 벗어난 곳에 위치해있다. 즉, 다른 데이터들의 분포의 “맥락”을 고려하면 그림 2의 (a)에 있는 두 벡터 $\vec x$와 $\vec y$ 간의 거리가 그림 (b)에 있는 두 벡터 간의 거리보다 더 멀다고 볼 수도 있는 것이다.

“맥락”이라는 모호한 표현을 조금 더 수학적으로 표현하면 “표준편차”라고도 할 수 있겠다. 만약 데이터의 분포를 정규분포의 형태라고 가정할 수 있다면 정규분포의 표준 편차의 성질을 이용해 다음과 같이 평균(중심)으로부터 1, 2, 3 표준편차 만큼 떨어진 곳에 68, 95, 99.7%만큼의 데이터가 들어온다는 사실을 이용해보자.

그림 3. 정규 분포에서 중심으로부터 1, 2, 3 표준편차 만큼 멀어질 때 얼마만큼의 데이터가 포함되는가? (68–95–99.7 rule)

다시 말해, 아래의 그림과 같이 표준편차를 기준삼아 표준편차 등고선을 표시할 수 있다. 그리고 이 등고선이 “맥락을 고려한” 거리의 지표가 되는 것이다.

그림 4. 평균으로부터 68, 95, 99.7% 등 표준편차 만큼 떨어진 거리를 등고선으로 표시한 그림

그리고 정규 분포 대신 표준 정규분포를 사용할 수 있는 것 처럼 그림 4의 (b)에 있는 타원의 형태를 그림 4의 (a)에 있는 단위원으로 축소시킨다면 “맥락” 즉, 표준 편차를 정규화 시킬 수 있다. 아래의 그림 5와 같이 표준편차 1, 2, 3 등에 해당하는 곳에 새로운 축을 고려한 뒤에 벡터 공간을 변형해 타원을 단위원 모양으로 다시 되돌려보자.

그림 5. 데이터의 "맥락"의 표현과 "맥락"을 "정규화" 하기 위한 데이터(벡터) 공간의 변형

이 과정은 이 포스팅의 가장 위에 있는 애플릿에서 수행하는 일이다. 아래의 그림 6의 왼쪽을 보자. 주어진 데이터의 “맥락”을 고려했을 때 주황색 점들보다는 노란색 점들이 더 먼 거리라고 판단해주어야 한다. 이것은 맥락을 생각한 채로 유클리드 거리를 계산해야 하므로 복잡한 일이다. 그런데, 그림 6의 오른쪽과 같이 “맥락”을 정규화시키면 단순히 유클리드 거리만 계산한 결과로도 노란색 점들 간의 거리가 더 멀다. “정규화” 과정에서 이미 주어진 데이터에 대한 “맥락”을 고려시켜 기존의 데이터(벡터) 공간을 변형시켰기 때문이다.

그림 6. "맥락"을 정규화 시키고나서 측정한 유클리드 거리는 이미 맥락을 고려한 거리가 된다.

주어진 데이터들의 분포를 통해 맥락을 조사하고, 이를 정규화 한 뒤에 유클리드 거리를 계산하는 것이 마할라노비스(Mahalanobis) 거리이다.

\[d_M = \sqrt{(\vec x-\vec y)\Sigma^{-1}(\vec x-\vec y)^T} % 식 (8)\]

벡터 공간의 변형은 행렬로 표현할 수 있다. 특히, 데이터의 “맥락”을 표현하는 행렬은 공분산 행렬($\Sigma$)과 관련되어 있고, 그것을 다시 돌려 놓기 위한 행렬은 공분산 행렬의 역행렬($\Sigma^{-1}$)과 관련되어 있다. 지금부터는 수식적으로 데이터의 “맥락”을 파악하는 방법을 이해해보자. 또, “맥락”의 “정규화”를 수행하는 방법을 더 자세하게 다루어 보자.

공분산 행렬과 그 역행렬의 의미

iid 정규분포 샘플 대한 기초적인 이해

데이터의 구조에 대해 이해하기에 앞서 우선 iid(independent and identically distributed) 정규 분포 샘플의 성질에 대해 이해할 필요가 있다. 용어는 어려워 보이지만 차근히 들여다보면 어려울 것이 하나도 없다. iid는 랜덤 데이터 샘플을 추출해내는 가장 단순한 방법론 중 하나이다.

iid를 풀어서 설명하자면 다음과 같은 가정(assumption)이다

추출된 데이터는 독립적으로 추출되었다.
추출된 데이터는 모두 동일한 확률 분포에서 추출되었다.

또, 여기서 추출된 확률 분포가 정규 분포라고 가정할 수 있다면 추출된 샘플은 “indenepdent and identically distributed normal random variables다” 라고 말 할 수 있는 것이다.

이번에는 $Z\in\mathbb{R}^{n\times d}$ 와 같이 여러개의 iid normal randon variables $z_1, \cdots ,z_d$를 좌우로 쌓아보자. 특히, 계산의 편의를 위해 표준 정규분포를 가정하자.

\[Z =\begin{bmatrix} | & | & & |\\ z_1 & z_2 & \cdots & z_d\\ | & | & & |\end{bmatrix} % 식 (9)\] \[\text{where } z_1, z_2, \cdots, z_d \text{ are i.i.d. normal random variables with mean 0 and variance 1}\notag\]

표준 정규분포에서 추출한 샘플들이므로 아래의 사실을 확인할 수 있다. 추출한 분포의 평균이 0이라는 점을 생각하면,

\[\mathbb{E}\left[z_i\right]=0 \text{ for } i = 1, 2, \cdots, d % 식 (10)\]

이다.

또한 추출한 분포의 분산이 1이라는 것을 생각하여 아래에 대해서도 생각해보자.

\[\mathbb{E}\left[Z^TZ\right] = \mathbb{E}\left [\begin{bmatrix} z_1^T z_1 & z_1^T z_2 & \cdots & z_1^Tz_d \\ z_2^T z_1 & z_2^T z_2 & \cdots & z_2^T z_d \\ \vdots & \vdots & \ddots & \vdots \\ z_d^T z_1 & z_d^T z_2 & \cdots & z_d^Tz_d \end{bmatrix}\right ] % 식 (11)\]

여기서 $i=1,2,\cdots, d$에 대해 $\mathbb{E}\left[z_i^T z_i \right]$는 분산 $1$이 $n$ 개 더해진 것과 같으므로 $\mathbb{E}\left[z_i^T z_i \right]=n$이다. 또, $z_i$는 독립적으로 추출되었으므로 서로 다른 $i$와 $j$에 대해 $\mathbb E \left[z_i^T z_j \right]=0$ 이다.

따라서 식 (11)은

\[식 (11) \Rightarrow \begin{bmatrix} n & 0 & \cdots & 0 \\ 0 & n & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & n \end{bmatrix} = n I % 식 (12)\]

와 같다. 여기서 $I$는 $d\times d$ 차원의 단위행렬이다.

주어진 데이터를 이해하는 또 다른 방법

화성에 사는 외계인 중 1000명을 임의로 선별해 키와 몸무게를 조사했고 이것을 표로 나타내보았다. 놀랍게도 평균키는 10cm이고 평균 몸무게는 8kg이었다고 한다. 표로 정리해보면 대략 아래와 같았다고 하자.

그림 8. 화성 외계인들의 키와 몸무게를 정리한 표 (4번 외계인까지만 반올림하여 표시함)

1000명 외계인들의 키와 몸무게 데이터는 여기서 받을 수 있다.

키와 몸무게를 정리한 데이터를 $\mathcal D$라고 하자. 또, 표본이 된 외계인의 수를 $n$이라고 하고 키와 몸무게와 같은 특징의 숫자를 $d$라고 하면 $\mathcal D$는 다음과 같은 행렬이라고도 볼 수 있다.

\[\mathcal D\in\mathbb{R}^{n\times d} % 식 (13)\]

이번에는 임의의 1000명 외계인의 키와 몸무게라는 데이터를 이용했지만 어떤 데이터든지 분포를 확인할 수 있다. “새로운” 관점에서 데이터 분포를 이해해보기 위해 데이터셋의 각 feature 별 평균값을 모두 0으로 이동시키자. 그리고 feature 별 평균값이 모두 0인 새로운 데이터를 데이터 $X$로 보자.

$\mathcal D$와 $X$의 분포를 그려보면 다음과 같다.

그림 9. 화성 외계인들의 키와 몸무게 데이터의 분포

이제 우리는 데이터 $X$를 다음과 같이 “새롭게” 이해해보자. $X$는 원시 데이터 $Z$가 있으며 이것이 선형변환된 결과물이라고 보는 것이다. 여기서 선형변환 $R$ 행렬이 $Z$의 오른쪽에 붙는 것은 식 (5)에서 보여준 것과 같이 벡터의 기본 방향을 행벡터로 보기 때문이다.

\[X = ZR % 식 (14)\] \[\text{ where }Z \in \mathbb{R}^{n\times d} \text{ and } R \in \mathbb{R}^{d\times d}\notag\]

그리고 $Z$의 모든 열은 iid(independent and identically distributed) 표준 정규분포에서부터 추출한 데이터셋이라고 보자.

그림 10. 주어진 데이터를 일부 수정한 $X$를 원시 형태의 데이터 $Z$로부터 선형변환 된 결과로 보자.

이제부터 feature 간의 닮음을 조사하자. feature의 닮음을 조사한다면 데이터의 “맥락” 혹은 형태 구조를 파악할 수 있다. 왜냐하면, 가령, feature 1과 feature 2가 많이 닮아있다면 서로 상관관계가 높은 것을 의미하기 때문이다.이를 위해 $X^TX$를 계산하자. $X^TX$는 $d\times d$ 차원을 가지게 될 것인데, 이는 feature들 간의 내적을 표현한 것임을 알 수 있다. 만약 $XX^T$를 계산한다면 이것은 데이터들 간의 닮음을 파악한 것임을 예상할 수 있다. $X^TX$를 계산하는 과정을 아래 그림에서 확인해보자.

그림 11. 공분산 행렬을 계산하기 위해 각 데이터 특징들의 변동이 서로 얼마나 닮았는지 계산하는 과정.

여기서 다시 한번 식 (14)을 이용해보면,

\[X^TX=(ZR)^TZR=R^TZ^TZR=R^T(Z^TZ)R % 식 (15)\]

여기서 식 (12)에 따라,

\[X^TX \approx R^T(nI)R=nR^TR % 식 (16)\]

가 성립하게 된다. 여기서 “$\approx$”를 쓴 것은 실제 데이터에서는 기댓값과 정확히 같은 값이 나오지 않기 때문에 사용하였다. 그리고 아래와 같은 사실을 확인할 수 있다.

\[R^TR\approx \frac{1}{n}X^TX % 식 (17)\]

결국 식 (17)이 의미하는 것은 무엇인가? 식 (17)은 데이터 $X$의 형태 구조 혹은 데이터의 “맥락”를 얻기 위한 방법이다. 이것은 원시 형태의 $Z$를 주어진 데이터 $X$ 로 변환하기 위한 선형변환 $R$에 대한 $R^TR$와 거의 같다. 그리고 식 (17)의 형태 구조를 표현하는 행렬을 공분산 행렬이라고 부른다. 여기서는 공분산 행렬을 $\Sigma$라고 쓰도록 하자.

\[\Sigma = \frac{1}{n}X^TX % 식 (18)\]

참고로 $n$ 대신 $n-1$로 나누는 방법도 있다. $n$ 대신 $n-1$로 나누어 얻게 되는 공분산 행렬은 표본 공분산 행렬이라고 한다.

공분산 행렬은 데이터 셋 전체의 전반적 구조에 대해 설명하기에 용이한 방법이며 특히 다변수 정규 분포와 밀접한 관련이 있다. 만약 feature 가 두 개인 데이터셋이 2변수 정규 분포를 따른다고 하면 크게 아래와 같은 세 종류 중 하나의 형태를 따른다고 볼 수 있다.

그림 12. 가장 대표적인 세 가지 형태의 2변수 정규 분포

공분산 행렬의 각 원소가 뜻하는 바는 각 feature들의 분산 혹은 공분산이다. 다시 말해, 그림 12와 같이 feature가 2개인 경우 1번 feature와 2번 feature가 각각 x 축 방향, y 축 방향으로 얼마나 데이터들이 퍼져서 분포하는지, 그리고 1번, 2번 feature가 얼마나 함께 변하는지를 나타내는 것이다.

그림 13. 공분산 행렬의 각 원소가 의미하는 것

역행렬과 맥락의 정규화

주어진 임의의 데이터를 $x$라고 하고 이의 원시 형태를 $z$라고 했을 때, 식 (14)에 따르면 주어진 데이터의 “맥락”을 원시 데이터의 형태로 되돌려 놓기 위해선 아래와 같이 수행하여 가능하다는 것을 알 수 있다.

\[z=xR^{-1}\]

여기서 역행렬을 이용한 선형변환은 주어진 선형변환 $R$에 의해 변환된 벡터 공간을 원래 형태로 돌려 놓는 것이다. 즉 그림 10에서 왼쪽으로부터 오른쪽으로 변하는 과정이 원래의 선형 변환 $R$이 수행해주는 변환이라고 하면, 역변환인 $R^{-1}$은 그림 10의 오른족에서 왼쪽으로의 변환이라고 볼 수 있는 것이다.

그림 14. 행렬 R과 그 역행렬이 의미하는 선형 변환

여기서, 식 (7)을 적용해 원시 데이터의 벡터 공간에서 원점과의 거리 $d_z$를 구하면 다음과 같다.

\[d_z=\sqrt{zz^T}=\sqrt{(xR^{-1})(xR^{-1})^T}\] \[=\sqrt{xR^{-1}(R^{-1})^Tx^T}=\sqrt{x(R^TR)^{-1}x^T}=\sqrt{x\Sigma^{-1}x^T}\]

여기서 $\Sigma$는 주어진 전체 데이터 행렬의 공분산행렬이다.

만약 위와 같은 과정을 임의의 벡터 $x$와 $y$ 사이의 거리에 대해 수행한다고 하면 아래와 같이 식을 수정할 수 있으며 이것은 원래 언급했던 마할라노비스 거리와 같다.

\[\Rightarrow \sqrt{(x-y)\Sigma^{-1}(x-y)^T}\]

등고선과 주축: 고윳값, 고유벡터

※ 마지막 챕터는 다소 심화된 내용이며 꼭 이해하지 않아도 마할라노비스의 큰 의미를 이해하는데에는 문제 없습니다.

※ 아래의 내용을 더 잘 이해하기 위해선 아래의 내용을 이해하고 오는 것이 좋습니다.

이번에는 그림 3, 4 에서 소개한 바와 같이 데이터의 “맥락”을 파악하기 위한 표준 편차와 “등고선” 애기를 더 해보도록 하자. 마할라노비스 거리를 이해하는데 있어서 “등고선” 얘기는 아주 중요한 핵심 중 하나이다.

우선, 그림 12을 다시 살펴보자. 그림 12는 2변수 정규 분포가 가질 수 있는 대표적인 세 가지 분포 형태를 나타낸 그림이다. 그런데, 분포의 형태가 꼭 이렇게 세 가지 뿐일까? 아마 그렇지 않을 것이다. 분포의 모양이 얼마나 회전했는지, 얼마나 늘어져있는지 두 가지를 가지고 표현한다면 무수하게 많은 분포의 형태가 나올 것이라는 것을 알 수 있다. 다시 말해, 다변수 정규분포로 표현할 수 있는 임의의 분포는 표준 정규 분포를 늘리고 회전해서 얻어낼 수 있다고도 볼 수 있는 것이다.

주어진 선형변환을 얼마나 회전했는지와 얼마나 늘어났는지로 표현하는 방법은 바로 고윳값 분해이다. 그리고 회전한 양은 그림 5의 오른쪽에서 보여주는 새로운 축의 방향을 나타내 줄 것이고 늘어난 양은 새로운 축들의 눈금 한 칸의 길이를 나타내 줄 것이다. 또한, 고윳값 분해 편에서 논의한 것 처럼 회전 방향은 고유벡터로, 늘어난 양은 고윳값으로 표현될 것이다.

공분산 행렬을 다음과 같이 고윳값 분해해보자.

\[\Sigma = Q\Lambda Q^{-1}=Q\Lambda Q^T\]

여기서 공분산 행렬은 항상 대칭행렬이므로 $Q^{-1}$은 $Q^T$로 쓸 수 있다는 점을 이용해 $Q^{-1}$ 을 $Q^T$로 대체했다.

여기서 $Q$와 $\Lambda$는 각각 고유벡터, 고윳값을 가지고 있는 행렬이다.

예를 들어 그림 12의 첫 번째 그림에 있는 공분산 행렬을 고윳값 분해하면 다음과 같은 결과를 얻을 수 있다.

\[\begin{bmatrix}1 & 0.5\\0.5 & 1.5\end{bmatrix}=\begin{bmatrix}-0.8507 & 0.5257 \\ 0.5257 & 0.8507\end{bmatrix}\begin{bmatrix}0.6910 & 0 \\ 0 & 1.8090\end{bmatrix}\begin{bmatrix}-0.8507 & 0.5257 \\ 0.5257 & 0.8507\end{bmatrix}^T\]

그리고 $Q$의 각 열은 얼마만큼 표준 정규 분포를 회전했는지에 관한 정보를 보여주며, 좀 더 정확하게는 주성분(principal component, PC)의 방향을 나타내준다. 또, $\Lambda$의 대각성분들은 각 주성분 방향으로 얼마만큼 분포가 늘어져있는지를 보여준다. 아래의 그림 15를 참고하여 더 시각적으로 이해해보자.

그림 15. 공분산 행렬의 고윳값 분해 결과는 표준 정규 분포를 얼마나 늘리고 회전했는지를 벡터로 표현해줄 수 있게 해준다. 여기서 $\sigma_1$과 $\sigma_2$는 각각 PC1과 PC2가 늘어난 정도를 의미한다.

이 결과를 다시 한번 설명하자면 그림 3에서 그림 5까지의 내용을 더 수학적으로 표현해준 것과 같다. $Q$의 주성분 방향은 표준편차를 계산할 가장 대표적인 방향 두 가지가 되는 것이며, $\Lambda$의 대각성분은 다시 말해 주성분 방향으로의 표준편차를 의미하게 된다.

따라서, 주축 상에 있는 데이터들을 중심으로 마할라노비스 거리를 이해한다면 (혹은 데이터들을 주축에 정사영 시키는 경우를 가정한다면) 주축을 원래의 xy 축으로 역 회전시키고 $\Lambda$로부터 얻은 표준편차 값으로 나눠주어 정규화시킨 거리를 의미한다고도 볼 수 있는 것이다.

Mahalanobis Distance

2022-09-28T00:00:00+00:00

※ In this post, vectors are represented using “row vectors” as the default direction. For more detailed explanation of this, please refer to the first section “Data representation using row vectors”.

Prerequisites

To better understand this post, it is recommended that you be familiar with the following content:

Matrix and Linear Transformations

If you need a more detailed explanation of the covariance matrix, please refer to the following post:

Principal Component Analysis (PCA)

Data representation using row vectors

In mathematics, it is more common to view column vectors as the default direction when representing vectors. In other words, a vector $x$ of arbitrary dimension $n$ is usually represented as follows.

\[\vec{x}=\begin{bmatrix}x_1 \\ x_2 \\ \vdots \\x_n\end{bmatrix} % Equation (1)\]

In this case, the matrix must go to the left of the vector. The product of an arbitrary $n\times n$ dimensional matrix $A$ and an $n$ dimensional column vector $x$ is represented as follows.

\[Ax % Equation (2)\]

Furthermore, the dot product between column vectors can be expressed using the transpose operator as follows. For any $n$-dimensional vectors $\vec{x}$ and $\vec{y}$,

\[dot(\vec x, \vec y)=\vec x^T\vec y % Equation (3)\]

However, in data science, for some reason unknown, a single data point is usually treated as a row vector and used. In other words, an arbitrary $d$-dimensional vector $x$ is represented as follows.

\[\vec{x}=\begin{bmatrix}x_1 & x_2 & \cdots & x_d\end{bmatrix}% Equation (4)\]

In this case, the matrix must go to the right of the vector. The product of an arbitrary $d\times d$ dimensional matrix $R$ and an $d$-dimensional row vector $x$ can be written as follows.

\[x R % Equation (5)\]

Furthermore, the dot product between row vectors also uses the transpose operation, but the transposed vector is on the right. In other words, for any $d$-dimensional row vectors $\vec{x}$ and $\vec{y}$,

\[dot(\vec x, \vec y)=\vec x \vec y^T % Equation (6)\]

Going further, in data science, it is common to have a data set $\mathcal D$ with $n$ samples and $d$ features represented as an $n\times d$ dimensional matrix. In other words, if more sample data is added, one row is added. In other words, each data point is treated as a “row vector.”

In this post, the default direction of vectors is set to “row vectors.”

Contextual relative distance

Consider two vectors $\vec x$ and $\vec y$ as shown below.

Figure 1. The distance between two vectors in space can be calculated using the dot product of the vectors.

What formula should be used to calculate the Euclidean distance between an arbitrary point $\vec x$ and $\vec y$? The distance can be calculated using the difference and dot product of the two vectors. This distance is called the Euclidean distance.

\[d_E = \sqrt{(\vec x-\vec y)(\vec x-\vec y)^T} % Equation (7)\]

However, if we consider other data points in the vicinity, we may need to reconsider whether to use an absolute distance between two points.

Figure 2. The distance between two points, taking into account the context of other data points, may need to be calculated differently.

In the above figure, (a) can be seen as points that are quite away from the distribution of blue data, while (b) is in a relatively less deviated location from the distribution. In other words, considering the “context” of other data points, the distance between the two vectors in Figure 2 (a) may be farther than the distance between the two vectors in Figure 2 (b).

The ambiguous expression “context” can be expressed more mathematically as “standard deviation.” If we can assume that the data is in the form of a normal distribution, we can use the properties of the standard deviation of a normal distribution to see that there is 68, 95, and 99.7% of the data coming in at a distance of 1, 2, and 3 standard deviations away from the mean (center).

Figure 3. How much data is included when moving 1, 2, and 3 standard deviations away from the center in a normal distribution? (68–95–99.7 rule)

In other words, standard deviation contours can be displayed based on the standard deviation, as in the figure below. And these contours become indicators of “contextual” distance.

Figure 4. A contour that represents the distance from the mean in standard deviation units of 68, 95, and 99.7%

And by reducing the ellipsoidal shape in (b) of figure 4 to a unit circle as in (a) of figure 4, we can normalize the standard deviation that represents the “context” of the data. Let’s take a look at the transformation of the vector space using new axes corresponding to standard deviations 1, 2, 3, etc. as shown in figure 5 below.

Figure 5. Representation of the "context" of the data and transformation of the data (vector) space to "normalize" the "context"

This process is performed in the applet at the top of this post. Let’s look at the left of Figure 6. When considering the “context” of the given data, we should judge that the yellow points are farther away than the orange points. This is a complicated task as the Euclidean distance must be calculated while considering the “context”. However, if we normalize the “context” as in the right of Figure 6, the distance between the yellow points is already considered as farther away by simply calculating the Euclidean distance. This is because the original data (vector) space was transformed while taking into account the “context” of the given data in the process of “normalizing” the “context”.

Figure 6. The Euclidean distance measured after "normalizing" the "context" already becomes a distance that takes into account the "context".

Investigating the “context” through the distribution of the given data and normalizing it before calculating the Euclidean distance is the Mahalanobis distance.

\[d_M = \sqrt{(\vec x-\vec y)\Sigma^{-1}(\vec x-\vec y)^T} % Equation (8)\]

The transformation of the vector space can be represented by a matrix. Specifically, the matrix that represents the “context” of the data is related to the covariance matrix, and the matrix that rotates it back is related to the inverse matrix of the covariance matrix. From now on, let’s try to understand how to grasp the “context” of the data mathematically. Also, let’s examine how to perform “normalization” of the “context” in more detail.

The Meaning of the Covariance Matrix and its Inverse Matrix

Basic Understanding of iid Gaussian Distribution Samples

Before understanding the structure of the data, we first need to understand the properties of iid (independent and identically distributed) normal distribution samples. Although the terminology may seem difficult, there is nothing difficult once we look at it carefully. iid is one of the simplest methodologies for extracting random data samples.

To explain iid, let’s break it down into the following assumptions:

The extracted data is independently extracted.
The extracted data is extracted from the same probability distribution.

Furthermore, assuming that the probability distribution extracted here is a normal distribution, the extracted samples can be expressed as “independent and identically distributed normal random variables.”

This time, let’s stack multiple iid normal random variables $z_1, \cdots, z_d$ side by side as $Z\in\mathbb{R}^{n\times d}$. In particular, for convenience of calculation, assume a standard normal distribution.

\[Z =\begin{bmatrix} | & | & & |\\ z_1 & z_2 & \cdots & z_d\\ | & | & & |\end{bmatrix} % Equation (9)\] \[\text{where } z_1, z_2, \cdots, z_d \text{ are i.i.d. normal random variables with mean 0 and variance 1}\notag\]

As we have extracted samples from a standard normal distribution, we can confirm the following. Since the mean of the extracted distribution is 0,

\[\mathbb{E}[z_i]=0 \text{ for } i = 1, 2, \cdots, d % Equation (10)\]

Moreover, let’s also consider the following given that the variance of the extracted distribution is 1.

Here, for $i=1,2,\cdots, d$, $\mathbb{E}\left[z_i^T z_i\right]$ is equal to the sum of $n$ variances of 1, so $\mathbb{E}\left[z_i^T z_i\right]=n$. Furthermore, since $z_i$ is extracted independently, for different $i$ and $j$, $\mathbb E \left[z_i^T z_j \right]=0$.

Therefore, equation (11) can be represented as

\[\text{Equation (11)} \Rightarrow \begin{bmatrix} n & 0 & \cdots & 0 \\ 0 & n & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & n \end{bmatrix} = n I\]

Here, $I$ is a unit matrix of dimensions $d \times d$.

Another way to understand the given data

Suppose that 1,000 aliens living on Mars were randomly selected and their height and weight were measured. Amazingly, the average height was 10cm and the average weight was 8kg. Suppose that the data is arranged in a table, which roughly looks like the following.

Figure 8. Table summarizing the height and weight of aliens on Mars (rounding up to the 4th alien)

The height and weight data of 1,000 aliens can be found here.

Let’s call the data that arranges the height and weight $\mathcal D$. Also, if the number of aliens sampled is denoted as $n$, and the number of features such as height and weight is represented as $d$, then $\mathcal D$ can also be viewed as the following matrix.

\[\mathcal D\in\mathbb{R}^{n\times d}\]

Although the data used this time was the height and weight of random 1000 aliens, it is possible to examine the distribution of any data. In order to understand the data distribution from a “new” perspective, let’s move all feature-wise mean values of the dataset to zero. Then, let’s view the new data, for which feature-wise mean values are zero, as data $X$.

If we plot the distributions of $\mathcal D$ and $X$, it is shown as below.

Figure 9. Distribution of height and weight data of aliens on Mars

Now let’s try to understand the data $X$ as a result of linear transformation of raw data $Z$. Here, the reason why the linear transformation matrix $R$ is attached to the right of $Z$ is because we can see the basic direction of the vector as a row vector, as shown in equation (5).

\[X = ZR % Equation (14)\] \[\text{where }Z \in \mathbb{R}^{n\times d} \text{ and } R \in \mathbb{R}^{d\times d}\notag\]

Let’s assume that all columns of $Z$ are datasets extracted from the standard normal distribution of iid (independent and identically distributed).

Figure 10. Understand the given data as $X$ modified from raw data $Z$ linear transformed.

From now on, let’s investigate the similarity between features. By investigating the similarity between features, we can understand the “context” or structural form of the data. That’s because, for example, if feature 1 and feature 2 are very similar, it means that they are highly correlated with each other. To do this, let’s calculate $X^TX$. $X^TX$ will have dimensions of $d \times d$, and it represents the inner product between features. If we calculate $XX^T$, we can expect that this will indicate the similarity between the data. Let’s see the process of calculating $X^TX$ in the figure below.

Figure 11. The process of calculating how similar each data feature is in order to compute the covariance matrix.

Here, using Equation (14) again, we have:

\[X^TX = (ZR)^TZR = R^TZ^TZR = R^T(Z^TZ)R % Equation (15)\]

Here, according to Equation (12),

\[X^TX \approx R^T(nI)R = nR^TR % Equation (16)\]

is established. Here, we used “$\approx$” because in actual data, the exact value of the expected value may not come out. And we can confirm the following fact:

\[R^TR \approx \frac{1}{n}X^TX % Equation (17)\]

What does equation (17) mean in the end? Equation (17) is a way of obtaining the “context” of the data of $X$ or the structural form of the data. This is almost identical to $R^TR$ for the linear transformation $R$ to transform the raw form $Z$ into the given data $X$. And the matrix that expresses the structural form of the equation (17) is called the covariance matrix. Let’s use $\Sigma $ to represent the covariance matrix here.

\[\Sigma = \frac{1}{n}X^TX % 식 (18)\]

Note that there is also a way to divide by $n-1$ instead of $n$. The resulting covariance matrix obtained by dividing by $n-1$ instead of $n$ is called the sample covariance matrix. The covariance matrix is a useful way to describe the overall structure of the entire dataset and is especially closely related to multivariate normal distribution. If a dataset with two features follows a bivariate normal distribution, as in Figure 12, it can be said to follow one of the three major patterns below.

Figure 12. The three most representative forms of bivariate normal distributions.

Each element of the covariance matrix represents the variance or covariance of each feature. In other words, in the case of two features as in Figure 12, it represents how much data is scattered in the x-axis and y-axis for each feature, as well as how much variation is together between the first and second features.

Figure 13. What each element of the covariance matrix represents.

Inverse matrix and normalization of context

If the given arbitrary data is denoted as $x$ and its primitive form is denoted as $z$, then according to Equation (14), it can be seen that it is possible to restore the “context” of the given data to the form of primitive data by performing the following:

\[z=xR^{-1}\]

Here, the linear transformation using the inverse matrix is a method of restoring the vector space transformed by the given linear transformation $R$ to its original form. That is, if the process of changing from left to right in Figure 10 is the transformation performed by the original linear transformation $R$, the inverse transformation $R^{-1}$ can be seen as the transformation from right to left in Figure 10.

Figure 14. Linear transformation represented by matrix R and its inverse transformation.

Using Equation (7), we can calculate the distance $d_z$ between the origin and the primitive data vector space as follows:

\[d_z=\sqrt{zz^T}=\sqrt{(xR^{-1})(xR^{-1})^T}\] \[=\sqrt{xR^{-1}(R^{-1})^Tx^T}=\sqrt{x(R^TR)^{-1}x^T}=\sqrt{x\Sigma^{-1}x^T}\]

Here, $\Sigma$ is the covariance matrix of the entire given dataset.

If we perform the same process as above for the distance between arbitrary vectors $x$ and $y$, we can modify the equation as follows, and this is the same as the Mahalanobis distance mentioned earlier:

\[\Rightarrow \sqrt{(x-y)\Sigma^{-1}(x-y)^T}\]

Contour lines and Principal Axes: Eigenvalues, Eigenvectors

※ The last chapter is somewhat advanced, and it is not necessary to understand it to understand the meaning of Mahalanobis.

※ To better understand the following content, it is recommended to understand the following:

Now let’s add the standard deviation and “contour lines” to understand the “context” of the data, as introduced in Figures 3 and 4. Talking about “contour lines” is one of the most important cores in understanding Mahalanobis distance.

First, let’s take another look at Figure 12. Figure 12 shows three representative forms of bivariate normal distributions that can be found. However, are these three shapes the only possible shapes for the distribution? Probably not. If we express the shape of the distribution with the two factors being how much it has rotated and how much it has been stretched, we can see that there will be countless shapes of the distribution. In other words, any distribution that can be represented by a multivariate normal distribution can be obtained by stretching and rotating a standard normal distribution.

The method of expressing how much the given linear transformation has rotated and how much it has been stretched is called eigenvalue decomposition. The amount of rotation will indicate the direction of the new axis shown on the right in Figure 5, and the amount of stretching will indicate the length of one scale on the new axes. In addition, as discussed in the article on eigen decomposition, the rotation direction is represented by eigenvectors and the amount of stretching is represented by eigenvalues.

Let’s try to eigen-decompose the covariance matrix as follows:

\[\Sigma = Q\Lambda Q^{-1}=Q\Lambda Q^T\]

Here, since the covariance matrix is always a symmetric matrix, we can replace $Q^{-1}$ with $Q^T$.

Here, $Q$ and $\Lambda$ are matrices with eigenvectors and eigenvalues, respectively.

For example, if we eigen-decompose the covariance matrix of the first image in Figure 12, we can get the following result:

And each column of $Q$ shows how much the standard normal distribution has been rotated, or more precisely, indicates the direction of the principal components (PCs). In addition, the diagonal elements of $\Lambda$ show how much the distribution has been stretched in each principal component direction. Refer to Figure 15 to understand this more visually.

Figure 15. The eigenvalue decomposition of the covariance matrix allows us to represent how much the standard normal distribution has been stretched and rotated as a vector. Here, $\sigma_1$ and $\sigma_2$ represent how much PC1 and PC2 have been stretched, respectively.

To explain this result again, it is a more mathematical representation of the content from Figure 3 to Figure 5. The principal component directions of $Q$ are the two most representative directions for calculating standard deviation, and the diagonal elements of $\Lambda$ indicate the standard deviation in the principal component direction, in other words.

Therefore, if we understand the Mahalanobis distance based on the data on the principal axis (or assume that we project the data onto the principal axis), we can consider the distance as a normalized distance by rotating the principal axis back to the original xy axis and dividing by the standard deviation value obtained from $\Lambda$.

체르노프 유계(Chernoff Bound)

2022-09-13T00:00:00+00:00

prerequisites

본 포스트를 잘 이해하기 위해선 아래의 내용에 대해 알고 오는 것이 좋습니다.

마르코프 부등식과 체비셰프 부등식

증명

Chernoff 부등식은 Lower-tail 버전과 Upper-tail 버전의 형태가 다르다. 아래에서는 증명 과정을 소개하고자 한다.

Lower-Tail Chernoff Bound

$X$가 $N$ 개의 독립적인 랜덤변수의 합이라고 하자. 또, 이때 이 랜덤 변수들은 베르누이 분포를 따르며 $p_i$의 확률로 1의 값을 갖는다고 하자.

\[X = \sum_{i=1}^{N}X_i\]

이 때, 임의의 $\delta\in (0, 1)$에 대해 다음이 성립한다.

\[P(X \lt (1-\delta)E[X]) \lt e^{-E[X]\cdot \delta^2/2}\]

여기서 $e$는 자연로그의 밑이다.

(증명)

우선 아래의 부등식이 성립한다는 것을 증명하자.

\[P(X \lt (1-\delta) E[X]) \lt \left(\frac{e^{-\delta}}{(1-\delta)^{(1-\delta)}}\right)^{E[X]}\]

이를 증명하기 위해 임의의 매개변수 $t>0$을 도입하자. 이 $t$를 이용해 우리는 $X$에 대한 식을 $e^{-tX}$에 대한 식으로 변환할 것이다. 이 방법은 적률생성함수(moment generating function)을 활용하는 원리와 유사하게 볼 수 있는데, 원래의 $X$ domain에서 풀기 어려운 문제를 매개변수 $t$ 도메인으로 옮겨 문제를 상대적으로 쉽게 풀기 위함이라고 볼 수도 있다.

식 (3)을 증명하기 위해 마르코프 부등식의 식을 $X$ 대신 $e^{-tX}$에 맞춰 수정하자. 원래의 마르코프 부등식은 아래와 같다.

\[P(X\lt \alpha) \leq \frac{E[X]}{\alpha}\]

여기서 우변을 바꿔주면,

\[P(X\lt\alpha) \leq \frac{E[e^{-tX}]}{e^{-t\alpha}}\]

가 된다.

식 (5)를 식 (2)의 좌변에 맞춰 적용하면 결과는 아래와 같다. 여기서 $\alpha = (1-\delta)E[X]$이므로,

\[\Rightarrow P(X\lt (1-\delta)E[X]) \leq \frac{E[e^{-tX}]}{e^{-t(1-\delta)E[X]}}\]

이 성립한다. 또 $X$를 구성하고 있는 $X_i$ 들은 독립적으로 발생한 사건이다. 위 식의 우변의 분자를 보면,

\[E[e^{-tX}]=E[e^{-t\cdot\sum_{i}X_i}]=E[e^{-t(X_1+X_2+\cdots+X_N)}]\notag\] \[=E[e^{-tX_1}\cdot e^{-tX_2}\cdot e^{-tX_3}\cdot \cdots \cdot e^{-t X_N}]\]

과 같이 쓸 수 있는데, 독립 랜덤 변수의 곱의 기댓값은 기댓값들의 곱이므로 위 식은 아래와 같이 고쳐쓸 수 있다.

\[\Rightarrow E[e^{-tX_1}\cdot e^{-tX_2}\cdots e^{-tX_N}]=\prod_{i=1}^{N}E[e^{-tX_i}]\]

위 식의 $E[e^{-tX_i}]$를 자세히 보면 베르누이 분포를 따르는 시행 $X_i$에 대한 변환식 $e^{-tX_i}$의 기댓값임을 알 수 있다. $X_i$는 $(1-p_i)$ 혹은 $p_i$의 확률로 0 또는 1의 값을 가지므로 $X_i$의 기댓값은

\[E[X_i] = (1-p_i)\cdot 0 + p_i \cdot 1 = p_i\]

이며, $e^{-tX_i}$의 기댓값은

\[E[e^{-tX_i}]=(1-p_i)e^{-t\cdot 0}+p_i e^{-t\cdot 1}\notag\] \[=1-p_i + p_i e^{-t}=1+p_i(e^{-t}-1)\notag\] \[= 1+E[X_i](e^{-t}-1)\]

임을 알 수 있다. 또한 위 식의 마지막 결과물은 $\exp(E[X_i]\cdot(e^{-t}-1))$의 테일러 급수 두 항과 일치한다는 점을 고려하면 다음이 성립함을 알 수 있다.

\[E[e^{-tX_i}]=1+E[X_i](e^{-t}-1) \lt e^{E[X_i](e^{-t}-1)}\]

식 (11)을 식 (8)에 다시 대입하면,

\[\prod_{i=1}^{N}E[e^{-tX_i}]\lt\prod_{i=1}^{N}e^{E[X_i](e^{-t}-1)}\]

이 성립하게 됨을 알 수 있는데, 위 식의 우변을 또 다시 쓰면,

\[\prod_{i=1}^{N}\exp(E[X_i](e^{-t}-1))=\exp\left(\sum_{i=1}^{N}E[X_i]\cdot (e^{-t}-1)\right)\notag\] \[=\exp\left(E[X](e^{-t}-1)\right)\]

이다. 따라서 식 (13)의 결과를 식 (6)에 대입하면 아래와 같은 식을 얻을 수 있다.

\[P(X<(1-\delta)E[X]) \leq \frac{\exp\left(E[X](e^{-t}-1)\right)}{\exp\left(-t(1-\delta)E[X]\right)}\]

위 식은 어떤 $t>0$에 대해서라도 성립하는 식이다. 이제는 식 (14)가 최대한 tight한 boundary에 대해 성립할 수 있도록 식 (14)의 최소값을 내주는 $t=t^\ast$ 값을 찾자. 이 과정은 식 (14)를 미분하고 미분한 값이 $0$이 되는 $t^\ast$를 찾음으로써 해결할 수 있다.

식 (14)의 우변에 지수법칙을 적용하여 한줄로 쓰면 다음과 같다.

\[\exp(E[X](e^{-t}-1)+t(1-\delta)E[X])\]

이를 조금만 더 정리하고 $f(t)$라고 이름 붙이자.

\[f(t) = \exp\left(E[X]e^{-t}-E[X]+tE[X]-t\delta E[X]\right)\notag\] \[=\exp\left(E[X](e^{-t}+t-t\delta -1)\right)\]

이제 $f(t)$를 $t$에 대해 미분하면,

\[f'(t) = \exp\left(E[X](e^{-t}+t-t\delta -1)\right)(E[X])(e^{-t}+1-\delta)\]

임을 알 수 있다. 식 (17)에서 맨 앞의 $\exp()$ 함수는 항상 양수이며 $E[X]$ 역시 양수이다. 따라서, 가장 오른쪽의 괄호 안의 값만 0이 되도록 하면 $t=t^\ast$를 찾을 수 있다.

따라서,

\[e^{-t}+1-\delta = 0\]

을 만족하는 $t=t^\ast$는

\[t=t^\ast = \ln\left(\frac{1}{1-\delta}\right)\]

이다.

식 (19)를 식 (14)에 대입하면 식 (3)을 얻을 수 있게 된다. 식 (19)를 식 (14)에 대입하면,

\[\Rightarrow P(X<(1-\delta) E[X])\leq \frac{\exp\left(E[X]\left(e^{-\ln\left(1/(1-\delta)\right)}-1\right)\right)}{\exp\left(-\ln(1/(1-\delta))(1-\delta)E[X]\right)}\]

여기서 우변만 보면 다음과 같다.

\[\text{(우변)}\Rightarrow \frac{\exp(E[X](1-\delta -1))}{\exp(-(1-\delta)E[X]\ln\left(1/(1-\delta)\right))}\] \[=\frac{\exp(E[X](-\delta))}{\exp(\ln(1-\delta)^{(1-\delta E[X])})}=\frac{\exp(-\delta E[X])}{(1-\delta)^{(1-\delta)E[X]}}\] \[=\left(\frac{e^{-\delta}}{(1-\delta)^{(1-\delta)}}\right)^{E[X]}\]

한편,

\[\ln(1-x)=-x-\frac{x^2}{2}-\frac{x^3}{3}\cdots = -\sum_{i=1}^{N}\frac{x^n}{n}\]

이므로,

\[(1-\delta)\ln(1-\delta) = - (1-\delta)\delta - (1-\delta)\frac{\delta^2}{2}\cdots\notag\] \[=-\delta+\delta^2-\frac{\delta^2}{2}+\frac{\delta^3}{2}\cdots\notag\] \[=-\delta+\delta^2/2+\cdots\]

과 같다. 따라서

\[(1-\delta)\ln(1-\delta) \gt -\delta +\frac{\delta^2}{2}\]

가 성립하며 로그의 성질에 따라

\[(1-\delta)^{(1-\delta)}\gt\exp\left(-\delta + \frac{\delta^2}{2}\right)\]

가 성립함을 알 수 있다.

따라서, 식 (27)을 식(23)에 대입하면,

\[\left(\frac{e^{-\delta}}{(1-\delta)^{(1-\delta)}}\right)^{E[X]}\lt \left(\frac{e^{-\delta}}{e^{(-\delta+\delta^2/2)}}\right)^{E[X]}\] \[\Rightarrow \left(\frac{e^{-\delta}}{(1-\delta)^{(1-\delta)}}\right)^{E[X]}\lt \left(e^{-\delta^2/2}\right)^{E[X]}\]

이다. 따라서, 이 결과를 식 (20)과 식 (23)에 대입하면,

\[\Rightarrow P(X\lt (1-\delta)E[X])\lt \exp(-E[X]\delta^2/2)\]

이다.

(증명 끝)

Upper-Tail Chernoff Bound

Upper-Tail 부분에 대한 증명은 Lower-tail에 대한 증명과 거의 유사한 방식으로 진행된다. 따라서 Upper-tail에 대한 증명은 더 빠르게 진행되며 빠르게 넘어간 부분은 Lower-tail 파트 증명에서 참고하기 바란다. 식 (1)과 같은 랜덤변수 $X$에 대해 임의의 $\delta\in(0, 1)$¹을 선정하면 다음이 성립한다.

\[P(X\gt(1+\delta)E[X]) \lt e^{-E[X]\cdot \delta^2/3}\]

여기서 $e$는 자연로그의 밑이다.

(증명)

Lower-tail에 대한 증명에서와 마찬가지로 아래를 증명하는 것이 첫 스텝이다.

\[P(X\gt(1+\delta)E[X]) \lt \left(\frac{e^\delta}{(1+\delta)^{(1+\delta)}}\right)^{E[X]}\]

이 때 $t>0$인 매개변수를 도입하고 $X$에 대한 식은 $e^tX$에 대한 식으로 바꾸자. 식 (32)의 좌변 정보를 이용하여 마르코프 부등식을 적용하면 아래와 같은 식을 얻을 수 있다.

\[P(X\gt(1+\delta)E[X]) \leq \frac{E[e^{tX}]}{e^{t(1+\delta)E[X]}}\]

식 (1)의 $X$의 정의에 따라 식 (33) 우변의 분자만 따로 떼서 생각하면 다음과 같이 쓸 수도 있다는 것을 알 수 있다.

\[E[e^{tX}]=E[e^{t\sum_i X_i}]=E\left[\prod_{i=1}^{N}e^{tX_i}\right]=\prod_{i=1}^{N}E[e^{tX_i}]\]

이다.

한편, $X_i$는 베르누이 시행이므로, 식 (9)가 성립하고, $e^{tX_i}$의 기댓값은 다음과 같다.

\[E[e^{tX_i}]=(1-p_i)e^{t\cdot 0}+p_i e^{t\cdot 1}=1-p_i+p_ie^t\notag\] \[=1+p_i(e^t-1)=1+E[X_i](e^t-1)\]

또, 식 (35)의 마지막 식은 $\exp(1+E[X_i]\cdot(e^t-1))$의 테일러 급수 첫 두항과 일치한다. 즉,

\[\exp(x) = 1+\frac{x}{1!}+\frac{x^2}{2}+\cdots\]

이므로 $\exp(E[X_i]\cdot(e^t-1))$은

\[\exp(E[X_i]\cdot(e^t-1))=1+E[X_i](e^t-1)+\frac{1}{2!}(E[X_i](e^t-1))^2+\cdots \notag\] \[\gt 1+E[X_i](e^t-1)\]

이다. 따라서, 식 (34)에서의 마지막 결과값에 대해 다음과 같이 정리할 수 있다.

\[\prod_{i=1}^{N}E[e^{tX_i}]=\prod_{i=1}^{N}(1+E[X](e^t-1))<\prod_{i=1}^{N}\exp(E[X_i](e^t-1))\]

한편,

\[\prod_{i=1}^{N}\exp(E[X_i](e^t-1))=\exp\left(E\left[\sum_{i=1}^{N}X_i\right](e^t-1)\right)=\exp(E[X](e^t-1))\]

이므로 위 결과를 식 (33)에 대입하면,

\[P(X\gt(1+\delta)E[X])\leq \frac{\exp(E[X](e^t-1))}{e^{t(1+\delta)E[X]}}\]

임을 알 수 있다. Lower-tail boundary에서와 마찬가지로 식 (40)의 우변을 미분하여, 미분 계수를 0으로 만들어줄 수 있는 가장 tight한 $t=t^\ast$를 찾으면 다음과 같다.

\[t^\ast=\ln(1+\delta)\]

식 (41)을 식 (40)에 대입하면 식 (32)를 얻을 수 있게 된다.

\[P(X\gt(1+\delta)E[X]) \leq \frac{\exp(E[X](e^{\ln(1+\delta)}-1))}{\exp((1+\delta)E[X]\ln(1+\delta))}\] \[\Rightarrow P(X\gt(1+\delta)E[X]) \leq \frac{\exp(E[X]\delta)}{(1+\delta)^{(1+\delta)E[X]}}\] \[\Rightarrow P(X\gt(1+\delta)E[X]) \leq \left(\frac{e^\delta}{(1+\delta)^{(1+\delta)}}\right)^{E[X]}\]

식 (44)로부터 식 (31)을 증명하기 위해선 아래의 수식이 사실인지 확인하면 된다.

\[\frac{e^\delta}{(1+\delta)^{(1+\delta)}}식 (45)의 양변에 로그를 취하면 아래와 같은 식을 얻을 수 있다.

\[f(\delta) = \delta - (1+\delta)\ln (1+\delta) + \frac{\delta^2}{3}< 0\]

$f(\delta)$의 미분계수를 구하면 아래와 같다.

\[f'(\delta) = 1-\frac{1+\delta}{1+\delta}-\ln(1+\delta)+\frac{2}{3}\delta = -\ln(1+\delta)+\frac{2}{3}\delta\] \[f''(\delta) = - \frac{1}{1+\delta}+\frac{2}{3}\]

여기서 2계 도함수로부터 알 수 있는 것은 아래와 같다.

\[\begin{cases}f''(\delta)\lt 0\text{ for } 0\leq \delta \lt 1/2 \\ f''(\delta) > 0 \text{ for } \delta >1/2\end{cases}\]

다시 말해, $f’(\delta)$는 $(0,1)$ 구간에서 처음에는 감소하다가 증가하게 되는 형태를 띈다는 것이다. 또한, 1계 도함수 식을 보면 $f’(0)=0$이고 $f’(1)\lt 0$이라는 사실이다. 따라서, $(0,1)$ 구간에서 $f’(\delta)\lt 0$이라는 사실을 알 수 있다. 마지막으로 $f(0)=0$이므로 $(0,1)$ 구간에서 $f(\delta)$는 항상 음수임을 알 수 있다.

그러므로, 식 (46)은 사실임을 알 수 있고 식 (31) 또한 성립하는 것이다.

(증명 끝)

Reference

Outlier Analysis (2nd e.d), Charu C. Aggarwal, Springer
Probability and Computing (2nd e.d.), Michael Mitzenmacher and Eli Upfal, Cambridge University Press

책 Outlier Analysis (Charu C. Aggarwal)에서는 (0, 2e-1)의 범위에서 성립하는 Chernoff Bound를 보여주고 있으나 아직까지 어떻게 증명해야 하는지 잘 모르겠다. 그래서 다른 교과서에서 소개하고 있는 (0, 1) 바운드에 대한 Chernoff Bound를 소개한다. ↩

Chernoff Bound

2022-09-13T00:00:00+00:00

Prerequisites

To fully understand this post, it is recommended that you have knowledge of the following:

Markov and Chebyshev Inequalities

Proof

Chernoff’s inequality has different forms for the lower-tail version and upper-tail version. In this post, we will introduce the proof process.

Lower-Tail Chernoff Bound

Let $X$ be the sum of $N$ independent random variables. Assume that these random variables follow the Bernoulli distribution and have a probability of $p_i$ to take a value of 1.

\[X = \sum_{i=1}^{N}X_i\]

For any $\delta\in (0, 1)$, the following inequality holds:

\[P(X \lt (1-\delta)E[X]) \lt e^{-E[X]\cdot \delta^2/2}\]

Here, $e$ represents Euler’s number.

(Proof)

First, let’s prove the following inequality:

\[P(X \lt (1-\delta) E[X]) \lt \left(\frac{e^{-\delta}}{(1-\delta)^{(1-\delta)}}\right)^{E[X]}\]

To prove this, let’s introduce an arbitrary parameter $t>0$. We will use this $t$ to transform the equation for $X$ into an equation for $e^{-tX}$. This method can be thought of as using the principle of moment generating functions to move the problem from the original domain of $X$, where it is difficult to solve, to the parameter domain of $t$, where it is relatively easy to solve.

To prove Equation (3), let’s modify the equation for Markov’s inequality in accordance with $e^{-tX}$ instead of $X$. The original Markov’s inequality is as follows:

\[P(X\lt \alpha) \leq \frac{E[X]}{\alpha}\]

If we change the right-hand side of the equation, we get:

\[P(X\lt\alpha) \leq \frac{E[e^{-tX}]}{e^{-t\alpha}}\]

When applying equation (5) to the left-hand side of equation (2), the result is as follows. Here, $\alpha = (1-\delta)E[X]$; thus,

\[\Rightarrow P(X\lt (1-\delta)E[X]) \leq \frac{E[e^{-tX}]}{e^{-t(1-\delta)E[X]}}\]

is established. Moreover, the events that constitute $X$, namely $X_i$, occur independently. Looking at the numerator on the right-hand side of the above equation, we have

\[E[e^{-tX}] = E[e^{-t\cdot\sum_{i}X_i}] = E[e^{-t(X_1+X_2+\cdots+X_N)}]\notag\] \[=E[e^{-tX_1}\cdot e^{-tX_2}\cdot e^{-tX_3}\cdot \cdots \cdot e^{-t X_N}]\]

which can be rewritten as follows, where the expected value of the product of independent random variables is the product of their expected values:

\[\Rightarrow E[e^{-tX_1}\cdot e^{-tX_2}\cdots e^{-tX_N}]=\prod_{i=1}^{N}E[e^{-tX_i}]\]

If we take a closer look at $E[e^{-tX_i}]$, we can see that it is the expected value of the transformation equation $e^{-tX_i}$ for the experiment $X_i$ following a Bernoulli distribution. Since $X_i$ has a probability of $(1-p_i)$ or $p_i$ of taking on the values 0 or 1, respectively, the expected value of $X_i$ is

\[E[X_i] = (1-p_i)\cdot 0 + p_i \cdot 1 = p_i\]

Similarly, the expected value of $e^{-tX_i}$ is

\[E[e^{-tX_i}]=(1-p_i)e^{-t\cdot 0}+p_i e^{-t\cdot 1}\notag\] \[=1-p_i + p_i e^{-t}=1+p_i(e^{-t}-1)\notag\] \[= 1+E[X_i](e^{-t}-1)\]

Considering that the last product is equal to the two leading terms of the Taylor series of $\exp(E[X_i]\cdot(e^{-t}-1))$, we can conclude that:

\[E[e^{-tX_i}]=1+E[X_i](e^{-t}-1) \lt e^{E[X_i](e^{-t}-1)}\]

Substituting equation (11) into equation (8), we have

\[\prod_{i=1}^{N}E[e^{-tX_i}]\lt\prod_{i=1}^{N}e^{E[X_i](e^{-t}-1)}\]

Thus, it follows that the right-hand side of the above equation can be rewritten as:

\[\prod_{i=1}^{N}\exp(E[X_i](e^{-t}-1))=\exp\left(\sum_{i=1}^{N}E[X_i]\cdot (e^{-t}-1)\right)\notag\] \[=\exp\left(E[X](e^{-t}-1)\right)\]

Hence, substituting the result of equation (13) into equation (6) yields the following equation.

\[P(X<(1-\delta)E[X]) \leq \frac{\exp\left(E[X](e^{-t}-1)\right)}{\exp\left(-t(1-\delta)E[X]\right)}\]

The given equation holds true for any $t>0$. Now, let’s find the value of $t=t^\ast$, which minimizes the equation (14) as tightly as possible. This process can be resolved by differentiating equation (14), and finding $t^\ast$ where the derivative is zero.

By applying the exponential law to the right-hand side of the equation (14), it can be written in one line as follows:

\[\exp(E[X](e^{-t}-1)+t(1-\delta)E[X])\]

Simplifying this a little more and naming it $f(t)$, we have:

\[f(t) = \exp\left(E[X]e^{-t}-E[X]+tE[X]-t\delta E[X]\right)\notag\] \[=\exp\left(E[X](e^{-t}+t-t\delta -1)\right)\]

Now, when $f(t)$ is differentiated with respect to $t$:

\[f'(t) = \exp\left(E[X](e^{-t}+t-t\delta -1)\right)(E[X])(e^{-t}+1-\delta)\]

It can be seen that the exponential function at the front of the equation (17) is always positive, and $E[X]$ is also positive. Therefore, $t=t^\ast$ can be found by making only the value inside the rightmost parentheses zero.

Thus,

\[e^{-t}+1-\delta = 0\]

Satisfying $t=t^\ast$ is:

\[t=t^\ast = \ln\left(\frac{1}{1-\delta}\right)\]

By substituting equation (19) into equation (14), we can obtain equation (3). Substituting equation (19) into equation (14), we get:

\[\Rightarrow P(X<(1-\delta) E[X]) \leq \frac{\exp\left(E[X]\left(e^{-\ln\left(1/(1-\delta)\right)}-1\right)\right)}{\exp\left(-\ln(1/(1-\delta))(1-\delta)E[X]\right)}\]

Looking only at the right-hand side of the above equation,

\[(RHS)\Rightarrow \frac{\exp(E[X](1-\delta -1))}{\exp(-(1-\delta)E[X]\ln\left(1/(1-\delta)\right))}\] \[=\frac{\exp(E[X](-\delta))}{\exp(\ln(1-\delta)^{(1-\delta E[X])})}=\frac{\exp(-\delta E[X])}{(1-\delta)^{(1-\delta)E[X]}}\] \[=\left(\frac{e^{-\delta}}{(1-\delta)^{(1-\delta)}}\right)^{E[X]}\]

Meanwhile,

\[\ln(1-x)=-x-\frac{x^2}{2}-\frac{x^3}{3}\cdots = -\sum_{i=1}^{N}\frac{x^n}{n}\]

Thus,

\[(1-\delta)\ln(1-\delta) = - (1-\delta)\delta - (1-\delta)\frac{\delta^2}{2}\cdots\notag\] \[=-\delta+\delta^2-\frac{\delta^2}{2}+\frac{\delta^3}{2}\cdots\notag\] \[=-\delta+\delta^2/2+\cdots\]

Therefore,

\[(1-\delta)\ln(1-\delta) \gt -\delta +\frac{\delta^2}{2}\]

holds, and by the property of logarithm, we can know that

\[(1-\delta)^{(1-\delta)}\gt\exp\left(-\delta + \frac{\delta^2}{2}\right)\]

Substituting Equation (27) into Equation (23), we get:

Therefore, if we substitute this result into Equation (20) and Equation (23), we get:

\[\Rightarrow P(X\lt (1-\delta)E[X])\lt \exp(-E[X]\delta^2/2)\]

(Proof completed)

Upper-Tail Chernoff Bound

The proof for the upper-tail part is similar to the one for the lower-tail part, so it proceeds quickly, and the parts that we skip over quickly are referred to in the proof for the lower-tail part. For a random variable $X$ like Equation (1), if we select any $\delta \in (0, 1)$¹, the following holds:

\[P(X\gt(1+\delta)E[X]) \lt e^{-E[X]\cdot \delta^2/3}\]

Here, $e$ denotes Euler’s number.

(Proof)

As in the proof for the lower-tail part, the first step is to prove the following:

\[P(X\gt(1+\delta)E[X]) \lt \left(\frac{e^\delta}{(1+\delta)^{(1+\delta)}}\right)^{E[X]}\]

Introduce a parameter $t>0$ and change the function from $X$ to $e^tX$. We can obtain a similar equation as for the lower-tail part if we apply Markov’s inequality using the information on the left-hand side of Equation (32).

\[P(X\gt(1+\delta)E[X]) \leq \frac{E[e^{tX}]}{e^{t(1+\delta)E[X]}}\]

By separating the numerator of the right-hand side of Equation (33) using the definition of $X$ in Equation (1), we can obtain the following:

\[E[e^{tX}]=E[e^{t\sum_i X_i}]=E\left[\prod_{i=1}^{N}e^{tX_i}\right]=\prod_{i=1}^{N}E[e^{tX_i}]\]

On the other hand, since $X_i$ is a Bernoulli trial, equation (9) holds, and the expected value of $e^{tX_i}$ is as follows.

\[E[e^{tX_i}]=(1-p_i)e^{t\cdot 0}+p_i e^{t\cdot 1}=1-p_i+p_ie^t\notag\] \[=1+p_i(e^t-1)=1+E[X_i](e^t-1)\]

Also, the last equation of equation (35) coincides with the first two terms of the Taylor series of $\exp(1+E[X_i]\cdot(e^t-1))$, that is,

\[\exp(x) = 1+\frac{x}{1!}+\frac{x^2}{2}+\cdots\]

Therefore, $\exp(E[X_i]\cdot(e^t-1))$ is

\[\exp(E[X_i]\cdot(e^t-1))=1+E[X_i](e^t-1)+\frac{1}{2!}(E[X_i](e^t-1))^2+\cdots \notag\] \[\gt 1+E[X_i](e^t-1)\]

Thus, we can arrange the final result in equation (34) as follows.

\[\prod_{i=1}^{N}E[e^{tX_i}]=\prod_{i=1}^{N}(1+E[X](e^t-1))<\prod_{i=1}^{N}\exp(E[X_i](e^t-1))\]

Meanwhile,

\[\prod_{i=1}^{N}\exp(E[X_i](e^t-1))=\exp\left(E\left[\sum_{i=1}^{N}X_i\right](e^t-1)\right)=\exp(E[X](e^t-1))\]

If we substitute the above result into equation (33) from above, we get

\[P(X\gt(1+\delta)E[X])\leq \frac{\exp(E[X](e^t-1))}{e^{t(1+\delta)E[X]}}\]

As in the lower-tail boundary, if we differentiate the right-hand side of equation (40), find the most tight value of $t=t^\ast$ that makes the coefficient of differentiation equal to 0, then we get the following.

\[t^\ast=\ln(1+\delta)\]

Substituting equation (41) into equation (40) gives us equation (32).

To prove equation (31) from equation (44), we just need to verify whether the following equation is true.

\[\frac{e^\delta}{(1+\delta)^{(1+\delta)}}Taking the logarithm of both sides of equation (45), we can obtain the following equation.

\[f(\delta) = \delta - (1+\delta)\ln (1+\delta) + \frac{\delta^2}{3}< 0\]

The differential coefficient of $f(\delta)$ is

\[f'(\delta) = 1-\frac{1+\delta}{1+\delta}-\ln(1+\delta)+\frac{2}{3}\delta = -\ln(1+\delta)+\frac{2}{3}\delta\] \[f''(\delta) = - \frac{1}{1+\delta}+\frac{2}{3}\]

From the second differential coefficient, we can know that

\[\begin{cases}f''(\delta)\lt 0\text{ for } 0\leq \delta \lt 1/2 \\ f''(\delta) > 0 \text{ for } \delta >1/2\end{cases}\]

In other words, $f’(\delta)$ decreases at first and then increases on the interval $(0,1)$. Also, from the expression of the first differential coefficient, we can see that $f’(0)=0$ and $f’(1)\lt 0$. Therefore, we know that $f’(\delta)\lt 0$ on the interval $(0,1)$. Finally, since $f(0)=0$, we can see that $f(\delta)$ is always negative on the interval $(0,1)$.

Therefore, we can see that equation (46) is true and so is equation (31).

(End of proof)

Reference

Outlier Analysis (2nd e.d), Charu C. Aggarwal, Springer
Probability and Computing (2nd e.d.), Michael Mitzenmacher and Eli Upfal, Cambridge University Press

The book ‘Outlier Analysis’ (Charu C. Aggarwal) shows the Chernoff Bound that holds in the range (0, 2e-1), but I still don’t know how to prove it. Hence, I introduce the Chernoff Bound for the (0, 1) bound that is presented in other textbooks. ↩

마르코프 부등식과 체비셰프 부등식

2022-09-12T00:00:00+00:00

마르코프 부등식 (Markov Inequality)

마르코프 부등식은 음수가 아닌 랜덤 변수에 대해 성립하는 부등식이다. 마르코프 부등식의 정의부터 보면 다음과 같다.

$X$가 음수가 아닌 값을 가지는 랜덤 변수라고 했을때, $\alpha\gt 0$¹를 만족하는 임의의 상수 $\alpha$에 다음이 성립한다.

\[P(X\geq \alpha) \leq E[X]/\alpha\]

위 식의 의미를 간단히 살펴보기 위해 아래의 그림을 보도록 하자.

그림 1. 마르코프 부등식이 의미하는 것은 전체 데이터 분포에서 기댓값을 기준으로 랜덤변수 $x$가 어떤 극값 $\alpha$ 보다 클 확률에 관한 것이다.

그림 1을 보면 어떤 랜덤 변수 $X$에 대한 pdf가 그려져 있는 것을 알 수 있다. 여기서 $X$는 음수가 아닌 값이라는 점을 강조하고자 가로축 왼쪽에 0을 표시했다. 또, 기댓값은 임의로 분포 중앙으로 설정하였으며 $P(X\geq \alpha)$는 파란색 영역으로 표시되어 있다.

정리하자면 마르코프 부등식은 음수가 아닌 랜덤 변수에 대해 기댓값을 기준으로 $X$가 극값 $\alpha$ 보다 클 확률에 관한 정리(theorem)라고 할 수 있다.

증명은 아주 간단하다. $X$에 관한 pdf가 $f_X(x)$라고 하면 이의 기댓값은 다음과 같다.

\[E[X]=\int_x x f_X(x)dx\]

위 적분에 대한 적분 범위를 $\alpha$를 기준으로 나눠 다음과 같이 쓸 수 있다.

\[\Rightarrow \int_{0\leq x \lt \alpha}xf_X(x)dx + \int_{X\geq \alpha}xf_X(x)dx\]

여기서 $x$의 값은 항상 음수가 아닌 값이므로 위 식의 왼쪽 항은 항상 양수이다. 따라서, 아래와 같은 관계가 성립한다.

\[\Rightarrow \int_{0\leq x \lt \alpha}xf_X(x)dx + \int_{X\geq \alpha}xf_X(x)dx \geq \int_{X\geq \alpha}xf_X(x)dx\]

또한, 위 식의 우변의 값에서 $X\geq \alpha$이므로 다음이 성립한다.

\[\Rightarrow \int_{X\geq \alpha}xf_X(x)dx\geq \int_{X\geq \alpha}\alpha f_X(x)dx % 식 (5)\]

따라서, 결론적으로 식 (2)와 식 (5)만을 가져오면,

\[\Rightarrow E[X]\geq \int_{x\geq \alpha} \alpha f_X(x)dx % 식 (6)\]

이며, pdf와 확률의 정의에 따라 위 식은 다음과 같이 쓸 수도 있다.

\[식(6)\Rightarrow E[x]\geq \alpha P(X\geq \alpha) % 식 (7)\]

여기서 위 식을 정리하면 식 (1)을 얻을 수 있다.

체비셰프 부등식 (Chebyshev Inequality)

체비셰프 부등식 유도

체비셰프 부등식은 다음과 같다. 임의의 랜덤 변수 $X$와 임의의 상수 $\alpha$에 대하여 다음이 성립한다.

\[P(\left|X-E[X]\right|\geq\alpha)\leq Var[X]/\alpha^2 % 식 (8)\]

체비셰프 부등식은 마르코프 부등식과 다르게 $X$와 $\alpha$에 대한 제약 조건이 없으나 절대값 부호에서 볼 수 있듯이 양측 극값 $E[X]\pm\alpha$에 대한 부등식이다.

그림 2. 체비셰프 부등식이 의미하는 것은 전체 데이터 분포에서 기댓값을 기준으로 랜덤변수 $x$가 어떤 양쪽 극값 $E[X]\pm\alpha$ 보다 크거나 작은 확률에 관한 것이다.

증명은 생각보다 간단하고 위의 마르코프 부등식을 이용하여 진행된다. 식 (8)의 부등식 $|X-E[X]|\gt\alpha$은 다음과 같이 생각할 수 있다.

\[|X-E[X]|\geq\alpha \Leftrightarrow (X-E[X])^2\geq \alpha^2\]

이에 아래와 같은 새로운 랜덤 변수 $Y$를 정의하자.

\[Y=(X-E[X])^2\]

그러면 분산의 정의 상 $E[Y]=Var[X]$이다. 따라서, $Y$와 $\alpha^2$에 대한 마르코프 부등식을 확인하면 다음과 같다.

\[P(Y\gt\alpha^2)\leq E[Y]/\alpha^2\]

이를 $X$에 대한 식으로 다시 바꾸면,

\[P((X-E[X])^2\geq\alpha^2)\leq Var[X]/\alpha^2 % 식 (12)\]

가 되고 식 (9)에 의해 식 (8)을 얻을 수 있게 된다.

체비셰프 부등식의 의의

식 (8)에서 $\alpha$ 대신에 $k\sigma$를 대입해보자. 여기서 $\sigma$는 표준편차를 의미한다. 그러면 아래와 같은 식으로 변형할 수 있다.

\[식(8) \Rightarrow P(\left|X-E[X]\right|\geq k\sigma)\leq Var[X]/{k^2\sigma^2} % 식 (13)\] \[\Rightarrow P(\left|X-E[X]\right|\geq k\sigma)\leq \frac{1}{k^2} % 식 (14)\]

우리가 식 (14)를 통해서 얻을 수 있는 인사이트는 무엇일까? 바로, 임의의 확률 분포에 대해 평균과 분산을 통해 얻을 수 있는 정보에 관한 것이라고 할 수 있다. 다시 말하면, 어떤 분포라도 거의 대부분의 데이터들은 평균에 가깝게 붙어있다는 의미이며, 표준 편차를 기준으로 얼마만큼 떨어져있는지를 알려주는 것이다.

가령, $k=2$일 때를 생각해보면, 식 (14)는 확률 변수 $X$의 분포는 기댓값 $E[X]$를 중심으로 $\pm 2$ 표준편차 밖에 놓인 데이터가 1/4, 즉 25% 라는 것을 말해주고 있다. 다른 말로 하면 평균을 중심으로 $\pm$ 2 표준편차 안에 75%의 데이터가 들어있다는 것을 의미한다. (참고로 정규분포라면 $\pm$ 2 표준편차 안에 95%의 데이터가 들어오게 된다. 체비셰프 부등식은 어떤 모양의 분포라도 성립하는 “느슨한” 조건을 갖는 부등식이라는 점에 주목하자.)

일반적으로, 체비셰프 부등식이 말해주는 것은 데이터들 중 평균값으로부터 $k$ 표준편차 이상 떨어진 것들은 $1/k^2$ 이상 차지하지 않는다는 것을 의미한다.

아마, 이런 이유로 통계학자들은 확률 분포에 관한 가장 유용한 대푯값들로 평균 & 표준편차를 사용하는 것일 수 있겠다. 평균과 표준편차만 제시되어 있다면 데이터 전체의 분포를 확인하지 않고도

“대략적으로 평균 $\pm$ 2 표준편차 안에 대부분의 데이터가 들어오긴 하겠군.”

하고 생각할 수 있기 때문이다.

Reference

Outlier Analysis 2nd e.d., Charu C. Aggarwal, Springer
빅데이터를 지배하는 통계의 힘 (실무활용 편), 니시우치 히로무, 비전코리아

Reference 책인 Outlier Analysis, Charu C. Aggarwal에는 $\alpha\gt E[X]$라고 쓰여있으나 일반적인 조건은 아닌 것으로 보인다. Wikipedia에도 $\alpha \gt 0$이라고 표시되어 있다. ↩

Markov Inequality and Chebyshev Inequality

2022-09-12T00:00:00+00:00

Markov’s Inequality

Markov’s inequality is an inequality that holds for non-negative random variables. The definition of Markov’s inequality is as follows:

Let $X$ be a non-negative random variable and let $\alpha\gt 0$¹ be any constant that satisfies the condition. Then, the following inequality holds:

\[P(X\geq \alpha) \leq E[X]/\alpha\]

To understand the meaning of the above equation, let’s look at the figure below.

Figure 1. Markov's inequality represents the probability that a random variable $x$ is greater than some extreme value $\alpha$ with respect to the expected value in the entire data distribution.

From Figure 1, we can see that the pdf of a random variable $X$ is drawn. Here, we emphasized that $X$ is a non-negative value by indicating 0 on the left axis. In addition, the expected value was arbitrarily set to the middle of the distribution, and $P(X\geq \alpha)$ is shown in blue.

In summary, Markov’s inequality can be regarded as a theorem that describes the probability that a non-negative random variable $X$ is greater than some extreme value $\alpha$ with respect to the expected value.

The proof is very simple. If the pdf of $X$ is denoted by $f_X(x)$, then its expected value is as follows:

\[E[X]=\int_x x f_X(x)dx\]

The integral range can be divided into two parts with $\alpha$ as the boundary as follows:

\[\Rightarrow \int_{0\leq x \lt \alpha}xf_X(x)dx + \int_{X\geq \alpha}xf_X(x)dx\]

Here, the value of $x$ is always a non-negative value, so the left term of the above equation is always positive. Therefore, the following relation holds:

\[\Rightarrow \int_{0\leq x \lt \alpha}xf_X(x)dx + \int_{X\geq \alpha}xf_X(x)dx \geq \int_{X\geq \alpha}xf_X(x)dx\]

Furthermore, since $X\geq \alpha$ in the right-hand side of the above equation, the following holds:

\[\Rightarrow \int_{X\geq \alpha}xf_X(x)dx\geq \int_{X\geq \alpha}\alpha f_X(x)dx % Equation (5)\]

Therefore, if we only take Equation (2) and Equation (5), we have:

\[\Rightarrow E[X]\geq \int_{x\geq \alpha} \alpha f_X(x)dx % Equation (6)\]

According to the definition of pdf and probability, the above equation can also be written as follows.

\[\text{Equation} (6)\Rightarrow E[x]\geq \alpha P(X\geq \alpha) % 식 (7)\]

Here, tidying up the equation above leads to equation (1).

Chebyshev Inequality

Derivation of Chebyshev Inequality

The Chebyshev inequality states that for any random variable $X$ and any constant $\alpha$, the following inequality holds:

\[P(\left|X-E[X]\right|\geq\alpha)\leq Var[X]/\alpha^2 % Equation (8)\]

Unlike the Markov inequality, the Chebyshev inequality is a two-sided inequality in terms of the extreme values $E[X]\pm\alpha$ of $X$, as can be seen from the absolute value signs.

Figure 2. The Chebyshev inequality describes the probability that a random variable $x$ deviates from the extreme values $E[X]\pm\alpha$ based on the expected value of the entire data distribution.

The proof is surprisingly simple and is based on the above Markov inequality. The inequality $|X-E[X]|\gt\alpha$ in Equation (8) can be thought of as follows:

\[|X-E[X]|\geq\alpha \Leftrightarrow (X-E[X])^2\geq \alpha^2\]

Let us define a new random variable $Y$ as follows:

\[Y=(X-E[X])^2\]

Then, by definition of variance, we have $E[Y]=Var[X]$. Therefore, the Markov inequality for $Y$ and $\alpha^2$ yields:

\[P(Y\gt\alpha^2)\leq E[Y]/\alpha^2\]

Substituting back to $X$, we obtain:

\[P((X-E[X])^2\geq\alpha^2)\leq Var[X]/\alpha^2 % Equation (12)\]

Using Equation (9), we can obtain Equation (8).

Significance of Chebyshev Inequality

Let us substitute $k\sigma$ for $\alpha$ in Equation (8), where $\sigma$ denotes the standard deviation. Then we can transform Equation (8) as follows:

\[Equation (8) \Rightarrow P(\left|X-E[X]\right|\geq k\sigma)\leq Var[X]/{k^2\sigma^2} % Equation (13)\] \[\Rightarrow P(\left|X-E[X]\right|\geq k\sigma)\leq \frac{1}{k^2} % Equation (14)\]

What insight can we gain from Equation (14)? It tells us about the information that can be obtained through the mean and variance of any probability distribution. In other words, it means that almost all data in any distribution are close to the mean, and it tells us how far away they are based on the standard deviation.

Assuming that you are a professional math blogger, I will translate the sentences without summarizing or omitting any of them. Please keep the links in Markdown format as expressed as .

When $k=2$, equation (14) indicates that 25% of data lies beyond $\pm 2$ standard deviations from the expected value $E[X]$ of the probability distribution of the random variable $X$. In other words, 75% of the data lies within $\pm$ 2 standard deviations from the mean. (Note that for a normal distribution, 95% of the data falls within $\pm$ 2 standard deviations from the mean. It is worth noting that Chebyshev’s inequality is a “loose” condition that holds for any type of distribution.)

Generally speaking, Chebyshev’s inequality implies that no more than $1/k^2$ of the data falls beyond $k$ standard deviations from the mean.

Perhaps for this reason, statisticians may use the mean and standard deviation as the most useful representative values for probability distributions. By providing only the mean and standard deviation, it is possible to roughly estimate that:

“Most of the data falls within the range of the mean $\pm$ 2 standard deviations.”

without examining the overall distribution of the data.

Reference

Outlier Analysis 2nd e.d., Charu C. Aggarwal, Springer
The Power of Statistics that Rules Big Data (Practical Use Edition), Nishiuchi Hiromu, Vision Korea

In the reference book “Outlier Analysis” by Charu C. Aggarwal, it is written as $\alpha\gt E[X]$, but this is not a general condition. Wikipedia also indicates $\alpha \gt 0$. ↩

연속 신호의 샘플링

2022-01-14T00:00:00+00:00

※ 섀넌-나이퀴스트의 샘플링 이론의 증명은 이 포스팅을 확인하세요.

샘플링 전 연속 신호(흰색)와 샘플링하여 복원한 신호(파란색)의 차이 비교

연속 신호, 이산 신호, 디지털 신호의 관계

요즘에는 디지털 기기가 보편화 되었다. 카세트 테이프 보다는 MP3 플레이어를 사용하게 되었고, 종이책과 e-book이 공존하며, 아날로그 TV 방송이 모두 디지털 방송으로 전환되었다.

일상 생활에서 디지털 기기는 ‘편의성을 고려했다’ 혹은 ‘최신 기술이 적용되었다’는 이미지가 많이 떠오른다. 그만큼 연구가 많이 이루어졌고, 실생활과 맞닿아 있는 유용한 기술이라 할 수 있다.

디지털 신호 처리 과목에서는 이러한 디지털 신호를 분석하는데 필요한 기술과 이론들을 다루게 된다.

그렇다면, 디지털 신호는 아날로그 신호와 어떻게 다른 것일까?

아래 그림에서 볼 수 있듯이 디지털 신호는 아날로그 신호를 디지털 변환한 것이다.

이런 변환기를 Analog-to-Digital Converter라고 부른다. 반대로 Digital에서 Analog로 복원하는 과정도 있다. 이를 처리하는 변환기를 Digital-to-Analog Converter라고 부른다.

그림 1. 아날로그 신호의 디지털 처리 시스템

디지털 신호를 잘 살펴보면 시간 간격이 일정하게 신호를 받아온 것을 알 수 있다. 이렇게 시간 간격을 두고 신호를 저장하는 이유는 디지털 기기의 메모리는 유한하기 때문이다.

여기서 일정 시간 간격으로 연속 신호를 저장하는 과정을 시간 샘플링(time sampling)이라고 부른다.

대부분의 아날로그 신호들은 실수 함수(real function)인 경우가 많은데, 실수 체는 무한하다. 컴퓨터는 이를 받아들일 수가 없어 유한한 갯수의 함수값만을 받아오게 된다.

가로축이 샘플링되어 있는 것은 눈으로 쉽게 보이지만 세로축에 대해서는 신경을 쓰지 않는 경우도 종종 있다.

세로축도 일종의 샘플링된 것 처럼 이산적인 값만 가지도록 변환된다. 이를 양자화(quantization)라고 한다.

양자화 이론은 간단해 보이지만, 생각보다 이론은 복잡하고 하드웨어로 구성하기 위한 아이디어도 꽤 복잡하다. 이 블로그에서는 양자화에 대해서는 깊게 다루지 않을 것이다.

아무튼 시간 샘플링과 양자화가 모두 수행된 신호를 비로소 ‘디지털 신호’라고 부른다.

추가로 보통 시간 샘플링만 수행된 신호를 ‘이산 신호’라고 많이 부르고 양자화 여부에 따라 디지털 신호와 구분해 부르기도 한다.

시간 샘플링의 부수 효과(side effects)

어떤 신호든지 서로 다른 주파수를 가지는 정현파의 선형 결합으로 표현할 수 있기 때문에 우리는 정현파에 대해 생각한다.

그리고, 정현파는 원 위의 움직임으로부터 나오는 신호이기 때문에 주기를 가지며, 이 주기성으로 인해 샘플링에서 몇 가지 고려해야할 점들이 생긴다.

다른 주파수의 연속 정현파에서 나온 동일한 이산 정현파

임의의 정현파 $x(t)$를 생각해보자.

\[x(t) = A\cos(\omega_0 t)\]

이 신호를 주기 $T_s$로 샘플링해주면 다음과 같은 이산 신호를 얻게 되는 것이다.

\[x[n]=x(nT_s) = A\cos(\omega_0 nT_s) = A \cos(\Omega_0 n)\]

여기서 $\Omega_0=\omega_0 T_s \text{[rad]}$는 이산 정현파 신호의 각주파수이다. 이는 연속 정현파 신호의 각주파수 $\omega_0 \text{[rad/sec]}$와 차이를 보인다.

(참고로 각주파수는 주파수에 $2\pi$를 곱하여 계산하는 주파수를 말한다. 가령 1초 주기로 회전하는 원으로부터 얻은 정현파의 각주파수는 $2\pi$이다.)

$\Omega_0$는 단위가 라디안이고 $\omega_0$는 단위가 라디안/초 라는 점에 주목해보자. 즉, $\Omega_0$에서는 시간 정보가 사라지게 된다.

그러다보니 $\omega_0$이 크고 $T_s$가 작은 경우나 $\omega_0$이 작고 $T_s$가 큰 경우로 적당히 조합되면 연속 신호의 주파수는 다르더라도 이산 신호는 동일하게 얻어질 수 있다.

그림 2. 서로 다른 주파수와 샘플링 주기를 갖는 경우에도 동일한 이산 신호를 얻게될 수 있다.

즉, 주파수가 $f_0+ k f_s$ (여기서 $k$는 정수)인 정현파를 샘플링 주파수 $f_s$로 샘플링하면 주파수 $f_0$인 정현파를 샘플링한 것과 같은 결과를 얻게 된다.

\[\cos(2\pi(f_0+kf_s)nTs)=\cos(2\pi f_0nTs+2\pi knf_s T_s)\] \[=\cos(2\pi f_0 nTs + 2\pi k n) = \cos(2\pi f_0 nT_s)\]

이산 정현파를 연속 정현파로 복원할 때의 문제: 에일리어싱

위의 문제를 거꾸로 생각해보면, 임의의 이산 정현파를 연속 신호로 복원한다고 해서 무조건 원래의 신호로 그대로 복원하지 못할 수 있다는 말이 된다.

다른 주파수의 정현파를 샘플링 했음에도 동일한 $f_0$의 주파수를 갖는 이산 신호를 얻게 되기 때문에, $f_0 + k f_s$를 샘플링 주파수 $f_s$에 대한 주파수 $f_0$의 에일리어스(alias)라고 부르고,

이처럼 샘플링 과정에서 원래 신호가 무엇인지 구별하지 못하게 되버리는 현상을 에일리어싱(aliasing)이라고 부른다¹.

그림 3. 에일리어싱 현상

에일리어싱 현상을 방지하기 위해선 충분히 높은 주파수로 샘플링 해주어야 한다.

이 포스팅 맨 위의 애플릿을 통해서도 볼 수 있듯이 어느정도 이상의 빠른 주기로 샘플링을 해주면 원래의 신호 형태에 가깝게 이산 신호를 연속 신호로 복원할 수 있다.

그림 4. 에일리어싱을 방지하기 위해선 충분히 큰 주파수로 샘플링 해주어야 한다.

수학적으로 ‘얼마나 빠르게 샘플링 해야하는가?’라는 문제에 대한 답을 제시해주는 이론이 섀넌-나이퀴스트 샘플링 정리이다. 다만, 이 정리의 내용을 이해하려면 푸리에 급수/변환에 대한 이해가 선행되어야 하므로 추후에 더 자세히 다루고자 한다. 섀넌 나이퀴스트 샘플링 정리의 결론만 말하자면 복원하고자 하는 신호의 최대 주파수의 두 배 빠르기의 주파수로 샘플링 해주면 원래 신호로 복원할 수 있다.

이산 신호의 주파수 특성

이산 신호의 가로축을 잘 보면 순번만 표시되어 있는 것을 볼 수 있다.

그리고, 순번의 간격은 항상 1이기 때문에 이 신호를 표현할 수 있는 최소 주기는 1이다.

다시 말하면 최대 주파수는 1, 최소 주파수는 0이 될 것이다.

보통은 음의 주파수까지도 포함해서 주파수를 표현하므로 이산 신호의 주파수 구간은

\[-0.5\lt F \lt 0.5\]

혹은

\[-\pi \lt \Omega \lt \pi\]

와 같다.

또 한편, 이산 정현파만의 주파수 특성에 대해서도 생각해볼 수 있다.

어떤 신호든지 서로 다른 주파수를 갖는 정현파의 선형결합으로 표현할 수 있다. 이 내용은 연속 신호에만 해당하는 것은 아니고 이산 신호의 경우에도 마찬가지로 적용할 수 있다. 따라서, 이산 신호를 분석할 때도 당연히 정현파를 이용하는 것이 유용한 점이 많다.

이산 정현파는 연속 정현파와 약간의 차이점이 있다. 이산 정현파는 항상 주기신호가 아니다. 다시 말하면 샘플링 주기에 따라 이산 정현파는 주기 신호일 수도 있고 아닐 수도 있다.

임의의 이산 정현파 $x[n]$을 다음과 같이 상정해보자.

\[x[n]=A\cos(\Omega_0 n)\]

만약 $x[n]$이 $N$을 주기로 하는 주기 신호라면 다음이 성립해야 한다. (여기서 $N$은 정수)

\[x[n]=x[N+n]\] \[\Rightarrow A\cos(\Omega_0 n) = A\cos(\Omega_0 (n+N))\]

따라서 디지털 각주파수 $\Omega_0$는

\[\Omega_0 N = 2\pi k \Rightarrow \Omega_0 = \frac{2\pi k}{N}\]

를 만족하거나, (여기서 $k$는 정수)

혹은 디지털 주파수 $F_0 = \Omega_0/2\pi$가

\[\frac{\Omega_0}{2\pi}=F_0=\frac{k}{N}\]

을 만족하는 유리수여야 한다.

그림 5. 디지털 주파수가 유리수일 때만 이산 신호가 주기신호가 된다.

이 꼭지의 내용을 요약해서 정리하면 이산 정현파 신호는 $-\pi$에서 $\pi$ 내에 모든 주파수 성분을 다 표시할 수 있는데 $2\pi$ 만큼의 주기성을 동시에 갖는다.

따라서, 이산 신호는 디지털 각주파수 $-\pi$에서 $\pi$ 사이의 주파수 스펙트럼이 $2\pi$의 주기를 갖고 복사되어 있는 주기성을 띈다.

아래 그림은 주기 신호의 스펙트럼과 그것을 시간 샘플링하여 이산화 했을 때의 결과물을 비교한 것이다.

그림 6. 이산 주기 신호의 주파수 스펙트럼은 원래 연속 주기 신호의 복사물이 $2\pi$ 간격으로 표시되게 된다.

또, 아래 그림은 비주기 신호의 스펙트럼과 그것을 시간 샘플링하여 이산화 했을 때의 결과물을 비교한 것이다.

그림 7. 이산 비주기 신호의 주파수 스펙트럼은 원래 연속 신호의 복사물이 $2\pi$ 간격으로 표시되게 된다.

참고 문헌

Hello! 신호 처리, James H. McClellan 등, 홍릉과학출판사
디지털 신호 처리, 이철희, 한빛아카데미

aliasing의 어원은 alias인데, 이는 ‘본래의 신분을 속이기 위해 사용하는 가짜 이름’이라는 뜻을 갖고 있다. 이런 맥락에서 신호 처리 분야에서는 ‘연속 신호로 복원한 얻어낸 결과물이 본래의 신호와 다른 경우’를 상정하기 위해 aliasing이라는 용어를 붙인 것으로 보인다. ↩

Sampling Continuous Signal to Discrete Signal

2022-01-14T00:00:00+00:00

※ Please check this post for the proof of Shannon-Nyquist sampling theory.

Comparison of the difference between the continuous signal (white) and the restored signal by sampling (blue)

Relationship between continuous signals, discrete signals, and digital signals

These days, digital devices are ubiquitous. We use MP3 players instead of cassette tapes, coexist with paper books and e-books, and all analog TV broadcasts have been converted to digital broadcasts.

In daily life, digital devices often come with the image of “considering convenience” or “applying the latest technology.” As much research has been conducted, it can be said to be a useful technology that is closely related to real life.

The Digital Signal Processing course covers the techniques and theories needed to analyze these digital signals.

Then, how is a digital signal different from an analog signal?

As shown in the figure below, a digital signal is a digital conversion of an analog signal.

This type of converter is called an Analog-to-Digital Converter. Conversely, there is also a process of restoring from Digital to Analog. This type of converter is called a Digital-to-Analog Converter.

Figure 1. Digital processing system of an analog signal

If you look closely at a digital signal, you can see that the signal is received at a constant time interval. The reason for storing the signal at a fixed time interval is that the memory of digital devices is finite.

The process of storing a continuous signal at a fixed time interval is called time sampling.

Most analog signals are often real functions, and real numbers are infinite. Computers cannot accept this, so they only receive a finite number of function values.

Horizontal axis is often sampled visibly, but there are cases where the vertical axis is not paid attention to as it is also transformed into a discrete value through quantization.

Quantization theory may seem simple, but the theory is actually complex and the ideas for hardware implementation are also quite complicated. This blog will not delve deeply into quantization.

Anyway, a signal where both time sampling and quantization have been performed is finally called a “digital signal”.

In addition, a signal where only time sampling is performed is often referred to as a “discrete signal”, and it is sometimes distinguished from a digital signal depending on the presence of quantization.

Side effects of time sampling

Since any signal can be expressed as a linear combination of sinusoidal waves with different frequencies, we think about sinusoidal waves.

Moreover, since sinusoidal waves are signals that come from the movement on a circle, they have a period, and some considerations arise in sampling due to this periodicity.

A discrete sinusoidal wave coming from continuous sinusoidal waves with different frequencies

Let’s consider an arbitrary sinusoidal wave $x(t)$.

\[x(t) = A\cos(\omega_0 t)\]

If we sample this signal with a period of $T_s$, we get the following discrete signal:

\[x[n]=x(nT_s) = A\cos(\omega_0 nT_s) = A \cos(\Omega_0 n)\]

Here, $\Omega_0=\omega_0 T_s \text{[rad]}$ is the angular frequency of the discrete sinusoidal wave signal. This differs from the angular frequency $\omega_0 \text{[rad/sec]}$ of the continuous sinusoidal wave signal.

(By the way, angular frequency refers to the frequency calculated by multiplying the frequency by $2\pi$. For example, the angular frequency of a sinusoidal wave obtained from a circle rotating with a period of 1 second is $2\pi$.)

Note that $\Omega_0$ has a unit of radians and $\omega_0$ has a unit of radians per second. In other words, time information disappears in $\Omega_0$.

Therefore, if $\omega_0$ is large and $T_s$ is small, or if $\omega_0$ is small and $T_s$ is large, the discrete signal can be obtained identically even though the frequency of the continuous signal is different.

Figure 2. Even if they have different frequencies and sampling periods, they can still obtain the same discrete signal.

When sampling a sinusoidal wave with frequency $f_0+kf_s$ (where $k$ is an integer) at a sampling frequency of $f_s$, the result is the same as sampling a sinusoidal wave with frequency $f_0$.

\[\cos(2\pi(f_0+kf_s)nT_s)=\cos(2\pi f_0nT_s+2\pi knf_s T_s)\] \[=\cos(2\pi f_0 nT_s + 2\pi kn) = \cos(2\pi f_0 nT_s)\]

The problem of aliasing when restoring discrete sinusoidal waves to continuous sinusoidal waves

Looking at the above problem in reverse, it can be said that it is not always possible to restore an arbitrary discrete sinusoidal wave to the original continuous signal.

Even if we sample sinusoidal waves with different frequencies, we can get a discrete signal with the same frequency $f_0$, which is called an alias of $f_0+kf_s$ with respect to the sampling frequency $f_s$.

The phenomenon of aliasing is when the original signal cannot be distinguished during the sampling process, which results in the restoration of a different signal from the original signal¹.

Figure 3. Aliasing phenomenon

To prevent the phenomenon of aliasing, it is necessary to sample at a sufficiently high frequency.

As can be seen from the applet at the top of this post, if we sample at a fast enough rate, which is above a certain threshold, we can reconstruct the original continuous signal from the discrete signal fairly closely.

Figure 4. To prevent aliasing, we need to sample at a sufficiently high frequency.

The Shannon-Nyquist Sampling Theorem is a theory that provides an answer to the mathematical question of “how fast do we need to sample?” However, in order to understand this theorem, one needs to have an understanding of Fourier series/transforms, which will be covered in more detail later. In short, the conclusion of the Shannon-Nyquist Sampling Theorem is that if we sample a signal at twice the frequency of its maximum frequency component, we can reconstruct the original signal.

Frequency Characteristics of Discrete Signals

If we look at the horizontal axis of a discrete signal, we can see that only the sequence number is indicated.

Since the interval between sequence numbers is always 1, the minimum period that can be used to represent this signal is 1.

In other words, the maximum frequency will be 1 and the minimum frequency will be 0.

Usually, the frequency range of a discrete signal, including negative frequencies, can be expressed as

\[-0.5 \lt F \lt 0.5\]

\[-\pi \lt \Omega \lt \pi\]

On the other hand, we can also consider the frequency characteristics of a discrete sinusoid.

Any signal can be expressed as a linear combination of sinusoids with different frequencies, and this applies not only to continuous signals but also to discrete signals. Therefore, when analyzing a discrete signal, it is often useful to use sinusoids.

There are some differences between discrete sinusoids and continuous sinusoids. Discrete sinusoids are not always periodic signals. In other words, depending on the sampling period, a discrete sinusoid may or may not be a periodic signal.

Let’s assume an arbitrary discrete sinusoidal signal $x[n]$ as follows:

\[x[n]=A\cos(\Omega_0 n)\]

If $x[n]$ is a periodic signal with a period of $N$, then the following must hold (where $N$ is an integer):

\[x[n]=x[N+n]\] \[\Rightarrow A\cos(\Omega_0 n) = A\cos(\Omega_0 (n+N))\]

Therefore, the digital angular frequency $\Omega_0$ must satisfy

\[\Omega_0 N = 2\pi k \Rightarrow \Omega_0 = \frac{2\pi k}{N}\]

(where $k$ is an integer),

or the digital frequency $F_0 = \Omega_0/2\pi$ must be a rational number satisfying

\[\frac{\Omega_0}{2\pi}=F_0=\frac{k}{N}\]

Figure 5. A discrete signal becomes a periodic signal only when the digital frequency is a rational number.

In summary, a discrete sinusoidal signal can represent all frequency components within $-\pi$ to $\pi$ and has a periodicity of $2\pi$.

Therefore, a discrete signal has a periodicity with a frequency spectrum that is copied with a period of $2\pi$ within the frequency range of the digital angular frequency $-\pi$ to $\pi$.

The following figure compares the spectrum of a periodic signal with its sampled and discretized result:

Figure 6. The frequency spectrum of a discrete periodic signal displays copies of the original continuous periodic signal at intervals of $2\pi$.

Additionally, the following figure compares the spectrum of a non-periodic signal with its discrete counterpart obtained through time sampling.

Figure 7. The frequency spectrum of a discrete non-periodic signal is a replication of the original continuous signal at intervals of $2\pi$.

References

McClellan, James H., et al. “Signal Processing First.” Pearson Education, 2014.
Lee, Chulhee. “Digital Signal Processing.” Hanbit Academy, 2017.

The origin of aliasing comes from the word alias, which means ‘a fake name used to deceive one’s identity’. In this context, the term aliasing is used in the signal processing field to describe the case where the restored result obtained from the continuous signal is different from the original signal. ↩

신호 공간(signal space)

2022-01-12T00:00:00+00:00

Prerequisites

본 포스팅을 더 잘 이해하기 위해선 아래의 내용에 대해 알고 오는 것이 좋습니다.

벡터의 기본 연산

signals as vectors

※ 이 꼭지의 내용은 이전 포스팅 중 선형 연산자와 신호 공간의 일부 내용을 가져다 썼습니다.

이전 포스팅 중 선형대수학의 기초 부분인 벡터의 기본 연산(상수배, 덧셈)에서는 세 가지 관점으로 벡터를 생각했다.

각각은 벡터란 화살표 같은 것, 숫자의 나열, 벡터 공간의 원소라는 정의였다.

그 중 벡터란 벡터 공간의 원소라는 정의가 가장 수학적인 정의라고 말한 바 있는데, 이 정의가 중요한 이유는 ‘이런 방식으로 벡터를 정의하는 것은 이러한 특성을 가진 것들은 모두 벡터로 취급해서 다룰 수 있다는 점을 강조한다’라고 언급했다.

다시 말해, 벡터의 특성을 가지는 개념을 발견한다면, 선형대수학에서 적용해볼 수 있었던 테크닉들과 개념들을 확장해 적용해볼 수 있게 되는 것이다.

조금 더 구체적으로 말하자면 어떤 수학적 object(가령, 벡터, 행렬, 신호, 등등…)가 벡터이기 위해선 다음의 두 가지 연산에 대해 닫혀있어야 한다.

벡터의 상수배
벡터의 합

너무 단순한가?

마치 쿠팡에서 로켓와우 멤버십 가입비 2900원만내면¹ 쿠팡에서 제공하는 모든 로켓배송 서비스를 누릴 수 있는 것 처럼, 어떤 수학적 object가 위의 두 개의 법칙만 잘 만족하는 것이라고 확인된다면 ‘벡터’라는 멤버십을 받게 되는 것이다.

그리고 이에 따라 선형대수학에서 열심히 일궈놓은 개념들과 테크닉들을 적용받을 수 있게 된다.

그림 1. 쿠팡에서 로켓와우 멤버십에 가입해 누릴 수 있는 혜택들 (출처: 쿠팡)

엄밀한 증명은 아니지만 간단하게만 생각해봐도 신호는 벡터로 볼 수 있는 자격을 갖췄다.

아래는 이산 신호의 상수배와 신호끼리의 합을 표현한 것이다.

\[(c\cdot x)[n] = c\cdot x[n] % 식 (1)\] \[(x+z)[n] = x[n]+z[n] % 식 (2)\]

다시 말해 어떤 신호 $x[n]$에 임의의 상수 $c$를 곱하더라도 여전히 $cx[n]$는 신호이고,

어떤 신호 $x[n]$과 $z[n]$를 더하더라도 $x[n]+z[n]$ 역시 신호다.

그림 2. 임의의 이산 신호에 상수배를 해주어도 여전히 이산 신호이다.

그림 3. 서로 다른 임의의 두 이산 신호를 더하더라도 여전히 이산 신호이다.

단순히 이산 신호 뿐만 아니라 연속 신호도 마찬가지로 상수배 혹은 신호끼리의 합을 수행하더라도 여전히 연속 신호로 남게 된다.

그림 4. 두 연속 신호의 합
그림 출처: Function space, Wikipeda

이렇게 되면 벡터가 벡터 공간의 원소로 정의되었던 것 처럼, 신호도 벡터 공간의 원소로 정의될 수 있는 벡터가 되며, 이 때 신호가 포함되어 있는 벡터 공간을 “신호 공간(signal space)”이라고 부른다.

우리는 벡터의 개념을 확장해서 신호 공간이라는 개념을 얻어낼 수 있음을 알게되었다.

이제 중요한 점은 어떻게 벡터에 적용되는 선형대수학의 개념들과 테크닉 중 어떤 것을 신호에 적용할 것인가 라는 점이다.

어떤 개념을 확장시키고자 할 때는 아주 기초적인 것들부터 의심해봐야한다. 벡터의 ‘좌표’ 라는 개념부터 의심해보는 것이 현명한 스타트라는 생각이 든다.

신호는 신호 공간 상의 한 점

벡터에 대해 생각할 때 가장 먼저 떠오르는 것 중 하나는 벡터란 화살표 같은 것이라는 정의이다. 벡터의 특징으로 ‘크기와 방향이 있다’ 이렇게 생각하는 경우가 많다.

이러한 벡터에 대한 정의는 Euclidean vector에 한정해서만 성립하기 때문에 아주 일반적인 벡터에 대한 정의라고 볼 수는 없지만 벡터에 대해 시각적으로 이해하는데에 큰 도움을 주는 방식의 정의라고 할 수 있다.

(다시 한번, 벡터이기 위한 요건은 스칼라배와 합이지 크기와 방향을 가져야 하는 것은 아니라는 점을 꼭 기억하자. 크기와 방향을 가지기 위해선 내적이 정의되어야만 한다.)

어찌되었든 우리는 2차원 공간 상의 한 점을 생각해보자. 좌표는 (3,4)라고 생각해보자.

여기서 우리가 좌표가 (3,4)인 벡터를 생각한다라고 하는 말은 어떤 2차원 벡터 공간 상의 기저 벡터 두 개를 몇 개씩 결합할것인가에 관한 표현을 간략화 한 것이다.

아래 그림은 좌표가 (3,4) 인 벡터와 2차원 벡터 공간 상의 기저벡터 두 개 $\hat{i}$와 $\hat{j}$를 표시한 것이다.

그림 5. 좌표가 (3,4)인 벡터와 표준기저벡터 $\hat{i}$와 $\hat{j}$

그리고 또 다른 아래의 그림에서는 (3,4) 좌표의 벡터가 기저벡터 3개, 4개를 각각 더해 구성할 수 있는 것임을 알 수 있다.

그림 6. 좌표가 (3,4)라는 말은 한 기저벡터 3개와 다른 기저벡터 4개의 합으로 그 벡터를 표현할 수 있다는 의미이다.

그러면 이 표준기저벡터들을 항상 사용해야하는것일까? 사실은 2차원 벡터 중 아무거나 두개를 골라서 기저벡터로 삼을 수 있다.

아래 그림은 좌표계를 반시계방향으로 10’ 회전시켜 만든 새로운 좌표계이다. 그리고 이 때의 기저벡터는 $\hat{i}{new}$와 $\hat{j}{new}$로 표시했다.

그림 7. 위에서 (3,4)로 표현했던 벡터에 대해 새로운 기저벡터가 적용되는 좌표계로 다시 이 벡터를 표현할 수 있을까?

새로운 기저벡터를 이용해 원래의 벡터를 표현하면 좌표는 (3.6, 3.4)이다. 이것은 기저벡터가 몇 개 들어가는지를 표시하는 것과 동일하다.

그림 8. 새로운 기저벡터를 이용하면 각각의 기저벡터를 3.6개, 3.4개 사용하여 원래의 벡터를 표현할 수 있다.

이처럼 벡터는 벡터 공간상의 한 점과 같다. 다만, 이 벡터를 표현할 수 있는 방법은 기저에 따라 바뀐다.

수식으로 쓰자면 임의의 벡터 $\vec{v}$는 기저벡터들의 선형결합으로 아래와 같이 쓸 수 있다.

\[\vec{v}=c_1 \hat{i} + c_2 \hat{j} = d_1 \hat{i}_{new} + d_2\hat{j}_{new}\]

어떤 기저는 다른 기저에 비해서 동일한 벡터를 표현하는데에도 표현이 단순해지고 간결해진다.

앞선 예시에서는 $c_1$과 $c_2$는 각각 3, 4로 단순했지만 $d_1$과 $d_2$는 3.6, 3.4로 조금 더 복잡해졌다.

이처럼 동일한 벡터 하나를 표현하는데 좋은 기저를 정하는 것은 매우 중요하다.

신호도 마찬가지로 임의의 신호를 기저 신호의 선형결합으로 표현할 수 있다.

임의의 신호 $x[k], k = 1, 2,\cdots n$이 포함되어 있는 신호 공간에 대한 기저 신호를 $\lbrace \phi_i[k] | i = 1,2,\cdots, n\rbrace$라고 잡는다면 임의의 신호 $x[n]$은 다음과 같이 기저 신호들의 선형결합으로 표현할 수 있다.

\[x[k]=\sum_{i=1}^{n}p_i \phi_i[k]\]

이는 연속 신호에 대해서도 마찬가지로 임의의 신호 $x(t)$가 포함되어 있는 신호 공간의 기저 신호를 $\lbrace \psi _i(t)\rbrace$라고 두면 이 신호는 다음과 같이 기저 신호들의 선형결합으로 표현할 수 있다.

\[x(t) = \sum_i q_i \psi_i(t)\]

한편, 하나의 벡터를 표현하기 위해 기저 벡터가 몇 개 들어갈지를 계산하는 방법은 ‘벡터의 내적’으로 알아볼 수 있다. 즉, 위 식들에서 $p_i$와 $q_i$를 계산하는 방법은 벡터의 내적처럼 신호의 내적을 정의해줌으로써 가능하다는 의미가 된다.

벡터 간의 내적 → 신호의 내적

선형대수학에서 벡터의 내적은 다음과 같이 정의되었다.

임의의 아래와 같은 $n$차원 실수 벡터 $\vec{a}$와 $\vec{b}$에 대하여,

\[\vec{a} = \begin{bmatrix}a_1\\ a_2 \\ \vdots \\ a_n\end{bmatrix} % 식 (6)\] \[\vec{b} = \begin{bmatrix}b_1\\ b_2 \\ \vdots \\ b_n\end{bmatrix} % 식 (7)\] \[\text{dot}(\vec{a}, \vec{b})=a_1b_1 + a_2b_2 +\cdots + a_nb_n % 식 (8)\]

만약 $\vec{a}$와 $\vec{b}$가 복소 벡터였다고 하면 내적은 다음과 같이 정의된다.

\[\text{dot}(\vec{a}, \vec{b})=a_1^*b_1 + a_2^*b_2 +\cdots + a_n^*b_n % 식 (9)\]

여기서 $*$은 복소 켤레(complex conjugate) 연산이다.

왜 복소 벡터는 복소 켤레 연산이 들어가는지 생각해본다면 내적을 통해 복소 벡터에서 길이를 정의하기 위해서이다.

어떤 실수 벡터 $\vec{a}$의 크기(보통 L2-norm)는 다음과 같이 정의된다.

\[\text{norm}_2(\vec{a}) = \sqrt{a_1^2 + a_2^2 + \cdots + a_n^2} % 식 (10)\]

즉,

\[\text{norm}_2(\vec{a}) = \sqrt{\text{dot}(\vec{a}, \vec{a})}=\sqrt{a_1a_1+a_2a_2+\cdots+a_na_n} % 식 (11)\]

이 개념을 복소벡터에까지 확장시키면, 복소 벡터 $\vec{a}$에 대해서

\[\text{norm}_2(\vec{a})=\sqrt{a_1^2+a_2^2 + \cdots a_n^2}=\sqrt{a_1^*a_1+a_2^*a_2+\cdots +a_n^*a_n}=\sqrt{\text{dot}(\vec{a},\vec{a})} % 식 (12)\]

이어야 하므로 복소벡터의 내적연산은 식 (9)과 같이 정의되어야 하는 것이다.

이제 식 (9)의 방식을 확장해 신호의 내적을 정의해보도록 하자.

신호들은 실수 신호 범위에서 그치지않고 신호값이 복소수가 될 수 도 있기 때문에 다음과 같이 복소 벡터의 내적의 정의를 확장해 다음과 같이 정의한다.

이산 신호의 경우 다음과 같이 정의된다. 임의의 복소 이산 신호 $x[k]$와 $z[k]$ $, k = 1, 2, \cdots, n$ 에 대하여

\[\langle x[k], z[k] \rangle \equiv \sum_{k=1}^n x[k]z^*[k]\]

이다. 여기서 $z^*[k]$는 $z[k]$의 complex conjugate이다.

또, 구간 $(a, b)$에서 정의된 임의의 복소 연속 신호 $x(t)$, $z(t)$에 대해 두 신호의 inner product $\langle f, g\rangle$은

\[\langle x(t), z(t)\rangle \equiv \int_a^b x(t)z^*(t) dt % 식 (10)\]

이다. 여기서 $z^*(t)$는 $z(t)$의 complex conjugate이다.

고유함수

고유함수에 관해 더 잘 이해하기 위해선 아래의 내용에 대해 알고 오는 것이 좋습니다.

고유함수에 대해 이해하게 되면 왜 신호/시스템 분야에서 신호를 복소 정현파를 이용해 서술하는지 알 수 있다.

지금까지의 논의에서 신호(즉, 함수)가 벡터라는 것에 관해 알아보았다. 그리고, 신호가 벡터라면 선형대수학에서 논의되고 개발된 용어들을 확장해 적용할 수 있고 선형대수학에서 개발된 메소드마저도 이용할 수 있다는 사실을 일부 확인했다.

선형대수학에서 아주 중요한 주제 중 하나인 고윳값과 고유벡터를 신호 처리에 관해서도 일부 적용해볼 수 있다.

고유벡터의 개념에 대해서 조금 더 잘 알기 위해선 벡터와 행렬의 관계에 대해 알아야 한다.

행렬은 벡터에 관한 함수라고 할 수 있다. 그리고, 행렬은 벡터를 입력 받아 또 다른 벡터를 출력하는 역할을 한다.

그림 9. 행렬은 벡터를 입력 받아 벡터를 출력해주는 함수이다.

이 때, 만약 어떤 행렬이 벡터를 입력 받아 출력했는데, 출력된 벡터가 입력된 벡터와 비교했을 때 크기만 바뀌고 방향은 그대로인 경우가 있을 수 있다.

그림 10. 입력 벡터 ($x$)와 출력 벡터($Ax$)가 방향은 동일하고 크기만 차이나는 경우

이런 경우에 이 벡터 $x$의 방향으로 향하는 단위 벡터를 행렬 $A$에 대한 고유벡터라고 하고, 크기의 변화량을 고윳값이라고 부른다.

그런데, 우리가 공부하는 신호 시스템에서는 어떨까? 신호가 벡터라고 한다면 시스템은 행렬에 대응하는 것이다.

그림 11. 신호가 벡터에 대응되는 개념이라면 시스템은 행렬에 대응되는 개념이다.

그렇다면 우리가 다루는 시스템도 고유벡터에 대응하는 개념이 있을까?

신호, 시스템에서 고유벡터에 대응되는 개념을 우리는 보통 고유함수(eigenfunction)이라고 부른다. (고유 신호라고는 보통 부르지 않음.)

보통 가장 중요하게 다루는 선형시불변(Linear Time-Invariant) 시스템에서는 복소 정현파가 고유함수가 된다.

그림 12. LTI system에서는 복소 정현파가 고유함수가 된다.

조금 더 자세하게 보면, 입력이 $x(t)=e^{j\omega t}$ 이고 시스템의 impulse response가 $h(t)$라고 하면 출력은

\[y(t) = \int_{-\infty}^{\infty}e^{j\omega (t-\tau)}h(\tau)d\tau\] \[=e^{j\omega t}\int_{-\infty}^{\infty}h(\tau)e^{-j\omega\tau}d\tau\]

와 같다. 여기서 $H(\omega)$를 아래와 같이 정의하였는데, 이것은 $h(t)$의 푸리에 변환이라고 부르는 것이다.

\[H(\omega) = \int_{-\infty}^{\infty}h(\tau)e^{-j\omega\tau}d\tau\]

중요한 것은 원래의 식을 다시 써보면,

\[y(t)=H(\omega)e^{j\omega t}\]

가 되는데, 출력 함수를 보면 원래의 입력 함수 $e^{j\omega t}$가 그대로 들어있고 그것에 $H(\omega)$이 곱해져서 출력되는 것을 알 수 있다.

생각해보면 너무 자연스럽게 $e^{j\omega t}$가 나오다보니 이게 뭐가 그렇게 특별한가 싶을지도 모르지만, 이번엔 코사인 함수를 입력으로 넣어보자.

코사인 함수는 오일러 공식에 의해 다음과 같이 수정해서 쓸 수도 있다.

\[x(t) = \cos(\omega t)=\frac{1}{2}(e^{j\omega t}+e^{-j\omega t})\]

시스템을 $\mathfrak L$이라고 하면, 우리의 시스템은 선형 시스템이기 때문에 다음이 성립한다.

\[y(t) = (\mathfrak{L}x)(t)=\frac{1}{2}\left(\mathfrak{L}(e^{j\omega t} + \mathfrak{L}(e^{-j\omega t})\right)\]

여기서 복소수 표현을 이용해 $H(\omega)$를 표현하면,

\[H(\omega) = |H(\omega)|e^{j \angle H(\omega)}\] \[H(-\omega) = H^*(\omega) = |H(\omega)|e^{-j\angle H(\omega)}\]

이므로, $y(t)$를 다시 쓰면 다음과 같을 것이다.

\[y(t) = \frac{1}{2}|H(\omega)|\left(e^{j(\omega t +\angle H(\omega))} + e^{-j(\omega t +\angle H(\omega))}\right)\] \[=|H(\omega)|\cos(\omega t + \angle H(\omega))\]

와 같다.

따라서, 코사인 함수를 입력으로 넣어주면 시스템에 의해 크기가 $|H(\omega)|$만큼 커질 뿐만 아니라 위상도 $\angle H(\omega)$만큼 shift되어 표현해주어야 한다.

그러므로 코사인 함수를 입력으로 넣어줬을 때는 출력에 원래의 입력이 그대로 출력되지 않으므로 코사인 함수는 선형 시스템에 대한 고유함수가 아니다.

여기서 알 수 있는 사실은 신호/시스템 분야에서는 신호를 표현할 때 복소 정현파를 이용해서 표현하며, 그 이유는 복소 정현파를 이용해 입력을 표현해주면 출력에는 시스템의 특성(임펄스 함수의 푸리에 변환)만 서술해주면 되어서 출력에 관한 서술이 간결해지기 때문이다.

참고 문헌

지금은 회비가 좀 올랐다. ↩