PAC Learning

PAC Learning

카테고리 없음 2019. 8. 4. 17:40
https://www.youtube.com/watch?v=qOMOYM0WCzU

PAC = Probably Approximately Correct

Notation

Given:

Set of instances $X$

Set of hypotheses $H$

Set of possible target concepts $C$

Training instances generated by a fixed, unknown probability distribution $\mathcal{D}$ over $X$

Learner observes sequence $D$ of training examples $\langle x,c(x) \rangle$, for some target concept $c\in C$

Instances $x$ are drawn from distribution $\mathcal{D}$

$\mathcal{D}$는 데이터가 뽑히는 잠재적 분포 / $D$는 실제로 뽑혀서 우리에게 나타난 관찰 데이터.

Concept class(set) $C$: label을 만들어내는 mapping $y=c(x), c\in C$.

Wiki에는 mapping이 아니라, 전체 $X$ 중 label에 대응하는 subset이라고 나와있음. (확인)

Hypothesis space(set) $H$: $\hat{y}=h(x), h\in H$

실제 상황: Concept $c$가 Hypothesis space의 멤버가 아닌 상황 ($c\notin H$)

https://en.wikipedia.org/wiki/Version_space_learning

초록색 +는 positive examples / 빨간 o은 negative examples

GB: maximally general positive hypothesis boundary

SB: maximally specific positive hypothesis boundary

GB~SB(초록색 rectangles) = hypothesis in the VS(version space)

VS에서의 PAC Learning

VS에서는 Training data에 대한 에러($err$)는 0이다.

Concept $c$가 Hypothesis space의 멤버인 경우를 가정 ($c\in H$)

Test data에 대한 에러($Err$)를 $\epsilon$이라고 할 때,

Theorem: $P(Err>\epsilon)<|H|e^{-\epsilon m}$

증명:

어떤 hypothesis에서 1개의 test data에 대하여 틀릴 확률: $\epsilon$

어떤 hypothesis에서 1개의 test data에 대하여 맞출 확률: $1-\epsilon$

어떤 hypothesis에서 m개의 test data에 대하여 맞출 확률: $(1-\epsilon)$

모든 $x$에 대하여 $1+x\leq e^{x}$이므로: $(1-\epsilon)^m \leq e^{-\epsilon m}$

$(1-\epsilon)^m \leq e^{-\epsilon m}$

$|H|$개의 멤버를 가지는 hypothesis space $H$에서 $m$개의 testdata에 대하여 맞출 확률?

$P(Err=\epsilon)=(1-\epsilon)^m \leq |H|e^{-\epsilon m}$

위 그래프에 $m$승을 해서 구한 확률은 여전히 단조감소 함수가 되고,

$\epsilon$이 증가하면 $P$는 감소하므로 다음과 같은 inequality로 표현 가능:

$P(Err>\epsilon)<|H|e^{-\epsilon m}$

이때의 sample complexity(필요한 데이터 크기)?

Generalization을 못하는 나쁜 classifier가 나올 확률 $P(Err>\epsilon)$의 최대값을 $\delta$로 바운드하면:

$|H|e^{-\epsilon m} \geq \delta$

이때의 m은:

(양변에 로그) $\log|H|-\epsilon m \leq \log\delta$

(양변에 로그) $\log|H|-\log\delta \leq \epsilon m$

(양변에 $\frac{1}{\epsilon}$) $\frac{1}{\epsilon} (\log|H|-\log\delta) \leq m$

우리의 classifier의 에러가 $\epsilon$보다 클(approximately correct) 확률이 $\delta$보다 작으려면(probably)

적어도 $\frac{1}{\epsilon} (\log|H|+\log\frac{1}{\delta})$ 개의 데이터 갯수가 필요하다!

$|H|e^{-\epsilon m} \geq \delta$, $m \geq \frac{1}{\epsilon} (\log|H|+\log\frac{1}{\delta})$

위 두가지 사실에서 알 수 있는 사실은?

좋은 뉴스: 데이터 포인트가 많은 수록 나쁜 classifier의 확률에 대한 바운드 $\delta$는 exponentially 낮아진다.

나쁜 뉴스: Hypothesis space가 rich할 수록 $|H|$가 커지며, 이는 크기 때문에 바운드를 linear하게 높히게 된다.

실제 상황: 이 기준을 만족하는 $m$을 확보하는 것은 불가능에 가깝지만 실제 바운드는 더 loose하므로 괜찮다.

Agnostic Learning

더이상 VS에 제한되어 있지 않은 경우

실제 문제에서는 Concept $c$가 Hypothesis space의 멤버가 아님 ($c\notin H$)

Hoeffding bound를 사용하자!

(Bernoulli case)앞면이 $p$의 확률로 나오는 동전을 $n$번 던졌을 때의 실제 확률 $\overline{p}_n$

이때의 차이 $|p-\overline{p}_n|$에 대한 확률은 어떤 범위에 있을까?

$P(\overline{X}-X[\overline{X}]\geq t)\leq e^{-2nt^2}$

현재 상황에 Hoeffding bound를 적용해보자!

$P(Err(h)-err(h)+\epsilon]\geq t)\leq e^{-2m\epsilon^2}$

Generalization gap이 가장 큰 hypothesis $h_{best}$에 경우 전체 space를 고려:

$P(Err(h_{best})-err(h_{best})+\epsilon]\geq t)\leq |H|e^{-2m\epsilon^2}$

위에 똑같은 방법으로 sample complexity를 구할 수 있다:

$m \geq \frac{1}{2\epsilon^2} (\log|H|+\log\frac{1}{\delta})$

Infinite Hypothesis Set Case

지금까지는 finite hypothesis set을 가정했다.

그러나 대부분의 실제 학습 상황에서는 hypothesis set은 무한대의 크기를 가진다.

간단한 2차원에서 직선으로 이진 분류를 하는 경우만 생각해도 무한개의 직선이 VS에 존재할 수 있다.

그러므로, $|H|=\infty$인 상황에서, 기존의 식을 그대로 사용하는 것은 타당하지 않다.

이 문제를 어떻게 해결할까?

바로 $|H|$를 $VC(H)$로 대체하는 것이다.

$VC$ space는 hypothesis space의 capacity를 처리 가능한 training data의 갯수로써 표현한 수치이다.

실제 크기로 표현할 수 없으니, 주어진 training data에 종속적인 capacity로 대신 표현해보자는 것이다.

어차피 우리는 training set에 기반하여 hypothesis를 결정하기 때문에, 전체 space를 전부 고려할 필요는 없다.

가능한 hypothesis set은 무한개이지만, 우리가 classfiy하고자 하는 셋은 유한개라는 점을 이용하는 것이다.

전체를 고려하는 대신 필요한 만큼의 flexibility만 고려하는 방법으로 해석할 수 있다.

결론부터 얘기하면 새로운 질문에 대한 답은 다음과 같다:

How many randomly drawn examples suffice to guarantee error of at most $\epsilon$ with probability at least $(1-\delta)$?

$m \geq \frac{1}{}\epsilon(4\log_2(2/\delta)+8VC(H)log_2(13/\epsilon))$

VC를 이해하기 위해 필요한 개념은 아래와 같다.

Dichotomy (dai·kaa·tuh·mee / 양분, 이분)

Definition: a dichotomy of a set $S$ is a partition of $S$ into two disjoint subsets.

전체 셋을 두 개의 disjoint 셋으로 나누는 행위 자체를 dichotomy라고 한다.
전체 셋에 대해 여러 개의 dichotomy가 존재할 수 있다. (다른 분할 경계)

Shattering

Definition: a set of instances $S$ is shattered by hypothesis space $H$ if and only if for every dichotomy of $S$ there exists some hypothesis in $H$ consistent with this dichotomy.

전체 셋을 나눌 수 있는 dichotomy의 모든 경우의 수에 대해서, hypothesis space $H$에 존재하는 한 개 이상의 hypothesis $h$로 나눌 수 있는 경우, $S$가 $H$에 대해서 shatter되었다고 말한다.

예를 들어, hypothesis space가 모든 타원을 포함할때, 데이터 셋이 가질 수 있는 모든 이분할 조합에 대해 타원으로써 결정경계를 표시할 수 있어야 shatter된다고 말할 수 있다.

VC dimension (The Vapnik-Chervonenkis Dimension)

Definition: The Vapnik-Chervonenkis dimension, $VC(H)$, of hypothesis space $H$ defined over instance space $X$ is the size the largest finite subset of $X$ shattered by $H$. If arbitrarily large finite sets of $X$ can be shattered by $H$, then $VC(H)\equiv \infty$.

Hypothesis space $H$가 shatter할 수 있는 가장 큰 데이터 $X$의 개수를 $H$의 VC dimension이라고 부른다. 예를 들어, 2차원 데이터를 straight line으로 shatter할 수 있는 가장 큰 데이터 개수는 3개 이다. 이는 다음과 같이 일반화 가능하다: $d$차원의 hyper-plane의 VC dimension은 $d+1$이다.

https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension
댓글

ABOUT ME

ML Tech Blog ML Tech Blog

Notation

VS에서의 PAC Learning

Agnostic Learning

Infinite Hypothesis Set Case

티스토리툴바