Deep Learning

[draft] Cross Entropy Loss

Cross Entropy Loss

Information Theory

: it is mainly used in coding theory such as Huffman coding

Surprise = unexpectedness = disorder = less probability

Information = level of surprise
if event X has high probability = P(X),
then P(X) = "No Surprise".
else if event X has low probability = P(X),
then P(X) = "Surprise.".

if rain_summer has high probability = P(rain_summer),
else if snow_summer has low probability = P(snow_summer),
then when P(rain_summer) = "No Surprise".
but when P(snow_summer) = "Surprise".

1. when the `information = level of surprise` is high?

: P(X) has less probability, Information(X) = ∝(1/P(X))

2. How can we measure this `Information = level of surprise`?

: Coding Length(P(X)) = log(1/P(X)) (whatever log base = 2, 3, 10, e ...)
: Coding Length(P(X)) = -log(P(X))

Assume P(rain_summer) = 99/100, and P(snow_summer) = 1/100.

Coding_Length(P(rain_summer)) = log(100/99) = 0.01, assume log base=2

Coding_Length(P(snow_summer)) = log(100/1) = 6.64, assume log base=2
>> P(snow_summer) has more higher level of surprise = Information than P(rain_summer)

The smaller the probability, the more information, the higher surprise!

3. When sending some events, how much `Coding Length` is required in average?

: Entropy = summation(P(X) * Coding Length) = summation(P(X) * log(1/P(X)))

Fair Coin (when every event has a same proportional)
- P(H=T) = 1/2
- I(P(H)) = -log(1/2) = log(2) = 1, assume log base=2
- I(P(T)) = -log(1/2) = log(2) = 1, assume log base=2
- entropy = P(H) * I(P(H)) + P(T) * I(P(T)) = 1/2 + 1/2 = 1
Unfair Coin
- P(H) = 3/4, P(T) = 1/4
- I(P(H)) = -log(3/4) = log(4/3) = 0.42, assume log base=2
- I(P(T)) = -log(1/4) = log(4) = 2, assume log base=2
- entropy = P(H) * I(P(H)) + P(T) * I(P(T)) = 0.75 * 0.42 + 0.25 * 2 = 0.815
Fair Coin is in more entropy = disorder = unpredictable
Unfair Coin is in less entropy = predictable

Entropy = average Information = average Surprise = average Unexpectedness
: : required by using optimal coding scheme
entropy = H(X) = Expectation~~p(x)~~[Coding Length = Information=I(X)]
= summation(P(X) * I(X)) = summation(P(X) * log(1/P(X))
= (-1) * summation(P(X) * log(P(X))

4. What is `Cross Entropy`?

: 집합에서 추출한 event를 식별하는 데 필요한 평균 비트 수를 측정

Cross Entropy
: : measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution, rather than the "true" distribution.
: : by using non-optimal coding scheme

p : true distribution
q : guessed distribution
Cross Entropy = H(p,q) = Ep[log(1/q)) = (-1) * Ep[log(q)]
= p(X) * log(1/q(X)) + p(1-X) * log(1/(1-q(X)) = non-optimal
= H(p,q) >= H(p ), equal operation holds when p=q
= Cross Entropy is always more higher than Entropy
Assume that there are two kinds of entities - blue ball and red ball.
- p(blue ball) = 2/5, p(red ball) = 3/5
- and I guess, "blue ball is 4/5, and red ball is 1/5".
- and compare it with true
- Cross Entropy = H(p,q) = summation(true * log(1/guess))
  = 2/5 * log(5/4) + 3/5 * log(5) = 1.52, assume log base=2

5. `Cross Entropy` in Softmax Neuron

prediction distribution (guess = q) = [0.3, 0.3, 0.4]
target distribution (true = p) = [0, 0, 1]
H(p, q) = 0 + 0 + 1*log(4) = log(4) = 2, assume log base=2

6. What is `KL Divergence`?

: is used for measuring the distance between two distributions.

How we can use the Entropy and Cross Entropy for measuring the distance between two distribution?

H(true, guess) = non-optimal = optimal + extra Coding Length

H(true) = optimal

distance(true, guess) = H(true, guess) - H(true) = extra Coding Length
= KL(true||guess) = KL(p||q)
= H(p, q) - H(p ) = Ep[log(1/q)] - Ep[log(1/p)] = Ep[log(1/q)] + Ep[log(p )] = Ep[log(p/q)]
= summation(p * log(p/q)) = summation(true*log(true/guess))

Assume the true distribution p, guess distribution q and r.

p = [0.25, 0.25, 0.25, 0.25], Pair Coin

q = [0.25, 0.25, 0.26, 0.24]

r = [0.4, 0.15, 0.05, 0.4]

KL(p||q) = 0.25 * log(0.25/0.25) + 0.25 * log(0.25/0.25) + 0.25 * (0.25/0.26) + 0.25 * (0.25/0.24) = 0.25 * log1 + 0.25 * log1 + 0.25 * log(25/26) + 0.25 * log(25/24)
= 0 + 0 + 0.25 * log(25/26) + 0.25 * log(25/24)
= 0.0004, assume log base=e

KL(p||r) = 0.25 * log(0.25/0.4) + 0.25 * log(0.25/0.15) + 0.25 * log(0.25/0.05) + 0.25 * log(0.25/0.4) = 0.25 * log(25/4) + 0.25 * log(5/3) + 0.25 * log(5) + 0.25 * log(25/4)
= 0.5 * log(25/4) + 0.25 * log(5/3) + 0.25 * log(5)
= 0.295, assume log base=e

KL(p||q) != KL(q||p), KL Divergence is asymmetric = KL cannot be used as a distance metric.
- KL(q||p) = 0.00040032, KL(p||q) = 0.000400107
KL(p||q) >= 0, (KL(p||q) = H(p,q) - H(p ) = Extra!

does using Cross Entropy really makes sense when we deal with probability?
Shouldn't we use KL Divergence between predicted probability = guess
and label = true, instead of Cross Entropy?

t = label = true

y = prediction = guess

minimizing KL(t||y) = minimizing (Cross Entropy - Entropy)
= minimizing (H(t,y) - H(t))

guessing y is changing, the result is depend on Cross Entropy.
... Entropy = H(t) is not effect on prediction.

we try to minimize Cross Entropy but actually we are minimizing KL Divergence!

저작자표시 비영리 동일조건

'Deep Learning' 카테고리의 다른 글

[draft] Ranking Loss (Pairwise, Triplet) (0)	2021.05.03
[draft] Forward KL VS Reverse KL (0)	2021.05.03
[draft] Hinge Loss (0)	2021.04.05
[draft] Loss Function (0)	2021.04.05
[draft] data preparation (0)	2021.03.29

Contents

새소식

[draft] Cross Entropy Loss

Cross Entropy Loss

Information Theory

1. when the `information = level of surprise` is high?

2. How can we measure this `Information = level of surprise`?

3. When sending some events, how much `Coding Length` is required in average?

4. What is `Cross Entropy`?

5. `Cross Entropy` in Softmax Neuron

6. What is `KL Divergence`?

'Deep Learning' 카테고리의 다른 글

당신이 좋아할만한 콘텐츠

티스토리툴바

[draft] Cross Entropy Loss

Cross Entropy Loss

Information Theory

1. when the information = level of surprise is high?

2. How can we measure this Information = level of surprise?

3. When sending some events, how much Coding Length is required in average?

4. What is Cross Entropy?

5. Cross Entropy in Softmax Neuron

6. What is KL Divergence?

'Deep Learning' 카테고리의 다른 글

당신이 좋아할만한 콘텐츠

티스토리툴바

1. when the `information = level of surprise` is high?

2. How can we measure this `Information = level of surprise`?

3. When sending some events, how much `Coding Length` is required in average?

4. What is `Cross Entropy`?

5. `Cross Entropy` in Softmax Neuron

6. What is `KL Divergence`?