새소식

Deep Learning

[draft] Cross Entropy Loss

  • -

Cross Entropy Loss

Information Theory

: it is mainly used in coding theory such as Huffman coding

  • Surprise = unexpectedness = disorder = less probability
  • Information = level of surprise
    if event X has high probability = P(X),
    then P(X) = "No Surprise".
    else if event X has low probability = P(X),
    then P(X) = "Surprise.".
  • if rain_summer has high probability = P(rain_summer),
    else if snow_summer has low probability = P(snow_summer),
    then when P(rain_summer) = "No Surprise".
    but when P(snow_summer) = "Surprise".

1. when the information = level of surprise is high?

: P(X) has less probability, Information(X) = ∝(1/P(X))


2. How can we measure this Information = level of surprise?

: Coding Length(P(X)) = log(1/P(X)) (whatever log base = 2, 3, 10, e ...)
: Coding Length(P(X)) = -log(P(X))

Assume P(rain_summer) = 99/100, and P(snow_summer) = 1/100.

  • Coding_Length(P(rain_summer)) = log(100/99) = 0.01, assume log base=2
  • Coding_Length(P(snow_summer)) = log(100/1) = 6.64, assume log base=2
    >> P(snow_summer) has more higher level of surprise = Information than P(rain_summer)
The smaller the probability, the more information, the higher surprise!

3. When sending some events, how much Coding Length is required in average?

: Entropy = summation(P(X) * Coding Length) = summation(P(X) * log(1/P(X)))

  • Fair Coin (when every event has a same proportional)
    • P(H=T) = 1/2
    • I(P(H)) = -log(1/2) = log(2) = 1, assume log base=2
    • I(P(T)) = -log(1/2) = log(2) = 1, assume log base=2
    • entropy = P(H) * I(P(H)) + P(T) * I(P(T)) = 1/2 + 1/2 = 1
  • Unfair Coin
    • P(H) = 3/4, P(T) = 1/4
    • I(P(H)) = -log(3/4) = log(4/3) = 0.42, assume log base=2
    • I(P(T)) = -log(1/4) = log(4) = 2, assume log base=2
    • entropy = P(H) * I(P(H)) + P(T) * I(P(T)) = 0.75 * 0.42 + 0.25 * 2 = 0.815
  • Fair Coin is in more entropy = disorder = unpredictable
  • Unfair Coin is in less entropy = predictable

Entropy = average Information = average Surprise = average Unexpectedness
: : required by using optimal coding scheme
entropy = H(X) = Expectationp(x)[Coding Length = Information=I(X)]
= summation(P(X) * I(X)) = summation(P(X) * log(1/P(X))
= (-1) * summation(P(X) * log(P(X))


4. What is Cross Entropy?

: 집합에서 추출한 event를 식별하는 데 필요한 평균 비트 수를 측정

Cross Entropy
: : measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution, rather than the "true" distribution.
: : by using non-optimal coding scheme

  • p : true distribution
  • q : guessed distribution
  • Cross Entropy = H(p,q) = Ep[log(1/q)) = (-1) * Ep[log(q)]
    = p(X) * log(1/q(X)) + p(1-X) * log(1/(1-q(X)) = non-optimal
    = H(p,q) >= H(p ), equal operation holds when p=q
    = Cross Entropy is always more higher than Entropy

    Assume that there are two kinds of entities - blue ball and red ball.

    • p(blue ball) = 2/5, p(red ball) = 3/5
    • and I guess, "blue ball is 4/5, and red ball is 1/5".
    • and compare it with true
    • Cross Entropy = H(p,q) = summation(true * log(1/guess))
      = 2/5 * log(5/4) + 3/5 * log(5) = 1.52, assume log base=2

5. Cross Entropy in Softmax Neuron

  • prediction distribution (guess = q) = [0.3, 0.3, 0.4]
  • target distribution (true = p) = [0, 0, 1]
  • H(p, q) = 0 + 0 + 1*log(4) = log(4) = 2, assume log base=2

6. What is KL Divergence?

: is used for measuring the distance between two distributions.

How we can use the Entropy and Cross Entropy for measuring the distance between two distribution?

  • H(true, guess) = non-optimal = optimal + extra Coding Length
  • H(true) = optimal
  • distance(true, guess) = H(true, guess) - H(true) = extra Coding Length
    = KL(true||guess) = KL(p||q)
    = H(p, q) - H(p ) = Ep[log(1/q)] - Ep[log(1/p)] = Ep[log(1/q)] + Ep[log(p )] = Ep[log(p/q)]
    = summation(p * log(p/q)) = summation(true*log(true/guess))

Assume the true distribution p, guess distribution q and r.

  • p = [0.25, 0.25, 0.25, 0.25], Pair Coin
  • q = [0.25, 0.25, 0.26, 0.24]
  • r = [0.4, 0.15, 0.05, 0.4]
  • KL(p||q) = 0.25 * log(0.25/0.25) + 0.25 * log(0.25/0.25) + 0.25 * (0.25/0.26) + 0.25 * (0.25/0.24) = 0.25 * log1 + 0.25 * log1 + 0.25 * log(25/26) + 0.25 * log(25/24)
    = 0 + 0 + 0.25 * log(25/26) + 0.25 * log(25/24)
    = 0.0004, assume log base=e
  • KL(p||r) = 0.25 * log(0.25/0.4) + 0.25 * log(0.25/0.15) + 0.25 * log(0.25/0.05) + 0.25 * log(0.25/0.4) = 0.25 * log(25/4) + 0.25 * log(5/3) + 0.25 * log(5) + 0.25 * log(25/4)
    = 0.5 * log(25/4) + 0.25 * log(5/3) + 0.25 * log(5)
    = 0.295, assume log base=e
  • KL(p||q) != KL(q||p), KL Divergence is asymmetric = KL cannot be used as a distance metric.
    • KL(q||p) = 0.00040032, KL(p||q) = 0.000400107
  • KL(p||q) >= 0, (KL(p||q) = H(p,q) - H(p ) = Extra!

does using Cross Entropy really makes sense when we deal with probability?
Shouldn't we use KL Divergence between predicted probability = guess
and label = true, instead of Cross Entropy?

  • t = label = true
  • y = prediction = guess
  • minimizing KL(t||y) = minimizing (Cross Entropy - Entropy)
    = minimizing (H(t,y) - H(t))
  • guessing y is changing, the result is depend on Cross Entropy.
    ... Entropy = H(t) is not effect on prediction.
  • we try to minimize Cross Entropy but actually we are minimizing KL Divergence!

'Deep Learning' 카테고리의 다른 글

[draft] Ranking Loss (Pairwise, Triplet)  (0) 2021.05.03
[draft] Forward KL VS Reverse KL  (0) 2021.05.03
[draft] Hinge Loss  (0) 2021.04.05
[draft] Loss Function  (0) 2021.04.05
[draft] data preparation  (0) 2021.03.29
Contents

포스팅 주소를 복사했습니다

이 글이 도움이 되었다면 공감 부탁드립니다.