[draft] Cross Entropy Loss
- -
Cross Entropy Loss
Information Theory
: it is mainly used in coding theory such as
Huffman coding
- Surprise = unexpectedness = disorder = less probability
- Information = level of surprise
if
event X
has high probability =P(X)
,then
P(X)
= "No Surprise".else if
event X
has low probability =P(X)
,then
P(X)
= "Surprise.".
if
rain_summer
has high probability =P(rain_summer)
,else if
snow_summer
has low probability =P(snow_summer)
,then
when P(rain_summer)
= "No Surprise".but
when P(snow_summer)
= "Surprise".
1. when the information = level of surprise
is high?
: P(X) has less probability, Information(X) = ∝(1/P(X))
2. How can we measure this Information = level of surprise
?
:
(whatever log base = 2, 3, 10, e ...)Coding Length(P(X))
= log(1/P(X))
: Coding Length(P(X))
= -log(P(X))
Assume
P(rain_summer)
= 99/100, andP(snow_summer)
= 1/100.
Coding_Length(P(rain_summer))
= log(100/99) = 0.01,assume log base=2
Coding_Length(P(snow_summer))
= log(100/1) = 6.64,assume log base=2
>>P(snow_summer)
has more higherlevel of surprise = Information
thanP(rain_summer)
The smaller the probability, the more information, the higher surprise!
3. When sending some events, how much Coding Length
is required in average?
: Entropy
= summation(P(X) * Coding Length
) = summation(P(X) * log(1/P(X)))
Fair Coin
(when every event has a same proportional)- P(H=T) = 1/2
- I(P(H)) = -log(1/2) = log(2) = 1,
assume log base=2
- I(P(T)) = -log(1/2) = log(2) = 1,
assume log base=2
entropy
= P(H) * I(P(H)) + P(T) * I(P(T)) = 1/2 + 1/2 =1
Unfair Coin
- P(H) = 3/4, P(T) = 1/4
- I(P(H)) = -log(3/4) = log(4/3) = 0.42,
assume log base=2
- I(P(T)) = -log(1/4) = log(4) = 2,
assume log base=2
entropy
= P(H) * I(P(H)) + P(T) * I(P(T)) = 0.75 * 0.42 + 0.25 * 2 =0.815
Fair Coin
is in moreentropy
= disorder = unpredictableUnfair Coin
is in lessentropy
= predictable
Entropy = average Information = average Surprise = average Unexpectedness
: : required by usingoptimal
coding schemeentropy
= H(X) = Expectationp(x)[Coding Length
=Information
=I(X)]
= summation(P(X) * I(X)) = summation(P(X) * log(1/P(X))
= (-1) * summation(P(X) * log(P(X))
4. What is Cross Entropy
?
: 집합에서 추출한 event를 식별하는 데 필요한 평균 비트 수를 측정
Cross Entropy
: : measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution, rather than the "true" distribution.
: : by usingnon-optimal
coding scheme
- p :
true
distribution - q :
guessed
distribution Cross Entropy
= H(p,q) = Ep[log(1/q)) = (-1) * Ep[log(q)]
= p(X) * log(1/q(X)) + p(1-X) * log(1/(1-q(X)) =non-optimal
= H(p,q) >= H(p ), equal operation holds when p=q
=Cross Entropy
is always more higher thanEntropy
Assume that there are two kinds of entities -
blue ball
andred ball
.- p(
blue ball
) = 2/5, p(red ball
) = 3/5 - and I
guess
, "blue ball
is 4/5, andred ball
is 1/5". - and compare it with
true
Cross Entropy
= H(p,q) = summation(true
* log(1/guess
))
= 2/5 * log(5/4) + 3/5 * log(5) =1.52
,assume log base=2
- p(
5. Cross Entropy
in Softmax Neuron
- prediction distribution (
guess
= q) = [0.3, 0.3, 0.4] - target distribution (
true
= p) = [0, 0, 1] - H(p, q) = 0 + 0 + 1*log(4) = log(4) = 2,
assume log base=2
6. What is KL Divergence
?
: is used for measuring the distance between two distributions.
How we can use the
Entropy
andCross Entropy
for measuring the distance between two distribution?
- H(
true
,guess
) =non-optimal
=optimal
+ extraCoding Length
- H(
true
) =optimal
- distance(
true
,guess
) = H(true
,guess
) - H(true
) = extraCoding Length
= KL(true
||guess
) = KL(p||q)
= H(p, q) - H(p ) = Ep[log(1/q)] - Ep[log(1/p)] = Ep[log(1/q)] + Ep[log(p )] = Ep[log(p/q)]
= summation(p * log(p/q)) = summation(true
*log(true
/guess
))
Assume the
true
distributionp
,guess
distributionq
andr
.
p
= [0.25, 0.25, 0.25, 0.25],Pair Coin
q
= [0.25, 0.25, 0.26, 0.24]r
= [0.4, 0.15, 0.05, 0.4]- KL(
p
||q
) = 0.25 * log(0.25/0.25) + 0.25 * log(0.25/0.25) + 0.25 * (0.25/0.26) + 0.25 * (0.25/0.24) = 0.25 * log1 + 0.25 * log1 + 0.25 * log(25/26) + 0.25 * log(25/24)
= 0 + 0 + 0.25 * log(25/26) + 0.25 * log(25/24)
=0.0004
,assume log base=e
- KL(
p
||r
) = 0.25 * log(0.25/0.4) + 0.25 * log(0.25/0.15) + 0.25 * log(0.25/0.05) + 0.25 * log(0.25/0.4) = 0.25 * log(25/4) + 0.25 * log(5/3) + 0.25 * log(5) + 0.25 * log(25/4)
= 0.5 * log(25/4) + 0.25 * log(5/3) + 0.25 * log(5)
=0.295
,assume log base=e
- KL(p||q) != KL(q||p), KL Divergence is asymmetric = KL cannot be used as a distance metric.
- KL(q||p) = 0.00040032, KL(p||q) = 0.000400107
- KL(p||q) >= 0, (KL(p||q) = H(p,q) - H(p ) = Extra!
does using
Cross Entropy
really makes sense when we deal with probability?
Shouldn't we useKL Divergence
betweenpredicted probability = guess
andlabel = true
, instead ofCross Entropy
?
t
=label
=true
y
=prediction
=guess
- minimizing KL(
t
||y
) = minimizing (Cross Entropy
-Entropy
)
= minimizing (H(t
,y
) - H(t
))- guessing
y
is changing, the result is depend onCross Entropy
.
...Entropy
= H(t
) is not effect on prediction.- we try to minimize
Cross Entropy
but actually we are minimizingKL Divergence
!
'Deep Learning' 카테고리의 다른 글
[draft] Ranking Loss (Pairwise, Triplet) (0) | 2021.05.03 |
---|---|
[draft] Forward KL VS Reverse KL (0) | 2021.05.03 |
[draft] Hinge Loss (0) | 2021.04.05 |
[draft] Loss Function (0) | 2021.04.05 |
[draft] data preparation (0) | 2021.03.29 |
소중한 공감 감사합니다