Cross entropy#

Cross-entropy is a common method for evaluating classification quality. Due to its properties, it can also be used as a loss function.

This page uses the following notation:

  • \(c \in \overline{1,C}\): class index.

  • \(o \in \overline{1,N}\): observation index.

  • \(p_{o,c}\): predicted probability that the \(o\)-th object belongs to the \(c\)-th class.

  • A variable indicating whether the \(o\)-th observation belongs to the \(c\)-th class:

\[\begin{split} y_{o,c} = \begin{cases} 1, & \text{if the } o\text{-th observation belongs to class } c, \\ 0, & \text{otherwise}. \end{cases} \end{split}\]

Logits transformation#

Typical output of a machine learning model is a raw number that tends to be higher for the correct class. These values are called logits. However, for cross-entropy to work, the outputs must satisfy probability properties:

\[\sum_{c=1}^C p_c = 1\]

A common approach is to apply a transformation that ensures the outputs meet these properties—this is known as the softmax transformation. It can be included in the model, so sometimes this transformation is not needed explicitly.

But it’s typical to introduce softmax during the loss computation, so if the raw model outputs (logits) are denoted as \(\hat{y}_{o,c}\), they need to be transformed:

\[p_{o, c'} = \frac{exp(\hat{y}_{o,c})}{\sum_{c'=1}^C exp(\hat{y}_{o,c'})}, c = \overline{1, C}.\]

Under these conditions, it’s common to see such a representation of cross entropy in some sources:

\[ - \sum_{c=1}^C y_{o,c} log\left[\frac{exp(\hat{y}_{o,c})}{\sum_{c'=1}^C exp(\hat{y}_{o,c'})}\right] \]

For example pytorch users such definition.

Binary cross entropy#

Popular particular case is cross entropy for binary classification:

\[-(y_o log[p_o] + [1-y_o] log[1-p_o])\]
\[\begin{split} y_o=\begin{cases} 1, & \text{if the } o\text{-th observation has a learnt trait}, \\ 0, & \text{otherwise}. \end{cases} \end{split}\]
  • \(p_o\): probability that trait under consideration manifests itself in an \(o\)-th object.