Neural Network [cs231n - week 3 : Loss Functions and Optimization]

The purpose of this post is to summarize the content of cs231n lecture for me, so it could be a little bit unkind for people who didn’t watch the video. In addition, I omitted some contents that I don’t think it’s important enough, so use this article as just an assistance.

Loss Function

Below is the general numerical expression of loss functions.
$L = \frac{1}{N} \sum_i L_i (f(x_i, W), y_i)$

Multiclass SVM Loss

There are many many many kinds of loss functions, but this class only deals with two of them : SVM loss and softmax.
$L = \frac{1}{N} \sum_i^N \sum_{j \not= y_i} max(0, s_j - s_{y_i} + 1)\\ s = f(x_i, W)$
Above is the expression of SVM loss function.
Meaning of the expression is, that score of the correct class should be larger at least one comparing to scores of the others to make the loss zero.
It is more easier to understand it with an example.

First, get each loss of the class like the image above. ~~I'm not omitting the explanation because it's annoying to write. It's because explanation of the lecture slide is kind enough.~~

And then, get the loss by calculating the mean value of three losses. In fact, there is no need to calculate the mean. It’s enough if you get the sum instead of the mean.

Shape of The Graph

The graph of SVM Loss is in ‘hinge shape’.

You can notify that the max value of this loss function is zero, and min value is the infinity. It is one of the key factors of ‘good’ loss functions. According to the mentor 강성희, we might have to make our ‘own loss functions’ to solve certain problems if we treat real-field-problems.

Question : Why 1?

The ‘+1’ is called as margin. It is a hyperparameter, and it is actually arbitrary value. But in most case, people just use +1 as the margin. You may be able to get the mathematical proof about this in the end part of the lecture 7 of Coursera Machine Learning Class.

What If $\forall s_i \sim 0$

Let num of class is $c$ . Then, the answer of the question is about $c-1$ .
This is useful in debugging. In the initial stage of learning, if the loss is very different with $c-1$ , the code may contain a bug.

Numpy Code

# Get the loss
def L_i_vectorized(x, y, W):
    scores = W.dot(x)
    margins = np.maximum(0, scores - scores[y] + 1)
    margins[y] = 0  # correct label will be ignored. In fact, you can remove this line.
    loss_i = np.sum(margins)
    return loss_i

Regularization

Is loss=0 The Best?

No. Overfitting!
We can avoid the overfitting problem by adding regularization( $+\lambda R(W)$ ) to the loss function, so that make $W$ simpler. ~~I don't know the reason why it works. Don't ask me.~~
So, the completed expression will be composed of the two parts like below.

data loss : $L = \frac{1}{N} \sum_i L_i (f(x_i, W), y_i)$
regularization loss : $+\lambda R(W)$

Occam’s Razor

Among competing hypotheses, the simplest is the best
- William of Ockham, 1285 - 1347 -

The rule above can be applied in the world of deep learning. But I think it’s a kind of the humorous metaphor of the professor, so don’t need to take a serious look on it.

Question : How Regularization Loss Makes The W Linear?

Though I had said don’t ask me, one student asked this question. ~~Actually he asked it to the professor, not me.~~
The professor didn’t answered clearly. Instead, he said there are two ways to predict W.
1. strict model class not to contain high-demension polynomial
2. let model class to contain high-demension polynomial and penalize it with regularization loss

Various Kinds of Regularization

L2 regularization
- $R(W) = \sum_k \sum_l W^2_{k,l}$
L1 regularization
- $R(W) = \sum_k \sum_l |W_{k,l}|$
Elastic net (L1 + L2)
- $R(W) = \sum_k \sum_l \beta W^2_{k,l} + |W_{k,l}|$
Max norm regularization
Dropout
Batch normalization
Stochastic Depth

L1 vs L2

I don’t think the explanation of the professor wasn’t enough, but the fact is that L2 loss is preferred when the each datum of dataset x has similar values, while L1 loss is prefferred when one or some of the dataset x have much larger values than others. The reason that I inferred why each loss is preferred in the cases above is that L2 loss is sensitive to small difference because it contains square.

Softmax Classifier (= Cross-Entropy Loss1)

It converts scores into probabilities.
$P = \frac{e^{s_k}}{\sum_je^{s_j}}$
The loss function should be a monotone decreasing function because loss of the correct class should be low.
$L_i = -\log P$
The reason why it uses $e$ to the power of score is 1) to change negative numbers into positive and 2) to cancel the log.

Debugging Skill (What If $\forall s_i \sim 0$ )

In initial stage, loss value should be $\log C$ . If not, it means something is going wrong.

SVM vs Softmax

Let’s see these three examples.

[10, -2, 3]
[10, 9, 9]
[10, -100, -100]

Unlike softmax, SVM regards these three status same, which means, when it reaches second status above, it doesn’t learn any more. However, I heard it’s not true that softmax is absolutely better than SVM ~~and I don't know why~~.

Optimization

Strategy #1 : Random Search

Yes, random search. I’ll skill explanation.

Strategy #2 : Gradient Descent

The way following the scope.

Naive Way : Numerical Gradient

Just following the definition of variation of the scope.
$\frac{f(x+h)-f(x)}h$

It takes too long!
But we can use it as debugging method : Gradient Check.
First, we make code in analytic gradient method that I’ll explain next, and then rewrite the same code with numerical gradient.

Using Differentiation : Analytic Gradient

# Vanilla Gradient Descent

while True:
    weights_grad = evaluate_gradient(loss_fun, data, weights)
    weights -= step_size * weights_grad   # step_size == learning_rate

Updated Gradient Descent

Purple one is gradient descent with momentum, and blue one is called ‘Adam’. But I don’t know well about them. Let’s pass.

Stochastic Gradient Descent (SGD)

Using mini batch.

# Vanilla Minibatch Gradient Descent

while True:
    data_batch = sample_training_data(data, 256) # 32, 64, 128 are commonly used.
    weight_grad = evaluate_gradient(loss_fun, data_batch, weights)
    weights -= step_size * weights_grad

In fact, SGD isn’t so faster than normal gradient descent. But, the size of GPU memory is very small, so we should use this because we can’t load full data on the memory.

Image Features

As the title of the image is saying, it’s aside of the main stream of the lecture, so I won’t explain about it much.
The point is that people get image features manually in the past, but now, because deep learning is being advanced a lot, we are getting image features automatically.

Past

Color Histogram
Histogram of Oriented Gradients (HoG)
Bag of Words
- It appears because of FeiFei. Don’t need to understand.

Now

“Convolutional Network is fucking good, Deep learning is the best.”

Summary

We learned two kinds of loss functions : SVM / Softmax
We use regularization to avoid overfitting, so the expression will look like : $Loss + \lambda R$
To optimize, we use gradient descent.

It’s not exactly same with cross-entropy loss( $-\frac1n \sum_x \sum_j [y_j loga_j^L + (1-y_j) log(1-a_j^L)]$ ), but it’s phenomenally similar. ↩

Dono (頓悟, sudden enlightenment)

이 블로그 검색