The purpose of this post is to summarize the content of cs231n lecture for me, so it could be a little bit unkind for people who didn’t watch the video. In addition, I omitted some contents that I don’t think it’s important enough, so use this article as just an assistance.
Above is the expression of SVM loss function.
Meaning of the expression is, that score of the correct class should be larger at least one comparing to scores of the others to make the loss zero.
It is more easier to understand it with an example.
First, get each loss of the class like the image above.I'm not omitting the explanation because it's annoying to write. It's because explanation of the lecture slide is kind enough.
And then, get the loss by calculating the mean value of three losses. In fact, there is no need to calculate the mean. It’s enough if you get the sum instead of the mean.
You can notify that the max value of this loss function is zero, and min value is the infinity. It is one of the key factors of ‘good’ loss functions. According to the mentor 강성희, we might have to make our ‘own loss functions’ to solve certain problems if we treat real-field-problems.
This is useful in debugging. In the initial stage of learning, if the loss is very different with , the code may contain a bug.
We can avoid the overfitting problem by adding regularization() to the loss function, so that make simpler.I don't know the reason why it works. Don't ask me.
So, the completed expression will be composed of the two parts like below.
Actually he asked it to the professor, not me.
The professor didn’t answered clearly. Instead, he said there are two ways to predict W.
1. strict model class not to contain high-demension polynomial
2. let model class to contain high-demension polynomial and penalize it with regularization loss
The loss function should be a monotone decreasing function because loss of the correct class should be low.
The reason why it uses to the power of score is 1) to change negative numbers into positive and 2) to cancel the log.
and I don't know why.
It takes too long!
But we can use it as debugging method : Gradient Check.
First, we make code in analytic gradient method that I’ll explain next, and then rewrite the same code with numerical gradient.
Purple one is gradient descent with momentum, and blue one is called ‘Adam’. But I don’t know well about them. Let’s pass.
As the title of the image is saying, it’s aside of the main stream of the lecture, so I won’t explain about it much.
The point is that people get image features manually in the past, but now, because deep learning is being advanced a lot, we are getting image features automatically.
“Convolutional Network is fucking good, Deep learning is the best.”
Loss Function
Below is the general numerical expression of loss functions.Multiclass SVM Loss
There are many many many kinds of loss functions, but this class only deals with two of them : SVM loss and softmax.Above is the expression of SVM loss function.
Meaning of the expression is, that score of the correct class should be larger at least one comparing to scores of the others to make the loss zero.
It is more easier to understand it with an example.
First, get each loss of the class like the image above.
And then, get the loss by calculating the mean value of three losses. In fact, there is no need to calculate the mean. It’s enough if you get the sum instead of the mean.
Shape of The Graph
The graph of SVM Loss is in ‘hinge shape’.You can notify that the max value of this loss function is zero, and min value is the infinity. It is one of the key factors of ‘good’ loss functions. According to the mentor 강성희, we might have to make our ‘own loss functions’ to solve certain problems if we treat real-field-problems.
Question : Why 1?
The ‘+1’ is called as margin. It is a hyperparameter, and it is actually arbitrary value. But in most case, people just use +1 as the margin. You may be able to get the mathematical proof about this in the end part of the lecture 7 of Coursera Machine Learning Class.What If
Let num of class is . Then, the answer of the question is about .This is useful in debugging. In the initial stage of learning, if the loss is very different with , the code may contain a bug.
Numpy Code
# Get the loss
def L_i_vectorized(x, y, W):
scores = W.dot(x)
margins = np.maximum(0, scores - scores[y] + 1)
margins[y] = 0 # correct label will be ignored. In fact, you can remove this line.
loss_i = np.sum(margins)
return loss_i
Regularization
Is loss=0 The Best?
No. Overfitting!We can avoid the overfitting problem by adding regularization() to the loss function, so that make simpler.
So, the completed expression will be composed of the two parts like below.
- data loss :
- regularization loss :
Occam’s Razor
Among competing hypotheses, the simplest is the bestThe rule above can be applied in the world of deep learning. But I think it’s a kind of the humorous metaphor of the professor, so don’t need to take a serious look on it.
- William of Ockham, 1285 - 1347 -
Question : How Regularization Loss Makes The W Linear?
Though I had said don’t ask me, one student asked this question.The professor didn’t answered clearly. Instead, he said there are two ways to predict W.
1. strict model class not to contain high-demension polynomial
2. let model class to contain high-demension polynomial and penalize it with regularization loss
Various Kinds of Regularization
- L2 regularization
- L1 regularization
- Elastic net (L1 + L2)
- Max norm regularization
- Dropout
- Batch normalization
- Stochastic Depth
L1 vs L2
I don’t think the explanation of the professor wasn’t enough, but the fact is that L2 loss is preferred when the each datum of dataset x has similar values, while L1 loss is prefferred when one or some of the dataset x have much larger values than others. The reason that I inferred why each loss is preferred in the cases above is that L2 loss is sensitive to small difference because it contains square.Softmax Classifier (= Cross-Entropy Loss1)
It converts scores into probabilities.The loss function should be a monotone decreasing function because loss of the correct class should be low.
The reason why it uses to the power of score is 1) to change negative numbers into positive and 2) to cancel the log.
Debugging Skill (What If )
In initial stage, loss value should be . If not, it means something is going wrong.SVM vs Softmax
Let’s see these three examples.[10, -2, 3]
[10, 9, 9]
[10, -100, -100]
Unlike softmax, SVM regards these three status same, which means, when it reaches second status above, it doesn’t learn any more. However, I heard it’s not true that softmax is absolutely better than SVM Optimization
Strategy #1 : Random Search
Yes, random search. I’ll skill explanation.Strategy #2 : Gradient Descent
The way following the scope.Naive Way : Numerical Gradient
Just following the definition of variation of the scope.It takes too long!
But we can use it as debugging method : Gradient Check.
First, we make code in analytic gradient method that I’ll explain next, and then rewrite the same code with numerical gradient.
Using Differentiation : Analytic Gradient
# Vanilla Gradient Descent
while True:
weights_grad = evaluate_gradient(loss_fun, data, weights)
weights -= step_size * weights_grad # step_size == learning_rate
Updated Gradient Descent
Purple one is gradient descent with momentum, and blue one is called ‘Adam’. But I don’t know well about them. Let’s pass.
Stochastic Gradient Descent (SGD)
Using mini batch.# Vanilla Minibatch Gradient Descent
while True:
data_batch = sample_training_data(data, 256) # 32, 64, 128 are commonly used.
weight_grad = evaluate_gradient(loss_fun, data_batch, weights)
weights -= step_size * weights_grad
In fact, SGD isn’t so faster than normal gradient descent. But, the size of GPU memory is very small, so we should use this because we can’t load full data on the memory.Image Features
As the title of the image is saying, it’s aside of the main stream of the lecture, so I won’t explain about it much.
The point is that people get image features manually in the past, but now, because deep learning is being advanced a lot, we are getting image features automatically.
Past
- Color Histogram
- Histogram of Oriented Gradients (HoG)
- Bag of Words
- It appears because of FeiFei. Don’t need to understand.
Now
“Convolutional Network is fucking good, Deep learning is the best.”
Summary
- We learned two kinds of loss functions : SVM / Softmax
- We use regularization to avoid overfitting, so the expression will look like :
- To optimize, we use gradient descent.
- It’s not exactly same with cross-entropy loss(), but it’s phenomenally similar. ↩
댓글
댓글 쓰기