기본 콘텐츠로 건너뛰기

Neural Network [cs231n - week 3 : Loss Functions and Optimization]

The purpose of this post is to summarize the content of cs231n lecture for me, so it could be a little bit unkind for people who didn’t watch the video. In addition, I omitted some contents that I don’t think it’s important enough, so use this article as just an assistance.

Loss Function

Below is the general numerical expression of loss functions.

Multiclass SVM Loss

There are many many many kinds of loss functions, but this class only deals with two of them : SVM loss and softmax.

Above is the expression of SVM loss function.
Meaning of the expression is, that score of the correct class should be larger at least one comparing to scores of the others to make the loss zero.
It is more easier to understand it with an example.

First, get each loss of the class like the image above. I'm not omitting the explanation because it's annoying to write. It's because explanation of the lecture slide is kind enough.

And then, get the loss by calculating the mean value of three losses. In fact, there is no need to calculate the mean. It’s enough if you get the sum instead of the mean.

Shape of The Graph

The graph of SVM Loss is in ‘hinge shape’.

You can notify that the max value of this loss function is zero, and min value is the infinity. It is one of the key factors of ‘good’ loss functions. According to the mentor 강성희, we might have to make our ‘own loss functions’ to solve certain problems if we treat real-field-problems.

Question : Why 1?

The ‘+1’ is called as margin. It is a hyperparameter, and it is actually arbitrary value. But in most case, people just use +1 as the margin. You may be able to get the mathematical proof about this in the end part of the lecture 7 of Coursera Machine Learning Class.

What If

Let num of class is . Then, the answer of the question is about .
This is useful in debugging. In the initial stage of learning, if the loss is very different with , the code may contain a bug.

Numpy Code

# Get the loss
def L_i_vectorized(x, y, W):
    scores = W.dot(x)
    margins = np.maximum(0, scores - scores[y] + 1)
    margins[y] = 0  # correct label will be ignored. In fact, you can remove this line.
    loss_i = np.sum(margins)
    return loss_i

Regularization

Is loss=0 The Best?

No. Overfitting!
We can avoid the overfitting problem by adding regularization() to the loss function, so that make simpler. I don't know the reason why it works. Don't ask me.
So, the completed expression will be composed of the two parts like below.
  • data loss :
  • regularization loss :

Occam’s Razor

Among competing hypotheses, the simplest is the best
- William of Ockham, 1285 - 1347 -
The rule above can be applied in the world of deep learning. But I think it’s a kind of the humorous metaphor of the professor, so don’t need to take a serious look on it.

Question : How Regularization Loss Makes The W Linear?

Though I had said don’t ask me, one student asked this question. Actually he asked it to the professor, not me.
The professor didn’t answered clearly. Instead, he said there are two ways to predict W.
1. strict model class not to contain high-demension polynomial
2. let model class to contain high-demension polynomial and penalize it with regularization loss

Various Kinds of Regularization

  • L2 regularization
  • L1 regularization
  • Elastic net (L1 + L2)
  • Max norm regularization
  • Dropout
  • Batch normalization
  • Stochastic Depth

L1 vs L2

I don’t think the explanation of the professor wasn’t enough, but the fact is that L2 loss is preferred when the each datum of dataset x has similar values, while L1 loss is prefferred when one or some of the dataset x have much larger values than others. The reason that I inferred why each loss is preferred in the cases above is that L2 loss is sensitive to small difference because it contains square.

Softmax Classifier (= Cross-Entropy Loss1)

It converts scores into probabilities.

The loss function should be a monotone decreasing function because loss of the correct class should be low.

The reason why it uses to the power of score is 1) to change negative numbers into positive and 2) to cancel the log.

Debugging Skill (What If )

In initial stage, loss value should be . If not, it means something is going wrong.

SVM vs Softmax

Let’s see these three examples.
[10, -2, 3]
[10, 9, 9]
[10, -100, -100]
Unlike softmax, SVM regards these three status same, which means, when it reaches second status above, it doesn’t learn any more. However, I heard it’s not true that softmax is absolutely better than SVM and I don't know why.

Optimization

Yes, random search. I’ll skill explanation.

Strategy #2 : Gradient Descent

The way following the scope.

Naive Way : Numerical Gradient

Just following the definition of variation of the scope.


It takes too long!
But we can use it as debugging method : Gradient Check.
First, we make code in analytic gradient method that I’ll explain next, and then rewrite the same code with numerical gradient.

Using Differentiation : Analytic Gradient

# Vanilla Gradient Descent

while True:
    weights_grad = evaluate_gradient(loss_fun, data, weights)
    weights -= step_size * weights_grad   # step_size == learning_rate

Updated Gradient Descent


Purple one is gradient descent with momentum, and blue one is called ‘Adam’. But I don’t know well about them. Let’s pass.

Stochastic Gradient Descent (SGD)

Using mini batch.
# Vanilla Minibatch Gradient Descent

while True:
    data_batch = sample_training_data(data, 256) # 32, 64, 128 are commonly used.
    weight_grad = evaluate_gradient(loss_fun, data_batch, weights)
    weights -= step_size * weights_grad
In fact, SGD isn’t so faster than normal gradient descent. But, the size of GPU memory is very small, so we should use this because we can’t load full data on the memory.

Image Features


As the title of the image is saying, it’s aside of the main stream of the lecture, so I won’t explain about it much.
The point is that people get image features manually in the past, but now, because deep learning is being advanced a lot, we are getting image features automatically.

Past

  • Color Histogram
  • Histogram of Oriented Gradients (HoG)
  • Bag of Words
    • It appears because of FeiFei. Don’t need to understand.

Now


“Convolutional Network is fucking good, Deep learning is the best.”

Summary

  • We learned two kinds of loss functions : SVM / Softmax
  • We use regularization to avoid overfitting, so the expression will look like :
  • To optimize, we use gradient descent.

  1. It’s not exactly same with cross-entropy loss(), but it’s phenomenally similar.

댓글

이 블로그의 인기 게시물

The reason why I selected Google Blogger

In SW Maestro, 3 mentors for our team are fixed, and we got our first official mentoring from mentor 배권한. He gave us an assignment - make a personal blog and post an article about co-working tools. He insisted we run a blog for 4 reasons. We forget what we have studied someday We can realize something new when we think about it again Somebody like a headhunter may contact me through my blog To make my activities visible So, I decided to make a post about pros and cons of famous blog services. Criteria of Selection I consider these factors. Main Factors It should 'work' well. It should be easy to parse my posts from the blog. It should be easy to write code on my blog. It must take responsive design, so that it can be seen with mobile device. It must be easy to be exposed by Google search. It must provide reply function. Sub Factors It is good if the design is fine. It is good if the markdown function is...

Running Anaconda Python on Jupyter Notebook with Docker in Windows

I think there is no IDE more suitable for studying machine learning or deep learning with Python than Jupyter Notebook. However, it is so difficult to construct a Jupyter environment without crushing with the existing Python environment. So, I decided to make an only-for-deep-learning environment in the Ubuntu Docker container to avoid all annoyance. I will assume that you already installed Docker. Environment Ubuntu 17.04 LTS Anaconda 5.0.0 (containing Python 3.6) Creating Docker Container First of all, we are going to create a Docker container. The world is going better day after day! You can get a Docker image containing Anaconda that someone created and simultaneously make a container with the image you just downloaded just with this one-line command. docker run - it --name jupyter -p 8000:8000 --volume /c/Users/jinai/Dropbox/HaveToLearnToRun/CSE/3_1_MachineLearning/jupyter_workspace:/root/jupyter_workspace continuumio/anaconda3 /bin/bash It’s quite complex. So let m...