기본 콘텐츠로 건너뛰기

Neural Network [cs231n - week 3 : Loss Functions and Optimization]

The purpose of this post is to summarize the content of cs231n lecture for me, so it could be a little bit unkind for people who didn’t watch the video. In addition, I omitted some contents that I don’t think it’s important enough, so use this article as just an assistance.

Loss Function

Below is the general numerical expression of loss functions.

Multiclass SVM Loss

There are many many many kinds of loss functions, but this class only deals with two of them : SVM loss and softmax.

Above is the expression of SVM loss function.
Meaning of the expression is, that score of the correct class should be larger at least one comparing to scores of the others to make the loss zero.
It is more easier to understand it with an example.

First, get each loss of the class like the image above. I'm not omitting the explanation because it's annoying to write. It's because explanation of the lecture slide is kind enough.

And then, get the loss by calculating the mean value of three losses. In fact, there is no need to calculate the mean. It’s enough if you get the sum instead of the mean.

Shape of The Graph

The graph of SVM Loss is in ‘hinge shape’.

You can notify that the max value of this loss function is zero, and min value is the infinity. It is one of the key factors of ‘good’ loss functions. According to the mentor 강성희, we might have to make our ‘own loss functions’ to solve certain problems if we treat real-field-problems.

Question : Why 1?

The ‘+1’ is called as margin. It is a hyperparameter, and it is actually arbitrary value. But in most case, people just use +1 as the margin. You may be able to get the mathematical proof about this in the end part of the lecture 7 of Coursera Machine Learning Class.

What If

Let num of class is . Then, the answer of the question is about .
This is useful in debugging. In the initial stage of learning, if the loss is very different with , the code may contain a bug.

Numpy Code

# Get the loss
def L_i_vectorized(x, y, W):
    scores = W.dot(x)
    margins = np.maximum(0, scores - scores[y] + 1)
    margins[y] = 0  # correct label will be ignored. In fact, you can remove this line.
    loss_i = np.sum(margins)
    return loss_i

Regularization

Is loss=0 The Best?

No. Overfitting!
We can avoid the overfitting problem by adding regularization() to the loss function, so that make simpler. I don't know the reason why it works. Don't ask me.
So, the completed expression will be composed of the two parts like below.
  • data loss :
  • regularization loss :

Occam’s Razor

Among competing hypotheses, the simplest is the best
- William of Ockham, 1285 - 1347 -
The rule above can be applied in the world of deep learning. But I think it’s a kind of the humorous metaphor of the professor, so don’t need to take a serious look on it.

Question : How Regularization Loss Makes The W Linear?

Though I had said don’t ask me, one student asked this question. Actually he asked it to the professor, not me.
The professor didn’t answered clearly. Instead, he said there are two ways to predict W.
1. strict model class not to contain high-demension polynomial
2. let model class to contain high-demension polynomial and penalize it with regularization loss

Various Kinds of Regularization

  • L2 regularization
  • L1 regularization
  • Elastic net (L1 + L2)
  • Max norm regularization
  • Dropout
  • Batch normalization
  • Stochastic Depth

L1 vs L2

I don’t think the explanation of the professor wasn’t enough, but the fact is that L2 loss is preferred when the each datum of dataset x has similar values, while L1 loss is prefferred when one or some of the dataset x have much larger values than others. The reason that I inferred why each loss is preferred in the cases above is that L2 loss is sensitive to small difference because it contains square.

Softmax Classifier (= Cross-Entropy Loss1)

It converts scores into probabilities.

The loss function should be a monotone decreasing function because loss of the correct class should be low.

The reason why it uses to the power of score is 1) to change negative numbers into positive and 2) to cancel the log.

Debugging Skill (What If )

In initial stage, loss value should be . If not, it means something is going wrong.

SVM vs Softmax

Let’s see these three examples.
[10, -2, 3]
[10, 9, 9]
[10, -100, -100]
Unlike softmax, SVM regards these three status same, which means, when it reaches second status above, it doesn’t learn any more. However, I heard it’s not true that softmax is absolutely better than SVM and I don't know why.

Optimization

Yes, random search. I’ll skill explanation.

Strategy #2 : Gradient Descent

The way following the scope.

Naive Way : Numerical Gradient

Just following the definition of variation of the scope.


It takes too long!
But we can use it as debugging method : Gradient Check.
First, we make code in analytic gradient method that I’ll explain next, and then rewrite the same code with numerical gradient.

Using Differentiation : Analytic Gradient

# Vanilla Gradient Descent

while True:
    weights_grad = evaluate_gradient(loss_fun, data, weights)
    weights -= step_size * weights_grad   # step_size == learning_rate

Updated Gradient Descent


Purple one is gradient descent with momentum, and blue one is called ‘Adam’. But I don’t know well about them. Let’s pass.

Stochastic Gradient Descent (SGD)

Using mini batch.
# Vanilla Minibatch Gradient Descent

while True:
    data_batch = sample_training_data(data, 256) # 32, 64, 128 are commonly used.
    weight_grad = evaluate_gradient(loss_fun, data_batch, weights)
    weights -= step_size * weights_grad
In fact, SGD isn’t so faster than normal gradient descent. But, the size of GPU memory is very small, so we should use this because we can’t load full data on the memory.

Image Features


As the title of the image is saying, it’s aside of the main stream of the lecture, so I won’t explain about it much.
The point is that people get image features manually in the past, but now, because deep learning is being advanced a lot, we are getting image features automatically.

Past

  • Color Histogram
  • Histogram of Oriented Gradients (HoG)
  • Bag of Words
    • It appears because of FeiFei. Don’t need to understand.

Now


“Convolutional Network is fucking good, Deep learning is the best.”

Summary

  • We learned two kinds of loss functions : SVM / Softmax
  • We use regularization to avoid overfitting, so the expression will look like :
  • To optimize, we use gradient descent.

  1. It’s not exactly same with cross-entropy loss(), but it’s phenomenally similar.

댓글

이 블로그의 인기 게시물

Kotlin + NDK, OpenCV

원래 이 블로그는 영어로만 작성하려고 했었으나, 코틀린 프로젝트에서 OpenCV를 사용하는 방법에 대해 정리한 한글 블로그가 거의 없어서 이 참에 블로그 방문자 유입도 좀 늘릴 겸하여 이번 포스트는 한글로 작성하려고 한다. 절대 영어로 쓰기 귀찮아서 한글로 쓰는 게 아니다. 내가 좀 쫄보여서 그런지는 몰라도 간단한 테스트도 iterative하게 진행하는 게 마음이 편하다. 그래서 1. Kotlin 2. Java + NDK(C++) 3. Kotlin + NDK(C++) 4. Java + NDK(C++) + JNI + OpenCV 5. Kotlin + NDK(C++) + JNI + OpenCV 순으로 프로젝트를 생성하여 한 단계씩 통과시켜가며 넘어갈 생각이다. 그런데 결론부터 말하자면, OpenCV에서 Kotlin을 지원하지 않는 것으로 보인다. OpenCV의 라이브러리 폴더(OpenCV-android-sdk\sdk\native\libs\mips64)를 열어보면 libopencv_java3.so 파일은 찾을 수 있지만 libopencv_kotlin 비슷한 이름을 가진 파일은 없다. Kotlin에서 C++을 돌려봤다는 사실 정도에 만족하고 넘어가도록 하자… ㅠㅠ 환경 다음의 환경에서 진행한다. * Android Studio 2.3 * OpenCV 3.3.0 Kotlin Project 생성 먼저 안드로이드 스튜디오에서 간단한 hello world 자바코드를 생성하여 코틀린코드로 변환해보자. 그냥 처음부터 코틀린으로 만들면 되지 왜 굳이 자바코드를 변환하고 앉아있느냐 할 수도 있는데 안드로이드 스튜디오 2.3에서는 그런 기능을 제공하지 않는다. ㅠㅠ 3.0부터는 아예 코틀린이 안드로이드 스튜디오에 빌트인으로 제공되면서 처음부터 코틀린 프로젝트를 만들 수 있게 된다 카더라. 프로젝트 생성 그냥 자바 기반 안드로이드 프로젝트를 만들면 된다. 어플리케이션명을 적당히 정해주자. Company Domain은 소속된 회사이...

The reason why I selected Google Blogger

In SW Maestro, 3 mentors for our team are fixed, and we got our first official mentoring from mentor 배권한. He gave us an assignment - make a personal blog and post an article about co-working tools. He insisted we run a blog for 4 reasons. We forget what we have studied someday We can realize something new when we think about it again Somebody like a headhunter may contact me through my blog To make my activities visible So, I decided to make a post about pros and cons of famous blog services. Criteria of Selection I consider these factors. Main Factors It should 'work' well. It should be easy to parse my posts from the blog. It should be easy to write code on my blog. It must take responsive design, so that it can be seen with mobile device. It must be easy to be exposed by Google search. It must provide reply function. Sub Factors It is good if the design is fine. It is good if the markdown function is...

Cooperation Tools That Support Scrum Development Process

We have to cooperate for our project, and we need to decide which tool to use. One of our mentors recommended 3 cooperation tools which support scrum board function, and suggested we learn about that tools, put into shape, and share them one another. So, to follow his order, I write this post. Scrum - Agile Software Development Process First of all, we should know what the hell the SCRUM is. I referred here  and wiki . In agile software development process, there are 3 elements that consist of it - roles, artifacts, workflow. Roles Product Owner They ponder on what they make. Scrum Master They ponder on how they make. Scrum Team They just make T_T Artifacts Product Backlog It contains general requirements. (I don't sure it is right explanation for product backlog.) User Stories It contains detail requirements. Estimates It contains the order of priority of all requirements(user stories). Workflow Iteration & Incremental Developme...