Deep Neural Networks step by step

Posted May 7, 2019 by Rokas Balsys



Improving Deep Neural Networks - Initialization:

Training neural network requires specifying an initial value of the weights. A well chosen initialization method would improve learning and accuracy.

If you were reading my previous tutorial pars, you probably followed my instructions for weight initialization, and it has worked out so far. But how do we choose the initialization for a another neural network? In this part I will show you, that different initialization methods can lead to different results.

A well chosen initialization can speed up the convergence of gradient descent and increase the odds of gradient descent converging to a lower training error.

I will use code from my last tutorial, where we used to classify circles. Before we used random weights initialization, now we'll try "He Initialization", this is named for the first author of He et al., 2015. (If you have heard of "Xavier initialization", this is similar except Xavier initialization uses a scaling factor for the weights $W^{[l]}$ of sqrt(1./layers_dims[l-1]) where He initialization would use sqrt(2./layers_dims[l-1]).)

Before we used following code:

def initialize_parameters_deep(layer_dimension):
    parameters = {}

    L = len(layer_dimension)

    for l in range(1, L):
        parameters["W" + str(l)] = np.random.randn(layer_dimension[l], layer_dimension[l-1]) * 0.01
        parameters["b" + str(l)] = np.zeros((layer_dimension[l], 1))

    return parameters

The only difference is that instead of multiplying np.random.randn(..,..) by 0.01, we will multiply it by $\sqrt{\frac{2}{\text{dimension of the previous layer}}}$, which is what "He" initialization recommends for layers with a ReLU activation:

def initialize_parameters_he(layer_dimension):
    parameters = {}

    L = len(layer_dimension)

    for l in range(1, L):
        parameters["W" + str(l)] = np.random.randn(layer_dimension[l], layer_dimension[l-1]) * np.sqrt(2./layers_dims[l-1])
        parameters["b" + str(l)] = np.zeros((layer_dimension[l], 1))

    return parameters

At first, lets test our results with random initialization, that we could compare the difference:

Cost after iteration 14600: 0.495909
Cost after iteration 14700: 0.496088
Cost after iteration 14800: 0.496531
Cost after iteration 14900: 0.496314
train accuracy: 72.33333333333333 %

Figure_1.jpg

From this above plot we can see that cost drops only after 8000 training steps.

Figure_2.jpg

Classification results are quite fine for us.

Cost after iteration 14600: 0.456906
Cost after iteration 14700: 0.457210
Cost after iteration 14800: 0.463999
Cost after iteration 14900: 0.467439
train accuracy: 73.66666666666667 %

Figure_3.jpg

From this above plot we can see that cost drops almost instantly, we don't need to wait 8000 steps to see improvement.

Figure_4.jpg

Classification results are 1% better than using random initialization.


We should remember that different parameters initialization methods lead to different results. Random initialization is used to break symmetry and to make sure different hidden units can learn different things.

Don't intialize to values that are too large, but initializing with overly large random numbers you'll slow down the optimization.

As you saw "He" initialization works well for networks with ReLU activations.

This is the last deep learning tutorial. To get better results you can try to optimize this deep neural network by implementing L2 regularization, dropout and gradient checking. You can try to optimize using momentum, adam optimizer or implement training with minibatches.

Full tutorial code and cats vs dogs image data-set can be found on my GitHub page.