Neural Network Backward propagation and parameters update

In this tutorial part, we'll write backward propagation and parameters update function using cache computed during the forward propagation

In our previous tutorial, we wrote forward propagation and cost functions. So next, we need to write a backpropagation function. For this, we'll use cache computed during the forward propagation.

Backpropagation is usually the hardest (most mathematical) part of deep learning. Here again, is the picture with six mathematical equations we'll use. We'll use six equations on the right of this image since we are building a vectorized implementation.

Code for our backward propagation function:

Arguments:

parameters - python dictionary containing our parameters;
cache - a dictionary containing "Z1", "A1", "Z2" and "A2";
X - input data of shape (input, number of examples);
Y - "true" labels vector of shape (1, number of examples).

Return:

grads - python dictionary containing our gradients with respect to different parameters.

def backward_propagation(parameters, cache, X, Y):
    # number of example
    m = X.shape[1]
    
    # Retrieve W1 and W2 from the "parameters" dictionary
    W1 = parameters["W1"]
    W2 = parameters["W2"]

    # Retrieve A1 and A2 from "cache" dictionary
    A1 = cache["A1"]
    A2 = cache["A2"]

    # Backward propagation for dW1, db1, dW2, db2
    dZ2 = A2-Y
    dW2 = 1./m*np.dot(dZ2, A1.T)
    db2 = 1./m*np.sum(dZ2, axis = 1, keepdims=True)
    dZ1 = np.dot(W2.T, dZ2) * (1 - np.power(A1, 2))
    dW1 = 1./m*np.dot(dZ1, X.T)
    db1 = 1./m*np.sum(dZ1, axis = 1, keepdims=True)
    
    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}
    
    return grads

Code for our update parameters function:

We'll implement the update rule using gradient descent. So we'll have to use (dW1, db1, dW2, db2) to update (W1, b1, W2, b2).

General Gradient descent rule is:θ=θ-αJθwhere: α - learning rate, θ represents a parameter

Arguments:

parameters - python dictionary containing our parameters;
grads - python dictionary containing your gradients;
learning_rate - learning rate of the gradient descent update rule;

Return:

parameters -- python dictionary containing your updated parameters

def update_parameters(parameters, grads, learning_rate = 0.1):
    # Retrieve each parameter from the dictionary "parameters"
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    b1 = parameters["b1"]
    b2 = parameters["b2"]

    # Retrieve each gradient from the "grads" dictionary
    dW1 = grads["dW1"]
    db1 = grads["db1"]
    dW2 = grads["dW2"]
    db2 = grads["db2"]

    # Update rule for each parameter
    W1 = W1 - dW1 * learning_rate
    b1 = b1 - db1 * learning_rate
    W2 = W2 - dW2 * learning_rate
    b2 = b2 - db2 * learning_rate
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters

Full tutorial code:

import os
import cv2
import numpy as np
import matplotlib.pyplot as plt
import scipy

ROWS = 64
COLS = 64
CHANNELS = 3

#TRAIN_DIR = 'Train_data/'
#TEST_DIR = 'Test_data/'

#train_images = [TRAIN_DIR+i for i in os.listdir(TRAIN_DIR)]
#test_images =  [TEST_DIR+i for i in os.listdir(TEST_DIR)]

def read_image(file_path):
    img = cv2.imread(file_path, cv2.IMREAD_COLOR)
    return cv2.resize(img, (ROWS, COLS), interpolation=cv2.INTER_CUBIC)

def prepare_data(images):
    m = len(images)
    X = np.zeros((m, ROWS, COLS, CHANNELS), dtype=np.uint8)
    y = np.zeros((1, m))
    for i, image_file in enumerate(images):
        X[i,:] = read_image(image_file)
        if 'dog' in image_file.lower():
            y[0, i] = 1
        elif 'cat' in image_file.lower():
            y[0, i] = 0
    return X, y

def sigmoid(z):
    s = 1/(1+np.exp(-z))
    return s
'''
train_set_x, train_set_y = prepare_data(train_images)
test_set_x, test_set_y = prepare_data(test_images)

train_set_x_flatten = train_set_x.reshape(train_set_x.shape[0], ROWS*COLS*CHANNELS).T
test_set_x_flatten = test_set_x.reshape(test_set_x.shape[0], -1).T

train_set_x = train_set_x_flatten/255
test_set_x = test_set_x_flatten/255
'''
#train_set_x_flatten shape: (12288, 6002)
#train_set_y shape: (1, 6002)

def initialize_parameters(input_layer, hidden_layer, output_layer):
    # initialize 1st layer output and input with random values
    W1 = np.random.randn(hidden_layer, input_layer) * 0.01
    # initialize 1st layer output bias
    b1 = np.zeros((hidden_layer, 1))
    # initialize 2nd layer output and input with random values
    W2 = np.random.randn(output_layer, hidden_layer) * 0.01
    # initialize 2nd layer output bias
    b2 = np.zeros((output_layer,1))
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters

def forward_propagation(X, parameters):
    # Retrieve each parameter from the dictionary "parameters"
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]

    # Implementing Forward Propagation to calculate A2 probabilities
    Z1 = np.dot(W1, X) + b1
    A1 =  np.tanh(Z1)
    Z2 = np.dot(W2, A1) + b2
    A2 = sigmoid(Z2)

    # Values needed in the backpropagation are stored in "cache"
    cache = {"Z1": Z1,
             "A1": A1,
             "Z2": Z2,
             "A2": A2}
    
    return A2, cache

def compute_cost(A2, Y, parameters):
    # number of example
    m = Y.shape[1]

    # Compute the cross-entropy cost
    logprobs = np.multiply(np.log(A2),Y) +  np.multiply(np.log(1-A2), (1-Y))
    cost = -1/m*np.sum(logprobs)

    # makes sure cost is in dimension we expect, E.g., turns [[51]] into 51 
    cost = np.squeeze(cost)
    
    return cost

def backward_propagation(parameters, cache, X, Y):
    # number of example
    m = X.shape[1]
    
    # Retrieve W1 and W2 from the "parameters" dictionary
    W1 = parameters["W1"]
    W2 = parameters["W2"]

    # Retrieve A1 and A2 from "cache" dictionary
    A1 = cache["A1"]
    A2 = cache["A2"]

    # Backward propagation for dW1, db1, dW2, db2
    dZ2 = A2-Y
    dW2 = 1./m*np.dot(dZ2, A1.T)
    db2 = 1./m*np.sum(dZ2, axis = 1, keepdims=True)
    dZ1 = np.dot(W2.T, dZ2) * (1 - np.power(A1, 2))
    dW1 = 1./m*np.dot(dZ1, X.T)
    db1 = 1./m*np.sum(dZ1, axis = 1, keepdims=True)
    
    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}
    
    return grads

def update_parameters(parameters, grads, learning_rate = 0.1):
    # Retrieve each parameter from "parameters" dictionary
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    b1 = parameters["b1"]
    b2 = parameters["b2"]

    # Retrieve each gradient from the "grads" dictionary
    dW1 = grads["dW1"]
    db1 = grads["db1"]
    dW2 = grads["dW2"]
    db2 = grads["db2"]

    # Update rule for each parameter
    W1 = W1 - dW1 * learning_rate
    b1 = b1 - db1 * learning_rate
    W2 = W2 - dW2 * learning_rate
    b2 = b2 - db2 * learning_rate
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters

Conclusion:

So up to this point, we already wrote parameter initialization, forward propagation, backward propagation, cost, and parameters update functions. So in the next tutorial, we'll connect all of them into a model, and we'll start training neural networks with one hidden layer.