Understanding Logistic Regression

Posted March 26, 2019 by Rokas Balsys

##### Reshaping arrays:

In previous tutorial we learned how to code Sigmoid and Sigmoid gradient functions. In this tutorial we'll learn how to reshape arrays, normalize rows, what is broadcasting and softmax.

Two common numpy functions used in deep learning are np.shape and np.reshape(). Shape function is used to get the shape (dimension) of a matrix or vector X. Reshape(...) is used to reshape amtrix or vector into some other dimension.

For example, in computer science, standard image is represented by a 3D array of shape (length,height,depth). However, when you read an image as the input of an algorithm you convert it to a vector of shape (length*height*depth,1). In other words, you "unroll", or reshape, the 3D array into a 1D vector.

So we will implement function that takes an input of shape (length, height, depth) and returns a vector of shape (length*height*depth,1). For example, if you would like to reshape an array A of shape (a, b, c) into a vector of shape (a*b,c) you would do:

A = A.reshape((A.shape[0]*A.shape[1], A.shape[2])) # A.shape[0] = a ; A.shape[1] = b ; A.shape[2] = c

To implement above function we write simple few lines of code:

def image2vector(image): A = image.reshape( image.shape[0]*image.shape[1]*image.shape[2],1) return A

To test our above function we will create a 3 by 3 by 2 array. Typically images will be (num_px_x, num_px_y,3) where 3 represents the RGB values:

image = np.array([ [[ 11, 12], [ 13, 14], [ 15, 16]], [[ 21, 22], [ 23, 24], [ 25, 26]], [[ 31, 32], [ 33, 34], [ 35, 36]]]) print(image2vector(image)) print(image.shape) print(image2vector(image).shape)

As result we will receive:

[[11] [12] [13] [14] [15] [16] [21] [22] [23] [24] [25] [26] [31] [32] [33] [34] [35] [36]] (3, 3, 2) (18, 1)

As you can see shape of image is (3, 3, 2) and after we call our function it is reshaped to 1D array of shape (18, 1).

##### Normalizing rows:

Another common technique used in Machine Learning and Deep Learning is to normalize our data. It often leads to a better performance because gradient descent converges faster after normalization. Here, by normalization we mean changing x to \( \frac{x}{\| x\|} \) (dividing each row vector of x by its norm).

For example, if: $$x = \begin{bmatrix} 0 & 3 & 4 \\ 2 & 6 & 4 \\ \end{bmatrix}\tag{1}$$ then: $$\| x\| = np.linalg.norm(x, axis = 1, keepdims = True) = \begin{bmatrix} 5 \\ \sqrt{56} \\ \end{bmatrix}\tag{2} $$ and: $$ x\_normalized = \frac{x}{\| x\|} = \begin{bmatrix} 0 & \frac{3}{5} & \frac{4}{5} \\ \frac{2}{\sqrt{56}} & \frac{6}{\sqrt{56}} & \frac{4}{\sqrt{56}} \\ \end{bmatrix}\tag{3}$$ If you ask how we received division by \( 5 \) or division by \( \sqrt{56} \), answer: $$ \sqrt{0^2+3^2+4^2} = 5 $$ and: $$ \sqrt{2^2+6^2+4^2} = \sqrt{56} $$

Next we will implement a function that normalizes each row of the matrix x (to have unit length). After applying 2nd function to an input matrix x, each row of x should be a vector of unit length:

def normalizeRows(x): x_norm = np.linalg.norm(x, ord = 2, axis = 1, keepdims = True) x = x/x_norm return x

To test our function we'll call it with simple array:

x = np.array([ [0, 3, 4], [1, 6, 4]]) print(normalizeRows(x))

We can try to print the shapes of x_norm and x. You'll find out that they have different shapes. This is normal given that x_norm takes the norm of each row of x. So x_norm has the same number of rows but only 1 column. So how did it worked when you divided x by x_norm? This is called broadcasting.

##### Softmax function:

Now we will implement a softmax function using numpy. You can think of softmax as a normalizing function used when your algorithm needs to classify two or more classes. You will learn more about softmax in future tutorials.

Mathematical softmax functions:
$$
\text{for } x \in \mathbb{R}^{1\times n} \text{, } softmax(x) = softmax(\begin{bmatrix}
x_1 &&
x_2 &&
... &&
x_n
\end{bmatrix}) =
$$
$$
= \begin{bmatrix}
\frac{e^{x_1}}{\sum_{j}e^{x_j}} &&
\frac{e^{x_2}}{\sum_{j}e^{x_j}} &&
... &&
\frac{e^{x_n}}{\sum_{j}e^{x_j}}
\end{bmatrix}
$$
$$\text{for a matrix } x \in \mathbb{R}^{m \times n} \text{, $x_{ij}$ maps to the element in the $i^{th}$ row and $j^{th}$ column of $x$, thus we have: }$$
$$softmax(x) = softmax\begin{bmatrix}
x_{11} & x_{12} & x_{13} & \dots & x_{1n} \\
x_{21} & x_{22} & x_{23} & \dots & x_{2n} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
x_{m1} & x_{m2} & x_{m3} & \dots & x_{mn}
\end{bmatrix} =
$$
$$
= \begin{bmatrix}
\frac{e^{x_{11}}}{\sum_{j}e^{x_{1j}}} & \frac{e^{x_{12}}}{\sum_{j}e^{x_{1j}}} & \frac{e^{x_{13}}}{\sum_{j}e^{x_{1j}}} & \dots & \frac{e^{x_{1n}}}{\sum_{j}e^{x_{1j}}} \\
\frac{e^{x_{21}}}{\sum_{j}e^{x_{2j}}} & \frac{e^{x_{22}}}{\sum_{j}e^{x_{2j}}} & \frac{e^{x_{23}}}{\sum_{j}e^{x_{2j}}} & \dots & \frac{e^{x_{2n}}}{\sum_{j}e^{x_{2j}}} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
\frac{e^{x_{m1}}}{\sum_{j}e^{x_{mj}}} & \frac{e^{x_{m2}}}{\sum_{j}e^{x_{mj}}} & \frac{e^{x_{m3}}}{\sum_{j}e^{x_{mj}}} & \dots & \frac{e^{x_{mn}}}{\sum_{j}e^{x_{mj}}}
\end{bmatrix} = \begin{pmatrix}
softmax\text{(first row of x)} \\
softmax\text{(second row of x)} \\
... \\
softmax\text{(last row of x)} \\
\end{pmatrix}
$$

We will create a softmax function that calculates the softmax for each row of the input x:

def softmax(x): # We exp() element-wise to x. x_exp = np.exp(x) # We create a vector x_sum that sums each row of x_exp. x_sum = np.sum(x_exp, axis = 1, keepdims = True) # We compute softmax(x) by dividing x_exp by x_sum. It should automatically use numpy broadcasting. s = x_exp / x_sum return s

To test our function we'll call it with simple array:

x = np.array([ [7, 4, 5, 1, 0], [4, 9, 1, 0 ,5]]) print(softmax(x))

If we would try to print the shapes of x_exp, x_sum and s above, you will see that x_sum is of shape (2,1) while x_exp and s are of shape (2,5). x_exp/x_sum works due to python broadcasting.

We now have a pretty good understanding of python numpy library and have implemented a few useful functions that we will be using in future deep learning tutorials.

From this tutorial we must remember that np.exp(x) works for any np.array x and applies the exponential function to every coordinate. Sigmoid function and its gradient image2vector are two commonly used functions in deep learning. np.reshape is also widely used. In the future, you'll see that keeping our matrix and vector dimensions straight will go toward eliminating a lot of bugs. You will see that numpy has efficient built-in functions - broadcasting that is extremely useful in machine learning.

Up to this point we learned nice stuff about numpy library, in next tutorial we'll learn about vectorization and then we will start coding our first gradient descent algorithm.