NumPy is an excellent package for numerical computations in python and as such forms the foundation of building many machine learning and deep learning algorithms from scratch.

Knowledge of NumPy can also serve useful when it comes to data manipulation and statistical modeling.

In this guide we explore the full extent of what a data scientist can achieve with the NumPy package in Python.

Let’s start by creating a random array of numbers with 6 rows and 8 columns. We can do this using the code shown below:

1 2 3 4 5 6 7 |
#Import required packages import numpy as np import random #Creating an array of 100 integers newarray=np.random.randint(100,size=(6,8)) #Printing out the array newarray |

The output of the code above is illustrated below:

In the code above we first import the NumPy package and then use the random.randint() function to create an array with 100 integers having 6 rows and 8 columns which is specified by the ‘size’ argument.

Learning how to slice arrays and extract the information/numbers we want is a critical skill when it comes to NumPy. In the array illustrated above let’s slice the contents of the 4th row to the 6th row and the 6th column to the 8th column. We can do this using the code shown below:

1 |
newarray[3:,5:] |

This results in an output as illustrated below:

Now let’s select only the first row of data and all the columns. We can do that using the code shown below:

1 |
newarray[:1, ::] |

This results in an output as shown below:

Let’s now select all rows but only the 2nd and 3rd column of the array:

1 |
newarray[::, 1:3] |

This results in an output as shown below:

Let’s now select the 4th row only and the 1st to the 4th column. We can do this using the code shown below:

1 |
newarray[3:4,:4] |

This results in an output as shown below:

We are now going to calculate the column statistics such as the mean, standard deviation, min and max values for each column of our NumPy array. We can do this using this code shown below:

1 2 |
#Mean for the columns newarray.mean(0) |

Notice how the argument inside the .mean() function is a “o”. If we replaced this with a “1” we would get the mean of all the rows as a result. The output of the code above is illustrated below:

Each element of the array gives you the mean of a column. For example, 24.5 is the mean of the 2nd column.

To compute all other summary statistics we would use the code shown below:

1 2 3 4 5 6 |
#Standard Deviation newarray.std(0) #Minimum of each column newarray.min(0) #Maximum of each column newarray.max(0) |

We can sort an array in ascending order for each column by using the code shown below:

1 2 |
newarray.sort(0) newarray |

This results in an output as shown below:

We can now see how each column is arranged from the lowest to the highest number.

In the section below we are going to learn how to concatenate two arrays in different ways.

To do this let’s create three arrays – “a”, “b” and “arr” using the concept of recursion as shown below:

1 2 3 4 5 6 7 8 |
#Generate a new array with 15 rows and 4 columns by using recursion. #Create an empty array with 15 rows and 4 columns arr=np.empty((15,4)) #Loop over each element in the empty array by placing a value between 0 to 15 in it for i in range(15): arr[i]=i arr |

This results in an array as shown below:

We create array “a” and “b” in a similar fashion:

1 2 3 4 5 6 7 8 9 10 11 |
#Array a a=np.array([[15.,15.,15.,15.]]) a #Array b b=np.empty((16,2)) for i in range(16): b[i]=i b |

This results in two arrays “a” and “b” respectively as shown below:

Let’s now concatenate the arrays – “arr” and “a” along the columns. We can do this using the code shown below:

1 |
np.concatenate((arr,a), axis=0) |

The output of the code is illustrated below:

Notice how the array “a” has been appended to the array “arr” at the end.

Let’s now concatenate the arrays – “arr” and “b” along the rows. We can do this using the code shown below:

1 |
np.concatenate((arra,b), axis=1) |

This results in an output as shown below:

Notice how the array “b” has been added to “arr” along the rows in the right hand side occupying the last two columns of the entire array.

In the section below we will illustrate how we can use the NumPy package to work on a real dataset. For the purpose of this example we are going to be working with the popular Titanic Dataset found on Kaggle.

1 2 3 4 5 6 7 8 9 |
import csv as csv #Reading in the gender data with open("gender.csv","r") as gender_fd: csv_data=csv.reader(gender_fd) gender=list(csv_data) gender_array=np.array(gender) #check the shape gender_array.shape |

In the code above we have used the csv module in python to read the “gender.csv” file. We then convert the csv into an array and name it gender_array. Using the gender_array.shape we can see how many rows and columns the file has. This output is illustrated below:

Thus we see that the file has 1 column and 891 rows.

In order to see the counts of the number of male and female passengers on board the titanic we use the code shown below:

1 |
np.unique(gender_array, return_counts=True) |

This returns an output as shown below:

Next, let’s read in the data of the passengers who survived the Titanic disaster and convert it into an array:

1 2 3 4 5 6 7 8 9 |
#Reading in the survived data with open("survived.csv","r") as survived_fd: csv_data=csv.reader(survived_fd) survived=list(csv_data) #Converting the list to an array survived_array=np.array(survived) #check the shape survived_array.shape |

The output of the code shows us there is one column and 891 rows.

Since the survived is a single binary column with – ‘1’ indicating that the passenger has survived and ‘0’ indicating that the passenger has not survived we can now extract the distinct counts for the total number of passengers who survived using the code shown below:

1 2 3 |
unique, counts=np.unique(survived_array, return_counts=True) print("The number of survivors is: " + str(counts[1])) |

This returns an output as shown below:

We can use a combination of arrays to extract more meaningful insights. For example by using a combination of the “gender_array” we defined earlier and the “survived_array” we can extract the total number of female passengers who survived using the code shown below:

1 2 3 4 5 6 7 |
#Subsetting the gender_array into the survived array and converting it into an integer female_survival=survived_array[gender_array=='female'].astype('int64') #Calculating the total female survivors uniquef, countsf=np.unique(female_survival, return_counts=True) print("The number of female survivors is: " + str(countsf[1])) |

The output of the code is illustrated below:

The guide serves as a great introduction to the world of NumPy and how it can be applied to more to extract useful insights and perform mathematical calculations.

Happy NumPy!