The Data Scientist’s guide to SQL

0
The Data Scientist’s guide to SQL

A data scientist will often have to work with large volumes of data and such data can be extracted from databases using the Structured Query Language or SQL. Therefore, it is vital to know how to effectively use the various tools that SQL offers and how you can use them to your advantage when you want to SELECT the right data for the job at hand.

SQL is generally used to extract data which are contained within relational databases. A relational database is a collection of tables. A table is simply a data frame with rows and columns.

The first command that you would need to know is:

This would return the names of all the tables present in the Relational Database (RD).

For the purpose of this guide let’s assume we have two tables:

Table 1 is called Football and has 3 columns called Player, Goals and Matches.

Table 2 is called Baseball and has 3 columns called Player, Average and Matches.

If we wanted to display all the rows and columns of Table 1 we would use the code shown below:

if we wanted to select the Player column from the table Football we would use the code shown below:

Relational databases usually contain tables that contains thousands if not millions of rows. If we want to only view the column names and the first ten rows of a table we would use the code shown below:

Sometimes a column might have many repeating values. For example, the Matches column in the Football table might have 10 players who have played 100 games and another 15 players who have played 98 games. So if we want to filter out the distinct or unique values from a column we would use the code shown below:

If we wanted to find out the total number of players in the Baseball table we would use the COUNT() function as shown below:

This will create a new column in the Baseball table called COUNT with one row containing the numerical value of the total number of players. We can customize the above query to create a custom column name called Total_players so that it makes more sense to us:

If we wanted to count the distinct number of matches played by players in the Baseball table we would use the code shown below:

If we wanted know the names of the players as well as the goals scored by the players who have played 100 games alone we would have structure our query with a WHERE statement. The WHERE statement is used to assign a condition as shown below:

But what if we wanted to select the names of the players who played a 100 games and scored more than 50 goals? We would have to use the AND statement as shown below:

Maybe we want the names of the players who played a 100 matches or who scored more than 50 goals. In this case we make use of the OR statement as shown below:

When combining AND and OR statements in a single query we need to be careful to use parenthesis so that we don’t confuse the database. Let’s take a look at the query below:

When we want to select a range of values such as the names of all the players who scored between 50 to 100 goals we can use the BETWEEN statement:

If we wanted select the names of the players who had played 50 or 54 or 56 or 60, or 100 matches from the baseball table we would need a lot of OR statements. We can avoid the multiple OR statements with a simple IN statement as shown below:

In SQL missing values are represented by a NULL. If we wanted to write a query to count the number of Players that have a NULL in the Football table we would use the code shown below:

Conversely we can count the number of Players where the Players is not a NULL value:

If we wanted to search for names that start with “Dav” like Dave or David we would use the LIKE statement as follows:

We could also fill in blanks with queries and get SQL to return a word that matches the blank as shown below

This query would return names like David, Daved and so on.

Aggregate functions return calculations of columns in SQL. Example of aggregate functions are SUM(), AVG(), MIN() and MAX()

Would return a column called SUM having the total number of goals scored by all the players in the football table.

AVG() would return the average of all the goals, while MIN() would return the lowest number of goals scored by a player and MAX() would return the largest number of goals scored by a player.

You can use SQL to perform simple mathematical calculations:

Would return 2 instead of 2.5 because SQL will only recognize that expression as a decimal. If we wanted 2.5 we would:

If we wanted a query to order our final results by the number of goals scored by the players in ascending order we would:

Alternatively we can order it by descending order using the code shown below:

We could also ORDER the players by multiple columns. For example we can order it first by the number of Goals they scored and then by the number of matches they played.

The GROUP BY statement is very useful for grouping unique entities and displaying the counts of them. For example we can use the GROUP BY statement to find out how many players scored 100 goals, 99 goals, 98 goals and so on using the code shown below:

One of the main drawbacks of the WHERE statement is that we cannot use functions like COUNT() or SUM() after the WHERE statement. We can do this however with the HAVING statement which is a conditional statement like WHERE but we can use functions after the statement.

If we wanted to write a query to find the number of players who scored a specific number of goal number like 100, 99, 98 using the GROUP BY but we only want Goals greater than 50 we can do so using the code shown below:

This guide is useful because it has now given you a fundamental understanding of how you can use SQL to extract data from Databases.

In a more advanced guide we will talk about how we can use JOINS to Join multiple tables and extract useful information from them.

Happy Querying!

LEAVE A REPLY