SQL 101:Introduction to SQL for Data Analysis

Mastering the Basics: Harnessing the Power of Structured Query Language for Analysis

SQL is like a paintbrush for data analysis. It is an essential tool that every data analysts analyst should have in their arsenal.

What is SQL?

SQL(Structured Query Language) is a programming language used in relational databases to manage and manipulate data.

It allows the user to interact with databases through a set of commands, or statements. It is used to create tables, insert, update and delete, as well as query the database to extract information.

SQL has also become a widely used language in data analysis and data science, because it provides a powerful set of tools for manipulating and analyzing large datasets.

How is SQL used for Data Analysis?

1. Data Extraction

One of the primary responsibilities of a Data Analyst is to extract and analyze data from various sources. SQL comes in handy to achieve this.

Using SQL, data can be retrieved from one or more databases using SQL queries. This process is simply known as data extraction.

Here are steps for data extraction in SQL:

Log in to the MySQL Server using a user account

Image description

Start by creating the database:

Image description

After that select the newly created database in order to use it:

Image description

Then, create a table within the Bookstore database:

Image description

Next, input data into the table:

Example 1: Populate one record:

Image description

Example 2: Populate multiple records:

Image description

In order to extract data the SELECT statement is used:

Image description

Output:

Image description

In addition In this case, our scenario is a Bookstore Database. More tables namely; Books, Stock and Categories were created.

Below is an example of creating a table that has a relationship with another.

Books Table:

Image description

Stock Table:

Image description

This is a database entity relationship diagram that will be useful while analyzing the data:

Image description

2. Joins

SQL joins are a powerful tool for data analysts because they allow them to combine data from multiple tables into a single result set. This is crucial because data is often spread across multiple tables, and combining this data is necessary to answer complex business questions.

There are several joins that can be used in SQL:

  • INNER JOIN

  • LEFT JOIN

  • RIGHT JOIN

  • FULL OUTER JOIN

Below are Venn diagrams, that provide a visual illustration of SQL Joins.

Image description

As Data Analysts we will try to answer a business question. Using SQL joins we will obtain various results.

Business Question: Which book categories are available for purchase in our database?

1. Inner Join It is also simply indicated as JOIN. It returns only the rows that have matching values in both tables being joined.

Image description

Results:

Image description

2. Left Join It returns all rows from the left table (table1) as well as the matched rows from the right table (table2). If no match is found, NULL values are returned.

Image description

Results:

Image description

3. Right Join Returns all rows from the right table (table2), and the matched rows from the left table (table1). If there is no match, NULL values are returned.

Image description

Results:

Image description

4. Full Outer Join Returns all rows from both tables, and NULL values are returned for any unmatched rows.

Image description

Results:

Image description

3. Data Filtering

This is an important technique for data analysis which allows selection of a subset of data based on a certain criteria. SQL utilizes the WHERE clause for filtering data.

Case 1: Filtering by a single condition.

Business question: Which Nicholas Sparks novels are available for purchase?

Image description

Results:

Image description

Business question: Which bookshops are located in Kisumu county?

Image description

Result

Image description

Case 2: Filtering by a multiple conditions. Business question: Is there a bookstore called Bookworms Haven in Kisumu County?

Image description

Result:

Image description

3. Data Aggregation

This is the process of summarizing and grouping data to obtain useful insights and metrics. It is a common task in data analysis and is often used to generate reports, perform statistical analysis, and identify trends.

In SQL, this is achieved via aggregate function. Here are some common functions used for aggregation:

  • SUM() Function

  • AVG() Function

  • MIN() Function

  • MAX() Function

  • COUNT Function

SUM() Function

It returns the total sum of a numeric column. Business question: How many book copies in stock?

Image description

Results:

Image description

AVG() Function

It calculates the average of a column. Business question: What is the average number of books in stock ?

Image description

Results:

Image description

MIN() Function

It returns the lowest(smallest) value of a numerical column. Business question: What is the minimum number of book copies available in a bookstore?

Image description

Results:

Image description

MAX() Function

It returns the largest value of a numerical column. Business question: What is the maximum number of book copies available in a bookstore?

Image description

Results:

Image description

COUNT() Function

It counts the number of rows in a table or the number of non-null values in a column. Business question: What is the count of the stock entries?

Image description

Results:

Image description

4. Data Transformation

This is the process of converting data from one format or structure to another to make it more suitable for analysis. In SQL, data transformation can be achieved using various techniques such as filtering, aggregating, joining, and grouping.

Business question: How many different title book options are there in total for each book category??

Image description

Result:

Image description

5. Data Cleaning

Data cleaning is an important step in preparing data for analysis in SQL. Here are some common techniques used for data cleaning in SQL: 1.Removing duplicates

Image description

This query will remove any duplicate rows from the table, and return only unique rows based on the columns specified in the SELECT statement. Result:

Image description

2.Handling missing values:

Image description

This query will return only the rows that do not contain null values in the specified columns. One replace null values with a default value or with values from other rows or columns. Result:

Image description

In conclusion, SQL is a valuable tool for data analysts retrieving, transforming and analyzing large datasets stored in relational databases, making it a must-have tool for any data analyst.