Photo by Rubaitul Azad on Unsplash
SQL 101:Introduction to SQL for Data Analysis
Mastering the Basics: Harnessing the Power of Structured Query Language for Analysis
SQL is like a paintbrush for data analysis. It is an essential tool that every data analysts analyst should have in their arsenal.
What is SQL?
SQL(Structured Query Language) is a programming language used in relational databases to manage and manipulate data.
It allows the user to interact with databases through a set of commands, or statements. It is used to create tables, insert, update and delete, as well as query the database to extract information.
SQL has also become a widely used language in data analysis and data science, because it provides a powerful set of tools for manipulating and analyzing large datasets.
How is SQL used for Data Analysis?
1. Data Extraction
One of the primary responsibilities of a Data Analyst is to extract and analyze data from various sources. SQL comes in handy to achieve this.
Using SQL, data can be retrieved from one or more databases using SQL queries. This process is simply known as data extraction.
Here are steps for data extraction in SQL:
Log in to the MySQL Server using a user account
Start by creating the database:
After that select the newly created database in order to use it:
Then, create a table within the Bookstore database:
Next, input data into the table:
Example 1: Populate one record:
Example 2: Populate multiple records:
In order to extract data the SELECT statement is used:
Output:
In addition In this case, our scenario is a Bookstore Database. More tables namely; Books, Stock and Categories were created.
Below is an example of creating a table that has a relationship with another.
Books Table:
Stock Table:
This is a database entity relationship diagram that will be useful while analyzing the data:
2. Joins
SQL joins are a powerful tool for data analysts because they allow them to combine data from multiple tables into a single result set. This is crucial because data is often spread across multiple tables, and combining this data is necessary to answer complex business questions.
There are several joins that can be used in SQL:
INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL OUTER JOIN
Below are Venn diagrams, that provide a visual illustration of SQL Joins.
As Data Analysts we will try to answer a business question. Using SQL joins we will obtain various results.
Business Question: Which book categories are available for purchase in our database?
1. Inner Join It is also simply indicated as JOIN. It returns only the rows that have matching values in both tables being joined.
Results:
2. Left Join It returns all rows from the left table (table1) as well as the matched rows from the right table (table2). If no match is found, NULL values are returned.
Results:
3. Right Join Returns all rows from the right table (table2), and the matched rows from the left table (table1). If there is no match, NULL values are returned.
Results:
4. Full Outer Join Returns all rows from both tables, and NULL values are returned for any unmatched rows.
Results:
3. Data Filtering
This is an important technique for data analysis which allows selection of a subset of data based on a certain criteria. SQL utilizes the WHERE clause for filtering data.
Case 1: Filtering by a single condition.
Business question: Which Nicholas Sparks novels are available for purchase?
Results:
Business question: Which bookshops are located in Kisumu county?
Result
Case 2: Filtering by a multiple conditions. Business question: Is there a bookstore called Bookworms Haven in Kisumu County?
Result:
3. Data Aggregation
This is the process of summarizing and grouping data to obtain useful insights and metrics. It is a common task in data analysis and is often used to generate reports, perform statistical analysis, and identify trends.
In SQL, this is achieved via aggregate function. Here are some common functions used for aggregation:
SUM() Function
AVG() Function
MIN() Function
MAX() Function
COUNT Function
SUM() Function
It returns the total sum of a numeric column. Business question: How many book copies in stock?
Results:
AVG() Function
It calculates the average of a column. Business question: What is the average number of books in stock ?
Results:
MIN() Function
It returns the lowest(smallest) value of a numerical column. Business question: What is the minimum number of book copies available in a bookstore?
Results:
MAX() Function
It returns the largest value of a numerical column. Business question: What is the maximum number of book copies available in a bookstore?
Results:
COUNT() Function
It counts the number of rows in a table or the number of non-null values in a column. Business question: What is the count of the stock entries?
Results:
4. Data Transformation
This is the process of converting data from one format or structure to another to make it more suitable for analysis. In SQL, data transformation can be achieved using various techniques such as filtering, aggregating, joining, and grouping.
Business question: How many different title book options are there in total for each book category??
Result:
5. Data Cleaning
Data cleaning is an important step in preparing data for analysis in SQL. Here are some common techniques used for data cleaning in SQL: 1.Removing duplicates
This query will remove any duplicate rows from the table, and return only unique rows based on the columns specified in the SELECT statement. Result:
2.Handling missing values:
This query will return only the rows that do not contain null values in the specified columns. One replace null values with a default value or with values from other rows or columns. Result:
In conclusion, SQL is a valuable tool for data analysts retrieving, transforming and analyzing large datasets stored in relational databases, making it a must-have tool for any data analyst.