SQL, short for Structured Query Language is a programming language used to communicate with databases and do various types of Data wrangling operations. This is an essential skillset for any type of Data related jobs. In this tutorial, let’s get started with the basics of SQL.
1. Why you should learn SQL?
First, let’s understand what is SQL.
SQL, pronounced “see-quel” or “S-Q-L”, is a programming language specifically designed for managing databases.
SQL is used to communicate with databases to retrieve and manipulate data.
Application or websites use databases to store and access data, like user information, transaction data, product details, etc. So, if you have to create such applications, you will need to know at least basic SQL.
It is particularly effective for handling structured data, i.e., data incorporating relations among entities and variables.
The History of SQL
SQL was first developed at IBM by Donald D. Chamberlin and Raymond F. Boyce in the early 1970s.
It was initially called SEQUEL (Structured English Query Language) and was designed to manipulate and retrieve data stored in IBM’s original quasi-relational database management system, System R, which a group at IBM San Jose Research Laboratory had developed.
Over time, SQL became a standard tool for managing data, and it has been maintained and standardized by the American National Standards Institute (ANSI) and the International Organization for Standardization (ISO) since 1986.
It is still effective and in-practice to this day, even modern databases adopting SQL at it’s core.
Why Learn SQL?
It is the most popular, largely accepted way to query data from large databases and do all sorts of data wrangling. It is a basic knowledge requirement expected out of Data professionals.
SQL and relational databases are everywhere!
Helpful in pursuing a Data Science Career
Almost every data related technology supports SQL in some form
Handle large amounts of structured data
Easy to learn. Get good with practice.
Learning SQL (Structured Query Language) can be advantageous for a number of reasons:
1) Data Management: SQL is a standard language for managing data held in a relational database management system (RDBMS) or for stream processing in a relational data stream management system (RDSMS).
2) In Demand Skill: Makes you more marketable for jobs. From data analysts and data scientists to software engineers and database administrators, many jobs require SQL skills.
3) Data Analysis: SQL is designed not just to manage data, but to retrieve and analyze it. You can use SQL to perform complex queries that help you understand data trends, patterns, and insights.
4) Interoperability: Almost every RDBMS (like MySQL, Oracle, SQL Server, SQLite, Postgres) uses SQL, even with some differences in dialect. This means that learning SQL gives you skills that you can use across various database systems.
5) Integrations with Other Languages: SQL integrates with other programming languages such as Python, Java, and R.
6) Handling Large Databases: SQL can handle databases of all sizes, from a few hundred entries to millions or even billions. This makes it a valuable for companies dealing with big data.
7) Long-term Relevance: SQL has been around since the 1970s and is still going strong. The importance of SQL in data management ensures its relevance for many years to come.
8) Ease of Learning: While SQL has depth and complexity. It has a short learning curve but more you practice the better you get at it.
SQL is a powerful and robust tool for extracting relevant and useful data from a large dataset.
While SQL has traditionally been the specialty of highly-trained data analysts and programmers, it’s finding greater acceptance among non-technical folks.
2. Types of SQL servers
SQL servers generally refer to a category of software that provides relational database management system (RDBMS) services, which are often used in conjunction with the SQL (Structured Query Language) programming language.
Here are several types of SQL servers provided by different software vendors:
- Microsoft SQL Server: A widely used commercial SQL server from Microsoft.
-
MySQL: An open-source RDBMS popular for web applications, now owned by Oracle.
-
PostgreSQL: An open-source object-relational database system known for its robustness and complexity.
-
Oracle Database: A comprehensive and powerful commercial RDBMS from Oracle Corporation.
-
MariaDB: An open-source RDBMS, and a community-developed fork of MySQL.
-
IBM DB2: An enterprise-level, high-performance DBMS from IBM.
-
Amazon RDS: A scalable and managed RDBMS service from Amazon Web Services.
-
Google Cloud SQL: A fully-managed RDBMS service offered by Google Cloud.
-
Azure SQL Database: A fully managed RDBMS service provided by Microsoft’s Azure Cloud platform.
-
SQLite: A lightweight, disk-based database with an embedded processing model.SQLite is a C library that provides a lightweight disk-based database. It doesn’t have a separate server process like most other SQL servers.
-
Apache HIVE: Manage petabytes of data residing in distributed storage using SQL.
The sqlite3 module in Python provides an SQL interface for interacting with SQLite databases. This module is an embedded SQL database engine and does not have a separate server process.
4. What is a Database and Table?
Before you start learning SQL, it is very important to understand what is Database and Table?
1) Database: Imagine a library. It’s a place where lots of information is stored. This information is organized in a way that you can find and access what you’re looking for.
A database is the equivalent of this library in the digital world. It’s a structured set of data. So, a database might be used to store information for a website, a business, a game, and so on. It helps us store, manage, and retrieve that information efficiently.
2) Tables: Now within this library, books are organized into different sections or shelves based on categories like ‘Science Fiction’, ‘History’, ‘Biography’ and so forth. Tables are just like these sections in our library analogy.
In a database, a table is a set of data elements (values) that is organized using a model of vertical columns (which are identified by their name) and horizontal rows, the cell being the unit where a row and column intersect.
A table has a specified number of columns, but can have any number of rows.
SQL Syntax example
Typically we write SQL Commands in CAPS. However, SQL is NOT case sensitive. So, SELECT
is same as select
in SQL.
SELECT ProductID, CustomerID, COUNT(*)
FROM Orders
WHERE Quantity > 0
GROUP BY ProductID, CustomerID
HAVING COUNT(*) > 5
ORDER BY ProductID ASC, CustomerID DESC;
Explanation:
- SELECT: Retrieve specific columns or expressions from the table.
- FROM: Specify the table from which to retrieve data.
- WHERE: Filter the data based on specific conditions.
- GROUP BY: Group the result set based on one or more columns.
- HAVING: Filter the grouped data based on conditions.
- ORDER BY: Sort the result set based on one or more columns.
5. Setup Python Environment for SQL
- Install Anaconda
- Or Use Google Colab
Check instuctions in respective video lessons in Python Course
6. Import necessary libraries and create sample data
Here is the plan:
We are going to use pandas to load a CSV file as a Pandas Dataframe. Then we will load this dataframe to SQL Database. From then on, we will do all the work / data processing in SQL.
Note:
Typically, in workplace scenarios, this step is done for you and not required. Your org will typically already have the databases with tables. Nevertheless, let’s see how to create one if not already given so you get the full picture.
#!pip install pandas
import pandas as pd
sample_df = pd.read_csv('DataSQL.csv')
sample_df.head()
Current_id | Dep_Var | Logit_Var | Var_1 | Var_2 | Var_3 | Var_4 | Var_5 | |
---|---|---|---|---|---|---|---|---|
0 | 301897853 | 16428 | 0 | -113 | 40000 | 436 | 105 | 13 |
1 | 301934672 | 8678 | 1 | -120 | 35000 | 446 | 113 | 15 |
2 | 301902540 | 25561 | 1 | -99 | 36000 | 456 | 132 | 14 |
3 | 301864713 | 8528 | 0 | -85 | 21600 | 408 | 142 | 14 |
4 | 301937240 | 10128 | 0 | -118 | 38000 | 396 | 139 | 13 |
Basic outline to follow while learning SQL in a structured manner
- Create a Database
- Create and Load Table in Database
- Create New Table from existing table
- Copy data from one table and insert it into already exisitng table
- Get list of tables in SQL DB
- Drop Tables in DB
- Delete rows in a existing Dataset using where condition
- Update rows/columns in a existing Dataset
- Select data / required vaiables from a dataset
- SQL Comments
- Limit the number of rows returned from a query
- Identify and Delete duplicates in a dataset
- Filter rows in SQL dataset
- Sort the data in ascending or descending order
- Aggregate identical data into groups
- Filter rows after the aggregated
- SQL Joins
- 17(a) – Inner Join
- 17(b) – Full Join
- 17(c) – Left Join
- 17(d) – RIght Join
1. Create DB and Run SQL in Jupyter Environment
Database in SQL is used to manage and manipulate the data.
In this Database, data is organized into tables which are related to each other based on common keys or concepts; hence the term “relational database”.
Each table consists of columns (fields) and rows (records). A unique key identifies each record in a table, and relationships between tables are created through these keys.
Python ships with a default SQL database called ‘sqlite’. Let’s connect to the db that comes with this course called jupyter_sql_tutorial.db
.
STEP 1: CONNECT or CREATE
# !pip install db-sqlite3
import sqlite3
# Use sqlite3 library to create a database connection
Test_db = sqlite3.connect('jupyter_sql_tutorial.db')
If the database does not exist, it will create a new one.
# Create new database if not exist
Test_db2 = sqlite3.connect('new_database.db')
If you check the work directory, it should contain a new file named new_database.db
.
Important: Install ipython-sql for the %sql
function to work
# Common error: https://stackoverflow.com/questions/37149748/ipython-notebook-and-sql-importerror-no-module-named-sql-when-running-load
# !pip install ipython-sql
STEP 2: LOAD DATABASE and ATTACH
In database management systems, the “ATTACH DATABASE” statement is used to connect or link an external database to the current database session.
%load_ext sql
%sql sqlite:///jupyter_sql_tutorial.db
Important note, if you want to run this code in Jupyter notebook:
- Add ‘%%sql’ in the beginning of each cell so Jupyter knows the cell contains SQL code.
- Do ‘Attach Database’ so we can refer ‘jupyter_sql_tutorial.db’ as ‘test_db’.
ATTACH DATABASE 'jupyter_sql_tutorial.db' AS Test_db;
STEP 3: Save Pandas DataFrame to SQL Database (Test_db)
# Load the dataset if not already loaded.
try:
sample_df.to_sql('sample_df', Test_db)
except:
pass
Note: If above statement gives error that the table sample_df
already exists, then it means the table has already been loaded, so no need to re-run.
View / Print sample_df
Loaded into Test_db
Now, let’s read the data from the Sample_df
table stored in Test_db
Notice the syntax.
SELECT *
FROM Test_db.Sample_df
This prints the entire table. Not showing here because of long output.
The above output was too long, and can easily run out of memory for large tables.
So, let’s get only 5 rows.
SELECT *
FROM Test_db.Sample_df
LIMIT 5;
index | Current_id | Dep_Var | Logit_Var | Var_1 | Var_2 | Var_3 | Var_4 | Var_5 |
---|---|---|---|---|---|---|---|---|
0 | 301897853 | 16428 | 0 | -113 | 40000 | 436 | 105 | 13 |
1 | 301934672 | 8678 | 1 | -120 | 35000 | 446 | 113 | 15 |
2 | 301902540 | 25561 | 1 | -99 | 36000 | 456 | 132 | 14 |
3 | 301864713 | 8528 | 0 | -85 | 21600 | 408 | 142 | 14 |
4 | 301937240 | 10128 | 0 | -118 | 38000 | 396 | 139 | 13 |
How to count the number of records in the table?
SELECT count(*)
FROM Test_db.Sample_df;
14989
How to count the columns?
That is not the right way to think about it. Let me explain.
In Python, you normally get the count of columns to reference the column by column index or do some iterations.
Whereas in SQL, we reference the columns by the respective column names. All the operations we do is vectorised (applied on the entire data).
Nevertheless, SQL does not prevent you from finding the count of columns, but how you compute it depends on what RDBMS you work on.
Here are couple of ways to do it: MySQL, SQLite
SELECT COUNT(*)
FROM pragma_table_info('Sample_df');
COUNT(*) |
---|
9 |
2. Data wrangling in SQL vs Python
A key difference in Data Wrangling between SQL and Python is: In Python, you can create your own functions, iterate in a customized way by writing For Loops. But in SQL, we accomplish all tasks by the limited set of predefined SQL commands.
3. Setup MySQL Workbench
Helpful Article:
1. How to install MySQL on Windows
2. How to install MySQL on MacOS
3. Official Guide: Windows, MacOS
With time, these article links may change. To get the latest just google search “Install MySQL in Windows” (or Mac) and look into the latest article.
Steps
- Download MYSQL installer: https://dev.mysql.com/downloads/installer/
- Once installation completes select the ‘Full’ setup option.
- In ‘Type and Networking’ screen, keep it as ‘Development Computer’ if you are using a personal PC.
Keep the default port or you can give your own.
- Continue default method until it asks you to give root account password.
- Apply configuration and Finish.
Done! The next step is to connect to the server. Type the root account’s password and click Check
- In MySQL Router configuration, uncheck this box and click next.
- Once everything is set, you will be greeted with the workbench screen. Choose the connection to the server you had created and log into it.
- You can now start writing your queries.
Information on Sakila Database we use for Exercises
The sakila database is very popular, basic set of tables that comes preinstalled by default when you install MySQL server in the workbench. If it’s not present in anycase, you can find out more about the database and set it up using the following links:
- Structure of DB: https://dev.mysql.com/doc/sakila/en/sakila-structure.html
- Installation: https://dev.mysql.com/doc/sakila/en/sakila-installation.html
- Download link: https://dev.mysql.com/doc/index-other.html
- Alternate Download Link: https://www.sqliz.com/sakila/installation/
2. Create Table manually in DB
Tables are an essential component of databases as they hold all the data. When a table is created, the name of the table, names of columns, and the type of data that can be stored in the columns are defined.
The CREATE TABLE
statement in SQL is used to create a new table in a database with “0 Rows”.
CREATE TABLE Students (
StudentID int PRIMARY KEY,
FirstName varchar(255),
LastName varchar(255),
Age int,
Grade int
);
SELECT * FROM Students
StudentID | FirstName | LastName | Age | Grade |
---|
Add rows manually to Students table
INSERT INTO Students (StudentID, FirstName, LastName, Age, Grade)
VALUES (1, 'John', 'Doe', 15, 10),
(2, 'Jane', 'Doe', 14, 9),
(3, 'Jim', 'Beam', 16, 10),
(4, 'Jack', 'Daniels', 15, 9);
SELECT * FROM Students
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
1 | John | Doe | 15 | 10 |
2 | Jane | Doe | 14 | 9 |
3 | Jim | Beam | 16 | 10 |
4 | Jack | Daniels | 15 | 9 |
Get Table Variables types
PRAGMA table_info(Students);
cid | name | type | notnull | dflt_value | pk |
---|---|---|---|---|---|
0 | StudentID | INT | 0 | None | 1 |
1 | FirstName | varchar(255) | 0 | None | 0 |
2 | LastName | varchar(255) | 0 | None | 0 |
3 | Age | INT | 0 | None | 0 |
4 | Grade | INT | 0 | None | 0 |
PRAGMA
is a sqlite specific command. If you are working on other databases, say MySQL, use DESCRIBE
command.
DESCRIBE Students
SQL Variables Types:
Creating a table in SQL requires defining not only the column names but also the data types for each column. The most common types of data used in SQL are:
A) INT
: For integer numbers
B) FLOAT
or DOUBLE
: For floating point numbers
C) DATE
: For date values
D) TIMESTAMP
: For date and time values
E) CHAR(size)
: For fixed length strings (size = number of characters)
F) VARCHAR(size)
: For variable length strings (size = maximum number of characters)
G) TEXT
: For storing large amount of text
I) BIT
: For storing either true or false logical values.
J) BLOB
Here is a sample SQL command to create a table with various types of variables:
CREATE TABLE Employee (
EmployeeID INT PRIMARY KEY,
FirstName VARCHAR(50),
LastName VARCHAR(50),
BirthDate DATE,
HireDate TIMESTAMP,
Salary DOUBLE,
Bio TEXT,
Photo BLOB,
IsActive BIT
);
2b. Exercise: Create Orders table from raw data.
Try the following in Jupyter or MySQL Workbench.
- Create a database ‘homework’.
-
Create an orders table in ‘homework’ containing the following fields:
- OrderID
- OrderDate
- Qty
- Customer ID
- Product category
- Price
- Then insert the following entries.
- 1, 2023-01-01, 2, 10, “Electronics”, 110
- 2, 2023-01-01, 2, 11, “Home”, 10
- 3, 2023-01-01, 2, 11, “Kitchen”, 80
- 4, 2023-01-01, 2, 12, “Electronics”, 11
- 5, 2023-01-01, 2, 12, “Outdoor”, 19
- 6, 2023-01-01, 2, 13, “Outdoor”, 217
- 7, 2023-01-01, 2, 13, “Home”, 8
3. Create New Table from existing table
The “CREATE TABLE” statement copies data from one table and inserts it into a new table.
CREATE TABLE Students_test AS
SELECT * FROM Students
WHERE Grade < 10;
SELECT * FROM Students_test
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
2 | Jane | Doe | 14 | 9 |
4 | Jack | Daniels | 15 | 9 |
4. Show Tables from existing Database
Different RDBMS (like SQLite, MySQL.. ) have different ways of doing this. The two most common methods are shown below.
For SQLite, this info is stored in an in-built database called sqlite_master
. You can query that table using the following command.
SELECT name FROM sqlite_master WHERE type='table';
name |
---|
sample_df |
Students |
Employee |
Students_test |
In MYSQL databases, you can use the SHOW TABLES
command. This is very common in most RDBMS.
SHOW TABLES FROM {DB_NAME};
Ref: https://dev.mysql.com/doc/refman/8.0/en/show-tables.html
5. Insert Into Select: copy data from one table to another table
Insert Into Select
is a versatile feature that can be utilized to copy data from one table and insert it into already exisitng table.
It will not copy to a new table.
INSERT INTO Students_test
SELECT *
FROM Students;
SELECT * FROM Students_test
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
2 | Jane | Doe | 14 | 9 |
4 | Jack | Daniels | 15 | 9 |
1 | John | Doe | 15 | 10 |
2 | Jane | Doe | 14 | 9 |
3 | Jim | Beam | 16 | 10 |
4 | Jack | Daniels | 15 | 9 |
Create one more
INSERT INTO Students_test2
SELECT *
FROM Students;
SELECT * FROM Students_test2
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
1 | John | Doe | 15 | 10 |
2 | Jane | Doe | 14 | 9 |
3 | Jim | Beam | 16 | 10 |
4 | Jack | Daniels | 15 | 9 |
This did not work.
Why?
Because Student_test2
table did not exist. So, you need to first create it.
CREATE TABLE Students_test2 AS
SELECT * FROM Students
WHERE Grade < 10;
SELECT * FROM Students_test2
* sqlite:///jupyter_sql_tutorial.db
(sqlite3.OperationalError) table Students_test2 already exists
[SQL: CREATE TABLE Students_test2 AS
SELECT * FROM Students
WHERE Grade < 10;]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
Check if new table was created
SELECT name
FROM sqlite_master
WHERE type='table';
name |
---|
sample_df |
Students |
Employee |
Students_test |
Students_test2 |
6. SQL DELETE: Delete rows in a existing Dataset using where condition
DELETE
command can be used to delete a single record, multiple records, or all records from a table.
DELETE FROM Students_test2
WHERE Grade = 9;
SELECT * FROM Students_test2
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
1 | John | Doe | 15 | 10 |
3 | Jim | Beam | 16 | 10 |
You can give more complex conditions as well.
DELETE FROM Students_test2
WHERE Grade = 9 AND AGE = 15;
SELECT * FROM Students_test2
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
1 | John | Doe | 15 | 10 |
3 | Jim | Beam | 16 | 10 |
7. Delete a Table
You can delete / drop an entire table in a single DROP TABLE
command. So, be absolutely sure before running this.
The DROP TABLE
command in SQL is a Data Definition Language (DDL) statement used to remove a table definition along with all the data, indexes, triggers, constraints, and permission specifications for that table. Essentially, the command completely eliminates the table from the database.
It is important to exercise caution when using this command because once a table is dropped, all the information in the table is lost and cannot be recovered (unless you have a backup).
DROP TABLE Students_test2;
Check the DROP TABLE Command after dropping table (Students_test2) in Database
It will give an error, because, it does not exist anymore.
DROP TABLE Students_test2;
* sqlite:///jupyter_sql_tutorial.db
(sqlite3.OperationalError) no such table: Students_test2
[SQL: DROP TABLE Students_test2;]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
Check tables in the Database
SELECT name
FROM sqlite_master
WHERE type='table';
name |
---|
sample_df |
Students |
Employee |
Students_test |
8. SQL Update : Update rows/columns in a existing Dataset
The UPDATE
statement is used to modify existing records in a table
UPDATE Students
SET Grade = Grade * 10,
Age = Age + 2
WHERE FirstName = 'Jane';
SELECT * FROM Students
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
1 | John | Doe | 15 | 10 |
2 | Jane | Doe | 16 | 90 |
3 | Jim | Beam | 16 | 10 |
4 | Jack | Daniels | 15 | 9 |
8b. Exercise 2: Updating Table
Try the following in MySQL Workbench on the Sakila Database
- Set to use the database as Sakila.
- Update Sakila.FILM to increase all values ‘Rental_Duration’ by 5
Note:
- The Sakila db is present as a default database in MySQL workbench.
If not present, follow the instructions in the ‘Setup MySQL WorkBench’ lesson above to set it up. -
Basically, you need to run the 2 SQL scripts inside ‘Sakila-db’ directory: sakila-schema.sql, then, sakila-data.sql.
9. SQL Select : Select data / required vaiables from a dataset
The SELECT statement is used to select data from a database.
Select *
is used to select all colums in the table
SELECT *
FROM Students;
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
1 | John | Doe | 15 | 10 |
2 | Jane | Doe | 16 | 90 |
3 | Jim | Beam | 16 | 10 |
4 | Jack | Daniels | 15 | 9 |
Select only defined columns SELECT FirstName, LastName
SELECT FirstName, LastName
FROM Students;
FirstName | LastName |
---|---|
John | Doe |
Jane | Doe |
Jim | Beam |
Jack | Daniels |
Select Only ‘Jane’ and ‘Jack’
SELECT FirstName, LastName
FROM Students
WHERE FirstName in ("Jane", "Jack");
FirstName | LastName |
---|---|
Jane | Doe |
Jack | Daniels |
Select all except ‘Jane’ and ‘Jack’
SELECT FirstName, LastName
FROM Students
WHERE FirstName not in ("Jane", "Jack");
FirstName | LastName |
---|---|
John | Doe |
Jim | Beam |
Select all where LastName startd with “D”
SELECT FirstName, LastName
FROM Students
WHERE LastName LIKE ("D%");
FirstName | LastName |
---|---|
John | Doe |
Jane | Doe |
Jack | Daniels |
More on such patterns in an upcoming section.
10. SQL Comments
Comments are lines of text in your SQL code that the SQL interpreter or compiler ignores
SQL offers two types of comment syntax:
- Single-Line Comments
-
Multi-Line Comments
-- This is a single-line comment
/*
This is a multi-line comment.
The SELECT statement below retrieves all records from the Customers table.
*/
SELECT * FROM Students;
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
1 | John | Doe | 15 | 10 |
2 | Jane | Doe | 16 | 90 |
3 | Jim | Beam | 16 | 10 |
4 | Jack | Daniels | 15 | 9 |
11. Limit the number of rows returned from a query
The SELECT TOP
clause is a SQL command used to limit the number of rows returned from a query. This is particularly useful when you’re dealing with large databases, and you want a subset of records.
Fetching Top N Records
If you want to select the top 3 orders
The LIMIT
clause is used to limit the number of records returned by the query. For example, if you set LIMIT 3
, it will return only the first 3 records from the result set.
First, let’s print all rows
SELECT *
FROM Students_test;
Only 3 rows
SELECT *
FROM Students_test LIMIT 3;
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
2 | Jane | Doe | 14 | 9 |
4 | Jack | Daniels | 15 | 9 |
1 | John | Doe | 15 | 10 |
The OFFSET
clause is used to specify the starting point for the rows to be returned. If you set OFFSET 2
, it means that the first 2 rows will be skipped and the query will start returning records from the 3rd row.
So, when you combine LIMIT 3 OFFSET 2
in a SQL query, it means that the SQL statement will skip the first 2 records and then return the next 3 records.
SELECT *
FROM Students_test
LIMIT 3 OFFSET 2;
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
1 | John | Doe | 15 | 10 |
2 | Jane | Doe | 14 | 9 |
3 | Jim | Beam | 16 | 10 |
12. SELECT DISTINCT : Identify and Delete duplicates in a dataset
The primary function of the SELECT DISTINCT
statement is to eliminate all duplicate rows and present a table of unique rows.
Sample data for understanding SELECT DISTINCT
SELECT *
FROM Students_test
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
2 | Jane | Doe | 14 | 9 |
4 | Jack | Daniels | 15 | 9 |
1 | John | Doe | 15 | 10 |
2 | Jane | Doe | 14 | 9 |
3 | Jim | Beam | 16 | 10 |
4 | Jack | Daniels | 15 | 9 |
SELECT DISTINCT *
Get distinct of entire row
SELECT DISTINCT *
FROM Students_test;
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
2 | Jane | Doe | 14 | 9 |
4 | Jack | Daniels | 15 | 9 |
1 | John | Doe | 15 | 10 |
3 | Jim | Beam | 16 | 10 |
SELECT DISTINCT StudentID
Get distinct of Selected column(s)
SELECT DISTINCT StudentID
FROM Students_test;
StudentID |
---|
2 |
4 |
1 |
3 |
13. SQL Where: Filtering rows
The SQL WHERE clause is the filtering mechanism in SQL. It specifies a condition while fetching data from a single table or by joining with multiple tables. If the given condition is satisfied, only then it returns a specific value from the table.
You could say the WHERE clause is the gatekeeper of data – only letting through the data you specifically request.
Select *
FROM Students_test
WHERE Grade < 10;
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
2 | Jane | Doe | 14 | 9 |
4 | Jack | Daniels | 15 | 9 |
2 | Jane | Doe | 14 | 9 |
4 | Jack | Daniels | 15 | 9 |
Multiple conditions
Select *
FROM Students_test
WHERE Age > 14 AND Grade < 10;
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
4 | Jack | Daniels | 15 | 9 |
4 | Jack | Daniels | 15 | 9 |
Select specific STUDENTIDs
Select *
FROM Students_test
WHERE StudentID in (2,3,4);
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
2 | Jane | Doe | 14 | 9 |
4 | Jack | Daniels | 15 | 9 |
2 | Jane | Doe | 14 | 9 |
3 | Jim | Beam | 16 | 10 |
4 | Jack | Daniels | 15 | 9 |
13b. Exercise 3: Filtering Data
Try the following in MySQL Workbench on the Sakila Database
- From SAKILA.FILM, select only the film titles and description with rental rate more than 2.00 and Rating is ‘PG’.
-
From SAKILA.PAYMENTS, count how many payments were make more than $5.
Note:
- The Sakila db is present as a default database in MySQL workbench.
If not present, follow the instructions in the ‘Setup MySQL WorkBench’ lesson above to set it up. -
Basically, you need to run the 2 SQL scripts inside ‘Sakila-db’ directory: sakila-schema.sql, then, sakila-data.sql.
14. SQL Order By: Sort the data in ascending or descending order
The SQL ORDER BY
keyword is used to sort the result-set in ascending or descending order. The ORDER BY
keyword sorts the records in ascending order by default. If you want to sort the records in descending order, you can use the DESC keyword.
SELECT *
FROM Students_test
ORDER BY Grade ASC;
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
2 | Jane | Doe | 14 | 9 |
4 | Jack | Daniels | 15 | 9 |
2 | Jane | Doe | 14 | 9 |
4 | Jack | Daniels | 15 | 9 |
1 | John | Doe | 15 | 10 |
3 | Jim | Beam | 16 | 10 |
ORDER BY Grade ASC, Age DESC;
Sort / Order using multiple variables and diffrent order types (ASC, DESC)
SELECT *
FROM Students_test
ORDER BY Grade ASC, Age DESC;
StudentID | FirstName | LastName | Age | Grade |
---|---|---|---|---|
4 | Jack | Daniels | 15 | 9 |
4 | Jack | Daniels | 15 | 9 |
2 | Jane | Doe | 14 | 9 |
2 | Jane | Doe | 14 | 9 |
3 | Jim | Beam | 16 | 10 |
1 | John | Doe | 15 | 10 |
15. SQL Group By : Aggregate identical data into groups
‘GROUP BY’ is an SQL command that arranges identical data into groups.
‘GROUP BY’ goes hand in hand with aggregate functions like COUNT, AVG, SUM, MAX, or MIN to provide meaningful insights from the data. It enables you to perform calculations on each group of rows rather than individual rows.
SELECT StudentID,
FirstName,
AVG(Age) as AvgAge,
SUM(Grade) as SumGrade
FROM Students_test
GROUP BY StudentID, FirstName;
StudentID | FirstName | AvgAge | SumGrade |
---|---|---|---|
1 | John | 15.0 | 10 |
2 | Jane | 14.0 | 18 |
3 | Jim | 16.0 | 10 |
4 | Jack | 15.0 | 18 |
16. SQL Having: Filter rows after aggregating
Consider the following SQL statement.
SELECT StudentID,
FirstName,
AVG(Age) as Age,
SUM(Grade) as Grade
FROM Students_test
WHERE Grade <= 10
GROUP BY StudentID, FirstName;
Now consider this with the ‘HAVING’ clause
SELECT StudentID,
FirstName,
AVG(Age) as Age,
SUM(Grade) as SGrade
FROM Students_test
GROUP BY StudentID, FirstName
HAVING SGrade <= 100;
StudentID | FirstName | Age | SGrade |
---|---|---|---|
1 | John | 15.0 | 10 |
2 | Jane | 14.0 | 18 |
3 | Jim | 16.0 | 10 |
4 | Jack | 15.0 | 18 |
What is the difference between “WHERE” and “HAVING” in SQL?
This is a very important interview question.
Answer:
The HAVING
clause is used in SQL to filter the results of a GROUP BY
clause.
While the WHERE
clause can filter rows before they are aggregated, the HAVING
clause filters the rows after they’ve been grouped and aggregated. This is the key distinction between the WHERE
and HAVING
clauses in SQL.
SELECT StudentID,
FirstName,
AVG(Age) as Age,
SUM(Grade) as SGrade
FROM Students_test
GROUP BY StudentID, FirstName
HAVING SGrade <= 100;
StudentID | FirstName | Age | SGrade |
---|---|---|---|
1 | John | 15.0 | 10 |
2 | Jane | 14.0 | 18 |
3 | Jim | 16.0 | 10 |
4 | Jack | 15.0 | 18 |
The column name in the HAVING clause refers to the name the comes after the grouping is done.
17. Joining Datasets
Let’s say, you have two tables: Orders and Customer info. Both tables have the customerid in common. And you want to add Customer info to the orders data so you can see the details of each customer against every order.
How do you achieve this?
SQL Joins are used to combine rows from two or more tables based on a common column between them.
This common column is often a primary key in one table that corresponds to a foreign key in another.
Types of SQL Joins
There are four main types of SQL Joins
17(a) – INNER JOIN: Returns records that have matching values in both tables.
17(b) – FULL (OUTER) JOIN: Returns all records when there is a match in either the left or the right table.
17(c) – LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right table.
17(d) – RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left table.
Let’s illustrate these concepts using a simple database with two tables, Customers
and Orders
.
Let’s illustrate these concepts using a simple database with two tables, Customers and Orders.
Customer Table:
CREATE TABLE Customers (
CustomerID INT PRIMARY KEY,
Name VARCHAR(255)
);
INSERT INTO Customers (CustomerID, Name)
VALUES (1, 'John'),
(2, 'Peter'),
(3, 'Mary'),
(4, 'Sally');
View Customers
table
select * from Customers;
CustomerID | Name |
---|---|
1 | John |
2 | Peter |
3 | Mary |
4 | Sally |
Orders Table:
CREATE TABLE Orders (
OrderID INT PRIMARY KEY,
CustomerID INT,
Product VARCHAR(255)
);
INSERT INTO Orders (OrderID, CustomerID, Product)
VALUES (1, 1, 'Apples'),
(2, 2, 'Bananas'),
(3, 4, 'Grapes'),
(4, 5, 'Oranges');
View Orders
table
select * from Orders;
OrderID | CustomerID | Product |
---|---|---|
1 | 1 | Apples |
2 | 2 | Bananas |
3 | 4 | Grapes |
4 | 5 | Oranges |
17(a) – INNER JOIN
The INNER JOIN
keyword selects records that have matching values in both tables.
SQL Statement:
SELECT Orders.*, Customers.Name
FROM Customers
INNER JOIN Orders ON Customers.CustomerID = Orders.CustomerID;
OrderID | CustomerID | Product | Name |
---|---|---|---|
1 | 1 | Apples | John |
2 | 2 | Bananas | Peter |
3 | 4 | Grapes | Sally |
17(b) – FULL JOIN
The FULL JOIN
keyword returns all records when there is a match in the left (Customers
) or the right (Orders
) table. Records that do not have a match in the other table are shown as NULL
.
SQL Statement:
SELECT Customers.Name, Orders.Product
FROM Customers
FULL OUTER JOIN Orders ON Customers.CustomerID = Orders.CustomerID;
Name | Product |
---|---|
John | Apples |
Peter | Bananas |
Mary | None |
Sally | Grapes |
None | Oranges |
Name | Product |
---|---|
John | Apples |
Peter | Bananas |
Mary | NULL |
Sally | Grapes |
NULL | Oranges |
%%html
<style>
table {float:left}
</style>
17(c) – LEFT JOIN
The LEFT JOIN
keyword returns all records from the left table (Customers
), and the matched records from the right table (Orders
). The result is NULL
from the right side if there is no match.
SQL Statement:
SELECT Customers.Name, Orders.Product
FROM Customers
LEFT JOIN Orders ON Customers.CustomerID = Orders.CustomerID;
Name | Product |
---|---|
John | Apples |
Peter | Bananas |
Mary | None |
Sally | Grapes |
17(d) – RIGHT JOIN
The RIGHT JOIN
keyword returns all records from the right table (Orders
), and the matched records from the left table (Customers
). The result is NULL from the left side when there is no match.
SQL Statement:
SELECT Customers.Name, Orders.Product
FROM Customers
RIGHT JOIN Orders ON Customers.CustomerID = Orders.CustomerID;
Name | Product |
---|---|
John | Apples |
Peter | Bananas |
Sally | Grapes |
None | Oranges |
Name | Product |
---|---|
John | Apples |
Peter | Bananas |
Sally | Grapes |
NULL | Oranges |
18. Exercise 4: Joining Exercise
Try the following in MySQL Workbench on the Sakila Database
- From Sakila Database, create a new table that contains:
- Film name
- Corresponding Actor name
- Release Year
- Rating
- Rental_Rate
Make sure none of the films get missed out in the result.
Hint: Use Film table and Film Actor Table.
- What is the average rental rate for each Film actor?
Note:
- The Sakila db is present as a default database in MySQL workbench.
If not present, follow the instructions in the ‘Setup MySQL WorkBench’ lesson above to set it up. -
Basically, you need to run the 2 SQL scripts inside ‘Sakila-db’ directory: sakila-schema.sql, then, sakila-data.sql.
19. SQL DateTime Operations
SQL provides a set of functions for working with dates and times. These functions let you manipulate and format date and time values in your SQL queries.
1) Extract the Date and time from a DateTime
2) Extract Year, Month, and Day
3) Extract Hours, Minutes, and Seconds
4) Increment and decrement Date Column by Days, Months and Years
5) Current date, time, and timestamp
6) Calculate Date Difference in Days, Years and Months
CREATE TABLE Employees (
ID INT PRIMARY KEY,
Name VARCHAR(100),
HireDate DATETIME
);
INSERT INTO Employees (ID, Name, HireDate)
VALUES
(1, 'John Doe', '2017-03-12 09:30:10'),
(2, 'Jane Smith', '2018-05-14 15:20:08'),
(3, 'Emily Johnson', '2019-07-30 13:15:15'),
(4, 'Robert Brown', '2020-10-20 10:00:35');
Select * from Employees;
ID | Name | HireDate |
---|---|---|
1 | John Doe | 2017-03-12 09:30:10 |
2 | Jane Smith | 2018-05-14 15:20:08 |
3 | Emily Johnson | 2019-07-30 13:15:15 |
4 | Robert Brown | 2020-10-20 10:00:35 |
1) Extract the Date and time from a DateTime
You can extract the date part from a DateTime using the DATE()
function.
SELECT *,
DATE(HireDate) AS 'Hire Date',
TIME(HireDate) AS 'Hire Time'
FROM Employees;
ID | Name | HireDate | Hire Date | Hire Time |
---|---|---|---|---|
1 | John Doe | 2017-03-12 09:30:10 | 2017-03-12 | 09:30:10 |
2 | Jane Smith | 2018-05-14 15:20:08 | 2018-05-14 | 15:20:08 |
3 | Emily Johnson | 2019-07-30 13:15:15 | 2019-07-30 | 13:15:15 |
4 | Robert Brown | 2020-10-20 10:00:35 | 2020-10-20 | 10:00:35 |
2. Extract Year, Month, and Day
You can use the strftime('%Y', date_col)
, strftime('%m', date_col)
, and strftime('%d', date_col)
functions to extract the respective date components from a DateTime value.
SELECT *,
strftime('%Y', HireDate) as 'HireYear',
strftime('%m', HireDate) as 'HireMonth',
strftime('%d', HireDate) as 'HireDay'
FROM Employees;
ID | Name | HireDate | HireYear | HireMonth | HireDay |
---|---|---|---|---|---|
1 | John Doe | 2017-03-12 09:30:10 | 2017 | 03 | 12 |
2 | Jane Smith | 2018-05-14 15:20:08 | 2018 | 05 | 14 |
3 | Emily Johnson | 2019-07-30 13:15:15 | 2019 | 07 | 30 |
4 | Robert Brown | 2020-10-20 10:00:35 | 2020 | 10 | 20 |
3. Extract Hours, Minutes, and Seconds
You can use the strftime('%H', date_col)
, strftime('%M', date_col)
, and strftime('%S', date_col)
functions to extract the respective date components from a DateTime value.
SELECT *,
strftime('%H', HireDate) as 'HireHour',
strftime('%M', HireDate) as 'HireMinute',
strftime('%S', HireDate) as 'HireSecond'
FROM Employees;
ID | Name | HireDate | HireHour | HireMinute | HireSecond |
---|---|---|---|---|---|
1 | John Doe | 2017-03-12 09:30:10 | 09 | 30 | 10 |
2 | Jane Smith | 2018-05-14 15:20:08 | 15 | 20 | 08 |
3 | Emily Johnson | 2019-07-30 13:15:15 | 13 | 15 | 15 |
4 | Robert Brown | 2020-10-20 10:00:35 | 10 | 00 | 35 |
4. Increment and decrement Date Column by Days, Months and Years
SELECT *,
DATETIME(HireDate, '+10 days') AS Add_Days,
DATETIME(HireDate, '-1 months') AS Add_Months,
DATETIME(HireDate, '+1 years') AS Add_Years
FROM Employees;
ID | Name | HireDate | Add_Days | Add_Months | Add_Years |
---|---|---|---|---|---|
1 | John Doe | 2017-03-12 09:30:10 | 2017-03-22 09:30:10 | 2017-02-12 09:30:10 | 2018-03-12 09:30:10 |
2 | Jane Smith | 2018-05-14 15:20:08 | 2018-05-24 15:20:08 | 2018-04-14 15:20:08 | 2019-05-14 15:20:08 |
3 | Emily Johnson | 2019-07-30 13:15:15 | 2019-08-09 13:15:15 | 2019-06-30 13:15:15 | 2020-07-30 13:15:15 |
4 | Robert Brown | 2020-10-20 10:00:35 | 2020-10-30 10:00:35 | 2020-09-20 10:00:35 | 2021-10-20 10:00:35 |
5. Current date, time, and timestamp: CURRENT_DATE
, CURRENT_TIME
, CURRENT_TIMESTAMP
functions will return the current date, time, and timestamp respectively.
SELECT CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP;
CURRENT_DATE | CURRENT_TIME | CURRENT_TIMESTAMP |
---|---|---|
2023-08-20 | 07:27:36 | 2023-08-20 07:27:36 |
6. Calculate Date Difference in Days, Years and Months
SELECT *,
round(julianday('now') - julianday(HireDate),0) As Date_Diff_Days,
round((julianday('now') - julianday(HireDate))/30.44,0) As Date_Diff_Months,
round((julianday('now') - julianday(HireDate))/365.25,0) As Date_Diff_Years
FROM Employees;
ID | Name | HireDate | Date_Diff_Days | Date_Diff_Months | Date_Diff_Years |
---|---|---|---|---|---|
1 | John Doe | 2017-03-12 09:30:10 | 2352.0 | 77.0 | 6.0 |
2 | Jane Smith | 2018-05-14 15:20:08 | 1924.0 | 63.0 | 5.0 |
3 | Emily Johnson | 2019-07-30 13:15:15 | 1482.0 | 49.0 | 4.0 |
4 | Robert Brown | 2020-10-20 10:00:35 | 1034.0 | 34.0 | 3.0 |