Menu

SQL Window Functions – Made simple and intuitive

SQL window functions is one of the advanced concepts, understanding which will give you the ability to do complex data wrangling and transformations in SQL.

In this guide, we will intuitively understand how window functions work in a way you will never forget. Don’t memorize anything, just read through and you will clearly understand how the window functions works. Then you will know when and how to use it.

Once you complete this guide, head over to SQL Window Function Exercises and solve all the questions there.

Let’s get started.

In this on we will cover:

  1. Understanding Window Functions
  2. Type 1: Aggregation
    2.1. Aggregation functions used with Window functions
  3. Type 2: Ranking
    3.1. Ranking Functions
  4. Difference between rank() and dense_rank()?
  5. Type 3: Comparing boundaries with Offsets
    5.1. Offset Functions you will need
  6. Type 4: Rolling Aggregates with ROWS BETWEEN clause
    6.1. Understanding ROWS BETWEEN Clause

1. Understanding Window Functions

Window function performs a calculation across a ‘window of rows’ in relation to the current row. You get to define how these Windows, also called partitions, are created. You will see how to do that shortly.

This is comparable to ‘group by’, but instead of returning one row for a group, it will return result for each row based on the ‘window’ partitions that you define.

Initially, it will appears like the applications of window functions is vast and difficult to learn. But essentially, there are only four types of window functions operations.

By understanding these 4 types, you will understand nearly all applications of Window functions:

  1. Aggregation: You can create sums, averages, counts, max, min, etc. across related sets of data. For example, calculating a running total of sales.

  2. Ranking: You can rank items in your data set. For example, listing top salespeople, top 10 students, etc.

  3. Value comparison: You can compare values in a data set. For example, compare the current value with the previous value to find trends or patterns.

  4. Rolling Aggregates: Comparing specific ranks of values and aggregating if needed. You will learn the ROWS BETWEEN clause here.

Let’s use the students Table created below to illustrate the Window Function concept. You can create this table using the following schema.

Consider practicing the examples in this tutorial hands-on here.

select * from students;
student_id subject score
1 Math 85
1 English 92
1 Science 87
2 Math 91
2 English 88
2 Science 93
3 Math 78
3 English 85
3 Science 90
4 Math 92
4 English 86
4 Science 89
5 Math 90
5 English 88
5 Science 85
6 Math 93
6 English 82
6 Science 90

2. Type 1: Aggregation

This is similar to grouping using Group By, but instead of aggregating, we want to create a new column, without reducing the number of rows in the dataset.

Let’s understand with a solid example.

Task:

How to compute the average score for each student (and have the average shown as a separate column)?

Solution:

The usual way to do this is to do a ‘group by’ using student_id to get the average score for each student. Then, ‘left join’ it with the base table. This is what you typically do if you are not comfortable with Window functions.

You can achieve the same effect using Window function in one step. In the process let’s also see the syntax.

Notice the AVG(SCORE) OVER W part where W is the window you can define later.

SELECT *,
Avg(score) over w as Average_Score
FROM STUDENTS
WINDOW w as (PARTITION BY student_id);
student_id subject score Average_Score
1 Math 85 88.0
1 English 92 88.0
1 Science 87 88.0
2 Math 91 90.66666666666667
2 English 88 90.66666666666667
2 Science 93 90.66666666666667
3 Math 78 84.33333333333333
3 English 85 84.33333333333333
3 Science 90 84.33333333333333
4 Math 92 89.0
4 English 86 89.0
4 Science 89 89.0
5 Math 90 87.66666666666667
5 English 88 87.66666666666667
5 Science 85 87.66666666666667
6 Math 93 88.33333333333333
6 English 82 88.33333333333333
6 Science 90 88.33333333333333

The above command effectively creates partitions for each ‘student_id’. This is similar to groups, but without actually reducing the group to one row.

The AVG(SCORE) calculation is done on the entire partition (student_id) but is repeatedly done for every row iteratively, instead of grouping by student_id and returning one value per student_id.

Hope this is clear? If not, please read that one more time.

Now, The ‘window’ statement in the last line after ‘FROM’ is not accepted in certain databases, such as MS SQL or Oracle SQL. So, better take it to the select statement like below.

SELECT *,
Avg(score) over (PARTITION BY student_id) as Average_Score
FROM STUDENTS
LIMIT 7;
student_id subject score Average_Score
1 Math 85 88.0
1 English 92 88.0
1 Science 87 88.0
2 Math 91 90.66666666666667
2 English 88 90.66666666666667
2 Science 93 90.66666666666667
3 Math 78 84.33333333333333

2.1. Aggregation functions used with Window functions

Here are the other Aggregation functions used with Window functions. You probably know how they work, else refer to the SQL

  1. min(value): returns the minimum value across all window rows
  2. max(value): returns the maximum value
  3. count(value): returns the count of non-null values
  4. avg(value): returns the average value
  5. sum(value): returns the sum total value
  6. group_concat(value, separator): returns a string combining values using separator (SQLite and MySQL only)

3. Type 2: Ranking

This is the second type of usecase for window functions. This is easy to understand with an example as well.

Task:

How to rank the students in descending order of their Total Score?. Where, Total score is the sum of scores in all three subjects.

Solution:

First compute the total score and put in a separate table.

DROP TABLE IF EXISTS TEMP;

CREATE TABLE TEMP AS
SELECT student_id, 
       subject, 
       score,
       AVG(score) OVER (PARTITION BY student_id) as average_score,
       SUM(score) OVER (PARTITION BY student_id) as total_score
FROM students;

SELECT * FROM TEMP LIMIT 5;
student_id subject score average_score total_score
1 Math 85 88.0 264
1 English 92 88.0 264
1 Science 87 88.0 264
2 Math 91 90.66666666666667 272
2 English 88 90.66666666666667 272

Notice what is happening above in SUM(score) OVER (PARTITION BY student_id) as total_score.

We sum(score) for each student while iterating through the rows. For each row, the window includes all the rows that belong to the current row’s student_id. This window is specified using the PARTITION BY student_id command.

Now, in a similar way, compute the rank in descending order of Total Score.

SELECT *, 
RANK() OVER (order by TOTAL_SCORE DESC) as srank
FROM TEMP;
student_id subject score average_score total_score srank
2 Math 91 90.66666666666667 272 1
2 English 88 90.66666666666667 272 1
2 Science 93 90.66666666666667 272 1
4 Math 92 89.0 267 4
4 English 86 89.0 267 4
4 Science 89 89.0 267 4
6 Math 93 88.33333333333333 265 7
6 English 82 88.33333333333333 265 7
6 Science 90 88.33333333333333 265 7
1 Math 85 88.0 264 10
1 English 92 88.0 264 10
1 Science 87 88.0 264 10
5 Math 90 87.66666666666667 263 13
5 English 88 87.66666666666667 263 13
5 Science 85 87.66666666666667 263 13
3 Math 78 84.33333333333333 253 16
3 English 85 84.33333333333333 253 16
3 Science 90 84.33333333333333 253 16

See how the rows containing the same student_id contains the same rank?

What if you want to increase the rank number for every row? So that the result becomes: 1,2,3,4 instead of 1,1,1,4?

To do this, we need to state another unique column besides the total score based on which RANK() can assign the rank.

SELECT *, 
RANK() OVER (ORDER BY TOTAL_SCORE DESC, SCORE DESC) as iRANK
FROM TEMP;
student_id subject score average_score total_score iRANK
2 Science 93 90.66666666666667 272 1
2 Math 91 90.66666666666667 272 2
2 English 88 90.66666666666667 272 3
4 Math 92 89.0 267 4
4 Science 89 89.0 267 5
4 English 86 89.0 267 6
6 Math 93 88.33333333333333 265 7
6 Science 90 88.33333333333333 265 8
6 English 82 88.33333333333333 265 9
1 English 92 88.0 264 10
1 Science 87 88.0 264 11
1 Math 85 88.0 264 12
5 Math 90 87.66666666666667 263 13
5 English 88 87.66666666666667 263 14
5 Science 85 87.66666666666667 263 15
3 Science 90 84.33333333333333 253 16
3 English 85 84.33333333333333 253 17
3 Math 78 84.33333333333333 253 18

3.1. Ranking Functions

  1. row_number(): returns the row ordinal number
  2. dense_rank(): returns row rank
  3. rank(): returns row rank with possible gaps (see below)
  4. ntile(n): splits all rows into n groups and returns the index of the group that the row belongs to

4. Difference between rank() and dense_rank()?

When items have same value, rank() will increment to the row_number() value when the next value occurs. Whereas, dense_rank() will increment by only 1.

Example: Trying to rank the following – 10, 10, 10, 17, 17, 29

  • rank() output: 1,1,1,4,4,6
  • dense_rank() output: 1,1,1,2,2,3

5. Type 3: Comparing boundaries with Offsets

Let’s suppose you want to compare the salaries of adjacent members, that is, you want to find what percentage the current value is larger than the previous value.

Let’t first create the salaries table.

DROP TABLE IF EXISTS SALARIES;

CREATE TABLE salaries (
    id INT,
    name VARCHAR(20),
    city VARCHAR(20),
    dept VARCHAR(20),
    salary FLOAT
);

INSERT INTO SALARIES (id, name, city, dept, salary) VALUES
(21, 'Dhanya', 'Chennai', 'hr', 75),
(22, 'Bob', 'London', 'hr', 71),
(31, 'Akira', 'Chennai', 'it', 89),
(32, 'Grace', 'Berlin', 'it', 60),
(33, 'Steven', 'London', 'it', 103),
(34, 'Ramesh', 'Chennai', 'it', 103),
(35, 'Frank', 'Berlin', 'it', 120),
(41, 'Cindy', 'Berlin', 'sales', 95),
(42, 'Yetunde', 'London', 'sales', 95),
(43, 'Alice', 'Berlin', 'sales', 100);

Practice Hands-On Here. The table schema is created, you will need to write the queries on the right hand side panel and run.

Task:

How to find the percentage diff increase in salaries of neighbours?

Solution:

We need to have the salaries sorted (by salary ascending) and then look up the previous salary value. Then, use the current and previous value to compute the difference.

Let’s first get the previous value using lag function.

select id, 
        name, 
        city, 
        dept, 
        salary,
        lag(salary) over (order by id asc) as prev_salary
from salaries;
id name city dept salary prev_salary
21 Dhanya Chennai hr 75.0 None
22 Bob London hr 71.0 75.0
31 Akira Chennai it 89.0 71.0
32 Grace Berlin it 60.0 89.0
33 Steven London it 103.0 60.0
34 Ramesh Chennai it 103.0 103.0
35 Frank Berlin it 120.0 103.0
41 Cindy Berlin sales 95.0 120.0
42 Yetunde London sales 95.0 95.0
43 Alice Berlin sales 100.0 95.0

The previous salary is created, we can now create the percentage difference.

select *, 
        round((salary - prev_salary)/prev_salary, 4) * 100 as perc_diff 
from (select id, 
        name, 
        city, 
        dept, 
        salary,
        lag(salary) over (order by id asc) as prev_salary
from salaries);
id name city dept salary prev_salary perc_diff
21 Dhanya Chennai hr 75.0 None None
22 Bob London hr 71.0 75.0 -5.33
31 Akira Chennai it 89.0 71.0 25.35
32 Grace Berlin it 60.0 89.0 -32.58
33 Steven London it 103.0 60.0 71.67
34 Ramesh Chennai it 103.0 103.0 0.0
35 Frank Berlin it 120.0 103.0 16.5
41 Cindy Berlin sales 95.0 120.0 -20.830000000000002
42 Yetunde London sales 95.0 95.0 0.0
43 Alice Berlin sales 100.0 95.0 5.26

This can be written using a with statement as well to improve the readability.

with sal2 as (
select id, 
        name, 
        city, 
        dept, 
        salary,
        lag(salary) over (order by id asc) as prev_salary
from salaries)
select *, 
        round((salary - prev_salary)/prev_salary, 4) * 100 as perc_diff 
from sal2;
id name city dept salary prev_salary perc_diff
21 Dhanya Chennai hr 75.0 None None
22 Bob London hr 71.0 75.0 -5.33
31 Akira Chennai it 89.0 71.0 25.35
32 Grace Berlin it 60.0 89.0 -32.58
33 Steven London it 103.0 60.0 71.67
34 Ramesh Chennai it 103.0 103.0 0.0
35 Frank Berlin it 120.0 103.0 16.5
41 Cindy Berlin sales 95.0 120.0 -20.830000000000002
42 Yetunde London sales 95.0 95.0 0.0
43 Alice Berlin sales 100.0 95.0 5.26

Alternately, we could’ve computed the lead and then computed the differences as well.

Let’s look at one more example.

Task 2:

How to compute the perc difference between the min and max values of salary in each department, while keeping the entries ordered by salary?

That is, you want to compute the percentage difference between the first (min) and last (max) values in each department.

Solution:

Let’s try using the min() and max() as window functions.

select id, 
        name, 
        city, 
        dept, 
        salary,
        min(salary) over (partition by dept order by id asc) as min_salary,
        max(salary) over (partition by dept order by id asc) as max_salary
from salaries
id name city dept salary min_salary max_salary
21 Dhanya Chennai hr 75.0 75.0 75.0
22 Bob London hr 71.0 71.0 75.0
31 Akira Chennai it 89.0 89.0 89.0
32 Grace Berlin it 60.0 60.0 89.0
33 Steven London it 103.0 60.0 103.0
34 Ramesh Chennai it 103.0 60.0 103.0
35 Frank Berlin it 120.0 60.0 120.0
41 Cindy Berlin sales 95.0 95.0 95.0
42 Yetunde London sales 95.0 95.0 95.0
43 Alice Berlin sales 100.0 95.0 100.0

There is a problem here.

The min values look fine but the max values aren’t.

Can we try using the first_value() and last_value() functions?

select id, 
        name, 
        city, 
        dept, 
        salary,
        first_value(salary) over (partition by dept order by id asc) as min_salary,
        last_value(salary) over (partition by dept order by id asc) as max_salary
from salaries
id name city dept salary min_salary max_salary
21 Dhanya Chennai hr 75.0 75.0 75.0
22 Bob London hr 71.0 75.0 71.0
31 Akira Chennai it 89.0 89.0 89.0
32 Grace Berlin it 60.0 89.0 60.0
33 Steven London it 103.0 89.0 103.0
34 Ramesh Chennai it 103.0 89.0 103.0
35 Frank Berlin it 120.0 89.0 120.0
41 Cindy Berlin sales 95.0 95.0 95.0
42 Yetunde London sales 95.0 95.0 95.0
43 Alice Berlin sales 100.0 95.0 100.0

This does not work as well. Can you figure what is happening here?

Because, eventhough we have defined the window parition by department, since we are using order by, the processing is valid ONLY for the ‘Window Frame’, where the ‘frame’ goes from the first row to the current iteration row.

This is the default behaviour of the Frame. And this can be controlled, which we will do shortly.

But, let’s first understand the ‘Frame’ within the ‘Window’ clearly.

Ok. Have a look at the above ‘Salaries’ table.

When the iteration starts, the current row is the first row, whichever it is.

As the iteration proceeds for a given window (department), the frame size expands cumulatively from the beginning row up to the current row.

That is, when current row = 1, frame = 1st row only. But when the current row is 4, the frame includes the first 4 rows.

Look, how it is progressing below:

Window 1: Dept = HR
– Current row = 1, Frame rows = 1
– Current row = 2, Frame rows = 1,2

Window 2: Dept = IT
– Current row = 1, Frame rows = 1
– Current row = 2, Frame rows = 1,2
– Current row = 3, Frame rows = 1,2,3
– Current row = 4, Frame rows = 1,2,3,4

And so on..

We can abosolutely control the size of the frame in each iteration. I cover this in the next section.

But, let’s start by solving this usecase: Computing the min and max salary for each department, while keeping the order of neighbours.

Here it is.

select id, 
        name, 
        city, 
        dept, 
        salary,
        first_value(salary) over (partition by dept order by salary asc) as min_salary,
        last_value(salary) over (partition by dept 
                                 order by salary 
                                 rows between unbounded preceding and unbounded following) as max_salary
from salaries;
id name city dept salary min_salary max_salary
22 Bob London hr 71.0 71.0 75.0
21 Dhanya Chennai hr 75.0 71.0 75.0
32 Grace Berlin it 60.0 60.0 120.0
31 Akira Chennai it 89.0 60.0 120.0
33 Steven London it 103.0 60.0 120.0
34 Ramesh Chennai it 103.0 60.0 120.0
35 Frank Berlin it 120.0 60.0 120.0
41 Cindy Berlin sales 95.0 95.0 100.0
42 Yetunde London sales 95.0 95.0 100.0
43 Alice Berlin sales 100.0 95.0 100.0

The key part that makes the frame extend to the full width of the window is: rows between unbounded preceding and unbounded following. We will break this down in detail in the following section.

5.1. Offset Functions you will need

  • lag(value, offset): returns the value from the record that is offset rows behind the current one
  • lead(value, offset): returns the value from the record that is offset rows ahead of the current one
  • first_value(value): returns the value from the first row of the frame
  • last_value(value): returns the value from the last row of the frame
  • nth_value(value, n): returns the value from the n-th row of the frame

6. Type 4: Rolling Aggregates with ROWS BETWEEN clause

Let’s continue and take our understanding of window functions up a notch.

To get a complete understanding of window functions, you need to fully understand the rows between command.

Again, let’s understand using an example.

Task:

Compute difference between the minimum and the second last value.

Solution:

Previously we saw how to compute the difference between the maximum and minimum. Let’s use those ideas to get the diff between second max and the minimum value.

First, to get the second last value we need to know the number of values in the group. Once you know this, you can get whichever value you want, be it second last, third from last and so on.

So, let’s get the number of items in each group first.

We are extending on top of the previous query, so look at the count(*) statement in the below query.

select id, 
        name, 
        city, 
        dept, 
        salary,
        first_value(salary) over (partition by dept order by salary asc) as min_salary,
        last_value(salary) over (partition by dept 
                                 order by salary 
                                 rows between unbounded preceding and unbounded following) as max_salary,
        count(*) over (partition by dept
                       rows between unbounded preceding and unbounded following) as count
from salaries;
id name city dept salary min_salary max_salary count
22 Bob London hr 71.0 71.0 75.0 2
21 Dhanya Chennai hr 75.0 71.0 75.0 2
32 Grace Berlin it 60.0 60.0 120.0 5
31 Akira Chennai it 89.0 60.0 120.0 5
33 Steven London it 103.0 60.0 120.0 5
34 Ramesh Chennai it 103.0 60.0 120.0 5
35 Frank Berlin it 120.0 60.0 120.0 5
41 Cindy Berlin sales 95.0 95.0 100.0 3
42 Yetunde London sales 95.0 95.0 100.0 3
43 Alice Berlin sales 100.0 95.0 100.0 3

6.1. Understanding ROWS BETWEEN Clause

Note here the general structure of rows between is as follows.

The syntax is:

ROWS BETWEEN <lower_bound> AND <upper_bound>

The purpose of the ROWS clause is to specify the starting and the ending row of the frame in relation to the ‘current row’.

Starting and ending rows might be fixed or relative to the current row based on the following keywords:

  1. CURRENT ROW: the current row
  2. UNBOUNDED PRECEDING: all rows before the current row -> fixed
  3. UNBOUNDED FOLLOWING: all rows after the current row -> fixed
  4. ‘n’ PRECEDING: ‘n’ rows before the current row -> relative
  5. ‘n’ FOLLOWING: ‘n’ rows after the current row -> relative

Here are few examples of how you can use the ROWS BETWEEN clause:

Example 1: ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: Frame would include the entire window for each iteration.
Example 2: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: would include from 1st row in window to current row.
Example 3: ROWS BETWEEN UNBOUNDED PRECEDING AND 2 PRECEDING: would include from 1st row in window to 2 rows BEFORE the current row.
Example 4: ROWS BETWEEN UNBOUNDED PRECEDING AND 3 FOLLOWING: would include from 1st row in window to 3 rows AFTER the current row.

Questions:

What would it be for:

  1. Two rows before and two rows after the current row? (Easy)
  2. Two rows before to one before the last row? (Intermediate). Hint: find the count.

So, how to get the second last value?

–> Sort by salary first, then, use nth_value() and count column to get the n-1’th item.

select * from salaries;
id name city dept salary
21 Dhanya Chennai hr 75.0
22 Bob London hr 71.0
31 Akira Chennai it 89.0
32 Grace Berlin it 60.0
33 Steven London it 103.0
34 Ramesh Chennai it 103.0
35 Frank Berlin it 120.0
41 Cindy Berlin sales 95.0
42 Yetunde London sales 95.0
43 Alice Berlin sales 100.0

Count the number of rows in each partition. We will later use this along with nth_value function next.

select id, 
        name, 
        city, 
        dept, 
        salary,
        first_value(salary) over (partition by dept order by salary asc) as min_salary,
        last_value(salary) over (partition by dept 
                                 order by salary 
                                 rows between unbounded preceding and unbounded following) as max_salary,
        count(*) over (partition by dept 
                                 order by salary 
                                 rows between unbounded preceding and unbounded following) as count
from salaries;

id name city dept salary min_salary max_salary count
22 Bob London hr 71.0 71.0 75.0 2
21 Dhanya Chennai hr 75.0 71.0 75.0 2
32 Grace Berlin it 60.0 60.0 120.0 5
31 Akira Chennai it 89.0 60.0 120.0 5
33 Steven London it 103.0 60.0 120.0 5
34 Ramesh Chennai it 103.0 60.0 120.0 5
35 Frank Berlin it 120.0 60.0 120.0 5
41 Cindy Berlin sales 95.0 95.0 100.0 3
42 Yetunde London sales 95.0 95.0 100.0 3
43 Alice Berlin sales 100.0 95.0 100.0 3

Extract the second last value

This can be done as a subquery using the nth_value(salary, count-1) call.

DROP TABLE If EXISTS TMP2;
CREATE TABLE TMP2 AS
with tmp as (
select id, 
        name, 
        city, 
        dept, 
        salary,
        first_value(salary) over (partition by dept order by salary asc) as min_salary,
        last_value(salary) over (partition by dept 
                                 order by salary 
                                 rows between unbounded preceding and unbounded following) as max_salary,
        count(*) over (partition by dept 
                                 order by salary 
                                 rows between unbounded preceding and unbounded following) as count
from salaries)
select id,name,city,dept,salary,
        min_salary, 
        nth_value(salary, count-1) over (partition by dept 
                                         order by salary
                                        rows between unbounded preceding and unbounded following) as last_minus_1_salary
from tmp;        

Finally, compute the difference between last_minus_1 and minimum salary.

SELECT *, (last_minus_1_salary - min_salary) as DIFF 
FROM TMP2;
id name city dept salary min_salary last_minus_1_salary DIFF
22 Bob London hr 71.0 71.0 71.0 0.0
21 Dhanya Chennai hr 75.0 71.0 71.0 0.0
32 Grace Berlin it 60.0 60.0 103.0 43.0
31 Akira Chennai it 89.0 60.0 103.0 43.0
33 Steven London it 103.0 60.0 103.0 43.0
34 Ramesh Chennai it 103.0 60.0 103.0 43.0
35 Frank Berlin it 120.0 60.0 103.0 43.0
41 Cindy Berlin sales 95.0 95.0 95.0 0.0
42 Yetunde London sales 95.0 95.0 95.0 0.0
43 Alice Berlin sales 100.0 95.0 95.0 0.0

Congratulations if you have made it till here! You should have a fair understanding of Window functions now.

Next step is to try and practice SQL Window functions. I also teach complete SQL for Data Science and more, I look forward to see you inside.

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science