Post

stratascratch 2054: Consecutive Days

stratascratch 2054: Consecutive Days

stratascratch 2054 Consecutive Days

Intuition & Approach

The question requires to write a sql query to find any user who has 3 consecutive days with activity.

Step 1: Simple Intutive Solution

The very basic or intutive solution is, join in 2 levels.

1
2
3
SELECT DISTINCT a.user_id FROM sf_events AS a
INNER JOIN sf_events AS b ON a.user_id = b.user_id AND b.record_date = a.record_date + '1 days'
INNER JOIN sf_events AS c ON a.user_id = c.user_id AND c.record_date = b.record_date + '1 days';

Step2: Analysis Complexity

What is the worst case? Assume there’s very terrible situation, most user has 2 consecutive days within the database. Then you should compare almost $\Theta(n^3)$ times! It’s a disaster.

Optimization & Extension

Step3: Consider general case

If you’re asked to find 4,5,6 even 20 consecutive days, will you write 20 level join clause? That’s impossible.

What about: group the first date together?

This is written in postgresql. Let’s try to get ROW_NUMBER() as the baseline of first date. For example, you have Jan 1 asthe baseline, then Jan 2 will be on the second row, minus 1 more days compare with Jan 1, they’re in the same group! Otherwise, if the next row is Jan 3, it won’t be in the same group with Jan 1.

1
2
3
4
5
6
sf_processed AS (
    SELECT
        user_id,
        record_date - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY record_date ASC))::INTEGER AS group_id
    FROM sf_cleaned
)

So we only need to count how many dumplicates in the table.

1
2
3
4
SELECT user_id
FROM grouped_events
GROUP BY user_id, group_id
HAVING COUNT(*) >= 3;

Additionally, don’t forget to clean your data! There may be multiple rows for one user on the same day!

1
2
with sf_cleaned as(
select distinct user_id, record_date from sf_events),

Step4: Complexity Analysis

This only counts $O(n\log{n})$. Why? There’s a sort algorithm!

When you’re asked by your boss to count 21 consecutive days, this idea saves a lot of time!

This post is licensed under CC BY 4.0 by the author.