Session Groups Problem¶
You run an ecommerce site called shoesfordogs.com . You want to analyze your visitors, so
you compile a DataFrame called hits
that represents each time a visitor hit some page on your site.
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
generator = np.random.default_rng(90)
products = ['iev','pys','vae','dah','yck','axl','apx','evu','wqv','tfg','aur','rgy','kef','lzj','kiz','oma']
hits = pd.DataFrame({
'visitor_id':generator.choice(5, size=20, replace=True) + 1,
'session_id':generator.choice(4, size=20, replace=True),
'date_time':pd.to_datetime('2020-01-01') + pd.to_timedelta(generator.choice(60, size=20), unit='m'),
'page_url':[f'shoesfordogs.com/product/{x}' for x in generator.choice(products, size=20, replace=True)]
})
hits['session_id'] = hits.visitor_id * 100 + hits.session_id
print(hits)
# visitor_id session_id date_time page_url
# 0 4 400 2020-01-01 00:05:00 shoesfordogs.com/product/pys
# 1 2 200 2020-01-01 00:18:00 shoesfordogs.com/product/oma
# 2 1 102 2020-01-01 00:48:00 shoesfordogs.com/product/evu
# 3 4 403 2020-01-01 00:21:00 shoesfordogs.com/product/oma
# 4 2 201 2020-01-01 00:40:00 shoesfordogs.com/product/yck
# 5 3 302 2020-01-01 00:33:00 shoesfordogs.com/product/pys
# 6 2 203 2020-01-01 00:37:00 shoesfordogs.com/product/rgy
# 7 3 302 2020-01-01 00:54:00 shoesfordogs.com/product/tfg
# 8 3 302 2020-01-01 00:48:00 shoesfordogs.com/product/kef
# 9 4 402 2020-01-01 00:24:00 shoesfordogs.com/product/apx
# 10 3 300 2020-01-01 00:49:00 shoesfordogs.com/product/kef
# 11 1 101 2020-01-01 00:52:00 shoesfordogs.com/product/iev
# 12 3 302 2020-01-01 00:01:00 shoesfordogs.com/product/dah
# 13 4 403 2020-01-01 00:02:00 shoesfordogs.com/product/lzj
# 14 4 401 2020-01-01 00:42:00 shoesfordogs.com/product/evu
# 15 5 500 2020-01-01 00:39:00 shoesfordogs.com/product/apx
# 16 5 503 2020-01-01 00:31:00 shoesfordogs.com/product/dah
# 17 3 303 2020-01-01 00:01:00 shoesfordogs.com/product/lzj
# 18 2 200 2020-01-01 00:16:00 shoesfordogs.com/product/aur
# 19 1 100 2020-01-01 00:11:00 shoesfordogs.com/product/apx
You suspect that the undocumented third-party tracking system on your website is buggy and sometimes splits one session into two or more session_ids. You want to correct this behavior by creating a field called session_group_id that stitches broken session_ids together.
Two session, A & B, should belong to the same session group if
- They have the same
visitor_id
and- Their hits overlap in time or
- The latest hit from A is within five minutes of the earliest hit from B, or vice-versa
Associativity applies. So, if A is grouped with B, and B is grouped with C, then A should be grouped with C as well.
Create a column in hits
called session_group_id
that identifies which hits belong to the same session group.