Tinder Coach Solution¶
import numpy as np
from scipy import sparse
# build visitor_id:row_index mapping
row_keys = {}
for idx, val in enumerate(visitor_ids):
row_keys[val] = idx
# build page:col_index mapping
col_keys = {}
for idx, val in enumerate(pages):
col_keys[val] = idx
# determine the row & col index for each data point
row_idxs = []
col_idxs = []
for key in visits.keys():
visitor_id = key[0]
page = key[1]
row_idxs.append(row_keys[visitor_id])
col_idxs.append(col_keys[page])
# Build the csc matrix
data = list(visits.values())
mat = sparse.csc_matrix((data, (row_idxs, col_idxs)))
# print subset
print_visitor_ids = [3654, 1443, 3654]
print_pages = ['tindercoach.com/bgr', 'tindercoach.com/nky', 'tindercoach.com/wpb']
print_row_idxs = [row_keys[x] for x in print_visitor_ids]
print_col_idxs = [col_keys[x] for x in print_pages]
print(mat[np.ix_(print_row_idxs, print_col_idxs)].todense())
# [[0 1 0]
# [2 0 1]
# [1 0 0]]
Explanation¶
Our strategy is to build three lists:
data
: contains the non-zero elements of the matrixrow_idxs
: contains the row index of each non-zero elementcol_idxs
: contains the column index of each non-zero element
Then we can instantiate a CSC matrix with sparse.csc_matrix((data, (row_idxs, col_idxs)))
.
-
Build mappings for
visitor_id:row_index
andpage:col_index
.# build visitor_id:row_index mapping row_keys = {} for idx, val in enumerate(visitor_ids): row_keys[val] = idx print(row_keys) # { # 7040: 0, # 4545: 1, # ..., # 6584: 8, # 5502: 9 # } # build page:col_index mapping col_keys = {} for idx, val in enumerate(pages): col_keys[val] = idx print(col_keys) # { # 'tindercoach.com/gqy': 0, # 'tindercoach.com/yez': 1, # ..., # 'tindercoach.com/eaw': 98, # 'tindercoach.com/eow': 99 # }
These mappings allow us to get the row / column index for any visitor_id / page.
-
Determine the row and column index for each data point.
row_idxs = [] col_idxs = [] for key in visits.keys(): visitor_id = key[0] # (1)! page = key[1] # (2)! row_idxs.append(row_keys[visitor_id]) # (3)! col_idxs.append(col_keys[page]) # (4)!
- The keys in
visits
are tuples of(visitor_id, page)
s. Here we fetch thevisitor_id
from the first position of the current key tuple. - The keys in
visits
are tuples of(visitor_id, page)
s. Here we fetch thepage
from the second position of the current key tuple. row_keys[visitor_id]
gives us the row index for the currentvisitor_id
. We append this torow_idxs
.col_keys[page]
gives us the row index for the currentpage
. We append this tocol_idxs
.
- The keys in
-
Build the
csc_matrix
.data = list(visits.values()) # (1)! mat = sparse.csc_matrix((data, (row_idxs, col_idxs)))
-
visits.values()
returns the number of times each visitor visited each page.print(visits.values()) # dict_values([1, 1, 1, 1, 2, 1, 1, 2, ...])
..but it returns this as a
dict_values
instance. We convert it to list so it can be properly handled bycsc_matrix()
in the next step.
-
-
Print the sub matrix showing visitors 1443, 6584, and 7040 and pages tindercoach.com/chl, tindercoach.com/nky, and tindercoach.com/zmr.
We can fetch the appropriate row and column indices as follows:
print_visitor_ids = [1443, 6584, 7040] print_pages = ['tindercoach.com/chl', 'tindercoach.com/nky', 'tindercoach.com/zmr'] print_row_idxs = [row_keys[x] for x in print_visitor_ids] print_col_idxs = [col_keys[x] for x in print_pages] print(print_row_idxs) # [2, 8, 0] print(print_col_idxs) # [38, 17, 86]
To fetch the sub matrix indexed by these row and column indices, we can do
submat = mat[np.ix_(print_row_idxs, print_col_idxs)] print(submat.todense()) # [[0 1 0] # [2 0 1] # [1 0 0]]
Info
If we simply did
mat[print_row_idxs, print_col_idxs]
, scipy would fetch three elements from the matrix; the elements at positions:(2, 38)
,(8, 17)
, and(0, 86)
. This is consistent with NumPy array indexing behavior, but it's not the behavior we desire.Rather, we want to fetch all combinations of (row, col) indices from our lists
print_row_idxs
andprint_col_idxs
. (9 combinations in total, forming a 3x3 sub matrix.)We use
np.ix_()
to accomplish this.np.ix_(print_row_idxs, print_col_idxs)
constructs an open mesh from the input lists / arrays.np.ix_(print_row_idxs, print_col_idxs) # (array([[2], # [8], # [0]]), array([[38, 17, 86]]))
When these arrays are used to index
mat
, they are essentially broadcasted into the 3x3 i and j index arrays.i index array (after broadcasting) [[2 2 2] [8 8 8] [0 0 0]] j index array (after broadcasting) [[38 17 86] [38 17 86] [38 17 86]]