Skip to content

Tinder Coach Solution


import numpy as np
from scipy import sparse

# build visitor_id:row_index mapping
row_keys = {}
for idx, val in enumerate(visitor_ids):
  row_keys[val] = idx

# build page:col_index mapping
col_keys = {}
for idx, val in enumerate(pages):
  col_keys[val] = idx

# determine the row & col index for each data point
row_idxs = []
col_idxs = []
for key in visits.keys():
    visitor_id = key[0]
    page = key[1]
    row_idxs.append(row_keys[visitor_id])
    col_idxs.append(col_keys[page])

# Build the csc matrix
data = list(visits.values())
mat = sparse.csc_matrix((data, (row_idxs, col_idxs)))

# print subset
print_visitor_ids = [3654, 1443, 3654]
print_pages = ['tindercoach.com/bgr', 'tindercoach.com/nky', 'tindercoach.com/wpb']
print_row_idxs = [row_keys[x] for x in print_visitor_ids]
print_col_idxs = [col_keys[x] for x in print_pages]
print(mat[np.ix_(print_row_idxs, print_col_idxs)].todense())
# [[0 1 0]
# [2 0 1]
# [1 0 0]]

Explanation

Our strategy is to build three lists:

  • data: contains the non-zero elements of the matrix
  • row_idxs: contains the row index of each non-zero element
  • col_idxs: contains the column index of each non-zero element

Then we can instantiate a CSC matrix with sparse.csc_matrix((data, (row_idxs, col_idxs))).

  1. Build mappings for visitor_id:row_index and page:col_index.

    # build visitor_id:row_index mapping
    row_keys = {}
    for idx, val in enumerate(visitor_ids):
      row_keys[val] = idx
    
    print(row_keys)
    # {
    #  7040: 0, 
    #  4545: 1, 
    #  ..., 
    #  6584: 8, 
    #  5502: 9
    # }
    
    # build page:col_index mapping
    col_keys = {}
    for idx, val in enumerate(pages):
      col_keys[val] = idx
    
    print(col_keys)
    # {
    #  'tindercoach.com/gqy': 0, 
    #  'tindercoach.com/yez': 1, 
    #  ..., 
    #  'tindercoach.com/eaw': 98, 
    #  'tindercoach.com/eow': 99
    # }
    

    These mappings allow us to get the row / column index for any visitor_id / page.

  2. Determine the row and column index for each data point.

    row_idxs = []
    col_idxs = []
    for key in visits.keys():
        visitor_id = key[0]                   # (1)!
        page = key[1]                         # (2)!
        row_idxs.append(row_keys[visitor_id]) # (3)!
        col_idxs.append(col_keys[page])       # (4)!
    
    1. The keys in visits are tuples of (visitor_id, page)s. Here we fetch the visitor_id from the first position of the current key tuple.
    2. The keys in visits are tuples of (visitor_id, page)s. Here we fetch the page from the second position of the current key tuple.
    3. row_keys[visitor_id] gives us the row index for the current visitor_id. We append this to row_idxs.
    4. col_keys[page] gives us the row index for the current page. We append this to col_idxs.
  3. Build the csc_matrix.

    data = list(visits.values())  # (1)!
    mat = sparse.csc_matrix((data, (row_idxs, col_idxs)))
    
    1. visits.values() returns the number of times each visitor visited each page.

      print(visits.values())
      # dict_values([1, 1, 1, 1, 2, 1, 1, 2, ...])
      

      ..but it returns this as a dict_values instance. We convert it to list so it can be properly handled by csc_matrix() in the next step.

  4. Print the sub matrix showing visitors 1443, 6584, and 7040 and pages tindercoach.com/chl, tindercoach.com/nky, and tindercoach.com/zmr.

    We can fetch the appropriate row and column indices as follows:

    print_visitor_ids = [1443, 6584, 7040]
    print_pages = ['tindercoach.com/chl', 'tindercoach.com/nky', 'tindercoach.com/zmr']
    
    print_row_idxs = [row_keys[x] for x in print_visitor_ids]
    print_col_idxs = [col_keys[x] for x in print_pages]
    
    print(print_row_idxs)
    # [2, 8, 0]
    
    print(print_col_idxs)
    # [38, 17, 86]
    

    To fetch the sub matrix indexed by these row and column indices, we can do

    submat = mat[np.ix_(print_row_idxs, print_col_idxs)]
    
    print(submat.todense())
    # [[0 1 0]
    #  [2 0 1]
    #  [1 0 0]]
    

    Info

    If we simply did mat[print_row_idxs, print_col_idxs], scipy would fetch three elements from the matrix; the elements at positions: (2, 38) , (8, 17), and (0, 86). This is consistent with NumPy array indexing behavior, but it's not the behavior we desire.

    Rather, we want to fetch all combinations of (row, col) indices from our lists print_row_idxs and print_col_idxs. (9 combinations in total, forming a 3x3 sub matrix.)

    We use np.ix_() to accomplish this. np.ix_(print_row_idxs, print_col_idxs) constructs an open mesh from the input lists / arrays.

    np.ix_(print_row_idxs, print_col_idxs)
    # (array([[2],
    #        [8],
    #        [0]]), array([[38, 17, 86]]))
    

    When these arrays are used to index mat, they are essentially broadcasted into the 3x3 i and j index arrays.

    i index array (after broadcasting)
    [[2 2 2]
     [8 8 8]
     [0 0 0]]
    
    j index array (after broadcasting)
    [[38 17 86]
     [38 17 86]
     [38 17 86]]