Dog Shelter Problem¶
You operate a dog shelter , and you're building a model to predict the probability of a dog getting adopted based on its age, weight, and breed. You have the following data.
import numpy as np
import pandas as pd
# dog breeds list
breeds = [
'english pointer', 'english setter', 'kerry blue terrier', 'cairn terrier', 'english cocker spaniel',
'gordon setter', 'airedale terrier', 'australian terrier', 'bedlington terrier', 'border terrier',
'bull terrier', 'fox terrier (smooth)', 'english toy terrier (black &tan)', 'swedish vallhund',
'belgian shepherd dog', 'old english sheepdog', 'griffon nivernais', 'briquet griffon vendeen',
'ariegeois', 'gascon saintongeois', 'great gascony blue', 'poitevin', 'billy', 'artois hound',
'porcelaine', 'small blue gascony', 'blue gascony griffon', 'grand basset griffon vendeen',
'norman artesien basset', 'blue gascony basset'
]
# random generator
rng = np.random.default_rng(1)
# training DataFrame
Ndogs = 10
dogs = pd.DataFrame({
'id': rng.choice(1000000, size=Ndogs, replace=False),
'age': rng.uniform(low=0, high=16, size=Ndogs).round(0),
'weight': rng.uniform(low=10, high=115, size=Ndogs).round(1),
'breed': pd.Categorical(rng.choice(breeds[:25], size=Ndogs, replace=True), categories=breeds),
'adopted': rng.choice([True, False], size=Ndogs, replace=True)
})
# Insert some NaNs
dogs.iloc[rng.choice(Ndogs, size=int(Ndogs * 0.25), replace=False), 1] = np.nan
print(dogs)
# id age weight breed adopted
# 0 311831 NaN 88.8 english pointer False
# 1 473184 9.0 39.4 bull terrier False
# 2 822941 5.0 60.9 english toy terrier (black &tan) False
# 3 34852 13.0 113.0 australian terrier True
# 4 948647 NaN 111.0 kerry blue terrier True
# 5 511817 7.0 86.1 bull terrier False
# 6 144159 2.0 66.8 old english sheepdog False
# 7 755162 6.0 39.1 fox terrier (smooth) False
# 8 950457 3.0 26.9 gascon saintongeois True
# 9 249228 4.0 111.8 border terrier True
Build a compressed sparse column matrix to represent the training features. Be sure to one-hot-encode the dog breeds into indicator columns, one for each possible breed (not just the observed breeds ).
The output matrix should look something like this:
age weight is_english_pointer is_english_setter ...
0 NaN 88.8 1.0 0.0
1 9.0 39.4 0.0 0.0
2 5.0 60.9 0.0 0.0
3 13.0 113.0 0.0 0.0
4 NaN 111.0 0.0 0.0
...