Potholes Problem¶
Fed up with your city’s roads, you go around collecting data on potholes
in your area. Due to an unfortunate
coffee spill, you lost bits and pieces of your data.
import numpy as np
import pandas as pd
potholes = pd.DataFrame({
'length':[5.1, np.nan, 6.2, 4.3, 6.0, 5.1, 6.5, 4.3, np.nan, np.nan],
'width':[2.8, 5.8, 6.5, 6.1, 5.8, np.nan, 6.3, 6.1, 5.4, 5.0],
'depth':[2.6, np.nan, 4.2, 0.8, 2.6, np.nan, 3.9, 4.8, 4.0, np.nan],
'location':pd.Series(['center', 'north edge', np.nan, 'center', 'north edge', 'center', 'west edge',
'west edge', np.nan, np.nan], dtype='string')
})
print(potholes)
# length width depth location
# 0 5.1 2.8 2.6 center
# 1 NaN 5.8 NaN north edge
# 2 6.2 6.5 4.2 <NA>
# 3 4.3 6.1 0.8 center
# 4 6.0 5.8 2.6 north edge
# 5 5.1 NaN NaN center
# 6 6.5 6.3 3.9 west edge
# 7 4.3 6.1 4.8 west edge
# 8 NaN 5.4 4.0 <NA>
# 9 NaN 5.0 NaN <NA>
Given your DataFrame of pothole measurements, discard rows where more than half the values are NaN
,
elsewhere impute NaNs
with the average value per column unless the column is non-numeric, in which case use the mode.