Potholes Problem¶

Fed up with your city’s roads, you go around collecting data on potholes in your area. Due to an unfortunate coffee spill, you lost bits and pieces of your data.

import numpy as np
import pandas as pd

potholes = pd.DataFrame({
    'length':[5.1, np.nan, 6.2, 4.3, 6.0, 5.1, 6.5, 4.3, np.nan, np.nan],
    'width':[2.8, 5.8, 6.5, 6.1, 5.8, np.nan, 6.3, 6.1, 5.4, 5.0],
    'depth':[2.6, np.nan, 4.2, 0.8, 2.6, np.nan, 3.9, 4.8, 4.0, np.nan],
    'location':pd.Series(['center', 'north edge', np.nan, 'center', 'north edge', 'center', 'west edge',
                          'west edge', np.nan, np.nan], dtype='string')
})

print(potholes)
#    length  width  depth    location
# 0     5.1    2.8    2.6      center
# 1     NaN    5.8    NaN  north edge
# 2     6.2    6.5    4.2        <NA>
# 3     4.3    6.1    0.8      center
# 4     6.0    5.8    2.6  north edge
# 5     5.1    NaN    NaN      center
# 6     6.5    6.3    3.9   west edge
# 7     4.3    6.1    4.8   west edge
# 8     NaN    5.4    4.0        <NA>
# 9     NaN    5.0    NaN        <NA>

Given your DataFrame of pothole measurements, discard rows where more than half the values are NaN, elsewhere impute NaNs with the average value per column unless the column is non-numeric, in which case use the mode.

Try with Google Colab