Advanced
Strings¶
pandas has tons of features for manipulating Series of strings..
cajun = pd.Series([
'gumbo',
'crawfish boil',
'Mardi Gras',
'pirogue',
pd.NA,
'Zatarains'
], dtype='string') # (1)!
print(cajun)
# 0 gumbo
# 1 crawfish boil
# 2 Mardi Gras
# 3 pirogue
# 4 <NA>
# 5 Zatarains
# dtype: string
dtype='string'
tells pandas to use the StringDType extension which allows a Series of strings to include NaN values.
String Methods¶
cajun.str
returns a StringMethods object from which we can call numerous string processing methods.
String Indexing¶
You can use string indexing to select characters by position.
For example, cajun.str[0]
selects the first character in each string.
cajun.str[0]
# 0 g
# 1 c
# 2 M
# 3 p
# 4 <NA>
# 5 Z
# dtype: string
cajun.str[:2]
uses slicing to select all characters up to (but excluding) the third character in each string.
cajun.str[:2]
# 0 gu
# 1 cr
# 2 Ma
# 3 pi
# 4 <NA>
# 5 Za
# dtype: string
cajun.str[-1]
uses negative indexing to select the last character in each string.
cajun.str[-1]
# 0 o
# 1 l
# 2 s
# 3 e
# 4 <NA>
# 5 s
# dtype: string
upper()
¶
converts a Series of strings to upper case.
cajun.str.upper()
# 0 GUMBO
# 1 CRAWFISH BOIL
# 2 MARDI GRAS
# 3 PIROGUE
# 4 <NA>
# 5 ZATARAINS
# dtype: string
lower()
¶
converts a Series of strings to lower case.
cajun.str.lower()
# 0 gumbo
# 1 crawfish boil
# 2 mardi gras
# 3 pirogue
# 4 <NA>
# 5 zatarains
# dtype: string
len()
¶
returns the number of characters in each string.
cajun.str.len()
# 0 5
# 1 13
# 2 10
# 3 7
# 4 <NA>
# 5 9
# dtype: Int64
split()
¶
splits each string along some specified delimiter and put the resulting substrings in a list.
cajun.str.split(' ')
# 0 [gumbo]
# 1 [crawfish, boil]
# 2 [Mardi, Gras]
# 3 [pirogue]
# 4 <NA>
# 5 [Zatarains]
# dtype: object
chaining that with .str.get()
lets us pick out the ith substring in each list. For example,
# get the first substring after splitting by ' '
cajun.str.split(' ').str.get(0)
# 0 gumbo
# 1 crawfish
# 2 Mardi
# 3 pirogue
# 4 <NA>
# 5 Zatarains
# dtype: object
replace()
¶
replaces part of a string with another string. Here we replace all spaces with dashes.
cajun.str.replace(pat=' ', repl='-', regex=False)
# 0 gumbo
# 1 crawfish-boil
# 2 Mardi-Gras
# 3 pirogue
# 4 <NA>
# 5 Zatarains
# dtype: string
By default, replace()
assumes you're passing in a regular expression, but you can turn that
off with regex=False
.
What's a regular expression?
A regular expression is a universally used syntax that lets you do advanced string matching.
cat()
¶
concatenates a Series of strings together using a specified separator
cajun.str.cat(sep='_')
# 'gumbo_crawfish boil_Mardi Gras_pirogue_Zatarains'
or concatenates a Series of strings with another same-sized Series or list of strings.
cajun.str.cat(['1', '2', '3', '4', '5', '6'], sep=' ')
# 0 gumbo 1
# 1 crawfish boil 2
# 2 Mardi Gras 3
# 3 pirogue 4
# 4 <NA>
# 5 Zatarains 6
# dtype: string
startswith()
¶
checks if each string starts with a specified string.
cajun.str.startswith("p")
# 0 False
# 1 False
# 2 False
# 3 True
# 4 <NA>
# 5 False
# dtype: boolean
endswith()
¶
checks if each string ends with a specified string.
cajun.str.endswith("s")
# 0 False
# 1 False
# 2 True
# 3 False
# 4 <NA>
# 5 True
# dtype: boolean
contains()
¶
checks whether each string contains a specified string or regular expression.
cajun.str.contains('bo', regex=False)
# 0 True
# 1 True
# 2 False
# 3 False
# 4 <NA>
# 5 False
# dtype: boolean
extract()
¶
extracts the first matching substring using a regular expression with at least one capture group.
For example, here we extract the first word that start with a capital letter.
cajun.str.extract(r'(\b[A-Z][a-z]+\b)')
# 0
# 0 <NA>
# 1 <NA>
# 2 Mardi
# 3 <NA>
# 4 <NA>
# 5 Zatarains
extractall()
¶
extracts all matching substrings using a regular expression with at least one capture group.
cajun.str.extractall(r'(\b[A-Z][a-z]+\b)')
# 0
# match
# 2 0 Mardi
# 1 Gras
# 5 0 Zatarains
String Addition¶
You can insert a prefix or suffix to each element in a Series of strings by adding a string prefix or string suffix
directly to the Series with the +
operator.
'i like ' + cajun + ' a lot'
# 0 i like gumbo a lot
# 1 i like crawfish boil a lot
# 2 i like Mardi Gras a lot
# 3 i like pirogue a lot
# 4 <NA>
# 5 i like Zatarains a lot
# dtype: string
Dates and Times¶
The workhorse datetime type in pandas is Timestamp which is really just a wrapper
of NumPy's datetime64 type. Similar to NumPy, pandas doesn't have a proper date type, so if you want
to represent a date like "January 3rd, 2001", you'd typically use the TimeStamp 2022-01-03 00:00:00
.
to_datetime()
¶
You can use the to_datetime()
function to make a Timestamp in pandas. Usually, you'll pass in a string like this.
pd.to_datetime('2020-03-01')
# Timestamp('2020-03-01 00:00:00')
Timestamp Limits¶
Note that unlike NumPy, pandas Timestamps are always stored with nanosecond units, so you can't create a Timestamp before 1678 or after 2262. If you try it, you'll get an error
pd.to_datetime('2263-01-01')
# Out of bounds nanosecond timestamp: 2263-01-01 00:00:00
Timezone Aware Timestamps¶
Unlike NumPy, pandas Timestamps can be timezone aware. To make a timezone aware Timestamp, use
the tz_localize()
method.
mar8 = pd.to_datetime('March 8, 2001 3PM').tz_localize('America/Los_Angeles')
mar8
# Timestamp('2001-03-08 15:00:00-0800', tz='America/Los_Angeles')
The timezone information is stored in the .tz
attribute of the Timestamp.
mar8.tz
# <DstTzInfo 'America/Los_Angeles' PST-1 day, 16:00:00 STD>
To change the timezone, use the tz_convert()
method.
mar8.tz_convert('Europe/London')
# Timestamp('2001-03-08 23:00:00+0000', tz='Europe/London')
strptime format¶
The to_datetime()
smartly interprets various datetime formats. For example,
pd.to_datetime('March 8, 2001 3PM') # Timestamp('2001-03-08 15:00:00')
pd.to_datetime('2001-03-08 3PM') # Timestamp('2001-03-08 15:00:00')
pd.to_datetime('2001-03-08 15') # Timestamp('2001-03-08 15:00:00')
Some formats cannot be automatically interpreted. Furthermore, it's better to explicitly state datetime formats. To do
this, pass in a strptime format using the format
parameter.
pd.to_datetime('March 8, 2001 3PM', format='%B %d, %Y %I%p')
# Timestamp('2001-03-08 15:00:00')
Series of datetimes¶
If we pass a list of datetime strings into the to_datetime()
function, we get back a DatetimeIndex
that we could use as the index of a Series or DataFrame.
pd.to_datetime(['2020-01-01', pd.NA, '2020-01-03'])
# DatetimeIndex(['2020-01-01', 'NaT', '2020-01-03'], dtype='datetime64[ns]', freq=None)
If we wrap that with pd.Series()
, we get back a Series of datetime64
s.
pd.Series(pd.to_datetime(['2020-01-01', pd.NA, '2020-01-03']))
# 0 2020-01-01
# 1 NaT
# 2 2020-01-03
# dtype: datetime64[ns]
Add time to a Series of datetimes¶
If you want to add or subtract time from a Series of datetimes, you have to do so using an instance of Timedelta, specifying a value and a unit.
x = pd.Series(pd.to_datetime(['2020-01-01', pd.NA, '2020-01-03']))
x + pd.Timedelta(value=12, unit='H')
# 0 2020-01-01 12:00:00
# 1 NaT
# 2 2020-01-03 12:00:00
# dtype: datetime64[ns]
Tip
If you subtract two timestamps you get back a TimeDelta
pd.to_datetime('2020-03-02') - pd.to_datetime('2020-03-01')
Sequence of datetimes¶
You can use the date_range()
function to build a sequence of datetimes based on a regular interval. For
example, you can make a DatetimeIndex with 10 values starting on March 8th 2020, iterating by 12 hours like so.
pd.date_range('2020-03-08', periods=10, freq='12H')
# DatetimeIndex(['2020-03-08 00:00:00', '2020-03-08 12:00:00',
# '2020-03-09 00:00:00', '2020-03-09 12:00:00',
# '2020-03-10 00:00:00', '2020-03-10 12:00:00',
# '2020-03-11 00:00:00', '2020-03-11 12:00:00',
# '2020-03-12 00:00:00', '2020-03-12 12:00:00'],
# dtype='datetime64[ns]', freq='12H')
If we did the same thing with tz='America/Denver'
, notice the output changes slightly because Daylight Saving
in Denver started at 2AM on March 8th in 2020, meaning the clock in Denver went from 1:59AM to 3:00AM, basically
skipping the whole 2 O'clock hour.
pd.date_range('2020-03-08', periods=10, freq='12H', tz='America/Denver')
# DatetimeIndex(['2020-03-08 00:00:00-07:00', '2020-03-08 13:00:00-06:00',
# '2020-03-09 01:00:00-06:00', '2020-03-09 13:00:00-06:00',
# '2020-03-10 01:00:00-06:00', '2020-03-10 13:00:00-06:00',
# '2020-03-11 01:00:00-06:00', '2020-03-11 13:00:00-06:00',
# '2020-03-12 01:00:00-06:00', '2020-03-12 13:00:00-06:00'],
# dtype='datetime64[ns, America/Denver]', freq='12H')
DatetimeIndex
¶
Here we make a series of stock prices with a DatetimeIndex.
foostock = pd.Series(
data = np.round(np.random.normal(size=10), 2),
index = pd.date_range('2020-03-08', periods=10, freq='12H', tz='America/Denver')
)
print(foostock)
# 2020-03-08 00:00:00-07:00 -1.30
# 2020-03-08 13:00:00-06:00 -0.74
# 2020-03-09 01:00:00-06:00 -0.51
# 2020-03-09 13:00:00-06:00 -0.39
# 2020-03-10 01:00:00-06:00 -1.01
# 2020-03-10 13:00:00-06:00 0.46
# 2020-03-11 01:00:00-06:00 1.13
# 2020-03-11 13:00:00-06:00 -1.23
# 2020-03-12 01:00:00-06:00 0.15
# 2020-03-12 13:00:00-06:00 0.38
# Freq: 12H, dtype: float64
We can index the series in powerful ways. For example,
Get stock prices on March 9th, 2020
foostock.loc['2020-03-09']
# 2020-03-09 01:00:00-06:00 -0.51
# 2020-03-09 13:00:00-06:00 -0.39
# Freq: 12H, dtype: float64
Get all stock prices in March
foostock.loc['2020-03']
# 2020-03-08 00:00:00-07:00 -1.30
# 2020-03-08 13:00:00-06:00 -0.74
# 2020-03-09 01:00:00-06:00 -0.51
# 2020-03-09 13:00:00-06:00 -0.39
# 2020-03-10 01:00:00-06:00 -1.01
# 2020-03-10 13:00:00-06:00 0.46
# 2020-03-11 01:00:00-06:00 1.13
# 2020-03-11 13:00:00-06:00 -1.23
# 2020-03-12 01:00:00-06:00 0.15
# 2020-03-12 13:00:00-06:00 0.38
# Freq: 12H, dtype: float64
Get stock prices at exactly March 9th at 1AM Denver time
foostock.loc[pd.to_datetime('2020-03-09 1:00:00').tz_localize('America/Denver')]
# -0.51
Get stock prices between March 9th and March 10th
foostock.loc['2020-03-09':'2020-03-10'] # (1)!
# 2020-03-09 01:00:00-06:00 -0.51
# 2020-03-09 13:00:00-06:00 -0.39
# 2020-03-10 01:00:00-06:00 -1.01
# 2020-03-10 13:00:00-06:00 0.46
# Freq: 12H, dtype: float64
- Note that both endpoints are included.
Categoricals¶
One of pandas' best features is its ability to represent categorical data using the Categorical type, much like the factor data type in R. A Categorical takes on a limited, and usually fixed number of possible values ( i.e. categories).
Setup
Suppose we have a dataset representing four cars. For each car, we have its VIN number, color, and size classification:
VIN | color | size |
---|---|---|
AX193Q43 | red | standard |
Z11RTV201 | blue | mini |
WA4Q3371 | red | standard |
QWP77491 | green | extended |
How should we store these data?
Unique Ids¶
Categoricals are not appropriate for storing unique identifiers. In the case of our cars dataset, Vin numbers using a plain ole Series of strings.
VINs = pd.Series(['AX193Q43', 'Z11RTV201', 'WA4Q3371', 'QWP77491'], dtype='string')
print(VINs)
Unordered Categories¶
colors
is a classic candidate for categorical data since we have a limited set of colors and a single color can be a
feature of multiple cars.
To build a Categorical, use the pd.Categorical()
function, very similar to the pd.Series()
function.
colors = pd.Categorical(values = ['red', 'blue', 'red', 'green'])
print(colors) # (1)!
# ['red', 'blue', 'red', 'green']
# Categories (3, object): ['blue', 'green', 'red']
- Printing a categorical displays its values and its unique categories.
Note that this is not a Series, but you could make it a Series just by wrapping it in pd.Series(colors)
.
pd.Series(colors)
# 0 red
# 1 blue
# 2 red
# 3 green
# dtype: category
# Categories (3, object): ['blue', 'green', 'red']
By default, when you build a categorical, pandas sets the categories as the unique, non NaN values in the data. To
explicitly define the categories, use the categories
argument. This is especially useful if your collection includes
categories not observed in the data.
colors = pd.Categorical(
values = ['red', 'blue', 'red', 'green'],
categories = ['black', 'blue', 'green', 'orange', 'red', 'yellow']
)
print(colors)
# ['red', 'blue', 'red', 'green']
# Categories (6, object): ['black', 'blue', 'green', 'orange', 'red', 'yellow']
The categories
parameter also lets you organize the order in which categories should be displayed, which could be
handy for things like plots or reports. Without specifying the categories
parameter, pandas displays them in lexical (
alphabetical) order. Alternatively, you can define the categories in a specific order. For example, here we specify
colors in the order "bright to dark".
colors = pd.Categorical(
values = ['red', 'blue', 'red', 'green'],
categories = ['yellow', 'orange', 'red', 'green', 'blue', 'black']
)
print(colors)
# ['red', 'blue', 'red', 'green']
# Categories (6, object): ['yellow', 'orange', 'red', 'green', 'blue', 'black']
Unordered Categories¶
sizes
is similar to colors
, but unlike colors
, sizes
have an inherent order. 'mini' is less than 'standard'
is less than 'extended'. To define an ordered Categorical,
- pass the ordered categories into the
categories
argument and - set
ordered=True
sizes = pd.Categorical(
values = ['standard', 'mini', 'standard', 'extended'],
categories = ['mini', 'standard', 'extended'],
ordered = True
)
print(sizes) # (1)!
# ['standard', 'mini', 'standard', 'extended']
# Categories (3, object): ['mini' < 'standard' < 'extended']
- Categories are reported with
<
"less than" symbols indicating that they have a meaningful order.
With an ordered Categorical, you can do things like compare sizes < 'extended'
and get back a boolean array.
sizes < 'extended'
# array([ True, True, True, False])
Categoricals don't have .loc
or .iloc
accessors. To subset sizes
as those less than 'extended'
, you must use
square bracket notation, like indexing a numpy array.
sizes[sizes < 'extended']
# ['standard', 'mini', 'standard']
# Categories (3, object): ['mini' < 'standard' < 'extended']
Although, you could create a Series with a CategoricalIndex
.
sizesSeries = pd.Series(
data = [0,1,2,3],
index = pd.CategoricalIndex(sizes)
)
# print(sizesSeries)
# standard 0
# mini 1
# standard 2
# extended 3
# dtype: int64
In which case you could do
sizesSeries.loc['mini']
# 1
get_dummies()
¶
Another really cool benefit of using Categoricals is that you can one-hot-encode them using pandas' get_dummies()
function. For example,
pd.get_dummies(sizes, prefix = 'size') # (1)!
# size_mini size_standard size_extended
# 0 0 1 0
# 1 1 0 0
# 2 0 1 0
# 3 0 0 1
- This is a common input format for machine learning models.
MultiIndex¶
A MultiIndex, or hierarchical index, is an index that uses multiple values in its keys.
There are a few different ways to make a MultiIndex from scratch. Perhaps the most intuitive way is to use
the pd.MultiIndex.from_tuples()
constructor, passing in a list of tuples. We'll do that here
to generate a MultiIndex of (store, product)
pairs.
store_products = pd.DataFrame(
data = {'Price': [35.25, 45.00, 23.50, 1.95, 29.99, 35.65]},
index = pd.MultiIndex.from_tuples([
('super store', 'basketball'), ('super store', 'football'), ('super store', 'soccerball'),
('sports dorks', 'golfball'), ('sports dorks', 'basketball'), ('sports dorks', 'football')
], names=['store', 'product'])
)
print(store_products)
# Price
# store product
# super store basketball 35.25
# football 45.00
# soccerball 23.50
# sports dorks golfball 1.95
# basketball 29.99
# football 35.65
Levels¶
It's important to understand that our store_products
MultiIndex has two levels.
- Level 0 is the "store" level
- Level 1 is the "product" level
We could have more but in this example we have two.
MultiIndex to RangeIndex and back¶
You can use the reset_index()
to convert the index levels as columns and go back to using a plain ole
range index.
store_products.reset_index(inplace=True)
print(store_products)
# store product Price
# 0 super store basketball 35.25
# 1 super store football 45.00
# 2 super store soccerball 23.50
# 3 sports dorks golfball 1.95
# 4 sports dorks basketball 29.99
# 5 sports dorks football 35.65
To revert to a DataFrame with a MultiIndex, use the set_index()
method passing in a list of column names
to use as the MultiIndex levels.
store_products.set_index(['store', 'product'], inplace=True)
print(store_products)
# Price
# store product
# super store basketball 35.25
# football 45.00
# soccerball 23.50
# sports dorks golfball 1.95
# basketball 29.99
# football 35.65
Indexing a MultiIndex¶
To select rows the with the key ('sports dorks', 'golfball') or ('super store', 'football'), you can
use DataFrame.loc
, passing in a list of tuples.
store_products.loc[[('sports dorks', 'golfball'), ('super store', 'football')]]
# Price
# store product
# sports dorks golfball 1.95
# super store football 45.00
To select every row for 'sports dorks', you can use DataFrame.loc
passing in the key 'sports dorks'.
store_products.loc[['sports dorks']]
# Price
# store product
# sports dorks golfball 1.95
# basketball 29.99
# football 35.65
By contrast, store_products.loc[['football']]
does not return all rows where product equals 'football'.
store_products.loc[['football']]
# KeyError: "['football'] not in index"
That's because DataFrame.loc
only scans the first MultiIndex level.
To fetch all rows where product equals 'football', use the DataFrame.xs()
method.
store_products.xs(key='football', level='product')
# Price
# store
# super store 45.00
# sports dorks 35.65
To retain both the store
and product
indexes in the result, set the drop_level
argument to False
.
store_products.xs(key='football', level='product', drop_level=False)
# Price
# store product
# super store football 45.00
# sports dorks football 35.65
MultiIndex after groupby aggregate¶
Now let's take a look at a groupby aggregate operation since this is a more common way MultiIndexes arise in practice. We'll start by building a DataFrame with 7 rows and 4 columns.
df = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar', 'bar', 'foo', 'foo'],
'B': [False, True, False, True, True, True, True],
'C': [2.1, 1.9, 3.6, 4.0, 1.9, 7.8, 2.8],
'D': [50, 30, 30, 90, 10, 20, 10]
})
print(df)
# A B C D
# 0 foo False 2.1 50
# 1 bar True 1.9 30
# 2 foo False 3.6 30
# 3 bar True 4.0 90
# 4 bar True 1.9 10
# 5 foo True 7.8 20
# 6 foo True 2.8 10
Then we'll set stew
equal to the following groupby aggregate operation.
stew = df.groupby(by=['A', 'B']).agg({'C': ['sum'], 'D': ['sum', 'mean']})
print(stew)
# C D
# sum sum mean
# A B
# bar True 7.8 130 43.333333
# foo False 5.7 80 40.000000
# True 10.6 30 15.000000
In this case, the row index is a MultiIndex
stew.index
# MultiIndex([('bar', True),
# ('foo', False),
# ('foo', True)],
# names=['A', 'B'])
and so is the column index.
stew.columns
# MultiIndex([('C', 'sum'),
# ('D', 'sum'),
# ('D', 'mean')],
# )
We can use DataFrame.xs()
to select all columns representing a sum.
stew.xs(
key='sum', # (1)!
axis=1, # (2)!
level=1, # (3)!
drop_level=False # (4)!
)
# C D
# sum sum
# A B
# bar True 7.8 130
# foo False 5.7 80
# True 10.6 30
- Select columns with the key 'sum'
- Select columns (not rows
axis=0
) - Search for the key within level 1 (the second level of the MultiIndex)
- Retain this level in the result
We could select D columns by specifying key='D'
and level=0
.
stew.xs(key='D', axis=1, level=0, drop_level=False)
# D
# sum mean
# A B
# bar True 130 43.333333
# foo False 80 40.000000
# True 30 15.000000
There are lots of ways to flatten the columns of a MultiIndex into a regular index. Perhaps the easiest is to use the
to_flat_index()
method. For example,
stew.columns.to_flat_index()
This returns a generic Index object which behaves similarly to a list of tuples. With basic list comprehension, we can convert this to a list of strings
['_'.join(s) for s in stew.columns.to_flat_index()]
# ['C_sum', 'D_sum', 'D_mean']
... which we can set equal to stew.columns
. This flattens stew
's column MultiIndex.
stew.columns = ['_'.join(s) for s in stew.columns.to_flat_index()]
print(stew)
# C_sum D_sum D_mean
# A B
# bar True 7.8 130 43.333333
# foo False 5.7 80 40.000000
# True 10.6 30 15.000000
DataFrame Reshaping¶
Suppose you have DataFrame like this
| | row | col | val |
|---:|:------|:------|------:|
| 0 | row0 | col1 | 44 |
| 1 | row1 | col1 | 47 |
| 2 | row2 | col1 | 64 |
| 3 | row0 | col0 | 67 |
| 4 | row1 | col0 | 67 |
| 5 | row2 | col0 | 9 |
There's a good chance you'll want to reshape like this
| row | col0 | col1 |
|:------|-------:|-------:|
| row0 | 67 | 44 |
| row1 | 67 | 47 |
| row2 | 9 | 64 |
Here, we converted the data from long format to wide format. Some people would say we converted the data from
stacked format to unstacked format. For better or worse, there are a ton of ways to do this with pandas. Perhaps the
simplest method is to use the pivot()
method of a DataFrame...
pivot()
¶
Given the following dataframe in long format,
df = pd.DataFrame({
'row': ['row0', 'row1', 'row2', 'row0', 'row1', 'row2'],
'col': ['col1', 'col1', 'col1', 'col0', 'col0', 'col0'],
'val': [44, 47, 64, 67, 67, 9]
})
print(df)
# row col val
# 0 row0 col1 44
# 1 row1 col1 47
# 2 row2 col1 64
# 3 row0 col0 67
# 4 row1 col0 67
# 5 row2 col0 9
We can reshape it to wide format using df.pivot()
, telling pandas which columns we want to use for the row index,
column index, and values.
df.pivot(index='row', columns='col', values='val')
# col col0 col1
# row
# row0 67 44
# row1 67 47
# row2 9 64
Now suppose df
had slightly different data..
df = pd.DataFrame({
'row': ['row0', 'row0', 'row2', 'row2', 'row1', 'row1'],
'col': ['col1', 'col1', 'col1', 'col0', 'col0', 'col0'],
'val': [44, 47, 64, 67, 67, 9]
})
# row col val
# 0 row0 col1 44
# 1 row0 col1 47
# 2 row2 col1 64
# 3 row2 col0 67
# 4 row1 col0 67
# 5 row1 col0 9
This time, df.pivot()
generates an error.
df.pivot(index='row', columns='col', values='val')
# ValueError: Index contains duplicate entries, cannot reshape
The error occurs because there are two instances where multiple values in the input map to the same exact position in
the output, and we haven't told pandas how to deal with that. This is where the more general pivot_table()
method
comes in..
pivot_table()
¶
pivot_table()
works just like pivot()
, but you can specify an aggregation function to tell Pandas how to deal with
multiple values mapping to the same place.
For example, here we convert df
from tall to wide format using pivot_table()
with aggfunc=list
.
print(df) # (1)!
# row col val
# 0 row0 col1 44
# 1 row0 col1 47
# 2 row2 col1 64
# 3 row2 col0 67
# 4 row1 col0 67
# 5 row1 col0 9
pivotted = df.pivot_table(
index='row',
columns='col',
values='val',
aggfunc=list
)
print(pivotted)
# col col0 col1
# row
# row0 NaN [44, 47]
# row1 [67, 9] NaN
# row2 [67] [64]
-
df = pd.DataFrame({ 'row': ['row0', 'row0', 'row2', 'row2', 'row1', 'row1'], 'col': ['col1', 'col1', 'col1', 'col0', 'col0', 'col0'], 'val': [44, 47, 64, 67, 67, 9] })
Notice that 67 and 9 mapped to the same cell in the output because they had the same row and col values in the
input. aggfunc
tells Pandas how to deal with those values, in this case wrapping them into a list. Also, note that the
combinations (row0, col0) and (row1, col1) didn't exist in df
, so Pandas fills those entries with NaN by default.
You're probably not interested in pivoting data just to build lists of values. More interestingly, you can use
aggregation functions like mean()
, sum()
, size()
, count()
, or a combination of all of them.
This works just like doing normal DataFrame aggregation. So, for example, to get the count and sum of elements mapping
to each cell in the output table, use aggfunc=['count', 'sum']
.
df.pivot_table(
index='row',
columns='col',
values='val',
aggfunc=['count', 'sum']
)
# count sum
# col col0 col1 col0 col1
# row
# row0 NaN 2.0 NaN 91.0
# row1 2.0 NaN 76.0 NaN
# row2 1.0 1.0 67.0 64.0
As you could probably guess, Pandas supports this kind of pivot operation using multiple columns for the row index, column index, and values. So if you have input like this
toomuch = pd.DataFrame({
'row_major': ['A', 'A', 'B', 'A', 'B', 'B'],
'row_minor': ['x', 'x', 'y', 'y', 'z', 'z'],
'col_major': ['MAMMAL', 'MAMMAL', 'MAMMAL', 'FISH', 'FISH', 'FISH'],
'col_minor': ['dog', 'cat', 'dog', 'tuna', 'tuna', 'shark'],
'val0': [44, 47, 64, 67, 67, 9],
'val1': [91, 52, 86, 83, 79, 92]
})
print(toomuch)
# row_major row_minor col_major col_minor val0 val1
# 0 A x MAMMAL dog 44 91
# 1 A x MAMMAL cat 47 52
# 2 B y MAMMAL dog 64 86
# 3 A y FISH tuna 67 83
# 4 B z FISH tuna 67 79
# 5 B z FISH shark 9 92
you can do crazy stuff like this
toomuch.pivot_table(
index=['row_major', 'row_minor'],
columns=['col_major', 'col_minor'],
values=['val0', 'val1'],
aggfunc=['count', 'sum']
)
# count sum
# val0 val1 val0 val1
# col_major FISH MAMMAL FISH MAMMAL FISH MAMMAL FISH MAMMAL
# col_minor shark tuna cat dog shark tuna cat dog shark tuna cat dog shark tuna cat dog
# row_major row_minor
# A x NaN NaN 1.0 1.0 NaN NaN 1.0 1.0 NaN NaN 47.0 44.0 NaN NaN 52.0 91.0
# y NaN 1.0 NaN NaN NaN 1.0 NaN NaN NaN 67.0 NaN NaN NaN 83.0 NaN NaN
# B y NaN NaN NaN 1.0 NaN NaN NaN 1.0 NaN NaN NaN 64.0 NaN NaN NaN 86.0
# z 1.0 1.0 NaN NaN 1.0 1.0 NaN NaN 9.0 67.0 NaN NaN 92.0 79.0 NaN NaN
melt()
¶
Suppose we have a wide DataFrame like this..
wide = pd.DataFrame({
'test': [1, 2, 3, 4],
'john': [95, 81, 47, 99],
'patty': [90, 85, 93, 97]
})
| | test | john | patty |
|---:|-------:|-------:|--------:|
| 0 | 1 | 95 | 90 |
| 1 | 2 | 81 | 85 |
| 2 | 3 | 47 | 93 |
| 3 | 4 | 99 | 97 |
and we wanted to reshape it into a long DataFrame like this
| | test | student | score |
|---:|-------:|:----------|--------:|
| 0 | 1 | john | 95 |
| 1 | 2 | john | 81 |
| 2 | 3 | john | 47 |
| 3 | 4 | john | 99 |
| 4 | 1 | patty | 90 |
| 5 | 2 | patty | 85 |
| 6 | 3 | patty | 93 |
| 7 | 4 | patty | 97 |
The tool for the job is melt()
.
wide.melt(
id_vars='test',
value_vars=['john', 'patty'],
var_name='student',
value_name='score'
)
# test student score
# 0 1 john 95
# 1 2 john 81
# 2 3 john 47
# 3 4 john 99
# 4 1 patty 90
# 5 2 patty 85
# 6 3 patty 93
# 7 4 patty 97
The most important parameter here is value_vars
- a list of the columns you want to stack In fact, that's literally
all you need to provide for this function to work (although the output isn't very friendly without other information).
wide.melt(value_vars=['john', 'patty'])
# variable value
# 0 john 95
# 1 john 81
# 2 john 47
# 3 john 99
# 4 patty 90
# 5 patty 85
# 6 patty 93
# 7 patty 97
var_name
and value_name
tell pandas how to name the new variable and value columns, and id_vars
tells Pandas what
other columns from the original DataFrame you want to keep in the result.
stack()
¶
If you have a DataFrame of daily stock prices like this
df = pd.DataFrame({
'ABC': [105.30, 107.17, 101.52],
'DEF': [40.29, 40.97, 51.00],
'GHI': [70.05, 64.13, 64.88]
},
index = pd.DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'])
)
print(df)
# ABC DEF GHI
# 2020-01-01 105.30 40.29 70.05
# 2020-01-02 107.17 40.97 64.13
# 2020-01-03 101.52 51.00 64.88
You can use df.stack()
to reshape the DataFrame into a Series, literally stacking the values of ABC, DEF, and GHI
within each original row index, resulting in a 2-level row MultiIndex.
df.stack()
# 2020-01-01 ABC 105.30
# DEF 40.29
# GHI 70.05
# 2020-01-02 ABC 107.17
# DEF 40.97
# GHI 64.13
# 2020-01-03 ABC 101.52
# DEF 51.00
# GHI 64.88
# dtype: float64
Things get a little trickier when df has a column MultiIndex. Consider this DataFrame.
pets = pd.DataFrame(
data=[[497, 1056, 2047, 3595], [352, 922, 3800, 1048], [217, 2017, 3571, 4521]],
index=pd.Index(['AL', 'MS', 'LA'], name='state'),
columns=pd.MultiIndex.from_tuples([
('cat', 'persian'), ('cat', 'calico'), ('dog', 'corgie'), ('dog', 'lab')
], names=['type', 'subtype'])
)
print(pets)
# type cat dog
# subtype persian calico corgie lab
# state
# AL 497 1056 2047 3595
# MS 352 922 3800 1048
# LA 217 2017 3571 4521
stack()
's most important parameter is level
. It tells Pandas which level(s) from the column index you want to stack into the row index.
Let's see some examples, keeping in mind that our pets DataFrame has two levels in its column index:
type
with values dog and catsubtype
with values persian, calico, corgie, and lab.
If we call pets.stack(level='type')
, you can see how the first level, type, gets extracted from the column index
and inserted into the row index, and the data re-aligns itself accordingly
pets.stack(level='type')
# subtype calico corgie lab persian
# state type
# AL cat 1056.0 NaN NaN 497.0
# dog NaN 2047.0 3595.0 NaN
# MS cat 922.0 NaN NaN 352.0
# dog NaN 3800.0 1048.0 NaN
# LA cat 2017.0 NaN NaN 217.0
# dog NaN 3571.0 4521.0 NaN
Pretty much the same thing happens when you call stack(level='subtype')
.
pets.stack(level='subtype')
# type cat dog
# state subtype
# AL calico 1056.0 NaN
# corgie NaN 2047.0
# lab NaN 3595.0
# persian 497.0 NaN
# MS calico 922.0 NaN
# corgie NaN 3800.0
# lab NaN 1048.0
# persian 352.0 NaN
# LA calico 2017.0 NaN
# corgie NaN 3571.0
# lab NaN 4521.0
# persian 217.0 NaN
level=-1
By default, stack()
uses level=-1
which automatically stacks the last column level into the row index.
levels
can also be a list like ['type', 'subtype']
in which case both column levels are stacked into the row index
and the result is a Series, not a DataFrame.
pets.stack(level=['type', 'subtype'])
# state type subtype
# AL cat calico 1056.0
# persian 497.0
# dog corgie 2047.0
# lab 3595.0
# MS cat calico 922.0
# persian 352.0
# dog corgie 3800.0
# lab 1048.0
# LA cat calico 2017.0
# persian 217.0
# dog corgie 3571.0
# lab 4521.0
# dtype: float64
unstack()
¶
Just as Pandas give provides stack()
for stacking column levels into row levels, it provides unstack()
for
unstacking row levels into column levels.
So, where we have pets.stack(level='type')
, we can do almost the inverse operation with .unstack(level='type')
,
pets.stack(level='type').unstack(level='type')
# subtype calico corgie lab persian
# type cat dog cat dog cat dog cat dog
# state
# AL 1056.0 NaN NaN 2047.0 NaN 3595.0 497.0 NaN
# LA 2017.0 NaN NaN 3571.0 NaN 4521.0 217.0 NaN
# MS 922.0 NaN NaN 3800.0 NaN 1048.0 352.0 NaN
although in this case 'type' becomes the second level of the column index, and all the NaNs created by stack()
get retained as columns with unstack()
.