Series
What's a Series?¶
Series is one of the fundamental data structures in pandas. It's essentially an array with an index. Because it's an array, every value in a Series must be of the same type. You can have a Series of ints, a Series of floats, or a Series of booleans, but you can't have a Series of ints, floats and booleans together.
Series Documentation¶
You'll want to familiarize yourself with pandas' documentation. Here's the documentation for Series. It's the first place you should look when you have questions about a Series or Series method.
Series Creation¶
How to make a Series from a list¶
The easiest way to make a series is from a list.
x = pd.Series([5, 10, 15, 20, 25, 30, 35])
If we print the series, we get back something like this
print(x)
# 0 5
# 1 10
# 2 15
# 3 20
# 4 25
# 5 30
# 6 35
# dtype: int64
Notice how it already looks a bit different from a NumPy array. The column of values on the left is the Series index which you can use to access the Series elements in creative and meaningful ways. More on that later..
Also notice the output includes 'dtype int64' which tells us the data type of the elements in the Series.
How to check if an object is a Series¶
You can use Python's type()
function to check that x
is indeed a Series object.
type(x) # pandas.core.series.Series
How to check the type of data stored in a Series¶
If you want to check the internal data type of the Series elements without printing the whole Series, you can use the
Series.dtype
attribute.
x.dtype # int64
How to access the underlying NumPy array¶
Most pandas Series store the underlying data as a NumPy array. You can access the underlying NumPy array via
Series.to_numpy()
.
x.to_numpy()
# array([ 5, 10, 15, 20, 25, 30, 35])
You might also see people using the Series.values
attribute here, but this technique
is not recommended.
How to access the first N elements of a Series¶
You can use the highly popular Series.head()
method to pick out the first N elements of
a Series. For example, x.head(6)
returns the first 6 elements of x
as a new Series.
x.head(6)
# 0 5
# 1 10
# 2 15
# 3 20
# 4 25
# 5 30
# dtype: int64
How to access the last N elements of a Series¶
You can use Series.tail()
to pick out the last N elements of a Series.
For example, x.tail(3)
returns the last 3 elements of x
as a new Series.
x.tail(3)
# 4 25
# 5 30
# 6 35
# dtype: int64
How to make a Series from a dictionary¶
You can make a Series from a python dictionary, like this
data = {'a' : 0., 'b' : 1., 'c' : 2., 'd': 3.}
y = pd.Series(data)
print(y)
# a 0.0
# b 1.0
# c 2.0
# d 3.0
# dtype: float64
In this case, pandas uses the dictionary keys for the series index and the dictionary values for the series values. Again, we'll cover the index and its purpose shortly. For now, just know it's a thing.
How to make a Series of strings¶
If we wanted to make a Series of strings, we could do that too.
z = pd.Series(['frank', 'dee', 'dennis']) # (1)!
- This is the old, bad way. See the new, good way below.
If we print(z)
, notice the dtype
is listed as "object".
print(z)
# 0 frank
# 1 dee
# 2 dennis
# dtype: object
Why?
The short answer is, this is not a Series of strings. Rather, this is a Series of pointers. Since strings are
objects that vary in size, but arrays (and thus Series) use fixed-size memory blocks to store their data, pandas
implements a common trick - store the strings randomly in memory and put the address of each string in the
underlying array. (Memory addresses are fixed-size objects - usually just 64-bit integers). If you're confused by
this - don't worry, it's a tricky concept that'll make more sense later on.
The newer and better approach to creating a Series of strings is to specify dtype='string'
.
z = pd.Series(['frank', 'dee', 'dennis'], dtype='string')
Now when we print(z)
, pandas reports the dtype as 'string'.
print(z)
# 0 frank
# 1 dee
# 2 dennis
# dtype: string
(There's a lot to discuss here, but we'll cover these things later.)
How to make a Series from a NumPy array¶
Perhaps the most powerful way to make a Series from scratch is to make it from a NumPy array.
# import numpy and pandas
import numpy as np
import pandas as pd
If you have a NumPy array like this
x = np.array([10, 20, 30, 40])
you can convert it to a Series just by passing x
into pd.Series()
pd.Series(x)
# 0 10
# 1 20
# 2 30
# 3 40
# dtype: int64
Why is this so "powerful"?
Well, suppose you wanted to make a complex Series from scratch like a random sample of values from a normal distribution. The somewhat lame, but practical way to do this is to use NumPy. NumPy has lots of great tools for making arrays from scratch, and converting them into a Series is a piece of cake .
Is your NumPy rusty?
Check out our NumPy problem set.
Series Basic Indexing¶
Suppose we have the following Series, x
.
x = pd.Series([5, 10, 15, 20, 25])
print(x)
# 0 5
# 1 10
# 2 15
# 3 20
# 4 25
# dtype: int64
If you wanted to access the ith element of the Series, you might be inclined to use square-bracket indexing notation just like accessing elements from a Python list or a NumPy array.
x[0] # 5
x[1] # 10
x[0]
returns the 1st element, x[1]
returns the 2nd element and so on.
This appears to work like List indexing, but don't be fooled! x[0]
actually returns the element(s) of the Series
with index label 0. In this example, that element happens to be the first element in the Series, but if we shuffle
the index like this
x.index = [3,1,4,0,2]
print(x)
# 3 5
# 1 10
# 4 15
# 0 20
# 2 25
# dtype: int64
now x[0]
returns 20 instead of 5.
x[0] # 20
However, if we change the index to ['a','b','c','d','e']
x.index = ['a','b','c','d','e']
print(x)
# a 5
# b 10
# c 15
# d 20
# e 25
# dtype: int64
This time, x[0]
does return the first value in the Series.
x[0] # 5
Caution
The takeaway here is that square-bracket indexing in pandas isn't straight-forward. Its behavior changes depending
on characteristics of the Series. For this reason, we recommend using more explicit indexing techniques -
Series.iloc
and Series.loc
.
Indexing by position¶
x = pd.Series([5, 10, 15, 20, 25])
print(x)
# 0 5
# 1 10
# 2 15
# 3 20
# 4 25
# dtype: int64
How to access the ith value of a Series¶
Use the Series.iloc
property to access the ith value in a Series.
x.iloc[0] # 5, get the first value in the Series
x.iloc[1] # 10, get the second value in the Series
Negative Indexing¶
Series.iloc
supports negative indexing like Python lists and NumPy arrays.
x.iloc[-1] # 25 | last element
x.iloc[-2] # 20 | second-to-last element
x.iloc[-3] # 15 | third-to-last element
Positional Slicing¶
Series.iloc
supports negative indexing like Python lists and NumPy arrays.
x.iloc[1:4:2] # get values at position 1 to position 4 stepping by 2
# 1 10
# 3 20
# dtype: int64
Notice the result is a Series object whereas in the previous examples the results were scalars.
How to select multiple elements by position¶
Series.iloc
can receive a list, array, or Series of integers to select multiple values in x
.
x.iloc[[0, 2, 3]] # 5, 15, 20
x.iloc[np.array([0, 2, 3])] # 5, 15, 20
x.iloc[pd.Series([0, 2, 3])] # 5, 15, 20
Indexing by label¶
Let's talk about the index. Every Series has an index and its purpose is to provide a label for each element in the Series. When you make a Series from scratch, it automatically gets an index of sequential values starting from 0.
For example, here we make a Series to represent the test grades of five students, and you can see how the index automatically gets created.
grades = pd.Series([82, 94, 77, 89, 91, 54])
print(grades)
# 0 82
# 1 94
# 2 77
# 3 89
# 4 91
# 5 54
# dtype: int64
We can change the index pretty easily, just by setting it equal to another array, list, or Series of values with the proper length. The index values don't even need to be integers, and in fact, they're often represented as strings.
grades.index = ['homer', 'maggie', 'grandpa', 'bart', 'lisa', 'marge']
print(grades)
# homer 82
# maggie 94
# grandpa 77
# bart 89
# lisa 91
# marge 54
# dtype: int64
How to access the value of a Series with label¶
To fetch a Series value(s) with some specific label, use the Series.loc
method.
For example, to get bart's grade in the Series above, we can do grades.loc['bart']
.
grades.loc['bart'] # 89
Label Slicing¶
Series.loc
supports slicing by label. For example, to fetch the grades between homer and grandpa, we could do
grades.loc['homer':'grandpa']
.
grades.loc['homer':'grandpa']
# homer 82
# maggie 94
# grandpa 77
# dtype: int64
Notice that the slice 'homer':'grandpa'
includes homer and grandpa. By contrast, the equivalent positional
slice 0:2
would exclude the right endpoint (grandpa).
How to select multiple elements by label¶
Just like Series.iloc[]
, we can pass a list, array, or Series of labels into Series.loc[]
to retrieve multiple
elements.
grades.loc[['homer', 'grandpa', 'bart']]
# homer 82
# grandpa 77
# bart 89
# dtype: int64
RangeIndex¶
When you make a Series without specifying its index, pandas automatically gives it a RangeIndex.
x = pd.Series(np.random.normal(size=5))
print(x)
# 0 0.651743
# 1 0.311423
# 2 0.103382
# 3 -3.614402
# 4 -0.046355
# dtype: float64
print(x.index)
# RangeIndex(start=0, stop=5, step=1)
By contrast, when you explicitly set the index as a list of integers, pandas gives it an Int64Index.
x = pd.Series(np.random.normal(size=5), index=[0,1,2,3,4])
print(x)
# 0 -0.091815
# 1 -0.823428
# 2 1.394426
# 3 1.263174
# 4 -0.421659
# dtype: float64
print(x.index)
# Int64Index([0, 1, 2, 3, 4], dtype='int64')
For most situations, the difference is irrelevant. However, note that the RangeIndex is more memory efficient and has faster access times.
Modifying Series Data¶
Consider this Series foo
.
foo = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
Basic Series Modifications¶
We can change the second element to 200.
foo.iloc[1] = 200
foo.loc['b'] = 200
We can set the 1st, 2nd and 3rd elements to 99.
foo.iloc[[0, 1, 2]] = 99
or with slicing
foo.iloc[:3] = 999 # (1)!
[:3]
means "select every element from the start of the Series up to but excluding position 3".
foo.loc[['a', 'b', 'c']] = 99
or with slicing
foo.loc['a':'c'] = 99 # (1)!
['a':'c']
means "select every element from label 'a' to label 'c', including 'a' and 'c'".
How to update a Series with an array¶
Suppose you have a Series foo
and a NumPy array bar
foo = pd.Series([2, 3, 5, 7, 11], index=[2, 4, 6, 8, 10])
bar = np.array([5, 10, 15, 20, 25])
and your goal is to update foo
's values with bar
. If you overwrite foo
, you'll lose its index.
foo = pd.Series(bar)
print(foo)
# 0 5
# 1 10
# 2 15
# 3 20
# 4 25
# dtype: int64
Instead, use slicing to overwrite foo
's values without overwriting its index.
foo.iloc[:] = bar
print(foo)
# 2 2
# 4 3
# 6 5
# 8 7
# 10 11
# dtype: int64
How to update a Series with another Series¶
Suppose you have a Series x
and a Series y
whose indices are different but share a few common values.
x = pd.Series([10, 20, 30, 40])
y = pd.Series([1, 11, 111, 1111], index=[7,3,2,0])
print(x)
# 0 10
# 1 20
# 2 30
# 3 40
# dtype: int64
print(y)
# 7 1
# 3 11
# 2 111
# 0 1111
# dtype: int64
Predict the result of x.loc[[0, 1]] = y
.
x.loc[[0, 1]] = y
print(x)
# 0 1111.0
# 1 NaN
# 2 30.0
# 3 40.0
# dtype: float64
you may be surprised..
Index Alignment
When you assign a Series y
to a Series x
, pandas uses index alignment to
insert values from y
into x
based on matching index labels.
In the previous example, pandas starts by searching x
for the values with index labels 0 and 1. Then it looks for
matching labels in y
to use to overwrite x
. Since x
's label 1 doesn't match any elements in y
, pandas assigns
it the value NaN. And since NaN only exists as a floating point value in NumPy, pandas casts the entire Series from
ints to floats.
How to update a Series with a NumPy array¶
Given x
and y
from the previous section,
x = pd.Series([10, 20, 30, 40])
y = pd.Series([1, 11, 111, 1111], index=[7,3,2,0])
print(x)
# 0 10
# 1 20
# 2 30
# 3 40
# dtype: int64
print(y)
# 7 1
# 3 11
# 2 111
# 0 1111
# dtype: int64
If we do x.loc[[0, 1]] = y.to_numpy()
we'll get the error:
ValueError: cannot set using a list-like indexer with a different length than the value
When you assign a NumPy array to a Series, pandas assigns the ith element of the array to the ith value of the Series.
In this case, x.loc[[0, 1]] = y.to_numpy()
attempts to assign a 4-element array to a 2-element subseries, hence
the error.
If we restrict the numpy array to its first two elements, the assignment works.
x.loc[[0, 1]] = y.to_numpy()[:2] # (1)!
print(x)
# 0 1.0
# 1 11.0
# 2 30.0
# 3 40.0
# dtype: float64
- Select the first two elements of
x
and overwrite their values with the first two elements ofy.to_numpy()
- the NumPy array version ofy
.
Series Basic Operations¶
It's important to understand how pandas handles basic operations between arrays. Here we'll look at addition, although the core concepts apply to other operations such as subtraction, multiplication, etc.
Adding a scalar to a Series¶
When you add a scalar to a Series, pandas uses broadcasting to add the scalar to each element of the Series.
x = pd.Series([1, 2, 3, 4])
x + 1
# 0 2
# 1 3
# 2 4
# 3 5
# dtype: int64
Adding a Series to a Series¶
Series arithmetic is fundamentally different from NumPy arithmetic. When you add two Series x
and y
, pandas only
combines elements with the same index label.
x = pd.Series([1, 2, 3, 4])
y = pd.Series(1)
x + y
# 0 2.0
# 1 NaN
# 2 NaN
# 3 NaN
# dtype: float64
In this example, x
has index labels 0, 1, 2, 3, and y
has index label 0.
print(x)
# 0 1
# 1 2
# 2 3
# 3 4
# dtype: int64
print(y)
# 0 1
# dtype: int64
The result of x + y
will be a Series whose index labels is a combination of x
's index labels and y
's index
labels. In this case, the label 0 is in both Series, so the corresponding elements are added together. However,
labels 1, 2, and 3 in x
don't have matching elements in y
, so Pandas converts these to NaN in the result. Since,
NaN only exists as a floating point constant in NumPy (i.e. you can't have an integer array with NaNs), Pandas casts
the entire Series from int64
to float64
.
Add two Series' elements by position¶
If you want to add two Series' elements by position, convert them to NumPy arrays before adding them. For example,
A = pd.Series([10, 20, 30, 40, 50], index=[4, 3, 2, 1, 0])
B = pd.Series([1, 2, 3, 4, 5])
print(A)
# 4 10
# 3 20
# 2 30
# 1 40
# 0 50
# dtype: int64
print(B)
# 0 1
# 1 2
# 2 3
# 3 4
# 4 5
# dtype: int64
If we add A + B
, pandas uses index alignment to add elements by matching index label.
A + B
# 0 51
# 1 42
# 2 33
# 3 24
# 4 15
# dtype: int64
If we add the NumPy arrays underlying each Series, their elements are added by position.
A.to_numpy() + B.to_numpy()
# array([11, 22, 33, 44, 55])
To convert the resulting NumPy array back to a Series, just wrap it with pd.Series()
.
pd.Series(A.to_numpy() + B.to_numpy())
# 0 11
# 1 22
# 2 33
# 3 44
# 4 55
# dtype: int64
This technique drops A
's index labels. If you want to retain A
's labels, only convert B
to an array.
A + B.to_numpy()
# 4 11
# 3 22
# 2 33
# 1 44
# 0 55
# dtype: int64
Add Series by label, prevent NaNs in the result¶
If you add two Series by index label, you'll often get NaNs in the result where an index label didn't exist in both Series.
x = pd.Series([1, 2, 3, 4])
y = pd.Series([10, 20], index=[1,3])
print(x)
# 0 1
# 1 2
# 2 3
# 3 4
# dtype: int64
print(y)
# 1 10
# 3 20
# dtype: int64
x + y
# 0 NaN
# 1 12.0
# 2 NaN
# 3 24.0
# dtype: float64
If you wish to add y
to x
by matching label without introducing NaNs in the result, you can use x.loc[y.index]
to select elements of x with a matching index label in y
, combined with += y
.
x.loc[y.index] += y
print(x) # (1)!
# 0 1
# 1 12
# 2 3
# 3 24
# dtype: int64
- This operation modifies
x
unlikex + y
which creates a new Series.
Boolean Indexing¶
You can use a boolean Series x
to subset a different Series, y
via y.loc[x]
.
For example, given a Series of integers, foo
,
foo = pd.Series([20, 50, 11, 45, 17, 31])
print(foo)
# 0 20
# 1 50
# 2 11
# 3 45
# 4 17
# 5 31
# dtype: int64
you can set mask = foo < 20
to build a boolean Series, mask
, that identifies whether each element of foo
is
less than 20.
mask = foo < 20
print(mask)
# 0 False
# 1 False
# 2 True
# 3 False
# 4 True
# 5 False
# dtype: bool
Then you can pass mask
into foo.loc[]
to select elements of foo
which are less than 20.
foo.loc[mask]
# 2 11
# 4 17
# dtype: int64
Boolean Index Alignment
pandas uses index alignment to select elements in the target Series based on matching index label
amongst elements in the boolean index Series whose value is True
. For example, if we shuffle mask
's
index (but not mask
's values), foo.loc[mask]
produces a different result.
mask.index=[0,1,3,2,4,5]
print(mask)
# 0 False
# 1 False
# 3 True
# 2 False
# 4 True
# 5 False
# dtype: bool
foo.loc[mask]
# 3 45
# 4 17
# dtype: int64
Boolean Indexing by Position¶
If you want to select elements from a Series based on the position of True values from another Series, convert the boolean index Series to a NumPy array.
x = pd.Series([10, 20, 30, 40, 50])
mask = pd.Series([True, True, False, False, False], index=[4,3,2,1,0])
# boolean index by label
x.loc[mask]
# 3 40
# 4 50
# dtype: int64
# boolean index by position
x.loc[mask.to_numpy()]
# 0 10
# 1 20
# dtype: int64
Combining Boolean Series¶
You can combine two boolean Series to create a third boolean Series. For example, given a Series of person ages
ages = pd.Series(
data = [42, 43, 14, 18, 1],
index = ['peter', 'lois', 'chris', 'meg', 'stewie']
)
print(ages)
# peter 42
# lois 43
# chris 14
# meg 18
# stewie 1
# dtype: int64
and a series of person genders
genders = pd.Series(
data = ['female', 'female', 'male', 'male', 'male'],
index = ['lois', 'meg', 'chris', 'peter', 'stewie'],
dtype = 'string'
)
print(genders)
# lois female
# meg female
# chris male
# peter male
# stewie male
# dtype: string
you can create a boolean Series identifying males younger than 18 like this.
mask = (genders == 'male') & (ages < 18)
print(mask)
# chris True
# lois False
# meg False
# peter False
# stewie True
# dtype: bool
When you combine two logical expressions in this way, each expression must be wrapped in parentheses. In
this case, genders == 'male' & ages < 18
would raise an error.
Logical Operators¶
| x | y | x & y |
| ----- | ----- | ----- |
| True | True | True |
| True | False | False |
| False | True | False |
| False | False | False |
| x | y | x | y |
| ----- | ----- | ----- |
| True | True | True |
| True | False | True |
| False | True | True |
| False | False | False |
| x | y | x ^ y |
| ----- | ----- | ----- |
| True | True | False |
| True | False | True |
| False | True | True |
| False | False | False |
| x | ~x |
| ----- | ----- |
| True | False |
| False | True |
Missing Values (NaN)¶
You can use NaN to represent missing or invalid values in a Series.
NaN before pandas 1.0.0¶
Prior to pandas version 1.0.0, if you wanted to represent missing or invalid data, you had to use NumPy's special
floating point constant, np.nan
. If you had a Series of integers
roux = pd.Series([1, 2, 3])
print(roux)
# 0 1
# 1 2
# 2 3
# dtype: int64
and you set the second element to np.nan
roux.iloc[1] = np.nan
print(roux)
# 0 1.0
# 1 NaN
# 2 3.0
# dtype: float64
the Series would get cast to floats because NaN
only exists in NumPy as a floating point constant.
NaN after 1.0.0¶
pandas' release of version 1.0.0 included a
Nullable integer data type. If you want to make Series of integers with NaNs, you can specify the
Series dtype
as "Int64" with a capital "I" as opposed to NumPy's "int64" with a lower case "i".
roux = pd.Series([1, 2, 3], dtype='Int64')
print(roux)
# 0 1
# 1 2
# 2 3
# dtype: Int64
Now if you set the second element to NaN
, the Series retains its Int64 data type.
roux.iloc[1] = np.nan
print(roux)
# 0 1
# 1 <NA>
# 2 3
# dtype: Int64
A better way insert NaNs in modern pandas is to use pd.NA
.
roux.iloc[1] = pd.NA
Pandas Nullable Data Types¶
pd.Series([True, pd.NA, False], dtype="boolean")
# 0 True
# 1 <NA>
# 2 False
# dtype: boolean
pd.Series([10, pd.NA, 30], dtype="Int64")
# 0 10
# 1 <NA>
# 2 30
# dtype: Int64
pd.Series([1.2, pd.NA, 3.4], dtype="Float64")
# 0 1.2
# 1 <NA>
# 2 3.4
# dtype: Float64
pd.Series(["dog", pd.NA, "cat"], dtype="string")
# 0 dog
# 1 <NA>
# 2 cat
# dtype: string
NaN Tips and Tricks¶
Given a Series, x
, with some NaN values,
x = pd.Series([1, pd.NA, 3, pd.NA], dtype='Int64')
print(x)
# 0 1
# 1 <NA>
# 2 3
# 3 <NA>
# dtype: Int64
You can use pd.isna()
to check whether each value is NaN.
pd.isna(x)
# 0 False
# 1 True
# 2 False
# 3 True
# dtype: bool
You can use pd.notna()
to check whether each value is not NaN.
pd.notna(x)
# 0 True
# 1 False
# 2 True
# 3 False
# dtype: bool
If you want to replace NaN values in a Series with a fill value, you can use the Series.fillna()
function.
# replace NaNs with -1
x.fillna(-1) # (1)!
# 0 1
# 1 -1
# 2 3
# 3 -1
# dtype: Int64
- This creates a copy of
x
with NaNs filled in. (x
remains unmodified). If you want to modifyx
, usex.fillna(-1, inplace=True)
.
Boolean Indexing with NaN¶
It's important to understand how NaNs work with boolean indexing.
Suppose you have a Series of integers, goo
, and a corresponding Series of booleans, choo
, with some NaN values.
goo = pd.Series([10,20,30,40])
choo = pd.Series([True, False, pd.NA, True])
If you attempt to index goo
with choo
, Pandas throws an error.
goo.loc[choo]
"ValueError: Cannot mask with non-boolean array containing NA / NaN values"
Notice that choo
has dtype 'object'.
print(choo)
# 0 True
# 1 False
# 2 <NA>
# 3 True
# dtype: object
This happens because pandas relies on NumPy's handling of NaNs by default, and NumPy doesn't "play nicely" with NaN values unless you happen to be working with an array of floats. In this case, dtype='object' is an indicaiton that the underlying numpy array is really just a Series of pointers.
To overcome this issue, we can rebuild choo
with dtype = "boolean"
.
choo = pd.Series([True, False, np.NaN, True], dtype = "boolean")
print(choo)
# 0 True
# 1 False
# 2 <NA>
# 3 True
# dtype: boolean
Now the boolean index goo.loc[choo]
returns a 2-element subSeries as you might expect.
goo.loc[choo] # (1)!
# 0 10
# 3 40
# dtype: int64
-
| goo | choo | | --- | ----- | 0| 10 | True | 1| 20 | False | 2| 30 | <NA> | 3| 40 | True |
In this case, the NaN value in choo
is essentially ignored.
Note that the negation of NaN is NaN, so goo.loc[~choo]
does not return the compliment of goo.loc[choo]
.
goo.loc[~choo] # (1)!
# 1 20
# dtype: int64
-
| goo | choo | ~choo | | --- | ----- | ----- | 0| 10 | True | False | 1| 20 | False | True | 2| 30 | <NA> | <NA> | 3| 40 | True | False |