groupby
parameters#
This page duscusses the parameters of pandas.DataFrame.groupby
and pandas.Series.groupby
methods.
import pandas as pd
Groups as index#
The as_index
sepcify whether columns selected for certain groups are to be used as indexes in the output, or as regular columns.
The following cell creates a table that will serve as an example.
df = pd.DataFrame({
"A": ["a", "a", "b", "b"],
"B": [1, 2, 3, 4]
})
df
A | B | |
---|---|---|
0 | a | 1 |
1 | a | 2 |
2 | b | 3 |
3 | b | 4 |
The two following cells aggregate by the “A” column, with both options for as_index
.
df.groupby("A", as_index=True).sum()
B | |
---|---|
A | |
a | 3 |
b | 7 |
df.groupby("A", as_index=False).sum()
A | B | |
---|---|---|
0 | a | 3 |
1 | b | 7 |
If the as_index=True
, grouping column will be plased as the index in the result.
By external#
You can use an arbitrary array (that is not a column of the dataframe being grouped) for grouping by passing them to the by
argument.
The following cell creates a example dataframe.
df = pd.DataFrame({
"g": ['a', 'a', 'a', 'b', 'b'],
"val": [1, 2, 3, 4, 5]
})
df
g | val | |
---|---|---|
0 | a | 1 |
1 | a | 2 |
2 | a | 3 |
3 | b | 4 |
4 | b | 5 |
The following cell shows grouping dataframe by given list.
df.groupby(["x", "x", "y", "y", "y"]).sum()
g | val | |
---|---|---|
x | aa | 3 |
y | abb | 12 |
The next code shows how to use several external grouping rules at the same time.
df.groupby([
["x", "x", "y", "y", "y"],
["m", "k", "m", "k", "m"]
]).sum()
g | val | ||
---|---|---|---|
x | k | a | 2 |
m | a | 1 | |
y | k | b | 4 |
m | ab | 8 |
Mix external and internal variables.
df.groupby(
[["x", "x", "y", "y", "y"], "g"]
).sum()
val | ||
---|---|---|
g | ||
x | a | 3 |
y | a | 3 |
b | 9 |
observed
#
In the categorical datatype there is a possible case where a category exists but never appears in series
. This parameter describes whether unobserved catetories will be used in groupby
results (False
) or only observed categories will be used (True
).
The following cell sets the column A
to the categorical datatype and adds a new category l
that doesn’t appear in any observation.
df = pd.DataFrame({
"A": pd.Categorical(
['a', 'a', 'b', 'b'],
categories=['a', 'b', 'lost_cat']
),
"B": [1, 2, 3, 4]
})
df
A | B | |
---|---|---|
0 | a | 1 |
1 | a | 2 |
2 | b | 3 |
3 | b | 4 |
The key feature of the created dataframe is a fact that it is supposed to have more categories than it use. The following cell prints the A
column of the dataframe:
df["A"]
0 a
1 a
2 b
3 b
Name: A, dtype: category
Categories (3, object): ['a', 'b', 'lost_cat']
The Categories
mentions lost_cat
, but it never appears in the dataframe.
If observed=True
in result appear only categories that are actually mentioned in data frame.
df.groupby("A", observed=True).sum()
B | |
---|---|
A | |
a | 3 |
b | 7 |
If observed=False
, then the result includes all categories specified in the categorical columns, even if some of them do not appear in the Series.
df.groupby("A", observed=False).sum()
B | |
---|---|
A | |
a | 3 |
b | 7 |
lost_cat | 0 |