groupby parameters

groupby parameters#

This page duscusses the parameters of pandas.DataFrame.groupby and pandas.Series.groupby methods.

import pandas as pd

Groups as index#

The as_index sepcify whether columns selected for certain groups are to be used as indexes in the output, or as regular columns.


The following cell creates a table that will serve as an example.

df = pd.DataFrame({
    "A": ["a", "a", "b", "b"],
    "B": [1, 2, 3, 4]
})
df
A B
0 a 1
1 a 2
2 b 3
3 b 4

The two following cells aggregate by the “A” column, with both options for as_index.

df.groupby("A", as_index=True).sum()
B
A
a 3
b 7
df.groupby("A", as_index=False).sum()
A B
0 a 3
1 b 7

If the as_index=True, grouping column will be plased as the index in the result.

By external#

You can use an arbitrary array (that is not a column of the dataframe being grouped) for grouping by passing them to the by argument.


The following cell creates a example dataframe.

df = pd.DataFrame({
    "g": ['a', 'a', 'a', 'b', 'b'],
    "val": [1, 2, 3, 4, 5]
})
df
g val
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5

The following cell shows grouping dataframe by given list.

df.groupby(["x", "x", "y", "y", "y"]).sum()
g val
x aa 3
y abb 12

The next code shows how to use several external grouping rules at the same time.

df.groupby([
    ["x", "x", "y", "y", "y"],
    ["m", "k", "m", "k", "m"]
]).sum()
g val
x k a 2
m a 1
y k b 4
m ab 8

Mix external and internal variables.

df.groupby(
    [["x", "x", "y", "y", "y"], "g"]
).sum()
val
g
x a 3
y a 3
b 9

observed#

In the categorical datatype there is a possible case where a category exists but never appears in series. This parameter describes whether unobserved catetories will be used in groupby results (False) or only observed categories will be used (True).


The following cell sets the column A to the categorical datatype and adds a new category l that doesn’t appear in any observation.

df = pd.DataFrame({
    "A": pd.Categorical(
        ['a', 'a', 'b', 'b'],
        categories=['a', 'b', 'lost_cat']
    ),
    "B": [1, 2, 3, 4]
})
df
A B
0 a 1
1 a 2
2 b 3
3 b 4

The key feature of the created dataframe is a fact that it is supposed to have more categories than it use. The following cell prints the A column of the dataframe:

df["A"]
0    a
1    a
2    b
3    b
Name: A, dtype: category
Categories (3, object): ['a', 'b', 'lost_cat']

The Categories mentions lost_cat, but it never appears in the dataframe.

If observed=True in result appear only categories that are actually mentioned in data frame.

df.groupby("A", observed=True).sum()
B
A
a 3
b 7

If observed=False, then the result includes all categories specified in the categorical columns, even if some of them do not appear in the Series.

df.groupby("A", observed=False).sum()
B
A
a 3
b 7
lost_cat 0