Groupby#

pd.DataFrame.groupby is a very useful tool, but sometimes working with it can be a bit confusing. So in this page I want to pay more attention to some functions and cases.

The most useful page for learning is GroupBy object in pandas documentation.

import pandas as pd
from IPython.display import HTML

groupby parameters#

Here are described general parameters of the groupby method, that affect the result regardless of the specific transformations.

Argument

Description

by

Mapping, function, label, or list of labels. What to group by (e.g., a column name, a list of column names, or a function to transform the index or columns).

axis

{0 or ‘index’, 1 or ‘columns’}, default 0. Whether to group rows (0) or columns (1).

level

If the axis is a MultiIndex, group by a specific level or levels.

as_index

bool, default True. If True, the group labels become the index; if False, they remain columns.

sort

bool, default True. Sort group keys.

group_keys

bool, default True. If True, adds the group keys to the result index.

observed

bool, default False. For categorical groupers: if True, only show observed groups.

dropna

bool, default True. If True, do not include groups whose key is NaN.

Check some description for some of them in the corresponding page.

Usage options#

This section considers ways to compute things over groups and practical ways to overate with “gropby” objects.

Method

Description

Example

agg / aggregate

Apply one or more aggregation functions.

df.groupby('team').agg({'points': 'sum', 'assists': 'mean'})

apply

Apply any custom function to each group.

df.groupby('team').apply(lambda g: g.head(1))

transform

Return same-shaped output with function applied per group.

df['mean'] = df.groupby('team')['points'].transform('mean')

filter

Filter groups based on a condition.

df.groupby('team').filter(lambda g: g['points'].mean() > 15)

get_group

Extract a single group by key.

df.groupby('team').get_group('A')

expanding

Create an expanding window for each group — like cumulative operations but allows applying other functions too.

df.groupby('team')['points'].expanding().mean()

Check more details on the corresponding page.


The following cell defines the groupby object that will be used by all the following examples to make sure there are only different ways to use the same tool.

basic_frame = pd.DataFrame({
    'A': ['a', 'a', 'b', 'b', 'c', 'c'],
    'B': [2, 1, 3, 4, 6, 5],
    'C': [10, 20, 30, 40, 50, 60]
})
gb = basic_frame.groupby("A")

Classic option - just apply the function from the groupby object:

gb.sum()
B C
A
a 3 30
b 7 70
c 11 110

Iterating over the groupby object allows you to work with subsets that correspond to the particular group.

pd.DataFrame({sub_frame[0]: sub_frame[1].sum() for sub_frame in gb})
a b c
A aa bb cc
B 3 7 11
C 30 70 110

The agg method allows to set a separate transformation for each column by dictionary.

gb.agg({"B": "sum", "C": "sum"})
B C
A
a 3 30
b 7 70
c 11 110

The apply method of the groupby object allows to specify a function to be applied to each subset.

gb.apply(lambda sub_frame: sub_frame.sum(), include_groups=False)
B C
A
a 3 30
b 7 70
c 11 110

The transform function allows to compute values without collapsing the result dataframe into rows by groups.

gb.transform("sum")
B C
0 3 30
1 3 30
2 7 70
3 7 70
4 11 110
5 11 110

Arrays functions#

Consider the functions that can be applied to the corresponding group. All of the functions I am familiar yet, works in the same way with regular pandas.Series. Here are some cases that were useful for me in the groupby context.

Concatenation#

Use the sum function to concatenate strings of subsets.


The following cell defines an example dataframe.

df = pd.DataFrame({
    "group" : ["a", "a", "b", "b"],
    "text" : ["hello", "test", "line3", "superline"]
})
df
group text
0 a hello
1 a test
2 b line3
3 b superline

The applicaiton of the sum funciton to the groupby("group") object is displayed below.

df.groupby("group").sum()
text
group
a hellotest
b line3superline

Shift#

The shift functions allows each value to be matched with the previous/next value in the same group. Check corresponding documentation page.


The following cell creates an example dataframe.

df = pd.DataFrame({
    "group" : ["a", "a", "b", "b"],
    "variable": [1, 2, 3, 4]
})
df
group variable
0 a 1
1 a 2
2 b 3
3 b 4

Applying the shift function to the groupby("group") returns the dataframe that is indexed just like the original dataframe, but all the variable values are shifted inside the groups determined by the group column.

df.groupby("group").shift()
value
0 NaN
1 1.0
2 NaN
3 3.0