Groupby

Groupby#

pd.DataFrame.groupby is a very useful tool, but sometimes working with it can be a bit confusing. So in this page I want to pay more attention to some functions and cases.

The most useful page for learning is GroupBy object in pandas documentation.

import pandas as pd
from IPython.display import HTML

`groupby` parameters#

Here are described general parameters of the groupby method, that affect the result regardless of the specific transformations.

Argument	Description
`by`	Mapping, function, label, or list of labels. What to group by (e.g., a column name, a list of column names, or a function to transform the index or columns).
`axis`	`{0 or ‘index’, 1 or ‘columns’}`, default `0`. Whether to group rows (`0`) or columns (`1`).
`level`	If the axis is a MultiIndex, group by a specific level or levels.
`as_index`	`bool`, default `True`. If `True`, the group labels become the index; if `False`, they remain columns.
`sort`	`bool`, default `True`. Sort group keys.
`group_keys`	`bool`, default `True`. If `True`, adds the group keys to the result index.
`observed`	`bool`, default `False`. For categorical groupers: if `True`, only show observed groups.
`dropna`	`bool`, default `True`. If `True`, do not include groups whose key is `NaN`.

Check some description for some of them in the corresponding page.

Usage options#

This section considers ways to compute things over groups and practical ways to overate with “gropby” objects.

Method	Description	Example
`agg` / `aggregate`	Apply one or more aggregation functions.	`df.groupby('team').agg({'points': 'sum', 'assists': 'mean'})`
`apply`	Apply any custom function to each group.	`df.groupby('team').apply(lambda g: g.head(1))`
`transform`	Return same-shaped output with function applied per group.	`df['mean'] = df.groupby('team')['points'].transform('mean')`
`filter`	Filter groups based on a condition.	`df.groupby('team').filter(lambda g: g['points'].mean() > 15)`
`get_group`	Extract a single group by key.	`df.groupby('team').get_group('A')`
`expanding`	Create an expanding window for each group — like cumulative operations but allows applying other functions too.	`df.groupby('team')['points'].expanding().mean()`

Check more details on the corresponding page.

The following cell defines the groupby object that will be used by all the following examples to make sure there are only different ways to use the same tool.

basic_frame = pd.DataFrame({
    'A': ['a', 'a', 'b', 'b', 'c', 'c'],
    'B': [2, 1, 3, 4, 6, 5],
    'C': [10, 20, 30, 40, 50, 60]
})
gb = basic_frame.groupby("A")

Classic option - just apply the function from the groupby object:

gb.sum()

	B	C
A
a	3	30
b	7	70
c	11	110

Iterating over the groupby object allows you to work with subsets that correspond to the particular group.

pd.DataFrame({sub_frame[0]: sub_frame[1].sum() for sub_frame in gb})

	a	b	c
A	aa	bb	cc
B	3	7	11
C	30	70	110

The agg method allows to set a separate transformation for each column by dictionary.

gb.agg({"B": "sum", "C": "sum"})

	B	C
A
a	3	30
b	7	70
c	11	110

The apply method of the groupby object allows to specify a function to be applied to each subset.

gb.apply(lambda sub_frame: sub_frame.sum(), include_groups=False)

	B	C
A
a	3	30
b	7	70
c	11	110

The transform function allows to compute values without collapsing the result dataframe into rows by groups.

gb.transform("sum")

	B	C
0	3	30
1	3	30
2	7	70
3	7	70
4	11	110
5	11	110

Arrays functions#

Consider the functions that can be applied to the corresponding group. All of the functions I am familiar yet, works in the same way with regular pandas.Series. Here are some cases that were useful for me in the groupby context.

Concatenation#

Use the sum function to concatenate strings of subsets.

The following cell defines an example dataframe.

df = pd.DataFrame({
    "group" : ["a", "a", "b", "b"],
    "text" : ["hello", "test", "line3", "superline"]
})
df

	group	text
0	a	hello
1	a	test
2	b	line3
3	b	superline

The applicaiton of the sum funciton to the groupby("group") object is displayed below.

df.groupby("group").sum()

	text
group
a	hellotest
b	line3superline

Shift#

The shift functions allows each value to be matched with the previous/next value in the same group. Check corresponding documentation page.

The following cell creates an example dataframe.

df = pd.DataFrame({
    "group" : ["a", "a", "b", "b"],
    "variable": [1, 2, 3, 4]
})
df

	group	variable
0	a	1
1	a	2
2	b	3
3	b	4

Applying the shift function to the groupby("group") returns the dataframe that is indexed just like the original dataframe, but all the variable values are shifted inside the groups determined by the group column.

df.groupby("group").shift()

	value
0	NaN
1	1.0
2	NaN
3	3.0