Groupby#
pd.DataFrame.groupby is a very useful tool, but sometimes working with it can be a bit confusing. So in this page I want to pay more attention to some functions and cases.
The most useful page for learning is GroupBy object in pandas documentation.
import pandas as pd
from IPython.display import HTML
groupby parameters#
Here are described general parameters of the groupby method, that affect the result regardless of the specific transformations.
Argument |
Description |
|---|---|
|
Mapping, function, label, or list of labels. What to group by (e.g., a column name, a list of column names, or a function to transform the index or columns). |
|
|
|
If the axis is a MultiIndex, group by a specific level or levels. |
|
|
|
|
|
|
|
|
|
|
Check some description for some of them in the corresponding page.
Usage options#
This section considers ways to compute things over groups and practical ways to overate with “gropby” objects.
Method |
Description |
Example |
|---|---|---|
|
Apply one or more aggregation functions. |
|
|
Apply any custom function to each group. |
|
|
Return same-shaped output with function applied per group. |
|
|
Filter groups based on a condition. |
|
|
Extract a single group by key. |
|
|
Create an expanding window for each group — like cumulative operations but allows applying other functions too. |
|
Check more details on the corresponding page.
The following cell defines the groupby object that will be used by all the following examples to make sure there are only different ways to use the same tool.
basic_frame = pd.DataFrame({
'A': ['a', 'a', 'b', 'b', 'c', 'c'],
'B': [2, 1, 3, 4, 6, 5],
'C': [10, 20, 30, 40, 50, 60]
})
gb = basic_frame.groupby("A")
Classic option - just apply the function from the groupby object:
gb.sum()
| B | C | |
|---|---|---|
| A | ||
| a | 3 | 30 |
| b | 7 | 70 |
| c | 11 | 110 |
Iterating over the groupby object allows you to work with subsets that correspond to the particular group.
pd.DataFrame({sub_frame[0]: sub_frame[1].sum() for sub_frame in gb})
| a | b | c | |
|---|---|---|---|
| A | aa | bb | cc |
| B | 3 | 7 | 11 |
| C | 30 | 70 | 110 |
The agg method allows to set a separate transformation for each column by dictionary.
gb.agg({"B": "sum", "C": "sum"})
| B | C | |
|---|---|---|
| A | ||
| a | 3 | 30 |
| b | 7 | 70 |
| c | 11 | 110 |
The apply method of the groupby object allows to specify a function to be applied to each subset.
gb.apply(lambda sub_frame: sub_frame.sum(), include_groups=False)
| B | C | |
|---|---|---|
| A | ||
| a | 3 | 30 |
| b | 7 | 70 |
| c | 11 | 110 |
The transform function allows to compute values without collapsing the result dataframe into rows by groups.
gb.transform("sum")
| B | C | |
|---|---|---|
| 0 | 3 | 30 |
| 1 | 3 | 30 |
| 2 | 7 | 70 |
| 3 | 7 | 70 |
| 4 | 11 | 110 |
| 5 | 11 | 110 |
Arrays functions#
Consider the functions that can be applied to the corresponding group. All of the functions I am familiar yet, works in the same way with regular pandas.Series. Here are some cases that were useful for me in the groupby context.
Concatenation#
Use the sum function to concatenate strings of subsets.
The following cell defines an example dataframe.
df = pd.DataFrame({
"group" : ["a", "a", "b", "b"],
"text" : ["hello", "test", "line3", "superline"]
})
df
| group | text | |
|---|---|---|
| 0 | a | hello |
| 1 | a | test |
| 2 | b | line3 |
| 3 | b | superline |
The applicaiton of the sum funciton to the groupby("group") object is displayed below.
df.groupby("group").sum()
| text | |
|---|---|
| group | |
| a | hellotest |
| b | line3superline |
Shift#
The shift functions allows each value to be matched with the previous/next value in the same group. Check corresponding documentation page.
The following cell creates an example dataframe.
df = pd.DataFrame({
"group" : ["a", "a", "b", "b"],
"variable": [1, 2, 3, 4]
})
df
| group | variable | |
|---|---|---|
| 0 | a | 1 |
| 1 | a | 2 |
| 2 | b | 3 |
| 3 | b | 4 |
Applying the shift function to the groupby("group") returns the dataframe that is indexed just like the original dataframe, but all the variable values are shifted inside the groups determined by the group column.
df.groupby("group").shift()
| value | |
|---|---|
| 0 | NaN |
| 1 | 1.0 |
| 2 | NaN |
| 3 | 3.0 |