Groupby#
pd.DataFrame.groupby
is a very useful tool, but sometimes working with it can be a bit confusing. So in this page I want to pay more attention to some functions and cases.
The most useful page for learning is GroupBy object in pandas documentation.
import pandas as pd
from IPython.display import HTML
groupby
parameters#
Here are described general parameters of the groupby
method, that affect the result regardless of the specific transformations.
Argument |
Description |
---|---|
|
Mapping, function, label, or list of labels. What to group by (e.g., a column name, a list of column names, or a function to transform the index or columns). |
|
|
|
If the axis is a MultiIndex, group by a specific level or levels. |
|
|
|
|
|
|
|
|
|
|
Check some description for some of them in the corresponding page.
Usage options#
This section considers ways to compute things over groups and practical ways to overate with “gropby” objects.
Method |
Description |
Example |
---|---|---|
|
Apply one or more aggregation functions. |
|
|
Apply any custom function to each group. |
|
|
Return same-shaped output with function applied per group. |
|
|
Filter groups based on a condition. |
|
|
Extract a single group by key. |
|
|
Create an expanding window for each group — like cumulative operations but allows applying other functions too. |
|
Check more details on the corresponding page.
The following cell defines the groupby
object that will be used by all the following examples to make sure there are only different ways to use the same tool.
basic_frame = pd.DataFrame({
'A': ['a', 'a', 'b', 'b', 'c', 'c'],
'B': [2, 1, 3, 4, 6, 5],
'C': [10, 20, 30, 40, 50, 60]
})
gb = basic_frame.groupby("A")
Classic option - just apply the function from the groupby
object:
gb.sum()
B | C | |
---|---|---|
A | ||
a | 3 | 30 |
b | 7 | 70 |
c | 11 | 110 |
Iterating over the groupby
object allows you to work with subsets that correspond to the particular group.
pd.DataFrame({sub_frame[0]: sub_frame[1].sum() for sub_frame in gb})
a | b | c | |
---|---|---|---|
A | aa | bb | cc |
B | 3 | 7 | 11 |
C | 30 | 70 | 110 |
The agg
method allows to set a separate transformation for each column by dictionary.
gb.agg({"B": "sum", "C": "sum"})
B | C | |
---|---|---|
A | ||
a | 3 | 30 |
b | 7 | 70 |
c | 11 | 110 |
The apply
method of the groupby
object allows to specify a function to be applied to each subset.
gb.apply(lambda sub_frame: sub_frame.sum(), include_groups=False)
B | C | |
---|---|---|
A | ||
a | 3 | 30 |
b | 7 | 70 |
c | 11 | 110 |
The transform
function allows to compute values without collapsing the result dataframe into rows by groups.
gb.transform("sum")
B | C | |
---|---|---|
0 | 3 | 30 |
1 | 3 | 30 |
2 | 7 | 70 |
3 | 7 | 70 |
4 | 11 | 110 |
5 | 11 | 110 |
Arrays functions#
Consider the functions that can be applied to the corresponding group. All of the functions I am familiar yet, works in the same way with regular pandas.Series
. Here are some cases that were useful for me in the groupby context.
Concatenation#
Use the sum function to concatenate strings of subsets.
The following cell defines an example dataframe.
df = pd.DataFrame({
"group" : ["a", "a", "b", "b"],
"text" : ["hello", "test", "line3", "superline"]
})
df
group | text | |
---|---|---|
0 | a | hello |
1 | a | test |
2 | b | line3 |
3 | b | superline |
The applicaiton of the sum
funciton to the groupby("group")
object is displayed below.
df.groupby("group").sum()
text | |
---|---|
group | |
a | hellotest |
b | line3superline |
Shift#
The shift
functions allows each value to be matched with the previous/next value in the same group. Check corresponding documentation page.
The following cell creates an example dataframe.
df = pd.DataFrame({
"group" : ["a", "a", "b", "b"],
"variable": [1, 2, 3, 4]
})
df
group | variable | |
---|---|---|
0 | a | 1 |
1 | a | 2 |
2 | b | 3 |
3 | b | 4 |
Applying the shift
function to the groupby("group")
returns the dataframe that is indexed just like the original dataframe, but all the variable
values are shifted inside the groups determined by the group
column.
df.groupby("group").shift()
value | |
---|---|
0 | NaN |
1 | 1.0 |
2 | NaN |
3 | 3.0 |